From 1b4e19675dfd47553a183da73d28041165ec47e2 Mon Sep 17 00:00:00 2001 From: Goekdeniz-Guelmez Date: Sun, 19 Jan 2025 00:48:45 +0100 Subject: [PATCH] update LORA.md --- llms/mlx_lm/LORA.md | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/llms/mlx_lm/LORA.md b/llms/mlx_lm/LORA.md index 9eac9d7f..3ae78a01 100644 --- a/llms/mlx_lm/LORA.md +++ b/llms/mlx_lm/LORA.md @@ -12,12 +12,14 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families: - Gemma - OLMo - MiniCPM +- Mamba - InternLM2 ## Contents - [Run](#Run) - [Fine-tune](#Fine-tune) + - [DPO Training](#DPO Training) - [Evaluate](#Evaluate) - [Generate](#Generate) - [Fuse](#Fuse) @@ -76,6 +78,33 @@ You can specify the output location with `--adapter-path`. You can resume fine-tuning with an existing adapter with `--resume-adapter-file `. +### DPO Training + +Direct Preference Optimization (DPO) training allows you to fine-tune models using human preference data. To use DPO training, set the training mode to 'dpo': + +```shell +mlx_lm.lora \ + --model \ + --train \ + --training-mode dpo \ + --data \ + --beta 0.1 +``` + +The DPO training accepts the following additional parameters: + +- `--beta`: Controls the strength of the DPO loss (default: 0.1) +- `--dpo-loss-type`: Choose between "sigmoid" (default), "hinge", "ipo", or "dpop" loss functions +- `--is-reference-free`: Enable reference-free DPO training +- `--delta`: Margin parameter for hinge loss (default: 50.0) +- `--reference-model-path`: Path to a reference model for DPO training + +For DPO training, the data should be in JSONL format with the following structure: + +```jsonl +{"prompt": "User prompt", "chosen": "Preferred response", "rejected": "Less preferred response"} +``` + ### Evaluate To compute test set perplexity use: