mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-06-26 18:51:18 +08:00
update LORA.md
This commit is contained in:
parent
582f979dfd
commit
1b4e19675d
@ -12,12 +12,14 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:
|
||||
- Gemma
|
||||
- OLMo
|
||||
- MiniCPM
|
||||
- Mamba
|
||||
- InternLM2
|
||||
|
||||
## Contents
|
||||
|
||||
- [Run](#Run)
|
||||
- [Fine-tune](#Fine-tune)
|
||||
- [DPO Training](#DPO Training)
|
||||
- [Evaluate](#Evaluate)
|
||||
- [Generate](#Generate)
|
||||
- [Fuse](#Fuse)
|
||||
@ -76,6 +78,33 @@ You can specify the output location with `--adapter-path`.
|
||||
You can resume fine-tuning with an existing adapter with
|
||||
`--resume-adapter-file <path_to_adapters.safetensors>`.
|
||||
|
||||
### DPO Training
|
||||
|
||||
Direct Preference Optimization (DPO) training allows you to fine-tune models using human preference data. To use DPO training, set the training mode to 'dpo':
|
||||
|
||||
```shell
|
||||
mlx_lm.lora \
|
||||
--model <path_to_model> \
|
||||
--train \
|
||||
--training-mode dpo \
|
||||
--data <path_to_data> \
|
||||
--beta 0.1
|
||||
```
|
||||
|
||||
The DPO training accepts the following additional parameters:
|
||||
|
||||
- `--beta`: Controls the strength of the DPO loss (default: 0.1)
|
||||
- `--dpo-loss-type`: Choose between "sigmoid" (default), "hinge", "ipo", or "dpop" loss functions
|
||||
- `--is-reference-free`: Enable reference-free DPO training
|
||||
- `--delta`: Margin parameter for hinge loss (default: 50.0)
|
||||
- `--reference-model-path`: Path to a reference model for DPO training
|
||||
|
||||
For DPO training, the data should be in JSONL format with the following structure:
|
||||
|
||||
```jsonl
|
||||
{"prompt": "User prompt", "chosen": "Preferred response", "rejected": "Less preferred response"}
|
||||
```
|
||||
|
||||
### Evaluate
|
||||
|
||||
To compute test set perplexity use:
|
||||
|
Loading…
Reference in New Issue
Block a user