mlx-examples/llms/mlx_lm/LORA.md

# Fine-Tuning with LoRA or QLoRA

You can use use the `mlx-lm` package to fine-tune an LLM with low rank
adaptation (LoRA) for a target task.[^lora] The example also supports quantized
LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:

- Mistral
- Llama
- Phi2
- Mixtral
- Qwen2
- OLMo

## Contents

* [Run](#Run)
  * [Fine-tune](#Fine-tune)
  * [Evaluate](#Evaluate)
  * [Generate](#Generate)
* [Fuse and Upload](#Fuse-and-Upload)
* [Data](#Data)
* [Memory Issues](#Memory-Issues)

## Run

The main command is `mlx_lm.lora`. To see a full list of command-line options run:

```shell
python -m mlx_lm.lora --help
```

Note, in the following the `--model` argument can be any compatible Hugging
Face repo or a local path to a converted model.

You can also specify a YAML config with `-c`/`--config`. For more on the format see the
[example YAML](examples/lora_config.yaml). For example:

```shell
python -m mlx_lm.lora --config /path/to/config.yaml
```

If command-line flags are also used, they will override the corresponding
values in the config.

### Fine-tune

To fine-tune a model use:

```shell
python -m mlx_lm.lora \
    --model <path_to_model> \
    --train \
    --data <path_to_data> \
    --iters 600
```

The `--data` argument must specify a path to a `train.jsonl`, `valid.jsonl`
when using `--train` and a path to a `test.jsonl` when using `--test`. For more
details on the data format see the section on [Data](#Data).

For example, to fine-tune a Mistral 7B you can use `--model
mistralai/Mistral-7B-v0.1`.

If `--model` points to a quantized model, then the training will use QLoRA,
otherwise it will use regular LoRA.

By default, the adapter weights are saved in `adapters.npz`. You can specify
the output location with `--adapter-file`.

You can resume fine-tuning with an existing adapter with
`--resume-adapter-file <path_to_adapters.npz>`. 

### Evaluate

To compute test set perplexity use:

```shell
python -m mlx_lm.lora \
    --model <path_to_model> \
    --adapter-file <path_to_adapters.npz> \
    --data <path_to_data> \
    --test
```

### Generate

For generation use `mlx_lm.generate`:

```shell
python -m mlx_lm.generate \
    --model <path_to_model> \
    --adapter-file <path_to_adapters.npz> \
    --prompt "<your_model_prompt>"
```

## Fuse and Upload

You can generate a model fused with the low-rank adapters using the
`mlx_lm.fuse` command. This command also allows you to upload the fused model
to the Hugging Face Hub.

To see supported options run:

```shell
python -m mlx_lm.fuse --help
```

To generate the fused model run:

```shell
python -m mlx_lm.fuse --model <path_to_model>
```

This will by default load the adapters from `adapters.npz`, and save the fused
model in the path `lora_fused_model/`. All of these are configurable.

To upload a fused model, supply the `--upload-repo` and `--hf-path` arguments
to `mlx_lm.fuse`. The latter is the repo name of the original model, which is
useful for the sake of attribution and model versioning.

For example, to fuse and upload a model derived from Mistral-7B-v0.1, run: 

```shell
python -m mlx_lm.fuse \
    --model mistralai/Mistral-7B-v0.1 \
    --upload-repo mlx-community/my-4bit-lora-mistral \
    --hf-path mistralai/Mistral-7B-v0.1
```

## Data

The LoRA command expects you to provide a dataset with `--data`.  The MLX
Examples GitHub repo has an [example of the WikiSQL
data](https://github.com/ml-explore/mlx-examples/tree/main/lora/data) in the
correct format.

For fine-tuning (`--train`), the data loader expects a `train.jsonl` and a
`valid.jsonl` to be in the data directory. For evaluation (`--test`), the data
loader expects a `test.jsonl` in the data directory. Each line in the `*.jsonl`
file should look like:

```
{"text": "This is an example for the model."}
```

Note, other keys will be ignored by the loader.

## Memory Issues

Fine-tuning a large model with LoRA requires a machine with a decent amount
of memory. Here are some tips to reduce memory use should you need to do so:

1. Try quantization (QLoRA). You can use QLoRA by generating a quantized model
   with `convert.py` and the `-q` flag. See the [Setup](#setup) section for
   more details. 

2. Try using a smaller batch size with `--batch-size`. The default is `4` so
   setting this to `2` or `1` will reduce memory consumption. This may slow
   things down a little, but will also reduce the memory use.

3. Reduce the number of layers to fine-tune with `--lora-layers`. The default
   is `16`, so you can try `8` or `4`. This reduces the amount of memory
   needed for back propagation. It may also reduce the quality of the
   fine-tuned model if you are fine-tuning with a lot of data.

4. Longer examples require more memory. If it makes sense for your data, one thing
   you can do is break your examples into smaller
   sequences when making the `{train, valid, test}.jsonl` files.

5. Gradient checkpointing lets you trade-off memory use (less) for computation
   (more) by recomputing instead of storing intermediate values needed by the
   backward pass. You can use gradient checkpointing by passing the
   `--grad-checkpoint` flag. Gradient checkpointing will be more helpful for
   larger batch sizes or sequence lengths with smaller or quantized models.

For example, for a machine with 32 GB the following should run reasonably fast:

```
python lora.py \
    --model mistralai/Mistral-7B-v0.1 \
    --train \
    --batch-size 1 \
    --lora-layers 4 \
    --data wikisql
```

The above command on an M1 Max with 32 GB runs at about 250
tokens-per-second, using the MLX Example
[`wikisql`](https://github.com/ml-explore/mlx-examples/tree/main/lora/data)
data set.


[^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
[^qlora]: Refer to the paper [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00			`# Fine-Tuning with LoRA or QLoRA`

			You can use use the `mlx-lm` package to fine-tune an LLM with low rank
			`adaptation (LoRA) for a target task.[^lora] The example also supports quantized`
			`LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:`

			`- Mistral`
			`- Llama`
			`- Phi2`
			`- Mixtral`
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			`- Qwen2`
			`- OLMo`
feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00
			`## Contents`

			`* [Run](#Run)`
			`* [Fine-tune](#Fine-tune)`
			`* [Evaluate](#Evaluate)`
			`* [Generate](#Generate)`
			`* [Fuse and Upload](#Fuse-and-Upload)`
			`* [Data](#Data)`
			`* [Memory Issues](#Memory-Issues)`

			`## Run`

YAML configuration for mlx_lm.lora (#503) * Convert mlx_lm.lora to use YAML configuration * pre-commit run fixes * Fix loading of config file * Remove invalid YAML from doc * Update command-line options and YAML parameter overriding, per feedback in #503 * Minor wording change * Positional argument * Moved config to a (-c/--config) flag * Removed CLI option defaults (since CLI options take precedence and their defaults are in CONFIG_DEFAULTS) * pre-commit format updates * Fix handling of CLI option defaults * Prevent None values of unspecified CLI options from overwriting values from CONFIG_DEFAULTS * nits --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-03-08 23:57:52 +08:00			The main command is `mlx_lm.lora`. To see a full list of command-line options run:
feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00
			```shell
			`python -m mlx_lm.lora --help`
			```

			Note, in the following the `--model` argument can be any compatible Hugging
YAML configuration for mlx_lm.lora (#503) * Convert mlx_lm.lora to use YAML configuration * pre-commit run fixes * Fix loading of config file * Remove invalid YAML from doc * Update command-line options and YAML parameter overriding, per feedback in #503 * Minor wording change * Positional argument * Moved config to a (-c/--config) flag * Removed CLI option defaults (since CLI options take precedence and their defaults are in CONFIG_DEFAULTS) * pre-commit format updates * Fix handling of CLI option defaults * Prevent None values of unspecified CLI options from overwriting values from CONFIG_DEFAULTS * nits --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-03-08 23:57:52 +08:00			`Face repo or a local path to a converted model.`

			You can also specify a YAML config with `-c`/`--config`. For more on the format see the
			`[example YAML](examples/lora_config.yaml). For example:`

			```shell
			`python -m mlx_lm.lora --config /path/to/config.yaml`
			```

			`If command-line flags are also used, they will override the corresponding`
			`values in the config.`
feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00
			`### Fine-tune`

			`To fine-tune a model use:`

			```shell
			`python -m mlx_lm.lora \`
			`--model <path_to_model> \`
			`--train \`
			`--data <path_to_data> \`
			`--iters 600`
			```

			The `--data` argument must specify a path to a `train.jsonl`, `valid.jsonl`
			when using `--train` and a path to a `test.jsonl` when using `--test`. For more
			`details on the data format see the section on [Data](#Data).`

			For example, to fine-tune a Mistral 7B you can use `--model
			mistralai/Mistral-7B-v0.1`.

			If `--model` points to a quantized model, then the training will use QLoRA,
			`otherwise it will use regular LoRA.`

			By default, the adapter weights are saved in `adapters.npz`. You can specify
			the output location with `--adapter-file`.

			`You can resume fine-tuning with an existing adapter with`
			`--resume-adapter-file <path_to_adapters.npz>`.

			`### Evaluate`

			`To compute test set perplexity use:`

			```shell
			`python -m mlx_lm.lora \`
			`--model <path_to_model> \`
			`--adapter-file <path_to_adapters.npz> \`
			`--data <path_to_data> \`
			`--test`
			```

chore(mlx-lm): add adapter support in generate.py (#494) * chore(mlx-lm): add adapter support in generate.py * chore: remove generate from lora.py and raise error to let user use mlx_lm.generate instead 2024-02-28 23:49:25 +08:00			`### Generate`

YAML configuration for mlx_lm.lora (#503) * Convert mlx_lm.lora to use YAML configuration * pre-commit run fixes * Fix loading of config file * Remove invalid YAML from doc * Update command-line options and YAML parameter overriding, per feedback in #503 * Minor wording change * Positional argument * Moved config to a (-c/--config) flag * Removed CLI option defaults (since CLI options take precedence and their defaults are in CONFIG_DEFAULTS) * pre-commit format updates * Fix handling of CLI option defaults * Prevent None values of unspecified CLI options from overwriting values from CONFIG_DEFAULTS * nits --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-03-08 23:57:52 +08:00			For generation use `mlx_lm.generate`:
chore(mlx-lm): add adapter support in generate.py (#494) * chore(mlx-lm): add adapter support in generate.py * chore: remove generate from lora.py and raise error to let user use mlx_lm.generate instead 2024-02-28 23:49:25 +08:00
			```shell
			`python -m mlx_lm.generate \`
			`--model <path_to_model> \`
			`--adapter-file <path_to_adapters.npz> \`
			`--prompt "<your_model_prompt>"`
			```

feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00			`## Fuse and Upload`

			`You can generate a model fused with the low-rank adapters using the`
			`mlx_lm.fuse` command. This command also allows you to upload the fused model
			`to the Hugging Face Hub.`

			`To see supported options run:`

			```shell
			`python -m mlx_lm.fuse --help`
			```

			`To generate the fused model run:`

			```shell
			`python -m mlx_lm.fuse --model <path_to_model>`
			```

			This will by default load the adapters from `adapters.npz`, and save the fused
			model in the path `lora_fused_model/`. All of these are configurable.

			To upload a fused model, supply the `--upload-repo` and `--hf-path` arguments
			to `mlx_lm.fuse`. The latter is the repo name of the original model, which is
			`useful for the sake of attribution and model versioning.`

			`For example, to fuse and upload a model derived from Mistral-7B-v0.1, run:`

			```shell
			`python -m mlx_lm.fuse \`
			`--model mistralai/Mistral-7B-v0.1 \`
			`--upload-repo mlx-community/my-4bit-lora-mistral \`
			`--hf-path mistralai/Mistral-7B-v0.1`
			```

			`## Data`

			The LoRA command expects you to provide a dataset with `--data`. The MLX
			`Examples GitHub repo has an [example of the WikiSQL`
			`data](https://github.com/ml-explore/mlx-examples/tree/main/lora/data) in the`
			`correct format.`

			For fine-tuning (`--train`), the data loader expects a `train.jsonl` and a
			`valid.jsonl` to be in the data directory. For evaluation (`--test`), the data
			loader expects a `test.jsonl` in the data directory. Each line in the `*.jsonl`
			`file should look like:`

			```
			`{"text": "This is an example for the model."}`
			```

			`Note, other keys will be ignored by the loader.`

			`## Memory Issues`

			`Fine-tuning a large model with LoRA requires a machine with a decent amount`
			`of memory. Here are some tips to reduce memory use should you need to do so:`

			`1. Try quantization (QLoRA). You can use QLoRA by generating a quantized model`
			with `convert.py` and the `-q` flag. See the [Setup](#setup) section for
			`more details.`

			2. Try using a smaller batch size with `--batch-size`. The default is `4` so
			setting this to `2` or `1` will reduce memory consumption. This may slow
			`things down a little, but will also reduce the memory use.`

			3. Reduce the number of layers to fine-tune with `--lora-layers`. The default
			is `16`, so you can try `8` or `4`. This reduces the amount of memory
			`needed for back propagation. It may also reduce the quality of the`
			`fine-tuned model if you are fine-tuning with a lot of data.`

			`4. Longer examples require more memory. If it makes sense for your data, one thing`
			`you can do is break your examples into smaller`
			sequences when making the `{train, valid, test}.jsonl` files.

Make attention faster for a some models (#574) * make attention faster for a couple models * remove unused generation flags * add comment on lora * include text files as well 2024-03-15 12:35:54 +08:00			`5. Gradient checkpointing lets you trade-off memory use (less) for computation`
			`(more) by recomputing instead of storing intermediate values needed by the`
			`backward pass. You can use gradient checkpointing by passing the`
			`--grad-checkpoint` flag. Gradient checkpointing will be more helpful for
			`larger batch sizes or sequence lengths with smaller or quantized models.`

feat: move lora into mlx-lm (#337) * feat: Add lora and qlora training to mlx-lm --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-01-24 00:44:37 +08:00			`For example, for a machine with 32 GB the following should run reasonably fast:`

			```
			`python lora.py \`
			`--model mistralai/Mistral-7B-v0.1 \`
			`--train \`
			`--batch-size 1 \`
			`--lora-layers 4 \`
			`--data wikisql`
			```

			`The above command on an M1 Max with 32 GB runs at about 250`
			`tokens-per-second, using the MLX Example`
			[`wikisql`](https://github.com/ml-explore/mlx-examples/tree/main/lora/data)
			`data set.`


			`[^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.`
			`[^qlora]: Refer to the paper [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)`