mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-09-01 12:49:50 +08:00
@@ -1,8 +1,8 @@
|
||||
# LoRA
|
||||
# Fine-Tuning with LoRA or QLoRA
|
||||
|
||||
This is an example of using MLX to fine-tune either a Llama 7B[^llama] or a
|
||||
Mistral 7B[^mistral] model with low rank adaptation (LoRA)[^lora] for a target
|
||||
task.
|
||||
task. The example also supports quantized LoRA (QLoRA).[^qlora]
|
||||
|
||||
In this example we'll use the WikiSQL[^wikisql] dataset to train the LLM to
|
||||
generate SQL queries from natural language. However, the example is intended to
|
||||
@@ -43,10 +43,13 @@ Convert the model with:
|
||||
|
||||
```
|
||||
python convert.py \
|
||||
--torch-model <path_to_torch_model> \
|
||||
--mlx-model <path_to_mlx_model>
|
||||
--torch-path <path_to_torch_model> \
|
||||
--mlx-path <path_to_mlx_model>
|
||||
```
|
||||
|
||||
If you wish to use QLoRA, then convert the model with 4-bit quantization using
|
||||
the `-q` option.
|
||||
|
||||
## Run
|
||||
|
||||
The main script is `lora.py`. To see a full list of options run
|
||||
@@ -65,8 +68,11 @@ python lora.py --model <path_to_model> \
|
||||
--iters 600
|
||||
```
|
||||
|
||||
If `--model` points to a quantized model, then the training will use QLoRA,
|
||||
otherwise it will use regular LoRA.
|
||||
|
||||
Note, the model path should have the MLX weights, the tokenizer, and the
|
||||
`params.json` configuration which will all be output by the `convert.py` script.
|
||||
`config.json` which will all be output by the `convert.py` script.
|
||||
|
||||
By default, the adapter weights are saved in `adapters.npz`. You can specify
|
||||
the output location with `--adapter-file`.
|
||||
@@ -137,16 +143,20 @@ Note other keys will be ignored by the loader.
|
||||
Fine-tuning a large model with LoRA requires a machine with a decent amount
|
||||
of memory. Here are some tips to reduce memory use should you need to do so:
|
||||
|
||||
1. Try using a smaller batch size with `--batch-size`. The default is `4` so
|
||||
1. Try quantization (QLoRA). You can use QLoRA by generating a quantized model
|
||||
with `convert.py` and the `-q` flag. See the [Setup](#setup) section for
|
||||
more details.
|
||||
|
||||
2. Try using a smaller batch size with `--batch-size`. The default is `4` so
|
||||
setting this to `2` or `1` will reduce memory consumption. This may slow
|
||||
things down a little, but will also reduce the memory use.
|
||||
|
||||
2. Reduce the number of layers to fine-tune with `--lora-layers`. The default
|
||||
3. Reduce the number of layers to fine-tune with `--lora-layers`. The default
|
||||
is `16`, so you can try `8` or `4`. This reduces the amount of memory
|
||||
needed for back propagation. It may also reduce the quality of the
|
||||
fine-tuned model if you are fine-tuning with a lot of data.
|
||||
|
||||
3. Longer examples require more memory. If it makes sense for your data, one thing
|
||||
4. Longer examples require more memory. If it makes sense for your data, one thing
|
||||
you can do is break your examples into smaller
|
||||
sequences when making the `{train, valid, test}.jsonl` files.
|
||||
|
||||
@@ -164,6 +174,7 @@ The above command on an M1 Max with 32 GB runs at about 250 tokens-per-second.
|
||||
|
||||
|
||||
[^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
|
||||
[^qlora]: Refer to the paper [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
|
||||
[^llama]: Refer to the [arXiv paper](https://arxiv.org/abs/2302.13971) and [blog post](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) for more details.
|
||||
[^mistral]: Refer to the [blog post](https://mistral.ai/news/announcing-mistral-7b/) and [github repository](https://github.com/mistralai/mistral-src) for more details.
|
||||
[^wikisql]: Refer to the [GitHub repo](https://github.com/salesforce/WikiSQL/tree/master) for more information about WikiSQL.
|
||||
|
Reference in New Issue
Block a user