qlora
This commit is contained in:
Awni Hannun
2024-01-04 21:05:59 -08:00
committed by GitHub
parent 4fa659acbd
commit 37b41cec60
8 changed files with 137 additions and 51 deletions

View File

@@ -1,8 +1,8 @@
# LoRA
# Fine-Tuning with LoRA or QLoRA
This is an example of using MLX to fine-tune either a Llama 7B[^llama] or a
Mistral 7B[^mistral] model with low rank adaptation (LoRA)[^lora] for a target
task.
task. The example also supports quantized LoRA (QLoRA).[^qlora]
In this example we'll use the WikiSQL[^wikisql] dataset to train the LLM to
generate SQL queries from natural language. However, the example is intended to
@@ -43,10 +43,13 @@ Convert the model with:
```
python convert.py \
--torch-model <path_to_torch_model> \
--mlx-model <path_to_mlx_model>
--torch-path <path_to_torch_model> \
--mlx-path <path_to_mlx_model>
```
If you wish to use QLoRA, then convert the model with 4-bit quantization using
the `-q` option.
## Run
The main script is `lora.py`. To see a full list of options run
@@ -65,8 +68,11 @@ python lora.py --model <path_to_model> \
--iters 600
```
If `--model` points to a quantized model, then the training will use QLoRA,
otherwise it will use regular LoRA.
Note, the model path should have the MLX weights, the tokenizer, and the
`params.json` configuration which will all be output by the `convert.py` script.
`config.json` which will all be output by the `convert.py` script.
By default, the adapter weights are saved in `adapters.npz`. You can specify
the output location with `--adapter-file`.
@@ -137,16 +143,20 @@ Note other keys will be ignored by the loader.
Fine-tuning a large model with LoRA requires a machine with a decent amount
of memory. Here are some tips to reduce memory use should you need to do so:
1. Try using a smaller batch size with `--batch-size`. The default is `4` so
1. Try quantization (QLoRA). You can use QLoRA by generating a quantized model
with `convert.py` and the `-q` flag. See the [Setup](#setup) section for
more details.
2. Try using a smaller batch size with `--batch-size`. The default is `4` so
setting this to `2` or `1` will reduce memory consumption. This may slow
things down a little, but will also reduce the memory use.
2. Reduce the number of layers to fine-tune with `--lora-layers`. The default
3. Reduce the number of layers to fine-tune with `--lora-layers`. The default
is `16`, so you can try `8` or `4`. This reduces the amount of memory
needed for back propagation. It may also reduce the quality of the
fine-tuned model if you are fine-tuning with a lot of data.
3. Longer examples require more memory. If it makes sense for your data, one thing
4. Longer examples require more memory. If it makes sense for your data, one thing
you can do is break your examples into smaller
sequences when making the `{train, valid, test}.jsonl` files.
@@ -164,6 +174,7 @@ The above command on an M1 Max with 32 GB runs at about 250 tokens-per-second.
[^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
[^qlora]: Refer to the paper [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
[^llama]: Refer to the [arXiv paper](https://arxiv.org/abs/2302.13971) and [blog post](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) for more details.
[^mistral]: Refer to the [blog post](https://mistral.ai/news/announcing-mistral-7b/) and [github repository](https://github.com/mistralai/mistral-src) for more details.
[^wikisql]: Refer to the [GitHub repo](https://github.com/salesforce/WikiSQL/tree/master) for more information about WikiSQL.