Qlora (#219)

qlora
2025-12-16 02:08:55 +08:00 · 2024-01-04 21:05:59 -08:00
parent 4fa659acbd
commit 37b41cec60
8 changed files with 137 additions and 51 deletions
--- a/lora/README.md
+++ b/lora/README.md
@@ -1,8 +1,8 @@
-# LoRA
+# Fine-Tuning with LoRA or QLoRA

 This is an example of using MLX to fine-tune either a Llama 7B[^llama] or a
 Mistral 7B[^mistral] model with low rank adaptation (LoRA)[^lora] for a target
-task. 
+task. The example also supports quantized LoRA (QLoRA).[^qlora]

 In this example we'll use the WikiSQL[^wikisql] dataset to train the LLM to
 generate SQL queries from natural language. However, the example is intended to
@@ -43,10 +43,13 @@ Convert the model with:

 ```
 python convert.py \
-    --torch-model <path_to_torch_model> \
-    --mlx-model <path_to_mlx_model>
+    --torch-path <path_to_torch_model> \
+    --mlx-path <path_to_mlx_model>
 ```

+If you wish to use QLoRA, then convert the model with 4-bit quantization using
+the `-q` option.
+
 ## Run

 The main script is `lora.py`. To see a full list of options run
@@ -65,8 +68,11 @@ python lora.py --model <path_to_model> \
               --iters 600
 ```

+If `--model` points to a quantized model, then the training will use QLoRA,
+otherwise it will use regular LoRA.
+
 Note, the model path should have the MLX weights, the tokenizer, and the
-`params.json` configuration which will all be output by the `convert.py` script.
+`config.json` which will all be output by the `convert.py` script.

 By default, the adapter weights are saved in `adapters.npz`. You can specify
 the output location with `--adapter-file`.
@@ -137,16 +143,20 @@ Note other keys will be ignored by the loader.
 Fine-tuning a large model with LoRA requires a machine with a decent amount
 of memory. Here are some tips to reduce memory use should you need to do so:

-1. Try using a smaller batch size with `--batch-size`. The default is `4` so
+1. Try quantization (QLoRA). You can use QLoRA by generating a quantized model
+   with `convert.py` and the `-q` flag. See the [Setup](#setup) section for
+   more details. 
+
+2. Try using a smaller batch size with `--batch-size`. The default is `4` so
   setting this to `2` or `1` will reduce memory consumption. This may slow
   things down a little, but will also reduce the memory use.

-2. Reduce the number of layers to fine-tune with `--lora-layers`. The default
+3. Reduce the number of layers to fine-tune with `--lora-layers`. The default
   is `16`, so you can try `8` or `4`. This reduces the amount of memory
   needed for back propagation. It may also reduce the quality of the
   fine-tuned model if you are fine-tuning with a lot of data.

-3. Longer examples require more memory. If it makes sense for your data, one thing
+4. Longer examples require more memory. If it makes sense for your data, one thing
   you can do is break your examples into smaller
   sequences when making the `{train, valid, test}.jsonl` files.

@@ -164,6 +174,7 @@ The above command on an M1 Max with 32 GB runs at about 250 tokens-per-second.


 [^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
+[^qlora]: Refer to the paper [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
 [^llama]: Refer to the [arXiv paper](https://arxiv.org/abs/2302.13971) and [blog post](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) for more details.
 [^mistral]: Refer to the [blog post](https://mistral.ai/news/announcing-mistral-7b/) and [github repository](https://github.com/mistralai/mistral-src) for more details.
 [^wikisql]: Refer to the [GitHub repo](https://github.com/salesforce/WikiSQL/tree/master) for more information about WikiSQL.