
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training * Pre-commit formatting * Fix YAML config example * Print DS info * Include name * Add hf_dataset parameter default * Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility. * nits * update docs --------- Co-authored-by: Awni Hannun <awni@apple.com>
8.2 KiB
Fine-Tuning with LoRA or QLoRA
You can use use the mlx-lm
package to fine-tune an LLM with low rank
adaptation (LoRA) for a target task.1 The example also supports quantized
LoRA (QLoRA).2 LoRA fine-tuning works with the following model families:
- Mistral
- Llama
- Phi2
- Mixtral
- Qwen2
- Gemma
- OLMo
- MiniCPM
- InternLM2
Contents
Run
The main command is mlx_lm.lora
. To see a full list of command-line options run:
mlx_lm.lora --help
Note, in the following the --model
argument can be any compatible Hugging
Face repo or a local path to a converted model.
You can also specify a YAML config with -c
/--config
. For more on the format see the
example YAML. For example:
mlx_lm.lora --config /path/to/config.yaml
If command-line flags are also used, they will override the corresponding values in the config.
Fine-tune
To fine-tune a model use:
mlx_lm.lora \
--model <path_to_model> \
--train \
--data <path_to_data> \
--iters 600
The --data
argument must specify a path to a train.jsonl
, valid.jsonl
when using --train
and a path to a test.jsonl
when using --test
. For more
details on the data format see the section on Data.
For example, to fine-tune a Mistral 7B you can use --model mistralai/Mistral-7B-v0.1
.
If --model
points to a quantized model, then the training will use QLoRA,
otherwise it will use regular LoRA.
By default, the adapter config and weights are saved in adapters/
. You can
specify the output location with --adapter-path
.
You can resume fine-tuning with an existing adapter with
--resume-adapter-file <path_to_adapters.safetensors>
.
Evaluate
To compute test set perplexity use:
mlx_lm.lora \
--model <path_to_model> \
--adapter-path <path_to_adapters> \
--data <path_to_data> \
--test
Generate
For generation use mlx_lm.generate
:
mlx_lm.generate \
--model <path_to_model> \
--adapter-path <path_to_adapters> \
--prompt "<your_model_prompt>"
Fuse
You can generate a model fused with the low-rank adapters using the
mlx_lm.fuse
command. This command also allows you to optionally:
- Upload the fused model to the Hugging Face Hub.
- Export the fused model to GGUF. Note GGUF support is limited to Mistral, Mixtral, and Llama style models in fp16 precision.
To see supported options run:
mlx_lm.fuse --help
To generate the fused model run:
mlx_lm.fuse --model <path_to_model>
This will by default load the adapters from adapters/
, and save the fused
model in the path lora_fused_model/
. All of these are configurable.
To upload a fused model, supply the --upload-repo
and --hf-path
arguments
to mlx_lm.fuse
. The latter is the repo name of the original model, which is
useful for the sake of attribution and model versioning.
For example, to fuse and upload a model derived from Mistral-7B-v0.1, run:
mlx_lm.fuse \
--model mistralai/Mistral-7B-v0.1 \
--upload-repo mlx-community/my-lora-mistral-7b \
--hf-path mistralai/Mistral-7B-v0.1
To export a fused model to GGUF, run:
mlx_lm.fuse \
--model mistralai/Mistral-7B-v0.1 \
--export-gguf
This will save the GGUF model in lora_fused_model/ggml-model-f16.gguf
. You
can specify the file name with --gguf-path
.
Data
The LoRA command expects you to provide a dataset with --data
. The MLX
Examples GitHub repo has an example of the WikiSQL
data in the
correct format.
Datasets can be specified in *.jsonl
files locally or loaded from Hugging
Face.
Local Datasets
For fine-tuning (--train
), the data loader expects a train.jsonl
and a
valid.jsonl
to be in the data directory. For evaluation (--test
), the data
loader expects a test.jsonl
in the data directory.
Currently, *.jsonl
files support three data formats: chat
,
completions
, and text
. Here are three examples of these formats:
chat
:
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello."
},
{
"role": "assistant",
"content": "How can I assistant you today."
}
]
}
completions
:
{
"prompt": "What is the capital of France?",
"completion": "Paris."
}
text
:
{
"text": "This is an example for the model."
}
Note, the format is automatically determined by the dataset. Note also, keys in each line not expected by the loader will be ignored.
Hugging Face Datasets
To use Hugging Face datasets, first install the datasets
package:
pip install datasets
Specify the Hugging Face dataset arguments in a YAML config. For example:
hf_dataset:
name: "billsum"
prompt_feature: "text"
completion_feature: "summary"
-
Use
prompt_feature
andcompletion_feature
to specify keys for acompletions
dataset. Usetext_feature
to specify the key for atext
dataset. -
To specify the train, valid, or test splits, set the corresponding
{train,valid,test}_split
argument. -
Arguments specified in
config
will be passed as keyword arguments todatasets.load_dataset
.
In general, for the chat
and completions
formats, Hugging Face chat
templates are used. This applies
the model's chat template by default. If the model does not have a chat
template, then Hugging Face will use a default. For example, the final text in
the chat
example above with Hugging Face's default template becomes:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello.<|im_end|>
<|im_start|>assistant
How can I assistant you today.<|im_end|>
If you are unsure of the format to use, the chat
or completions
are good to
start with. For custom requirements on the format of the dataset, use the
text
format to assemble the content yourself.
Memory Issues
Fine-tuning a large model with LoRA requires a machine with a decent amount of memory. Here are some tips to reduce memory use should you need to do so:
-
Try quantization (QLoRA). You can use QLoRA by generating a quantized model with
convert.py
and the-q
flag. See the Setup section for more details. -
Try using a smaller batch size with
--batch-size
. The default is4
so setting this to2
or1
will reduce memory consumption. This may slow things down a little, but will also reduce the memory use. -
Reduce the number of layers to fine-tune with
--lora-layers
. The default is16
, so you can try8
or4
. This reduces the amount of memory needed for back propagation. It may also reduce the quality of the fine-tuned model if you are fine-tuning with a lot of data. -
Longer examples require more memory. If it makes sense for your data, one thing you can do is break your examples into smaller sequences when making the
{train, valid, test}.jsonl
files. -
Gradient checkpointing lets you trade-off memory use (less) for computation (more) by recomputing instead of storing intermediate values needed by the backward pass. You can use gradient checkpointing by passing the
--grad-checkpoint
flag. Gradient checkpointing will be more helpful for larger batch sizes or sequence lengths with smaller or quantized models.
For example, for a machine with 32 GB the following should run reasonably fast:
mlx_lm.lora \
--model mistralai/Mistral-7B-v0.1 \
--train \
--batch-size 1 \
--lora-layers 4 \
--data wikisql
The above command on an M1 Max with 32 GB runs at about 250
tokens-per-second, using the MLX Example
wikisql
data set.
-
Refer to the arXiv paper for more details on LoRA. ↩︎
-
Refer to the paper QLoRA: Efficient Finetuning of Quantized LLMs ↩︎