custom data with lora

This commit is contained in:
Awni Hannun
2023-12-15 09:56:10 -08:00
parent a3ecda22fe
commit 985f413f99
7 changed files with 1356 additions and 25 deletions

View File

@@ -6,7 +6,19 @@ task.
In this example we'll use the WikiSQL[^wikisql] dataset to train the LLM to
generate SQL queries from natural language. However, the example is intended to
be general should you wish to modify the task.
be general should you wish to use a custom dataset.
## Contents
* [Setup](Setup)
* [Run](Run)
* [Fine-tune](Fine-tune)
* [Evaluate](Evaluate)
* [Generate](Generate)
* [Results](Results)
* [Custom-Data](Custom Data)
* [Memory-Issues](Memory Issues)
## Setup
@@ -57,6 +69,9 @@ Note, the model path should have the MLX weights, the tokenizer, and the
By default, the adapter weights are saved in `adapters.npz`. You can specify
the output location with `--adapter_file`.
You can resume fine-tuning with an existing adapter with
`--resume_adapter_file` to specify the location of the adapter weights.
#### Evaluate
To compute test set perplexity use
@@ -98,6 +113,41 @@ training and validation loss at a few points over the course of training.
The model trains at around 475 tokens per second on an M2 Ultra.
## Custom Data
You can make your own dataset for fine-tuning with LoRA. You can specify the
dataset with `--data=<my_data_directory>`. Check the subdirectory `data/` to
see the expected format.
For fine-tuning, the data loader expects a `train.jsonl` and a `valid.jsonl` to
be in the data directory. For evaluation (`--test`), the data loader expects a
`test.jsonl` in the directory. Each line in the `*.jsonl` file should look
like: are:
```
{"text": "This is an example for the model."}
```
Note other keys will be ignored by the loader.
## Memory Issues
Fine-tuning a large model with LoRA requires a machine with a deccent amount
of memory. Here are some tips to reduce memory use should you need to do so:
1. Try using a smaller batch size with `--batch-size`. The default is `4` so
setting this to `2` or `1` will reduce memory consumption. This may slow
things down a little, but will also reduce the memory use.
2. Reduce the number of layers to fine-tune with `--lora-layers`. The default
is `16`, so you can try `8` or `4`. This reduces the amount of memory
needed for back propagation. It may also reduce the quality of the
fine-tuned model if you are fine-tuning with a lot of data.
3. Longer examples require more memory. If it makes sense for your data, one thing
you can do is break your examples into smaller
sequences when making the `{train, valid, test}.jsonl` files.
[^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
[^llama]: Refer to the [arXiv paper](https://arxiv.org/abs/2302.13971) and [blog post](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) for more details.
[^mistral]: Refer to the [blog post](https://mistral.ai/news/announcing-mistral-7b/) and [github repository](https://github.com/mistralai/mistral-src) for more details.