custom data with lora

2025-12-16 02:08:55 +08:00 · 2023-12-15 09:56:10 -08:00
parent a3ecda22fe
commit 985f413f99
7 changed files with 1356 additions and 25 deletions
--- a/lora/README.md
+++ b/lora/README.md
@@ -6,7 +6,19 @@ task.

 In this example we'll use the WikiSQL[^wikisql] dataset to train the LLM to
 generate SQL queries from natural language. However, the example is intended to
-be general should you wish to modify the task.
+be general should you wish to use a custom dataset.
+
+## Contents
+
+* [Setup](Setup)
+* [Run](Run)
+  * [Fine-tune](Fine-tune)
+  * [Evaluate](Evaluate)
+  * [Generate](Generate)
+* [Results](Results)
+* [Custom-Data](Custom Data)
+* [Memory-Issues](Memory Issues)
+

 ## Setup 

@@ -57,6 +69,9 @@ Note, the model path should have the MLX weights, the tokenizer, and the
 By default, the adapter weights are saved in `adapters.npz`. You can specify
 the output location with `--adapter_file`.

+You can resume fine-tuning with an existing adapter with
+`--resume_adapter_file` to specify the location of the adapter weights. 
+
 #### Evaluate

 To compute test set perplexity use
@@ -98,6 +113,41 @@ training and validation loss at a few points over the course of training.

 The model trains at around 475 tokens per second on an M2 Ultra.

+## Custom Data
+
+You can make your own dataset for fine-tuning with LoRA. You can specify the
+dataset with `--data=<my_data_directory>`. Check the subdirectory `data/` to
+see the expected format.
+
+For fine-tuning, the data loader expects a `train.jsonl` and a `valid.jsonl` to
+be in the data directory. For evaluation (`--test`), the data loader expects a
+`test.jsonl` in the directory. Each line in the `*.jsonl` file should look
+like: are:
+
+```
+{"text": "This is an example for the model."}
+```
+
+Note other keys will be ignored by the loader.
+
+## Memory Issues
+
+Fine-tuning a large model with LoRA requires a machine with a deccent amount
+of memory. Here are some tips to reduce memory use should you need to do so:
+
+1. Try using a smaller batch size with `--batch-size`. The default is `4` so
+   setting this to `2` or `1` will reduce memory consumption. This may slow
+   things down a little, but will also reduce the memory use.
+
+2. Reduce the number of layers to fine-tune with `--lora-layers`. The default
+   is `16`, so you can try `8` or `4`. This reduces the amount of memory
+   needed for back propagation. It may also reduce the quality of the
+   fine-tuned model if you are fine-tuning with a lot of data.
+
+3. Longer examples require more memory. If it makes sense for your data, one thing
+   you can do is break your examples into smaller
+   sequences when making the `{train, valid, test}.jsonl` files.
+
 [^lora]: Refer to the [arXiv paper](https://arxiv.org/abs/2106.09685) for more details on LoRA.
 [^llama]: Refer to the [arXiv paper](https://arxiv.org/abs/2302.13971) and [blog post](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) for more details.
 [^mistral]: Refer to the [blog post](https://mistral.ai/news/announcing-mistral-7b/) and [github repository](https://github.com/mistralai/mistral-src) for more details.