Custom local dataset features (#1085)

* Generalize prompt_feature and completion_feature for use in local datasets to facilitate compatibility with many other training dataset formats. * Persist configured prompt/completion key * rebase + nits --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2025-01-13 13:01:18 -05:00
parent bf2da36fc6
commit 0228c46434
2 changed files with 56 additions and 16 deletions
--- a/llms/mlx_lm/LORA.md
+++ b/llms/mlx_lm/LORA.md
@@ -241,14 +241,25 @@ Refer to the documentation for the model you are fine-tuning for more details.
 {"prompt": "What is the capital of France?", "completion": "Paris."}
 ```

+For the `completions` data format, a different key can be used for the prompt
+and completion by specifying the following in the YAML config:
+
+```yaml
+prompt_feature: "input"
+completion_feature: "output"
+```
+
+Here, `"input"` is the expected key instead of the default `"prompt"`, and
+`"output"` is the expected key instead of `"completion"`. 
+
 `text`:

 ```jsonl
 {"text": "This is an example for the model."}
 ```

-Note, the format is automatically determined by the dataset. Note also, keys in
-each line not expected by the loader will be ignored.
+Note, the format is automatically determined by the dataset. Note also, keys
+in each line not expected by the loader will be ignored.

 > [!NOTE]
 > Each example in the datasets must be on a single line. Do not put more than
@@ -270,7 +281,7 @@ Otherwise, provide a mapping of keys in the dataset to the features MLX LM
 expects. Use a YAML config to specify the Hugging Face dataset arguments. For
 example:

-```
+```yaml
 hf_dataset:
  name: "billsum"
  prompt_feature: "text"