Configuration-based use of HF hub-hosted datasets for training (#701)

* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training * Pre-commit formatting * Fix YAML config example * Print DS info * Include name * Add hf_dataset parameter default * Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility. * nits * update docs --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-10-24 06:28:07 +08:00 · 2024-06-26 13:20:50 -04:00
parent 1d701a1831
commit df6bc09d74
7 changed files with 140 additions and 28 deletions
--- a/llms/mlx_lm/examples/lora_config.yaml
+++ b/llms/mlx_lm/examples/lora_config.yaml
@@ -69,3 +69,11 @@ lora_parameters:
 #  warmup: 100 # 0 for no warmup
 #  warmup_init: 1e-7 # 0 if not specified
 #  arguments: [1e-5, 1000, 1e-7] # passed to scheduler
+
+#hf_dataset:
+#  name: "billsum"
+#  train_split: "train[:1000]"
+#  valid_split: "train[-100:]"
+#  prompt_feature: "text"
+#  completion_feature: "summary"
+