Commit Graph

8 Commits

Author SHA1 Message Date
Chime Ogbuji
5865899c81
Completion only fine-tuning of instruction models with collections of HF datasets (#1103)
- Optional completion only fine-tuning with `--mask-prompt`
- Collections of Hugging Face datasets

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2025-02-09 20:12:34 -08:00
Victor Nogueira
df1406735b
Fix dataset variable name, in datasets.py (#1212) 2025-01-21 14:12:43 -08:00
Chime Ogbuji
0228c46434
Custom local dataset features (#1085)
* Generalize prompt_feature and completion_feature for use in local datasets to facilitate compatibility with many other training dataset formats.

* Persist configured prompt/completion key

* rebase + nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2025-01-13 10:01:18 -08:00
Awni Hannun
c4833a2f55
fix encoding with special tokens + chat template (#1189) 2025-01-03 10:50:59 -08:00
madroid
aa1c8abdc6
LoRA: Support HuggingFace dataset via data parameter (#996)
* LoRA: support huggingface dataset via `data` argument

* LoRA: Extract the load_custom_hf_dataset function

* LoRA: split small functions

* fix spelling errors

* handle load hf dataset error

* fix pre-commit lint

* update data argument help

* nits and doc

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 07:36:21 -07:00
madroid
7ec2021bb9
LoRA: support tools(function calling) format datasets (#995)
* LoRA: support fine-tuning tools datasets

* LoRA: Split small function

* LoRA: add tools format to lora docs

* LoRA: pre-commit fix

* Revert "LoRA: pre-commit fix"

This reverts commit b94b7e0fe7.

* Revert "LoRA: Split small function"

This reverts commit 3f6a5f19fd.

* LoRA: remove ToolsDataset

In a JSONL file, not all data is required to include the tools value.

* nit in readme

* nit in readme

* nit in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 10:41:36 -07:00
Chime Ogbuji
df6bc09d74
Configuration-based use of HF hub-hosted datasets for training (#701)
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training

* Pre-commit formatting

* Fix YAML config example

* Print DS info

* Include name

* Add hf_dataset parameter default

* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

* nits

* update docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-26 10:20:50 -07:00
madroid
b0bcd86a40
Support for OpenAI’s fine-tuning dataset format (#548)
* LoRA: move load_dataset to tuner/datasets.py file

* LoRA: support OpenAI chat format datasets

see https://platform.openai.com/docs/guides/fine-tuning/example-format

* LoRA: support OpenAI completion format datasets

* LoRA: formatting dataset timing to reduce memory footprint

* Refactor dataset item access in PromptCompletionDataset

* Update mlx_lm/LORA.md

* Update mlx_lm/LORA.md

* check Unsupported data format

* add tests, fine-tune doc

* add tests, fine-tune doc

* add jinja2 for chat template

* nits in readme

* nits in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-03-19 16:45:46 -07:00