Commit Graph

23 Commits

Author SHA1 Message Date
Awni Hannun
eda597bdef simplify 2025-02-09 19:37:11 -08:00
Awni Hannun
bb2c8bcf96 more nits 2025-02-09 18:00:17 -08:00
Awni Hannun
6e9542a934 put offset in prompt, simplify 2025-02-09 17:31:23 -08:00
Awni Hannun
6ace6dc6b2 simplify collections 2025-02-09 08:33:42 -08:00
Chime Ogbuji
b9748e9ee4 Generalize the get_item method to all CompletionDatasets 2025-02-09 07:44:17 -08:00
Chime Ogbuji
7989d0a874 Move response template to LoRA configuration 2025-02-09 07:43:37 -08:00
Chime Ogbuji
cb87f6f22c Add response template (or token) argument
For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)
2025-02-09 07:43:01 -08:00
Chime Ogbuji
f989401881 Default for hf_datasets configuration 2025-02-09 07:41:24 -08:00
Chime Ogbuji
69282ab7fc Minor fix 2025-02-09 07:41:24 -08:00
Chime Ogbuji
4890870053 Add ability to fetch raw prompt and completion text from completion datasets 2025-02-09 07:41:23 -08:00
Chime Ogbuji
a5b866cf73 Fix index calculation 2025-02-09 07:41:01 -08:00
Chime Ogbuji
a4a86ad898 Fix iteration over HF dataset collection 2025-02-09 07:41:01 -08:00
Chime Ogbuji
78c33e5037 Fix keyword argument invokation 2025-02-09 07:41:00 -08:00
Chime Ogbuji
387c45efa2 Fixes to references to hf_datasets 2025-02-09 07:40:09 -08:00
Chime Ogbuji
14a75f3f03 Generalize HF datasets to a collection of HF dataasets via datasets, adds support for custom chat HF datasets (#1088), and fixes (#1087) 2025-02-09 07:38:40 -08:00
Chime Ogbuji
79a042768f Replace iterate_input_masked_batches with iterate_delineated_batches, an updated attempt to better sync with iterate_batches logic 2025-02-09 07:12:54 -08:00
Victor Nogueira
df1406735b
Fix dataset variable name, in datasets.py (#1212) 2025-01-21 14:12:43 -08:00
Chime Ogbuji
0228c46434
Custom local dataset features (#1085)
* Generalize prompt_feature and completion_feature for use in local datasets to facilitate compatibility with many other training dataset formats.

* Persist configured prompt/completion key

* rebase + nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2025-01-13 10:01:18 -08:00
Awni Hannun
c4833a2f55
fix encoding with special tokens + chat template (#1189) 2025-01-03 10:50:59 -08:00
madroid
aa1c8abdc6
LoRA: Support HuggingFace dataset via data parameter (#996)
* LoRA: support huggingface dataset via `data` argument

* LoRA: Extract the load_custom_hf_dataset function

* LoRA: split small functions

* fix spelling errors

* handle load hf dataset error

* fix pre-commit lint

* update data argument help

* nits and doc

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 07:36:21 -07:00
madroid
7ec2021bb9
LoRA: support tools(function calling) format datasets (#995)
* LoRA: support fine-tuning tools datasets

* LoRA: Split small function

* LoRA: add tools format to lora docs

* LoRA: pre-commit fix

* Revert "LoRA: pre-commit fix"

This reverts commit b94b7e0fe7.

* Revert "LoRA: Split small function"

This reverts commit 3f6a5f19fd.

* LoRA: remove ToolsDataset

In a JSONL file, not all data is required to include the tools value.

* nit in readme

* nit in readme

* nit in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 10:41:36 -07:00
Chime Ogbuji
df6bc09d74
Configuration-based use of HF hub-hosted datasets for training (#701)
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training

* Pre-commit formatting

* Fix YAML config example

* Print DS info

* Include name

* Add hf_dataset parameter default

* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

* nits

* update docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-26 10:20:50 -07:00
madroid
b0bcd86a40
Support for OpenAI’s fine-tuning dataset format (#548)
* LoRA: move load_dataset to tuner/datasets.py file

* LoRA: support OpenAI chat format datasets

see https://platform.openai.com/docs/guides/fine-tuning/example-format

* LoRA: support OpenAI completion format datasets

* LoRA: formatting dataset timing to reduce memory footprint

* Refactor dataset item access in PromptCompletionDataset

* Update mlx_lm/LORA.md

* Update mlx_lm/LORA.md

* check Unsupported data format

* add tests, fine-tune doc

* add tests, fine-tune doc

* add jinja2 for chat template

* nits in readme

* nits in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-03-19 16:45:46 -07:00