Configuration-based use of HF hub-hosted datasets for training (#701)

* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training

* Pre-commit formatting

* Fix YAML config example

* Print DS info

* Include name

* Add hf_dataset parameter default

* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

* nits

* update docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
This commit is contained in:
Chime Ogbuji
2024-06-26 13:20:50 -04:00
committed by GitHub
parent 1d701a1831
commit df6bc09d74
7 changed files with 140 additions and 28 deletions

View File

@@ -32,7 +32,7 @@ jobs:
pip install --upgrade pip
pip install unittest-xml-reporting
cd llms/
pip install -e .
pip install -e ".[testing]"
- run:
name: Run Python tests
command: |