Fix bug in upload + docs nit (#981)

* fix bug in upload + docs nit

* nit
This commit is contained in:
Awni Hannun 2024-09-07 14:46:57 -07:00 committed by GitHub
parent c3e3411756
commit 6c2369e4b9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 8 additions and 24 deletions

View File

@ -166,44 +166,28 @@ Currently, `*.jsonl` files support three data formats: `chat`,
`chat`:
```jsonl
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello."
},
{
"role": "assistant",
"content": "How can I assistant you today."
}
]
}
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello."}, {"role": "assistant", "content": "How can I assistant you today."}]}
```
`completions`:
```jsonl
{
"prompt": "What is the capital of France?",
"completion": "Paris."
}
{"prompt": "What is the capital of France?", "completion": "Paris."}
```
`text`:
```jsonl
{
"text": "This is an example for the model."
}
{"text": "This is an example for the model."}
```
Note, the format is automatically determined by the dataset. Note also, keys in
each line not expected by the loader will be ignored.
> [!NOTE]
> Each example in the datasets must be on a single line. Do not put more than
> one example per line and do not split an example accross multiple lines.
### Hugging Face Datasets
To use Hugging Face datasets, first install the `datasets` package:

View File

@ -581,7 +581,7 @@ def upload_to_hub(path: str, upload_repo: str, hf_path: str):
prompt="hello"
if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
messages = [{{"role": "user", "content": prompt}}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)