mlx_lm: Add Streaming Capability to Generate Function (#807)

* Add streaming feature to text generation function

* separate stream and regular functions

---------

Co-authored-by: Awni Hannun <awni@apple.com>
This commit is contained in:
Michał Kurc
2024-06-03 18:04:39 +02:00
committed by GitHub
parent 8353bbbf93
commit 43d6deb3c1
4 changed files with 89 additions and 37 deletions

View File

@@ -27,7 +27,7 @@ You can use `mlx-lm` as a module:
```python
from mlx_lm import load, generate
model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.1")
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)
```
@@ -46,13 +46,14 @@ You can convert models in the Python API with:
```python
from mlx_lm import convert
upload_repo = "mlx-community/My-Mistral-7B-v0.1-4bit"
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert("mistralai/Mistral-7B-v0.1", quantize=True, upload_repo=upload_repo)
convert(repo, quantize=True, upload_repo=upload_repo)
```
This will generate a 4-bit quantized Mistral-7B and upload it to the
repo `mlx-community/My-Mistral-7B-v0.1-4bit`. It will also save the
This will generate a 4-bit quantized Mistral 7B and upload it to the repo
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
converted model in the path `mlx_model` by default.
To see a description of all the arguments you can do:
@@ -61,12 +62,30 @@ To see a description of all the arguments you can do:
>>> help(convert)
```
#### Streaming
For streaming generation, use the `stream_generate` function. This returns a
generator object which streams the output text. For example,
```python
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(t, end="", flush=True)
print()
```
### Command Line
You can also use `mlx-lm` from the command line with:
```
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.1 --prompt "hello"
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
```
This will download a Mistral 7B model from the Hugging Face Hub and generate
@@ -81,7 +100,7 @@ mlx_lm.generate --help
To quantize a model from the command line run:
```
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.1 -q
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
```
For more options run:
@@ -96,7 +115,7 @@ You can upload new models to Hugging Face by specifying `--upload-repo` to
```
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-v0.1 \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--upload-repo mlx-community/my-4bit-mistral
```