mlx_lm: Add Streaming Capability to Generate Function (#807)

* Add streaming feature to text generation function * separate stream and regular functions --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2024-06-03 18:04:39 +02:00
parent 8353bbbf93
commit 43d6deb3c1
4 changed files with 89 additions and 37 deletions
--- a/llms/README.md
+++ b/llms/README.md
@@ -27,7 +27,7 @@ You can use `mlx-lm` as a module:
 ```python
 from mlx_lm import load, generate

-model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.1")
+model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

 response = generate(model, tokenizer, prompt="hello", verbose=True)
 ```
@@ -46,13 +46,14 @@ You can convert models in the Python API with:
 ```python
 from mlx_lm import convert

-upload_repo = "mlx-community/My-Mistral-7B-v0.1-4bit"
+repo = "mistralai/Mistral-7B-Instruct-v0.3"
+upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

-convert("mistralai/Mistral-7B-v0.1", quantize=True, upload_repo=upload_repo)
+convert(repo, quantize=True, upload_repo=upload_repo)
 ```

-This will generate a 4-bit quantized Mistral-7B and upload it to the
-repo `mlx-community/My-Mistral-7B-v0.1-4bit`. It will also save the
+This will generate a 4-bit quantized Mistral 7B and upload it to the repo
+`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
 converted model in the path `mlx_model` by default.

 To see a description of all the arguments you can do:
@@ -61,12 +62,30 @@ To see a description of all the arguments you can do:
 >>> help(convert)
 ```

+#### Streaming
+
+For streaming generation, use the `stream_generate` function. This returns a
+generator object which streams the output text. For example,
+
+```python
+from mlx_lm import load, stream_generate
+
+repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
+model, tokenizer = load(repo)
+
+prompt = "Write a story about Einstein"
+
+for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
+    print(t, end="", flush=True)
+print()
+```
+
 ### Command Line

 You can also use `mlx-lm` from the command line with:

 ```
-mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.1 --prompt "hello"
+mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
 ```

 This will download a Mistral 7B model from the Hugging Face Hub and generate
@@ -81,7 +100,7 @@ mlx_lm.generate --help
 To quantize a model from the command line run:

 ```
-mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.1 -q
+mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
 ```

 For more options run:
@@ -96,7 +115,7 @@ You can upload new models to Hugging Face by specifying `--upload-repo` to

 ```
 mlx_lm.convert \
-    --hf-path mistralai/Mistral-7B-v0.1 \
+    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral
 ```