2024-01-13 02:25:56 +08:00
## Generate Text with LLMs and MLX
The easiest way to get started is to install the `mlx-lm` package:
2024-01-23 13:14:48 +08:00
**With `pip` **:
```sh
2024-01-13 02:25:56 +08:00
pip install mlx-lm
```
2024-01-23 13:14:48 +08:00
**With `conda` **:
```sh
conda install -c conda-forge mlx-lm
```
2024-02-20 12:37:15 +08:00
The `mlx-lm` package also has:
- [LoRA and QLoRA fine-tuning ](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md )
- [Merging models ](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md )
- [HTTP model serving ](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md )
2024-01-24 00:44:37 +08:00
2024-01-13 02:25:56 +08:00
### Python API
You can use `mlx-lm` as a module:
```python
from mlx_lm import load, generate
2024-06-04 00:04:39 +08:00
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
2024-01-13 02:25:56 +08:00
response = generate(model, tokenizer, prompt="hello", verbose=True)
```
To see a description of all the arguments you can do:
```
>>> help(generate)
```
2024-07-09 21:49:59 +08:00
Check out the [generation example ](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py ) to see how to use the API in more detail.
2024-01-13 02:25:56 +08:00
The `mlx-lm` package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.
You can convert models in the Python API with:
```python
2024-01-23 23:17:24 +08:00
from mlx_lm import convert
2024-01-13 02:25:56 +08:00
2024-06-04 00:04:39 +08:00
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
2024-01-13 02:25:56 +08:00
2024-06-04 00:04:39 +08:00
convert(repo, quantize=True, upload_repo=upload_repo)
2024-01-13 02:25:56 +08:00
```
2024-06-04 00:04:39 +08:00
This will generate a 4-bit quantized Mistral 7B and upload it to the repo
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit` . It will also save the
2024-01-13 02:25:56 +08:00
converted model in the path `mlx_model` by default.
To see a description of all the arguments you can do:
```
>>> help(convert)
```
2024-06-04 00:04:39 +08:00
#### Streaming
For streaming generation, use the `stream_generate` function. This returns a
generator object which streams the output text. For example,
```python
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(t, end="", flush=True)
print()
```
2024-01-23 23:17:24 +08:00
### Command Line
2024-01-13 02:25:56 +08:00
You can also use `mlx-lm` from the command line with:
```
2024-06-04 00:04:39 +08:00
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
2024-01-13 02:25:56 +08:00
```
This will download a Mistral 7B model from the Hugging Face Hub and generate
2024-01-23 23:17:24 +08:00
text using the given prompt.
2024-01-13 02:25:56 +08:00
For a full list of options run:
```
2024-04-17 07:08:49 +08:00
mlx_lm.generate --help
2024-01-13 02:25:56 +08:00
```
To quantize a model from the command line run:
```
2024-06-04 00:04:39 +08:00
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
2024-01-13 02:25:56 +08:00
```
For more options run:
```
2024-04-17 07:08:49 +08:00
mlx_lm.convert --help
2024-01-13 02:25:56 +08:00
```
You can upload new models to Hugging Face by specifying `--upload-repo` to
2024-01-23 23:17:24 +08:00
`convert` . For example, to upload a quantized Mistral-7B model to the
2024-01-13 02:25:56 +08:00
[MLX Hugging Face community ](https://huggingface.co/mlx-community ) you can do:
```
2024-04-17 07:08:49 +08:00
mlx_lm.convert \
2024-06-04 00:04:39 +08:00
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
2024-01-13 02:25:56 +08:00
-q \
2024-01-14 00:35:03 +08:00
--upload-repo mlx-community/my-4bit-mistral
2024-01-13 02:25:56 +08:00
```
### Supported Models
The example supports Hugging Face format Mistral, Llama, and Phi-2 style
models. If the model you want to run is not supported, file an
[issue ](https://github.com/ml-explore/mlx-examples/issues/new ) or better yet,
submit a pull request.
Here are a few examples of Hugging Face models that work with this example:
- [mistralai/Mistral-7B-v0.1 ](https://huggingface.co/mistralai/Mistral-7B-v0.1 )
- [meta-llama/Llama-2-7b-hf ](https://huggingface.co/meta-llama/Llama-2-7b-hf )
- [deepseek-ai/deepseek-coder-6.7b-instruct ](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct )
- [01-ai/Yi-6B-Chat ](https://huggingface.co/01-ai/Yi-6B-Chat )
- [microsoft/phi-2 ](https://huggingface.co/microsoft/phi-2 )
2024-01-15 23:18:14 +08:00
- [mistralai/Mixtral-8x7B-Instruct-v0.1 ](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 )
2024-01-23 07:00:07 +08:00
- [Qwen/Qwen-7B ](https://huggingface.co/Qwen/Qwen-7B )
2024-01-23 23:17:24 +08:00
- [pfnet/plamo-13b ](https://huggingface.co/pfnet/plamo-13b )
- [pfnet/plamo-13b-instruct ](https://huggingface.co/pfnet/plamo-13b-instruct )
2024-01-27 02:28:00 +08:00
- [stabilityai/stablelm-2-zephyr-1_6b ](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b )
2024-05-27 21:22:21 +08:00
- [internlm/internlm2-7b ](https://huggingface.co/internlm/internlm2-7b )
2024-01-13 02:25:56 +08:00
Most
[Mistral ](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending ),
[Llama ](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending ),
2024-01-23 07:00:07 +08:00
[Phi-2 ](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending ),
2024-01-15 23:18:14 +08:00
and
[Mixtral ](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending )
2024-01-13 02:25:56 +08:00
style models should work out of the box.
2024-01-23 07:00:07 +08:00
2024-01-23 23:17:24 +08:00
For some models (such as `Qwen` and `plamo` ) the tokenizer requires you to
enable the `trust_remote_code` option. You can do this by passing
`--trust-remote-code` in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model.
For `Qwen` models you must also specify the `eos_token` . You can do this by
passing `--eos-token "<|endoftext|>"` in the command
line.
These options can also be set in the Python API. For example:
2024-01-23 07:00:07 +08:00
```python
model, tokenizer = load(
"qwen/Qwen-7B",
tokenizer_config={"eos_token": "< |endoftext|>", "trust_remote_code": True},
)
```