2024-01-13 02:25:56 +08:00
|
|
|
## Generate Text with LLMs and MLX
|
|
|
|
|
|
|
|
The easiest way to get started is to install the `mlx-lm` package:
|
|
|
|
|
2024-01-23 13:14:48 +08:00
|
|
|
**With `pip`**:
|
|
|
|
|
|
|
|
```sh
|
2024-01-13 02:25:56 +08:00
|
|
|
pip install mlx-lm
|
|
|
|
```
|
|
|
|
|
2024-01-23 13:14:48 +08:00
|
|
|
**With `conda`**:
|
|
|
|
|
|
|
|
```sh
|
|
|
|
conda install -c conda-forge mlx-lm
|
|
|
|
```
|
|
|
|
|
2024-02-20 12:37:15 +08:00
|
|
|
The `mlx-lm` package also has:
|
|
|
|
|
|
|
|
- [LoRA and QLoRA fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)
|
|
|
|
- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)
|
|
|
|
- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)
|
2024-01-24 00:44:37 +08:00
|
|
|
|
2024-01-13 02:25:56 +08:00
|
|
|
### Python API
|
|
|
|
|
|
|
|
You can use `mlx-lm` as a module:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from mlx_lm import load, generate
|
|
|
|
|
2024-02-20 12:37:15 +08:00
|
|
|
model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.1")
|
2024-01-13 02:25:56 +08:00
|
|
|
|
|
|
|
response = generate(model, tokenizer, prompt="hello", verbose=True)
|
|
|
|
```
|
|
|
|
|
|
|
|
To see a description of all the arguments you can do:
|
|
|
|
|
|
|
|
```
|
|
|
|
>>> help(generate)
|
|
|
|
```
|
|
|
|
|
|
|
|
The `mlx-lm` package also comes with functionality to quantize and optionally
|
|
|
|
upload models to the Hugging Face Hub.
|
|
|
|
|
|
|
|
You can convert models in the Python API with:
|
|
|
|
|
|
|
|
```python
|
2024-01-23 23:17:24 +08:00
|
|
|
from mlx_lm import convert
|
2024-01-13 02:25:56 +08:00
|
|
|
|
2024-03-03 11:39:23 +08:00
|
|
|
upload_repo = "mlx-community/My-Mistral-7B-v0.1-4bit"
|
2024-01-13 02:25:56 +08:00
|
|
|
|
|
|
|
convert("mistralai/Mistral-7B-v0.1", quantize=True, upload_repo=upload_repo)
|
|
|
|
```
|
|
|
|
|
|
|
|
This will generate a 4-bit quantized Mistral-7B and upload it to the
|
|
|
|
repo `mlx-community/My-Mistral-7B-v0.1-4bit`. It will also save the
|
|
|
|
converted model in the path `mlx_model` by default.
|
|
|
|
|
|
|
|
To see a description of all the arguments you can do:
|
|
|
|
|
|
|
|
```
|
|
|
|
>>> help(convert)
|
|
|
|
```
|
|
|
|
|
2024-01-23 23:17:24 +08:00
|
|
|
### Command Line
|
2024-01-13 02:25:56 +08:00
|
|
|
|
|
|
|
You can also use `mlx-lm` from the command line with:
|
|
|
|
|
|
|
|
```
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.1 --prompt "hello"
|
2024-01-13 02:25:56 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
This will download a Mistral 7B model from the Hugging Face Hub and generate
|
2024-01-23 23:17:24 +08:00
|
|
|
text using the given prompt.
|
2024-01-13 02:25:56 +08:00
|
|
|
|
|
|
|
For a full list of options run:
|
|
|
|
|
|
|
|
```
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.generate --help
|
2024-01-13 02:25:56 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
To quantize a model from the command line run:
|
|
|
|
|
|
|
|
```
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.1 -q
|
2024-01-13 02:25:56 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
For more options run:
|
|
|
|
|
|
|
|
```
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.convert --help
|
2024-01-13 02:25:56 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
You can upload new models to Hugging Face by specifying `--upload-repo` to
|
2024-01-23 23:17:24 +08:00
|
|
|
`convert`. For example, to upload a quantized Mistral-7B model to the
|
2024-01-13 02:25:56 +08:00
|
|
|
[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:
|
|
|
|
|
|
|
|
```
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.convert \
|
2024-01-13 02:25:56 +08:00
|
|
|
--hf-path mistralai/Mistral-7B-v0.1 \
|
|
|
|
-q \
|
2024-01-14 00:35:03 +08:00
|
|
|
--upload-repo mlx-community/my-4bit-mistral
|
2024-01-13 02:25:56 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
### Supported Models
|
|
|
|
|
|
|
|
The example supports Hugging Face format Mistral, Llama, and Phi-2 style
|
|
|
|
models. If the model you want to run is not supported, file an
|
|
|
|
[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,
|
|
|
|
submit a pull request.
|
|
|
|
|
|
|
|
Here are a few examples of Hugging Face models that work with this example:
|
|
|
|
|
|
|
|
- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
|
|
|
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
|
|
|
- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
|
|
|
|
- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
|
|
|
|
- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
|
2024-01-15 23:18:14 +08:00
|
|
|
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
|
2024-01-23 07:00:07 +08:00
|
|
|
- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)
|
2024-01-23 23:17:24 +08:00
|
|
|
- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)
|
|
|
|
- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)
|
2024-01-27 02:28:00 +08:00
|
|
|
- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)
|
2024-05-27 21:22:21 +08:00
|
|
|
- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)
|
2024-01-13 02:25:56 +08:00
|
|
|
|
|
|
|
Most
|
|
|
|
[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),
|
|
|
|
[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),
|
2024-01-23 07:00:07 +08:00
|
|
|
[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),
|
2024-01-15 23:18:14 +08:00
|
|
|
and
|
|
|
|
[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)
|
2024-01-13 02:25:56 +08:00
|
|
|
style models should work out of the box.
|
2024-01-23 07:00:07 +08:00
|
|
|
|
2024-01-23 23:17:24 +08:00
|
|
|
For some models (such as `Qwen` and `plamo`) the tokenizer requires you to
|
|
|
|
enable the `trust_remote_code` option. You can do this by passing
|
|
|
|
`--trust-remote-code` in the command line. If you don't specify the flag
|
|
|
|
explicitly, you will be prompted to trust remote code in the terminal when
|
|
|
|
running the model.
|
|
|
|
|
|
|
|
For `Qwen` models you must also specify the `eos_token`. You can do this by
|
|
|
|
passing `--eos-token "<|endoftext|>"` in the command
|
|
|
|
line.
|
|
|
|
|
|
|
|
These options can also be set in the Python API. For example:
|
2024-01-23 07:00:07 +08:00
|
|
|
|
|
|
|
```python
|
|
|
|
model, tokenizer = load(
|
|
|
|
"qwen/Qwen-7B",
|
|
|
|
tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
|
|
|
|
)
|
|
|
|
```
|