mirror of
https://github.com/ml-explore/mlx-examples.git
synced 2025-09-19 11:28:07 +08:00
remove mlx lm (#1353)
This commit is contained in:
300
llms/README.md
300
llms/README.md
@@ -1,300 +1,6 @@
|
||||
# DEPRECATED
|
||||
# MOVE NOTICE
|
||||
|
||||
The mlx-lm package has moved to a [new repo](https://github.com/ml-explore/mlx-lm).
|
||||
|
||||
The package here will be removed soon. Send new contributions and issues to the MLX LM repo.
|
||||
|
||||
## Generate Text with LLMs and MLX
|
||||
|
||||
The easiest way to get started is to install the `mlx-lm` package:
|
||||
|
||||
**With `pip`**:
|
||||
|
||||
```sh
|
||||
pip install mlx-lm
|
||||
```
|
||||
|
||||
**With `conda`**:
|
||||
|
||||
```sh
|
||||
conda install -c conda-forge mlx-lm
|
||||
```
|
||||
|
||||
The `mlx-lm` package also has:
|
||||
|
||||
- [LoRA, QLoRA, and full fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)
|
||||
- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)
|
||||
- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)
|
||||
|
||||
### Quick Start
|
||||
|
||||
To generate text with an LLM use:
|
||||
|
||||
```bash
|
||||
mlx_lm.generate --prompt "Hi!"
|
||||
```
|
||||
|
||||
To chat with an LLM use:
|
||||
|
||||
```bash
|
||||
mlx_lm.chat
|
||||
```
|
||||
|
||||
This will give you a chat REPL that you can use to interact with the LLM. The
|
||||
chat context is preserved during the lifetime of the REPL.
|
||||
|
||||
Commands in `mlx-lm` typically take command line options which let you specify
|
||||
the model, sampling parameters, and more. Use `-h` to see a list of available
|
||||
options for a command, e.g.:
|
||||
|
||||
```bash
|
||||
mlx_lm.generate -h
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
You can use `mlx-lm` as a module:
|
||||
|
||||
```python
|
||||
from mlx_lm import load, generate
|
||||
|
||||
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
|
||||
|
||||
prompt = "Write a story about Einstein"
|
||||
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
prompt = tokenizer.apply_chat_template(
|
||||
messages, add_generation_prompt=True
|
||||
)
|
||||
|
||||
text = generate(model, tokenizer, prompt=prompt, verbose=True)
|
||||
```
|
||||
|
||||
To see a description of all the arguments you can do:
|
||||
|
||||
```
|
||||
>>> help(generate)
|
||||
```
|
||||
|
||||
Check out the [generation
|
||||
example](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)
|
||||
to see how to use the API in more detail.
|
||||
|
||||
The `mlx-lm` package also comes with functionality to quantize and optionally
|
||||
upload models to the Hugging Face Hub.
|
||||
|
||||
You can convert models using the Python API:
|
||||
|
||||
```python
|
||||
from mlx_lm import convert
|
||||
|
||||
repo = "mistralai/Mistral-7B-Instruct-v0.3"
|
||||
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
|
||||
|
||||
convert(repo, quantize=True, upload_repo=upload_repo)
|
||||
```
|
||||
|
||||
This will generate a 4-bit quantized Mistral 7B and upload it to the repo
|
||||
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
|
||||
converted model in the path `mlx_model` by default.
|
||||
|
||||
To see a description of all the arguments you can do:
|
||||
|
||||
```
|
||||
>>> help(convert)
|
||||
```
|
||||
|
||||
#### Streaming
|
||||
|
||||
For streaming generation, use the `stream_generate` function. This yields
|
||||
a generation response object.
|
||||
|
||||
For example,
|
||||
|
||||
```python
|
||||
from mlx_lm import load, stream_generate
|
||||
|
||||
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
|
||||
model, tokenizer = load(repo)
|
||||
|
||||
prompt = "Write a story about Einstein"
|
||||
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
prompt = tokenizer.apply_chat_template(
|
||||
messages, add_generation_prompt=True
|
||||
)
|
||||
|
||||
for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
|
||||
print(response.text, end="", flush=True)
|
||||
print()
|
||||
```
|
||||
|
||||
#### Sampling
|
||||
|
||||
The `generate` and `stream_generate` functions accept `sampler` and
|
||||
`logits_processors` keyword arguments. A sampler is any callable which accepts
|
||||
a possibly batched logits array and returns an array of sampled tokens. The
|
||||
`logits_processors` must be a list of callables which take the token history
|
||||
and current logits as input and return the processed logits. The logits
|
||||
processors are applied in order.
|
||||
|
||||
Some standard sampling functions and logits processors are provided in
|
||||
`mlx_lm.sample_utils`.
|
||||
|
||||
### Command Line
|
||||
|
||||
You can also use `mlx-lm` from the command line with:
|
||||
|
||||
```
|
||||
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
|
||||
```
|
||||
|
||||
This will download a Mistral 7B model from the Hugging Face Hub and generate
|
||||
text using the given prompt.
|
||||
|
||||
For a full list of options run:
|
||||
|
||||
```
|
||||
mlx_lm.generate --help
|
||||
```
|
||||
|
||||
To quantize a model from the command line run:
|
||||
|
||||
```
|
||||
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
|
||||
```
|
||||
|
||||
For more options run:
|
||||
|
||||
```
|
||||
mlx_lm.convert --help
|
||||
```
|
||||
|
||||
You can upload new models to Hugging Face by specifying `--upload-repo` to
|
||||
`convert`. For example, to upload a quantized Mistral-7B model to the
|
||||
[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:
|
||||
|
||||
```
|
||||
mlx_lm.convert \
|
||||
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
|
||||
-q \
|
||||
--upload-repo mlx-community/my-4bit-mistral
|
||||
```
|
||||
|
||||
Models can also be converted and quantized directly in the
|
||||
[mlx-my-repo](https://huggingface.co/spaces/mlx-community/mlx-my-repo) Hugging
|
||||
Face Space.
|
||||
|
||||
### Long Prompts and Generations
|
||||
|
||||
`mlx-lm` has some tools to scale efficiently to long prompts and generations:
|
||||
|
||||
- A rotating fixed-size key-value cache.
|
||||
- Prompt caching
|
||||
|
||||
To use the rotating key-value cache pass the argument `--max-kv-size n` where
|
||||
`n` can be any integer. Smaller values like `512` will use very little RAM but
|
||||
result in worse quality. Larger values like `4096` or higher will use more RAM
|
||||
but have better quality.
|
||||
|
||||
Caching prompts can substantially speedup reusing the same long context with
|
||||
different queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:
|
||||
|
||||
```bash
|
||||
cat prompt.txt | mlx_lm.cache_prompt \
|
||||
--model mistralai/Mistral-7B-Instruct-v0.3 \
|
||||
--prompt - \
|
||||
--prompt-cache-file mistral_prompt.safetensors
|
||||
```
|
||||
|
||||
Then use the cached prompt with `mlx_lm.generate`:
|
||||
|
||||
```
|
||||
mlx_lm.generate \
|
||||
--prompt-cache-file mistral_prompt.safetensors \
|
||||
--prompt "\nSummarize the above text."
|
||||
```
|
||||
|
||||
The cached prompt is treated as a prefix to the supplied prompt. Also notice
|
||||
when using a cached prompt, the model to use is read from the cache and need
|
||||
not be supplied explicitly.
|
||||
|
||||
Prompt caching can also be used in the Python API in order to to avoid
|
||||
recomputing the prompt. This is useful in multi-turn dialogues or across
|
||||
requests that use the same context. See the
|
||||
[example](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/chat.py)
|
||||
for more usage details.
|
||||
|
||||
### Supported Models
|
||||
|
||||
`mlx-lm` supports thousands of Hugging Face format LLMs. If the model you want to
|
||||
run is not supported, file an
|
||||
[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,
|
||||
submit a pull request.
|
||||
|
||||
Here are a few examples of Hugging Face models that work with this example:
|
||||
|
||||
- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
||||
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
||||
- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
|
||||
- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
|
||||
- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
|
||||
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
|
||||
- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)
|
||||
- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)
|
||||
- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)
|
||||
- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)
|
||||
- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)
|
||||
- [tiiuae/falcon-mamba-7b-instruct](https://huggingface.co/tiiuae/falcon-mamba-7b-instruct)
|
||||
|
||||
Most
|
||||
[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),
|
||||
[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),
|
||||
[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),
|
||||
and
|
||||
[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)
|
||||
style models should work out of the box.
|
||||
|
||||
For some models (such as `Qwen` and `plamo`) the tokenizer requires you to
|
||||
enable the `trust_remote_code` option. You can do this by passing
|
||||
`--trust-remote-code` in the command line. If you don't specify the flag
|
||||
explicitly, you will be prompted to trust remote code in the terminal when
|
||||
running the model.
|
||||
|
||||
For `Qwen` models you must also specify the `eos_token`. You can do this by
|
||||
passing `--eos-token "<|endoftext|>"` in the command
|
||||
line.
|
||||
|
||||
These options can also be set in the Python API. For example:
|
||||
|
||||
```python
|
||||
model, tokenizer = load(
|
||||
"qwen/Qwen-7B",
|
||||
tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
|
||||
)
|
||||
```
|
||||
|
||||
### Large Models
|
||||
|
||||
> [!NOTE]
|
||||
This requires macOS 15.0 or higher to work.
|
||||
|
||||
Models which are large relative to the total RAM available on the machine can
|
||||
be slow. `mlx-lm` will attempt to make them faster by wiring the memory
|
||||
occupied by the model and cache. This requires macOS 15 or higher to
|
||||
work.
|
||||
|
||||
If you see the following warning message:
|
||||
|
||||
> [WARNING] Generating with a model that requires ...
|
||||
|
||||
then the model will likely be slow on the given machine. If the model fits in
|
||||
RAM then it can often be sped up by increasing the system wired memory limit.
|
||||
To increase the limit, set the following `sysctl`:
|
||||
|
||||
```bash
|
||||
sudo sysctl iogpu.wired_limit_mb=N
|
||||
```
|
||||
|
||||
The value `N` should be larger than the size of the model in megabytes but
|
||||
smaller than the memory size of the machine.
|
||||
The package has been removed from the MLX Examples repo. Send new contributions
|
||||
and issues to the MLX LM repo.
|
||||
|
Reference in New Issue
Block a user