![]() * Convert HF weights of PLaMo and load it to a plamo model in mlx * Fix model inference part * Add bos at the beginning of the prompt * Fix convert.py to copy tokenizer.model into the converted dir * Use the required insturction format in generate.py when "--instruct" option is specified * Change filenames and update existing scripts * Add README * Add requirements.txt * Fix plamo.py to stop generation when EOS appears * Add quantization to convert.py * Use mlx>=0.0.9 for mx.core.outer() in PLaMo model * Update acknowledgements.md * Fix card text in upload_to_hub() * Not use prompt template when --instruct is not specified * Ask if you trust_remote_code for loading tokenizer of PLaMo * Check the user trusts the remote code when converting * Remove plamo directory * Update README * Add PLaMo model file * Fix the handling of cache in PLaMo and update README * Ask if trust_remote_code only when the model is PLaMo * Remove resolve_trust_remote_code from convert.py and use the latest transformers * Remove code not to add EOS * Update README to fix an example not to use noncommercial version of the model * Remove unused imports * Remove unnecessary description about the instruct model of PLaMo from README * format, nits in README * typo --------- Co-authored-by: Shunta Saito <shunta@mitmul-mbp.local> Co-authored-by: Awni Hannun <awni@apple.com> |
||
---|---|---|
.. | ||
llama | ||
mistral | ||
mixtral | ||
mlx_lm | ||
phixtral | ||
speculative_decoding | ||
MANIFEST.in | ||
README.md | ||
setup.py |
Generate Text with LLMs and MLX
The easiest way to get started is to install the mlx-lm
package:
With pip
:
pip install mlx-lm
With conda
:
conda install -c conda-forge mlx-lm
Python API
You can use mlx-lm
as a module:
from mlx_lm import load, generate
model, tokenizer = load("mistralai/Mistral-7B-v0.1")
response = generate(model, tokenizer, prompt="hello", verbose=True)
To see a description of all the arguments you can do:
>>> help(generate)
The mlx-lm
package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.
You can convert models in the Python API with:
from mlx_lm import convert
upload_repo = "mlx-community/My-Mistral-7B-v0.1-4bit"
convert("mistralai/Mistral-7B-v0.1", quantize=True, upload_repo=upload_repo)
This will generate a 4-bit quantized Mistral-7B and upload it to the
repo mlx-community/My-Mistral-7B-v0.1-4bit
. It will also save the
converted model in the path mlx_model
by default.
To see a description of all the arguments you can do:
>>> help(convert)
Command Line
You can also use mlx-lm
from the command line with:
python -m mlx_lm.generate --model mistralai/Mistral-7B-v0.1 --prompt "hello"
This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.
For a full list of options run:
python -m mlx_lm.generate --help
To quantize a model from the command line run:
python -m mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.1 -q
For more options run:
python -m mlx_lm.convert --help
You can upload new models to Hugging Face by specifying --upload-repo
to
convert
. For example, to upload a quantized Mistral-7B model to the
MLX Hugging Face community you can do:
python -m mlx_lm.convert \
--hf-path mistralai/Mistral-7B-v0.1 \
-q \
--upload-repo mlx-community/my-4bit-mistral
Supported Models
The example supports Hugging Face format Mistral, Llama, and Phi-2 style models. If the model you want to run is not supported, file an issue or better yet, submit a pull request.
Here are a few examples of Hugging Face models that work with this example:
- mistralai/Mistral-7B-v0.1
- meta-llama/Llama-2-7b-hf
- deepseek-ai/deepseek-coder-6.7b-instruct
- 01-ai/Yi-6B-Chat
- microsoft/phi-2
- mistralai/Mixtral-8x7B-Instruct-v0.1
- Qwen/Qwen-7B
- pfnet/plamo-13b
- pfnet/plamo-13b-instruct
Most Mistral, Llama, Phi-2, and Mixtral style models should work out of the box.
For some models (such as Qwen
and plamo
) the tokenizer requires you to
enable the trust_remote_code
option. You can do this by passing
--trust-remote-code
in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model.
For Qwen
models you must also specify the eos_token
. You can do this by
passing --eos-token "<|endoftext|>"
in the command
line.
These options can also be set in the Python API. For example:
model, tokenizer = load(
"qwen/Qwen-7B",
tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)