
* testing quantization * conversion + quantization working * one config processor * quantization in mistral / nits in llama * args for quantization * llama / mistral conversion in good shape * phi2 quantized * mixtral * qwen conversion
2.0 KiB
Llama
An example of generating text with Llama (1 or 2) using MLX.
Llama is a set of open source language models from Meta AI Research1 2 ranging from 7B to 70B parameters. This example also supports Meta's Llama Chat and Code Llama models, as well as the 1.1B TinyLlama models from SUTD.3
Setup
Install the dependencies:
pip install -r requirements.txt
Next, download and convert the model. If you do not have access to the model weights you will need to request access from Meta:
[!TIP] Alternatively, you can also download a few converted checkpoints from the MLX Community organization on Hugging Face and skip the conversion step.
You can download the TinyLlama models directly from Hugging Face.
Convert the weights with:
python convert.py --torch-path <path_to_torch_model>
To generate a 4-bit quantized model use the -q
flag:
python convert.py --torch-path <path_to_torch_model> -q
For TinyLlama use
python convert.py --torch-path <path_to_torch_model> --model-name tiny_llama
By default, the conversion script will make the directory mlx_model
and save
the converted weights.npz
, tokenizer.model
, and config.json
there.
Run
Once you've converted the weights to MLX format, you can interact with the LlamA model:
python llama.py --prompt "hello"
Run python llama.py --help
for more details.
-
For Llama v1 refer to the arXiv paper and blog post for more details. ↩︎
-
For TinyLlama refer to the gihub repository ↩︎