mlx-examples/llms/gguf_llm
Juarez Bochi f5b80c95fb
Example reading directly from gguf file (#222)
* Draft of tiny llama from gguf

* Transpose all

* No transposition with new layout

* Read config from gguf

* Create tokenizer from gguf

* move gguf and update to be similar to hf_llm

* change model to HF style + updates to REAMDE

* nits in REAMDE

* nit readme

* only use mlx for metadata

* fix eos/bos tokenizer

* fix tokenization

* quantization runs

* 8-bit works

* tokenizer fix

* bump mlx version

---------

Co-authored-by: Juarez Bochi <juarez.bochi@grammarly.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-01-23 15:41:54 -08:00
..
generate.py Example reading directly from gguf file (#222) 2024-01-23 15:41:54 -08:00
models.py Example reading directly from gguf file (#222) 2024-01-23 15:41:54 -08:00
README.md Example reading directly from gguf file (#222) 2024-01-23 15:41:54 -08:00
requirements.txt Example reading directly from gguf file (#222) 2024-01-23 15:41:54 -08:00
utils.py Example reading directly from gguf file (#222) 2024-01-23 15:41:54 -08:00

LLMs in MLX with GGUF

An example generating text using GGUF format models in MLX.1

Note

MLX is able to read most quantization formats from GGUF directly. However, only a few quantizations are supported directly: Q4_0, Q4_1, and Q8_0. Unsupported quantizations will be cast to float16.

Setup

Install the dependencies:

pip install -r requirements.txt

Run

Run with:

python generate.py \
  --repo <hugging_face_repo> \
  --gguf <file.gguf> \
  --prompt "Write a quicksort in Python"

For example, to generate text with Mistral 7B use:

python generate.py \
  --repo TheBloke/Mistral-7B-v0.1-GGUF \
  --gguf mistral-7b-v0.1.Q8_0.gguf \
  --prompt "Write a quicksort in Python"

Run python generate.py --help for more options.

Models that have been tested and work include:


  1. For more information on GGUF see the documentation. ↩︎