mirror of
				https://github.com/ml-explore/mlx-examples.git
				synced 2025-10-31 02:48:07 +08:00 
			
		
		
		
	 f5b80c95fb
			
		
	
	f5b80c95fb
	
	
	
		
			
			* Draft of tiny llama from gguf * Transpose all * No transposition with new layout * Read config from gguf * Create tokenizer from gguf * move gguf and update to be similar to hf_llm * change model to HF style + updates to REAMDE * nits in REAMDE * nit readme * only use mlx for metadata * fix eos/bos tokenizer * fix tokenization * quantization runs * 8-bit works * tokenizer fix * bump mlx version --------- Co-authored-by: Juarez Bochi <juarez.bochi@grammarly.com> Co-authored-by: Awni Hannun <awni@apple.com>
LLMs in MLX with GGUF
An example generating text using GGUF format models in MLX.1
Note
MLX is able to read most quantization formats from GGUF directly. However, only a few quantizations are supported directly:
Q4_0,Q4_1, andQ8_0. Unsupported quantizations will be cast tofloat16.
Setup
Install the dependencies:
pip install -r requirements.txt
Run
Run with:
python generate.py \
  --repo <hugging_face_repo> \
  --gguf <file.gguf> \
  --prompt "Write a quicksort in Python"
For example, to generate text with Mistral 7B use:
python generate.py \
  --repo TheBloke/Mistral-7B-v0.1-GGUF \
  --gguf mistral-7b-v0.1.Q8_0.gguf \
  --prompt "Write a quicksort in Python"
Run python generate.py --help for more options.
Models that have been tested and work include:
- 
TheBloke/Mistral-7B-v0.1-GGUF, for quantized models use: - mistral-7b-v0.1.Q8_0.gguf
- mistral-7b-v0.1.Q4_0.gguf
 
- 
TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF, for quantized models use: - tinyllama-1.1b-chat-v1.0.Q8_0.gguf
- tinyllama-1.1b-chat-v1.0.Q4_0.gguf
 
- 
For more information on GGUF see the documentation. ↩︎