Quantize example (#162)

* testing quantization * conversion + quantization working * one config processor * quantization in mistral / nits in llama * args for quantization * llama / mistral conversion in good shape * phi2 quantized * mixtral * qwen conversion
2025-12-16 02:08:55 +08:00 · 2023-12-21 12:59:37 -08:00
parent 4c9db80ed2
commit 3cf436b529
17 changed files with 553 additions and 126 deletions
--- a/llms/llama/README.md
+++ b/llms/llama/README.md
@@ -30,24 +30,32 @@ Face](https://huggingface.co/TinyLlama).
 Convert the weights with:

 ```
-python convert.py --model-path <path_to_torch_model>
+python convert.py --torch-path <path_to_torch_model>
+```
+
+To generate a 4-bit quantized model use the `-q` flag:
+
+```
+python convert.py --torch-path <path_to_torch_model> -q
 ```

 For TinyLlama use

 ```
-python convert.py --model-path <path_to_torch_model> --model-name tiny_llama
+python convert.py --torch-path <path_to_torch_model> --model-name tiny_llama
 ```

-The conversion script will save the converted weights in the same location.
+By default, the conversion script will make the directory `mlx_model` and save
+the converted `weights.npz`, `tokenizer.model`, and `config.json` there.
+

 ### Run

 Once you've converted the weights to MLX format, you can interact with the
-LlaMA model:
+LlamA model:

 ```
-python llama.py <path_to_model> <path_to_tokenizer.model> --prompt "hello"
+python llama.py --prompt "hello"
 ```

 Run `python llama.py --help` for more details.