feat(mlx-lm): export the GGUF (fp16) format model weights from fuse.py (#555)

* wip * wip * feat: convert mlx model to gguf f16 * chore: conver norm layer to float32 to avoid overflow issue * chore: add support for mixtral * chore: clean up * chore: remove unused import statement * chore: clean up weight name mapping * version and readme * actual version bump --------- Co-authored-by: Awni Hannun <awni@apple.com>
2025-12-16 02:08:55 +08:00 · 2024-03-22 04:34:11 +11:00
parent 8f906c859a
commit fe96ef342f
4 changed files with 351 additions and 6 deletions
--- a/llms/mlx_lm/LORA.md
+++ b/llms/mlx_lm/LORA.md
@@ -9,6 +9,7 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:
 - Phi2
 - Mixtral
 - Qwen2
+- Gemma
 - OLMo

 ## Contents
@@ -17,7 +18,7 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:
  * [Fine-tune](#Fine-tune)
  * [Evaluate](#Evaluate)
  * [Generate](#Generate)
-* [Fuse and Upload](#Fuse-and-Upload)
+* [Fuse](#Fuse)
 * [Data](#Data)
 * [Memory Issues](#Memory-Issues)

@@ -93,11 +94,14 @@ python -m mlx_lm.generate \
    --prompt "<your_model_prompt>"
 ```

-## Fuse and Upload
+## Fuse

 You can generate a model fused with the low-rank adapters using the
-`mlx_lm.fuse` command. This command also allows you to upload the fused model
-to the Hugging Face Hub.
+`mlx_lm.fuse` command. This command also allows you to optionally:
+
+- Upload the fused model to the Hugging Face Hub.
+- Export the fused model to GGUF. Note GGUF support is limited to Mistral,
+  Mixtral, and Llama style models in fp16 precision.

 To see supported options run:

@@ -127,6 +131,17 @@ python -m mlx_lm.fuse \
    --hf-path mistralai/Mistral-7B-v0.1
 ```

+To export a fused model to GGUF, run:
+
+```shell
+python -m mlx_lm.fuse \
+    --model mistralai/Mistral-7B-v0.1 \
+    --export-gguf
+```
+
+This will save the GGUF model in `lora_fused_model/ggml-model-f16.gguf`. You
+can specify the file name with `--gguf-path`.
+
 ## Data

 The LoRA command expects you to provide a dataset with `--data`.  The MLX