feat(mlx-lm): export the GGUF (fp16) format model weights from fuse.py (#555)

* wip

* wip

* feat: convert mlx model to gguf f16

* chore: conver norm layer to float32 to avoid overflow issue

* chore: add support for mixtral

* chore: clean up

* chore: remove unused import statement

* chore: clean up weight name mapping

* version and readme

* actual version bump

---------

Co-authored-by: Awni Hannun <awni@apple.com>
This commit is contained in:
Anchen
2024-03-22 04:34:11 +11:00
committed by GitHub
parent 8f906c859a
commit fe96ef342f
4 changed files with 351 additions and 6 deletions

View File

@@ -9,6 +9,7 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:
- Phi2
- Mixtral
- Qwen2
- Gemma
- OLMo
## Contents
@@ -17,7 +18,7 @@ LoRA (QLoRA).[^qlora] LoRA fine-tuning works with the following model families:
* [Fine-tune](#Fine-tune)
* [Evaluate](#Evaluate)
* [Generate](#Generate)
* [Fuse and Upload](#Fuse-and-Upload)
* [Fuse](#Fuse)
* [Data](#Data)
* [Memory Issues](#Memory-Issues)
@@ -93,11 +94,14 @@ python -m mlx_lm.generate \
--prompt "<your_model_prompt>"
```
## Fuse and Upload
## Fuse
You can generate a model fused with the low-rank adapters using the
`mlx_lm.fuse` command. This command also allows you to upload the fused model
to the Hugging Face Hub.
`mlx_lm.fuse` command. This command also allows you to optionally:
- Upload the fused model to the Hugging Face Hub.
- Export the fused model to GGUF. Note GGUF support is limited to Mistral,
Mixtral, and Llama style models in fp16 precision.
To see supported options run:
@@ -127,6 +131,17 @@ python -m mlx_lm.fuse \
--hf-path mistralai/Mistral-7B-v0.1
```
To export a fused model to GGUF, run:
```shell
python -m mlx_lm.fuse \
--model mistralai/Mistral-7B-v0.1 \
--export-gguf
```
This will save the GGUF model in `lora_fused_model/ggml-model-f16.gguf`. You
can specify the file name with `--gguf-path`.
## Data
The LoRA command expects you to provide a dataset with `--data`. The MLX