Commit Graph

296 Commits

Author SHA1 Message Date
Anchen
82e3338987 chore(mlx-lm): add max token arg for mlx_lm.chat (#1089)
* chore(mlx-lm): add max token arg for mlx_lm.chat

* chore: update the default max token value
2024-11-04 06:06:34 -08:00
Angelos Katharopoulos
331148d8ec Enable distributed LoRA training (#821) 2024-11-02 18:02:31 -07:00
Awni Hannun
0f799947d0 fix (#1079) 2024-11-01 16:30:32 -07:00
Awni Hannun
e510987870 Clear cache every now and then (#1081)
* clear cache every now and then

* don't need user arg anymore
2024-11-01 14:15:32 -07:00
Alex Barron
85ffd2c96a Quantized KV Cache (#1075)
* add QuantizedKVCache

* simplify

* add tests

* single sdpa function

* fix sed

* in place

* fix tests

* support different k and v head dims
2024-10-31 16:59:52 -07:00
Awni Hannun
9f34fdbda4 Wire models in MLX LM (#1069)
* wired in MLX LM

* fix synch

* comment + nit

* version

* mlx lm version

* bump to 0.19.2
2024-10-31 08:17:14 -07:00
Awni Hannun
8fe9539af7 Fix detokenizer space match for quote (#1072)
* fix + test

* remove transformer flax/torch warning

* format
2024-10-27 15:06:07 -07:00
hschaeufler
ab4bf05c6e Update lora_config.yaml with new param: num_layers (#1068) 2024-10-26 09:34:46 -07:00
Awni Hannun
9000e280ae fix mamba models conversion (#1065) 2024-10-22 15:44:08 -07:00
madroid
d1d480867b LoRA: update tools datasets docs (#1063)
* LoRA: update tools datasets docs

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-10-22 12:19:11 -07:00
Awni Hannun
66e7bcb886 override dtype with quant (#1062) 2024-10-22 09:56:45 -07:00
aronson
743763bc2e Handle empty string case in maybe_trim_space (#1055)
* Handle empty string case in maybe_trim_space

* nit

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-10-20 20:46:43 -07:00
Awni Hannun
605c4854f1 Prompt caching in mlx_lm.server (#1026)
* caching in server

* nits

* fix tests

* don't throw if no metal

* comments
2024-10-14 10:57:22 -07:00
Awni Hannun
8dca1a2f60 Tokenizer updates + tests (#1024)
* tokenizer updates + tests

* nit

* add can_trim_prompt_cache

* nits
2024-10-14 10:48:46 -07:00
Awni Hannun
c799133998 Make llm async eval less brittle (#1040)
* Make llm async eval less brittle

* nit
2024-10-14 10:25:24 -07:00
Shunta Saito
7612c646f3 Fix PLaMo model to support Grouped Query Attention (#1037) 2024-10-12 15:26:50 -07:00
Awni Hannun
4360e7ccec clear cache during prompt processing (#1027) 2024-10-09 16:48:32 -07:00
Awni Hannun
b7373cb44f fix long prompt generations (#1023) 2024-10-09 11:09:36 -07:00
Awni Hannun
fca087be49 More cache improvements (#1015)
* fix rotating kv cache for chat use case

* reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat

* nit in chat

* fix tests

* fix tests

* fix tests

* docs

* chat command

* comments + docs

* Define meta_state on all Cache implementations

* fixes + trim_prompt_cache api

* fix default model

---------

Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-10-07 20:45:51 -07:00
madroid
36c1d8e8dc Server: support function calling (#1003) 2024-10-02 12:36:07 -07:00
nathan
0866e23a67 repetiton_penalty and logits_bias just using logits_processors (#1004)
* refactor of repetition_penalty and logits_bias to use logits_processor

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 08:49:03 -07:00
Zai Thottakath
418d9a5511 Feature: QDoRA (#891)
* feat: QDoRA with tests and a small bug fix for recalculation of self.m

* some simplifications and fixes

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 08:01:11 -07:00
madroid
aa1c8abdc6 LoRA: Support HuggingFace dataset via data parameter (#996)
* LoRA: support huggingface dataset via `data` argument

* LoRA: Extract the load_custom_hf_dataset function

* LoRA: split small functions

* fix spelling errors

* handle load hf dataset error

* fix pre-commit lint

* update data argument help

* nits and doc

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 07:36:21 -07:00
Gökdeniz Gülmez
50e5ca81a8 Adding full finetuning (#903)
* Adding full model weights finetuning

* Updating the LORA.md and ACKNOWLEDGMENTS.md files.

* removing --use-dora and --fulll-training and adding --fine-tune-type

* some clean up

* reformating and fixing dora training

* updated CONFIG_DEFAULTS

* update config example

* update in the config example fie

* Update LORA.md

* merge and commit

* adding argument for dora linear layer

* clean up

* clean up in the example yaml file

* fix

* final fix before sending

* small addition to re md file

* fix for loading the fully trained model by saving all the files and configs correctly

* clean up

* removing the unnesesairy files

* changing lora layers back to 16

* removed max file size

* nits

* resolve merge

* some consistency changes

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-29 17:12:47 -07:00
madroid
7ec2021bb9 LoRA: support tools(function calling) format datasets (#995)
* LoRA: support fine-tuning tools datasets

* LoRA: Split small function

* LoRA: add tools format to lora docs

* LoRA: pre-commit fix

* Revert "LoRA: pre-commit fix"

This reverts commit b94b7e0fe7.

* Revert "LoRA: Split small function"

This reverts commit 3f6a5f19fd.

* LoRA: remove ToolsDataset

In a JSONL file, not all data is required to include the tools value.

* nit in readme

* nit in readme

* nit in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 10:41:36 -07:00
nathan
ace2bb5890 Add logits_processor option to generate_step function (#983)
* Add logits_processor option for the generation as in huggingface transformers library

* concatenation correction

* Rename the tokens variable for clarity

* remove the logit_bias argument from generate_step method

* fix the variable name

* nits + test

* test

* add back logit bias + test

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 10:08:49 -07:00
jamesm131
d812516d3d Add /v1/models endpoint to mlx_lm.server (#984)
* Add 'models' endpoint to server

* Add test for new 'models' server endpoint

* Check hf_cache for mlx models

* update tests to check hf_cache for models

* simplify test

* doc

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 07:21:11 -07:00
Gökdeniz Gülmez
76710f61af Adding support for mamba (#940)
* initial commit

* initial commit

* Adding first lines

* adding x, and dt projection layers

* adding the clamping mechanism

* First succesful inference

* last commit for today - added custom geenrate function and it works as expected, will try training and then with loading a model from the hub

* clean up

* save up

* almost

* update

* update

* fixed cache handeling

* fixed loading

* added seperate generat_step method in the model and also in the utils to automaticaly use the generate step mthod in the model class

* quick update

* still not working

* save

* still not working

* initial commit

* utils.py logits = logits[:, -1, :] TypeError: tuple indices must be integers or slices, not tuple

* update

* update

* Fixing the Batching Depfwise Comnvolution and multi token input

* fixing generate and logits outputs

* Done!

* Fixing the cache handling, generating works now trying training

* update ACKNOWLEDGEMENTS

* removing the model_type if stuff in the _step loop in generate_step and adding MambaCache in base.py for training easier generations and removing mamba in tuner/utils.

* quick clean up

* update trainer/utils for right initialisation of the layers for LoRA, but not working.

* clean up

* Forther update to trainer/utils for correct layer selection. Successfull training

* removing extra mamba-infer.py file

* clean up, reformating will come later

* reformat and big clean up, final commit

* some speedups and cleanups

* fix test

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 07:02:53 -07:00
Angelos Katharopoulos
796d5e40e4 Fix export to gguf (#993) 2024-09-20 13:33:45 -07:00
Awni Hannun
f530f56df2 don't use internal exception (#990) 2024-09-17 16:22:48 -07:00
Awni Hannun
6c2369e4b9 Fix bug in upload + docs nit (#981)
* fix bug in upload + docs nit

* nit
2024-09-07 14:46:57 -07:00
Awni Hannun
c3e3411756 Update LLM generation docs to use chat template (#973)
* fix docs

* add template to model cards as well

* revert

* version
2024-09-07 06:06:15 -07:00
Angelos Katharopoulos
324184d670 Fix the cache_prompt (#979) 2024-09-06 20:19:27 -07:00
madroid
bd29aec299 Support HuggingFace model tree (#957)
* Hub: Update quantization configuration fields

* Hub: add base_model metadata

* Hub: add quantization_config for model tree Quantized type

* Hub: update quantization_config value

* Hub: remove config print
2024-09-04 06:19:32 -07:00
Chime Ogbuji
83a209e200 Add prompt piping (#962)
* Initial commit of --prompt-only and prompt from STDIN feature

* Switch to using --verbose instead of --prompt-only

* Fix capitalization typo

* Fix reference to changed option name

* Update exception text
2024-09-03 13:29:10 -07:00
James Zhao
bf921afcbe Make sure to import the correct "version" module when installing mlx_whisper and mlx_lm from local source code. (#969)
* Make sure to import the correct "version" module when installing the
mlx_whisper package from local source code.

* Make sure to import the correct "version" module when installing the mlx_lm package from local source code

* fix

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-03 13:16:21 -07:00
Awni Hannun
3c6e8b11af fix (#965) 2024-08-30 05:56:27 -07:00
L
fc93c55723 feat(mlx_lm): Nemotron (#949)
* feat: Nemotron

https://huggingface.co/nvidia/Minitron-4B-Base

This is basically Llama with partial RoPE and LayerNorm instead of
BatchNorm. Also they add 1 to the LayerNorm weight for some reason.

* fixup! feat: Nemotron

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-29 21:08:57 -07:00
Awni Hannun
b1186e2a81 Docs on prompt scaling (#963)
* docs on prompt scaling

* remove unused var

* nits
2024-08-29 15:05:17 -07:00
Angelos Katharopoulos
1003a8b2dd Add the ability to load the KV cache from a file (#956) 2024-08-28 22:11:45 -07:00
Angelos Katharopoulos
7f8c961287 Fix setattr for the TokenizerWrapper (#961) 2024-08-28 14:47:33 -07:00
Prince Canuma
b5e18ef1e3 Add Phi-3.5-MoE (#946)
* add phimoe

* add phimoe to tunner

* add switch_mlp

* fix SuScaled args

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-24 06:52:33 -07:00
Awni Hannun
6731254e76 Use fast rope (#945)
* use fast rope

* fix llama

* use fast rope for llama3.1

* requires unreleased mlx

* fix su

* fix deepseek v2

* only one of base or freqs

* nit

* fix

* hard code freqs
2024-08-23 13:18:51 -07:00
Awni Hannun
58591a1b41 fine tune deepseek (#932) 2024-08-22 10:41:21 -07:00
L
0164d2058b feat: DeepSeek MoE v1 (#942)
* feat: deepseek v1

DeepSeek is still releasing models on the DeepSeek V1 architecture.

```sh
mlx_lm.convert --hf-path deepseek-ai/DeepSeek-Prover-V1.5-RL --mlx-path DeepSeek-Prover-V1.5-RL-8bit --q-bits 8 -q
mlx_lm.generate --model DeepSeek-Prover-V1.5-RL-8bit --ignore-chat-template --max-tokens 512 --prompt 'import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- The second and fourth terms of a geometric sequence are $2$ and $6$. Which of the following is a possible first term?
Show that it is $\frac{2\sqrt{3}}{3}$.-/
theorem amc12b_2003_p6 (a r : ℝ) (u : ℕ → ℝ) (h₀ : ∀ k, u k = a * r ^ k) (h₁ : u 1 = 2)
  (h₂ : u 3 = 6) : u 0 = 2 / Real.sqrt 3 ∨ u 0 = -(2 / Real.sqrt 3) := by'
```

* nits

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-17 07:18:09 -07:00
Awni Hannun
7be292c0c9 Handle longer prompt/generation (#931)
* rebase

* nits

* nit

* fix rotating cache with step prefill

* update version
2024-08-16 15:28:39 -07:00
Zai Thottakath
4e01700816 Allow the entire model to be targed for LoRA and DoRA fine tuning: LoRA and DoRA embeddings with small DoRALinear bug fix (#914)
* feature: LoRA adapter for Embeddings

* feature: wire in LoRAEmbedding into the tuner. Allow the embedding and non model.layers Linear layers to be targeted for fine tuning

* feature: DoRA adapter for Embeddings

* feature: wire in DoRAEmbedding

* bugfix: ensure self.m is recalculated when the linear layer is changed in DoRALinear.from_linear

* refactor: prefer from_base over from_linear or from_embedding. prefer fuse over to_linear or to_embedding

* cleanup: remove unused imports in test_dora.py

* refactor: remove unnecessary non_layer_modules

* cleanup: remove wrong comments for lora embedding dropout. remove uncessary parens in dora embedding dropout

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-16 07:38:36 -07:00
Chime Ogbuji
c50971e860 Min P implementation (#926)
* Min P implementation

* Change default to 0 (no min_p)

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-15 15:45:02 -07:00
Awni Hannun
9b83004631 Faster sampling with mx.compile (#937)
* faster sampling with compile

* fix test
2024-08-15 11:29:09 -07:00
Awni Hannun
95840f32e2 Fix whipser conversion for safetensors models (#935)
* fix whipser conversion for safetensor only. error in mlx lm for existing paths

* fix tests
2024-08-14 10:22:04 -07:00