Sindhu Satish
ec06c04f4f
revert revision changes and retain qwen2 support
2025-01-29 07:39:31 -08:00
Sindhu Satish
ba6c7d3aba
Qwen2 support
2025-01-29 07:39:31 -08:00
Sindhu Satish
b0520e7708
Bug fix - Qwen2 support
2025-01-29 06:21:36 -08:00
Sindhu Satish
dd1690df81
bug fix
2025-01-29 06:00:06 -08:00
Sindhu Satish
e89a131668
Include revision version for HF models while loading
2025-01-29 05:53:18 -08:00
Awni Hannun
f44a52e2dc
batched min p and fix spec gen sampling ( #1222 )
2025-01-27 15:40:31 -08:00
Xingjun.Wang
514502da22
Support snapshot_download for ModelScope ( #1194 )
...
* add MLX_USE_MODELSCOPE env
* update
* update snapshot_download
* update
* remove modelscope dependency and add import check
* update
* nits
* fix
---------
Co-authored-by: wangxingjun778 <jason@U-C7X6TX5G-2239.local>
Co-authored-by: Awni Hannun <awni@apple.com>
2025-01-10 15:29:34 -08:00
Awni Hannun
93c5cfd781
Add a speculative decoding generator ( #1155 )
...
* add a speculative decoding generator
* fix
* fixes
* optional kwarg pop
2025-01-10 15:27:08 -08:00
Awni Hannun
5cae0a60e6
deepseek v3 model with pipeline parallelism ( #1191 )
...
* deepseekv3
* use upload_large_file instead of deprecated multi comit
* add pipeline generation and example
* comment
* get fp16 working
* use mlx==0.22
2025-01-09 15:55:53 -08:00
Pedro Cuenca
b8f0cacfa8
Use upload_large_folder ( #1193 )
2025-01-07 09:18:31 -08:00
Awni Hannun
c4833a2f55
fix encoding with special tokens + chat template ( #1189 )
2025-01-03 10:50:59 -08:00
Prince Canuma
dfa4dd6c93
Add support for cohere2 ( #1157 )
...
* add support for cohere2
* revert to act_fn to silu
* fix tests and sliding window attention
* add tests
* add to tuner
* fix sliding window
* add coauthor :)
Co-authored-by: n8programs <43304488+N8python@users.noreply.github.com>
* Add rotating kvcache to save space
* some nits
* style
* nits
---------
Co-authored-by: n8programs <43304488+N8python@users.noreply.github.com>
Co-authored-by: N8 <n8@n8programs.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-12-16 08:01:03 -08:00
Awni Hannun
2ba0e36683
[mlx-lm] Use top p in server ( #1144 )
...
* use top p in server
* couple other fixes
2024-12-12 11:12:21 -08:00
madroid
06af3c9b0e
Add finish_reason in GenerationResponse ( #1153 )
2024-12-12 10:37:40 -08:00
madroid
12083c4b7e
Support for multiple EOS tokens ( #1141 )
...
* Support for multiple EOS tokens
* Change _eos_token_ids type from list to set
* Remove model_config & add eos_token_id
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-12-09 08:53:58 -08:00
Alex Barron
2211b27388
Mixed Quantizations ( #1132 )
...
* saving/loading mixed quantizations
* comment
* add bits per weight
* more concise bpw
* count bias too
2024-12-08 14:21:50 -08:00
Awni Hannun
1963df8565
Allow prompt callback to generate_step
( #1133 )
...
* allow prompt callback and use in cache_prompt
* nit
* comments
* bump version
2024-12-03 16:17:14 -08:00
Neil Mehta
cefe793ae0
Accept mx.array type for prompt argument for stream_generate ( #1125 )
...
* Accept mx.array type for prompt argument for stream_generate
* Fix formatting
2024-11-26 16:51:55 -08:00
Awni Hannun
cfc29c29f4
Put prompt processing in same stream ( #1122 )
...
* put prompt processing in same stream
* patch
2024-11-25 09:47:00 -08:00
madroid
a5e173802e
docs: update stream_generate return type annotation ( #1121 )
...
Improve documentation clarity by:
1. Fix return type annotation to correctly reflect GenerationResponse
2. Simplify docstring by referencing GenerationResponse class
3. Remove redundant field descriptions
2024-11-25 08:10:14 -08:00
Awni Hannun
0f135396ae
Generation refactor: part 2 ( #1099 )
...
* unify with stream_generate
* fixes
* nit
* some cleanup, warnings, tests
* fix test + faster min p + test
* version
2024-11-23 11:47:06 -08:00
Alban Lecocq
bd6d910ca3
[MLX LM] Fix f-string formatting in memory warning message ( #1105 )
...
* Fix missing f-prefix for string interpolation in model size warning
* Ensures proper display of memory values in MB for model and max size
2024-11-13 06:14:03 -08:00
Awni Hannun
657b4cc0aa
[MLX LM] Sampler refactor + a few improvements ( #1094 )
...
* starting
* refactor sampler/processor and a few improvements
* fix stream
* fix stream generate
* fix eos handling in stream generate
2024-11-07 16:15:24 -08:00
ilyasch2
3b526f0aa1
Add support for falcon-mamba ( #1074 )
...
* Add support for falcon-mamba
* nits
* nit
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-11-04 12:23:30 -08:00
Awni Hannun
e510987870
Clear cache every now and then ( #1081 )
...
* clear cache every now and then
* don't need user arg anymore
2024-11-01 14:15:32 -07:00
Alex Barron
85ffd2c96a
Quantized KV Cache ( #1075 )
...
* add QuantizedKVCache
* simplify
* add tests
* single sdpa function
* fix sed
* in place
* fix tests
* support different k and v head dims
2024-10-31 16:59:52 -07:00
Awni Hannun
9f34fdbda4
Wire models in MLX LM ( #1069 )
...
* wired in MLX LM
* fix synch
* comment + nit
* version
* mlx lm version
* bump to 0.19.2
2024-10-31 08:17:14 -07:00
Awni Hannun
66e7bcb886
override dtype with quant ( #1062 )
2024-10-22 09:56:45 -07:00
Awni Hannun
c799133998
Make llm async eval less brittle ( #1040 )
...
* Make llm async eval less brittle
* nit
2024-10-14 10:25:24 -07:00
Awni Hannun
4360e7ccec
clear cache during prompt processing ( #1027 )
2024-10-09 16:48:32 -07:00
Awni Hannun
b7373cb44f
fix long prompt generations ( #1023 )
2024-10-09 11:09:36 -07:00
Awni Hannun
fca087be49
More cache improvements ( #1015 )
...
* fix rotating kv cache for chat use case
* reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat
* nit in chat
* fix tests
* fix tests
* fix tests
* docs
* chat command
* comments + docs
* Define meta_state on all Cache implementations
* fixes + trim_prompt_cache api
* fix default model
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-10-07 20:45:51 -07:00
nathan
0866e23a67
repetiton_penalty and logits_bias just using logits_processors ( #1004 )
...
* refactor of repetition_penalty and logits_bias to use logits_processor
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-30 08:49:03 -07:00
Gökdeniz Gülmez
50e5ca81a8
Adding full finetuning ( #903 )
...
* Adding full model weights finetuning
* Updating the LORA.md and ACKNOWLEDGMENTS.md files.
* removing --use-dora and --fulll-training and adding --fine-tune-type
* some clean up
* reformating and fixing dora training
* updated CONFIG_DEFAULTS
* update config example
* update in the config example fie
* Update LORA.md
* merge and commit
* adding argument for dora linear layer
* clean up
* clean up in the example yaml file
* fix
* final fix before sending
* small addition to re md file
* fix for loading the fully trained model by saving all the files and configs correctly
* clean up
* removing the unnesesairy files
* changing lora layers back to 16
* removed max file size
* nits
* resolve merge
* some consistency changes
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-29 17:12:47 -07:00
nathan
ace2bb5890
Add logits_processor option to generate_step function ( #983 )
...
* Add logits_processor option for the generation as in huggingface transformers library
* concatenation correction
* Rename the tokens variable for clarity
* remove the logit_bias argument from generate_step method
* fix the variable name
* nits + test
* test
* add back logit bias + test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-09-28 10:08:49 -07:00
Awni Hannun
f530f56df2
don't use internal exception ( #990 )
2024-09-17 16:22:48 -07:00
Awni Hannun
6c2369e4b9
Fix bug in upload + docs nit ( #981 )
...
* fix bug in upload + docs nit
* nit
2024-09-07 14:46:57 -07:00
Awni Hannun
c3e3411756
Update LLM generation docs to use chat template ( #973 )
...
* fix docs
* add template to model cards as well
* revert
* version
2024-09-07 06:06:15 -07:00
madroid
bd29aec299
Support HuggingFace model tree ( #957 )
...
* Hub: Update quantization configuration fields
* Hub: add base_model metadata
* Hub: add quantization_config for model tree Quantized type
* Hub: update quantization_config value
* Hub: remove config print
2024-09-04 06:19:32 -07:00
Angelos Katharopoulos
1003a8b2dd
Add the ability to load the KV cache from a file ( #956 )
2024-08-28 22:11:45 -07:00
Awni Hannun
7be292c0c9
Handle longer prompt/generation ( #931 )
...
* rebase
* nits
* nit
* fix rotating cache with step prefill
* update version
2024-08-16 15:28:39 -07:00
Chime Ogbuji
c50971e860
Min P implementation ( #926 )
...
* Min P implementation
* Change default to 0 (no min_p)
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-15 15:45:02 -07:00
Awni Hannun
9b83004631
Faster sampling with mx.compile
( #937 )
...
* faster sampling with compile
* fix test
2024-08-15 11:29:09 -07:00
Awni Hannun
95840f32e2
Fix whipser conversion for safetensors models ( #935 )
...
* fix whipser conversion for safetensor only. error in mlx lm for existing paths
* fix tests
2024-08-14 10:22:04 -07:00
Anchen
7a3ab1620a
support load model by custom get_model_classes ( #899 )
...
* feature(mlx_lm): support load model by custom get classes
* rename the param
2024-07-25 11:01:17 -07:00
Awni Hannun
20e221f7f7
Add recurrent gemma ( #856 )
...
* add recurrent gemma
* fix window cache
2024-07-07 12:10:04 -07:00
Chime Ogbuji
1d701a1831
Logprobs info to completion API ( #806 )
...
* Initial implementation
* Fix handling of return_step_logits in return
* Fixed OpenAI parameter expectations and logprob structure and datatypes
* pre-commit black formatting
* Remove unused parameter
* fix log probs
* fix colorize
* nits in server
* nits in server
* Fix top_logprobs structure (a dict) and include tokens in logprobs response
* nits
* fix types
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-23 10:35:13 -07:00
Michał Kurc
43d6deb3c1
mlx_lm: Add Streaming Capability to Generate Function ( #807 )
...
* Add streaming feature to text generation function
* separate stream and regular functions
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-03 09:04:39 -07:00
Awni Hannun
ca7ce60c91
Rename block sparse to gather ( #793 )
...
* rename block sparse to gather
* pin mlx version
2024-05-23 19:47:35 -07:00
Angelos Katharopoulos
9f671228cd
Block sparse MM MoEs ( #782 )
...
- Adds SwitchLinear
- Adds QuantizedSwitchLinear
2024-05-21 15:58:08 -07:00