Commit Graph

309 Commits

Author SHA1 Message Date
Awni Hannun
6731254e76
Use fast rope (#945)
* use fast rope

* fix llama

* use fast rope for llama3.1

* requires unreleased mlx

* fix su

* fix deepseek v2

* only one of base or freqs

* nit

* fix

* hard code freqs
2024-08-23 13:18:51 -07:00
Awni Hannun
58591a1b41
fine tune deepseek (#932) 2024-08-22 10:41:21 -07:00
L
0164d2058b
feat: DeepSeek MoE v1 (#942)
* feat: deepseek v1

DeepSeek is still releasing models on the DeepSeek V1 architecture.

```sh
mlx_lm.convert --hf-path deepseek-ai/DeepSeek-Prover-V1.5-RL --mlx-path DeepSeek-Prover-V1.5-RL-8bit --q-bits 8 -q
mlx_lm.generate --model DeepSeek-Prover-V1.5-RL-8bit --ignore-chat-template --max-tokens 512 --prompt 'import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- The second and fourth terms of a geometric sequence are $2$ and $6$. Which of the following is a possible first term?
Show that it is $\frac{2\sqrt{3}}{3}$.-/
theorem amc12b_2003_p6 (a r : ℝ) (u : ℕ → ℝ) (h₀ : ∀ k, u k = a * r ^ k) (h₁ : u 1 = 2)
  (h₂ : u 3 = 6) : u 0 = 2 / Real.sqrt 3 ∨ u 0 = -(2 / Real.sqrt 3) := by'
```

* nits

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-17 07:18:09 -07:00
Awni Hannun
7be292c0c9
Handle longer prompt/generation (#931)
* rebase

* nits

* nit

* fix rotating cache with step prefill

* update version
2024-08-16 15:28:39 -07:00
Zai Thottakath
4e01700816
Allow the entire model to be targed for LoRA and DoRA fine tuning: LoRA and DoRA embeddings with small DoRALinear bug fix (#914)
* feature: LoRA adapter for Embeddings

* feature: wire in LoRAEmbedding into the tuner. Allow the embedding and non model.layers Linear layers to be targeted for fine tuning

* feature: DoRA adapter for Embeddings

* feature: wire in DoRAEmbedding

* bugfix: ensure self.m is recalculated when the linear layer is changed in DoRALinear.from_linear

* refactor: prefer from_base over from_linear or from_embedding. prefer fuse over to_linear or to_embedding

* cleanup: remove unused imports in test_dora.py

* refactor: remove unnecessary non_layer_modules

* cleanup: remove wrong comments for lora embedding dropout. remove uncessary parens in dora embedding dropout

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-16 07:38:36 -07:00
Chime Ogbuji
c50971e860
Min P implementation (#926)
* Min P implementation

* Change default to 0 (no min_p)

* nits

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-15 15:45:02 -07:00
Awni Hannun
9b83004631
Faster sampling with mx.compile (#937)
* faster sampling with compile

* fix test
2024-08-15 11:29:09 -07:00
Awni Hannun
95840f32e2
Fix whipser conversion for safetensors models (#935)
* fix whipser conversion for safetensor only. error in mlx lm for existing paths

* fix tests
2024-08-14 10:22:04 -07:00
Awni Hannun
33905447f9
Whisper updates to allow HF models (#923)
* simplify conversion and update convert for HF models

* use npz for compat

* fixes

* fixes

* fix gguf

* allow user supplied path
2024-08-09 11:11:58 -07:00
tidely
df744c98e6
Predict stop sequence matches during streaming (#541)
* Predict stop sequence matches during streaming

Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.

* fix typo

* Change sequence_overlap logic

* range isn't inclusive, add 1 to max_overlap

* Add test_server.py

Added a test for the sequence_overlap method

* nits

* eos sequence

* finalize

---------

Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-06 15:24:15 -07:00
Khush Gupta
8fa12b0058
Adapters loading (#902)
* Added functionality to load in adapters through post-requests so you do not need to restart the server

* ran pre-commit

* nits

* fix test

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-08-01 16:18:18 -07:00
madroid
85dc76f6e0
Server: support stream_options (#913)
* Server: support stream_options

see https://x.com/OpenAIDevs/status/1787573348496773423

* Server: support stream_options

* Server: check None type
2024-07-26 08:58:52 -07:00
otriscon
46da74fea2
Unify attention mask in LLMs (#911)
* Unify attention mask creation in LLMs.

Currently, each model implementation in `mlx-examples/llms/models` has ad-hoc
code to create a mask for the attention mechanism. This usually takes the form:

```
    mask = None
    if h.shape[1] > 1:
        mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1])
        mask = mask.astype(h.dtype)
```

This correctly creates a mask only if the input consists of more than one token.
But this code assumes the multi-token input is at the beginning of inference.
If, for example, we are evaluating multiple tokens because of speculative
decoding or prompt cache reuse, this mask will not have the correct shape and
and will cause the raising of an exception in the attention computation.

Some of the models correctly implement the mask creation with code like this:

```
    mask = None
    if h.shape[1] > 1:
        mask = create_additive_causal_mask(
            h.shape[1], cache[0].offset if cache is not None else 0
        )
        mask = mask.astype(h.dtype)
```

This commit unifies the attention mask creation for all models with a new
function `create_attention_mask`, reducing code duplication and helping all
models support inference performance enhancements like those mentioned above.

* Allow batches in LLM key-value cache

The current implementation of the LLM key-value cache assumes that
the input batch is of size 1. Input batching (evaluating multiple
alterative inputs at the same time) can be a valuable tool for
speculative sampling and other techniques.

This change removes the hard-coded batch size from the code that
resizes the key-value cache.

* Simplify causal mask creation

Use the same codepath regardless of whether there's an offset or
not. Addresses [this comment](https://github.com/ml-explore/mlx-examples/pull/911#discussion_r1691459717).

* Use old-style type annotation to avoid linter error
2024-07-25 16:45:22 -07:00
Anchen
7a3ab1620a
support load model by custom get_model_classes (#899)
* feature(mlx_lm): support load model by custom get classes

* rename the param
2024-07-25 11:01:17 -07:00
Alex Cheema
cd8efc7fbc
Add support for Llama-3.1 (#907)
* add dynamicNTK scaling rope

* remove unused var

* fix rope base

* llama3.1 fixes

* TODO for rope eval

* vectorise llama3 base freq calculation

* removed the arbitrary 2.0 rope_scale default case

* fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init

* nits + format

* use mx.pi

* fix tests and add test for 3.1

---------

Co-authored-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-23 13:21:32 -07:00
Prince Canuma
3f337e0f0a
Add Mistral NeMo (fix) (#895)
* fix head_dim

* Update llms/mlx_lm/models/llama.py

* fix kv error

* formatting

* Delete test.py

---------

Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
2024-07-22 06:09:24 -07:00
Prince Canuma
3d365b612a
Add support for InternLM-2.5 (#871)
* fix internlm-2

* formatting

* add dynamic ntk rope

* formatting

* move dynamic scaling rope to intermlm2.py

* add default max_position_embeddings
2024-07-17 16:38:22 -07:00
Anchen
561dcf5643
Add support for deepseek coder v2 lite (#882)
* feat: add support for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

* fix softmax + some cleanup

* more nits

* fix rope

* fix original_max_position_embeddings in rope

* fix original_max_position_embeddings in rope config

* add group greedy

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-17 07:23:28 -07:00
Awni Hannun
f0c6c6e226
keep the server in a valid state (#889) 2024-07-15 18:35:36 -07:00
JosefAlbers
bfc1f2763b
longrope (#886) 2024-07-12 07:19:11 -07:00
Chime Ogbuji
8bf397e450
Pass use_dora parameter to linear_to_lora_layers (#885) 2024-07-11 14:34:34 -07:00
nicolov
fbe3247772
Add GPT-neox model (#863) 2024-07-11 06:13:17 -07:00
Alex Wozniakowski
63800c8feb
Example of response generation with optional arguments (#853)
* Generate response with optional arguments

* Reference response generation example

* Include transformers and sentencepiece

* Update example to run Mistral-7B-Instruct-v0.3

* Link to generation example

* Style changes from pre-commit
2024-07-09 06:49:59 -07:00
Awni Hannun
68e88d42fb
Fix server for openai package (#877)
* fix

* fixes for 9b
2024-07-08 12:34:31 -07:00
Awni Hannun
20e221f7f7
Add recurrent gemma (#856)
* add recurrent gemma

* fix window cache
2024-07-07 12:10:04 -07:00
n8programs
1e05aef344
Add logit soft capping to gemma, and fix precision issues (#857)
* Add logit soft capping to gemma, and fix precision issues

Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention).

* get rid of comments

* get rid of last comments (sry lol)

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-07-02 07:52:39 -07:00
Angelos Katharopoulos
f212b770d8
Server loads the model on demand from the request (#851) 2024-06-27 11:37:57 -07:00
Awni Hannun
538339b599
gemma2 (#855) 2024-06-27 10:06:28 -07:00
Awni Hannun
9f10728145
fix yi (#852) 2024-06-27 06:38:19 -07:00
Chime Ogbuji
df6bc09d74
Configuration-based use of HF hub-hosted datasets for training (#701)
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training

* Pre-commit formatting

* Fix YAML config example

* Print DS info

* Include name

* Add hf_dataset parameter default

* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

* nits

* update docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-26 10:20:50 -07:00
Chime Ogbuji
1d701a1831
Logprobs info to completion API (#806)
* Initial implementation

* Fix handling of return_step_logits in return

* Fixed OpenAI parameter expectations and logprob structure and datatypes

* pre-commit black formatting

* Remove unused parameter

* fix log probs

* fix colorize

* nits in server

* nits in server

* Fix top_logprobs structure (a dict) and include tokens in logprobs response

* nits

* fix types

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-23 10:35:13 -07:00
Yi Wang
a7598e9456
Fix mypy errors with models/{qwen2,qwen2_moe,startcoder2}.py (#835)
* Fix starcoder.py

* Fix qwen2

* Remvoe unnecessary assert not None
2024-06-14 09:44:50 -07:00
Awni Hannun
d8b073e3a7
Add eos token to lora fine-tunes (#818)
* add eos token to lora fine-tunes

* Comment
2024-06-12 07:44:21 -07:00
Nada Amin
3cc58e17fb
Tweaks to run dspy-produced calls to the server, with gemma template. (#810)
* Tweaks to run dspy-produced calls to the server, with gemma template.

following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936

can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```

and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```

* simpler way to validate float or int

* remove logic that works around incompatible templates, too gemma specific

* tweak messages for common denominator

* use generate.py workaround for DBXR

* put behind flag

* oops

* Solution to chat template issue: pass in a custom template!

The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.

--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

* remove convoluted solution

* Tweak for when None is provided explicitly, and must be set to [] too.

For example, the outlines library provides None explicitly.

* style

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-12 07:17:06 -07:00
Yi Wang
6da07fb1b0
make models/phi3.py and models/phi3small.py compatible with mypy (#833) 2024-06-12 06:53:55 -07:00
JosefAlbers
fda41545a6
Su-RoPE(Rotary Position Embedding) for Phi-3 (#813)
* Su-RoPE

* nits

* Update su_rope.py

* Update su_rope.py

Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later."

* Ran isort

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-11 06:20:04 -07:00
Yi Wang
a54dfd698e
Correct the type annotation of cache in llama.py (#828)
* Update

* Fix isort
2024-06-10 15:18:34 -07:00
Yi Wang
bb8227f181
Correct type annotation of llama.ModelArgs.num_key_value_heads (#827) 2024-06-10 14:47:31 -07:00
Robin Glauser
4872727f14
Fixing "NameError: name 'resume_adapter_file' is not defined" (#817)
args. is missing from resume_adapter_file so the name is not defined.
2024-06-05 10:07:31 -07:00
Michał Kurc
43d6deb3c1
mlx_lm: Add Streaming Capability to Generate Function (#807)
* Add streaming feature to text generation function

* separate stream and regular functions

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-03 09:04:39 -07:00
Derek Lewis
89b0b75250
GPT2 Support (#798)
* GPT-2 model support

* Add test for gpt2 model

* Fix weight sanitizing for quantization

* use approx gelu

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-02 16:33:20 -07:00
madroid
c457a3f88b
LoRA: Extract small function (#614)
* LoRA: Extract pre_processing_model  function

* LoRA: Extract small functions(train_model,evaluate_model)

* move test case to test_tuner_utils.py

* nits

* nits

* remove extra param, validate at it 0

* version

* fix test

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-02 06:38:42 -07:00
Awni Hannun
81318ad4a8
Port of phi3small (#794)
* start port of phi3small

* fix phi3

* use block sparsity

* compile activation

* nits in readme / mlx lm version
2024-05-31 12:54:14 -07:00
Awni Hannun
09aaeac72c
fix moe conversion (#802) 2024-05-31 12:36:05 -07:00
Behnam Moh
f49c5f2829
fixed the requirements (#803) 2024-05-29 06:14:19 -07:00
Chen Xin
aac98ca6f4
support internlm2 (#797)
* support internlm2

* only attention projections

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-27 06:22:21 -07:00
Awni Hannun
ca7ce60c91
Rename block sparse to gather (#793)
* rename block sparse to gather

* pin mlx version
2024-05-23 19:47:35 -07:00
Prince Canuma
69700d8431
Add support for Phi-3 Medium (#790)
* update to support phi-3 medium

* fuse qkv split
2024-05-22 16:47:06 -07:00
Prince Canuma
b044ce2acf
Add support for ibm granite (#758)
* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* remove unused function

* rebase fix

* move position emebedding to mask creation

* add to tuner and format

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* add support for granite 3-8B config

* add gpt_bigcode

* add positional embedding condition.

* rebase fix

* move position emebedding to mask creation

* add to tuner and format

* refactor mask

* remove dropout layers
2024-05-21 20:16:31 -07:00
Awni Hannun
9fc6efbd90
version bump + some fixes (#792) 2024-05-21 20:09:35 -07:00
Angelos Katharopoulos
9f671228cd
Block sparse MM MoEs (#782)
- Adds SwitchLinear
- Adds QuantizedSwitchLinear
2024-05-21 15:58:08 -07:00
AtakanTekparmak
199df9e110
fix: Added dedicated error handling to load and get_model_path (#775)
* fix: Added dedicated error handling to load and get_model_path

Added proper error handling to load and get_model_path by adding a dedicated exception class, because when the local path is not right, it still throws the huggingface RepositoryNotFoundError

* fix: Changed error message and resolved lack of import

* fix: Removed redundant try-catch block

* nits in message

* nits in message

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-20 06:39:05 -07:00
alexC-nonsense4k
42458914c8
support dora finetune in mlx-examples/llms/mlx_lm (#779)
* support dora finetune

* solve problems in lora.py and tuner.utils.py

* add use_dora (bool) in functions of load adapters

* delete all unsupported quantization code and fix all the calculate problems in mlx_lm/tuner/dora.py

* Using stop_gradient to prevent gradients from flowing through ‘norm’ during backpropagation

* set DEFAULT_USE_DORA in mlx_lm/generate.py

* add annotation for all the use_dora

* mlx_lm/fuse.py support fuse dora layers and fix a bug of to_linear() in mlx_lm/tuner/dora.py

* simplify code of juding type of a fused layer in mlx_lm/fuse.py

* add use_dora in mlx_lm/fuse.py when apply_lora_layers()

* style + nits

* style + nits

* more updates

---------

Co-authored-by: chenyifei08 <chenyifei08@baidu.com>
Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-16 08:21:26 -07:00
Awni Hannun
69181e0058
Support non incremental kv cache growth (#766) 2024-05-15 12:56:24 -07:00
JosefAlbers
10853b57d9
Add model_config parameter to load() and load_model() (#770)
* Add `model_config` parameter to `load()` and `load_model()`

For easy editing of the loaded model configuration (e.g., for changing RoPE theta or scaling of Phi-3 model)

Example:

```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed", model_config={"rope_theta":50000.0})
response = generate(model, tokenizer, prompt, max_tokens=MAX_TOKENS)
```

* Possible bug (default_loss)

* Revert "Possible bug (default_loss)"

This reverts commit 70a55ace18.

* Fix default_loss for lora

* 1. move load_model's new optional `model_config` arg to the end (fetch_from_hub()'s `model = load_model(model_path, lazy)`) 2. fix indentations (`black` hook)
2024-05-10 10:13:34 -07:00
Awni Hannun
6f0a69e682
fix lora for openelm (#773) 2024-05-10 09:51:41 -07:00
Awni Hannun
fad9598372
Fix llama cache check (#763)
* fix llama cache check

* add test
2024-05-08 08:35:54 -07:00
Awni Hannun
ee60e2a9d5
Kv cache (#643)
* in place kv_cache

* fix

* fix kv cache size

* partially fix kv cache dtype

* step kv cache

* multiple of step size

* more teests + kv cache

* more kv cache

* udpate all models to use kv cache
2024-05-08 08:18:13 -07:00
Kevin Wang
c0019c4908
Pad mask with zeros for non-square attention matrices (#715)
* Pad mask with zeros for non-square attention matrices

The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary.

This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly.

* Directly create mask instead of padding

* Update llama.py
2024-05-04 16:32:25 -07:00
Anchen
f30413b63c
chore(mlx-lm): fix the number of validation batches configuration. (#752)
* chore: fix number of validation batches

* clean up

* address comment
2024-05-04 06:52:42 -07:00
Awni Hannun
2bf11c4633
Use stable url for MNIST (#749)
* use stable url

* remove deprecated flag
2024-05-03 17:13:05 -07:00
Konstantin Kerekovski
d1c35fa684
Add MLX Cache Limit setting for mlx_lm.generate and mlx_lm.server CLI (#744)
* Add support for setting MLX cache limit in GB

* Add support for setting MLX cache limit in GB in mlx_lm.server

* format

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-03 12:42:48 -07:00
Ivan Fioravanti
b468091f7f
Add model management functionality for local caches (#736)
* Add model management functionality for local caches

This commit introduces a set of command-line utilities for managing MLX models downloaded and saved locally in Hugging Face cache. The functionalities include scanning existing models, retrieving detailed information about a specific model, and deleting a model by its name.

* Added mlx_lm.model to setup.py

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-03 12:20:13 -07:00
Awni Hannun
92430df0a0
Fix lora for qwen moe (#743)
* fix lora for qwen moe

* use max seq length in test as well
2024-05-02 21:55:09 -07:00
madroid
5079af62db
Update model card describe (#654)
* Update model card describe

- Add full link jump
- Add the address of the model uploader's Hugging Face homepage

* Add user_info to reduce whoami calls

* Remove the -U argument

* remove HF user info

* run pre-commit
2024-05-02 21:22:04 -07:00
madroid
6775d6cb3f
Whisper: Add pip distribution configuration to support pip installations. (#739)
* Whisper: rename whisper to mlx_whisper

* Whisper: add setup.py config for publish

* Whisper: add assets data to setup config

* Whisper: pre-commit for setup.py

* Whisper: Update README.md

* Whisper: Update README.md

* nits

* fix package data

* nit in readme

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-05-01 09:00:02 -07:00
Karim Elmaaroufi
4bf2eb17f2
Validate server params & fix logit bias bug (#731)
* Bug fix in logit bias

* Add parameter validations

* Fix typo

* Update docstrings to match MLX styling

* Black style + fix a validation bug
2024-04-30 07:27:40 -07:00
Jaward Sesay
7c0962f4e2
Add Supported Quantized Phi-3-mini-4k-instruct gguf Weight (#717)
* support for phi-3 4bits quantized gguf weights

* Added link to 4 bits quantized model

* removed some prints

* Added correct comment

* Added correct comment

* removed print

Since last condition already prints warning for when quantization is None
2024-04-29 20:11:32 -07:00
Thomas Lazarus
5513c4e57d
Fixes Typo in Starcoder2 (#740) 2024-04-29 13:14:45 -07:00
Javier de la Rosa
510d2bde49
Force multi_commits when uploading to HF (#729) 2024-04-28 19:07:17 -07:00
锦此
699de35b03
Update lora_config.yaml (#735)
Update LoRa config YAML, replacing the adapter file argument with the adapter path argument.
2024-04-28 10:24:34 -07:00
Prince Canuma
c012eb173f
Add support for OpenELM (#719)
* add openELM

* update splitting logic

* update qkv logic and, transformer and MLP block

* code formatting and fix args

* fix array slicing and remove unused var :)

* add to tuner

* use mx.split for slicing qkv

* merge with phi3

* remove rope scaling logic

* code formatting
2024-04-25 16:49:28 -07:00
Gökdeniz Gülmez
2c1c9e9024
MiniCPM implementation (#685)
* Added support for the MiniCPM architecture

* Added support for the MiniCPM architecture

* Updated utils.py and LORA.md

* Updated utils.py and LORA.md

* Update implementation details for MiniCPM architecture

* Cleaning up

* fixed the missing lm.head layer problem

* Refactor Model class to dynamically handle tied and untied word embeddings

* Quick update

* added a dynamic rope scaling base calucaltion

* Added support for the MiniCPM architecture

* Added support for the MiniCPM architecture

* Updated utils.py and LORA.md

* Updated utils.py and LORA.md

* Update implementation details for MiniCPM architecture

* Cleaning up

* fixed the missing lm.head layer problem

* Refactor Model class to dynamically handle tied and untied word embeddings

* added a dynamic rope scaling base calucaltion

* quick fix and clean up

* clean up again

* removed the MiniCPMNorm class as its not used

* forgot something, sorry

* format

* version bump

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-04-25 15:29:28 -07:00
Awni Hannun
685012c2ad
Couple fixes for LoRA (#711)
* don't overwrite in test only mode

* only load model specific safetensors
2024-04-25 14:16:13 -07:00
Kristian Muñiz
109ee2f2f8
Use CORS headers for streaming for MLX Server (#716) 2024-04-25 07:26:04 -07:00
Kevin Wang
8a265f0d54
Fix incorrect type annotation (#720)
A `Tuple` is missing in this type annotation.
2024-04-24 15:52:43 -07:00
Prince Canuma
abcd891851
Add support for phi-3 (#712)
* Add phi-3 modelling

* fix rope scaling warning

* add tests and update tuner utils

* update name and remove sanitize

* fix lora
2024-04-23 09:20:00 -07:00
Aaron Ng
8d5cf5b0c8
use logging in mlx server (#705) 2024-04-22 07:50:06 -07:00
Anchen
749cabf299
fix: unicode decoding (#702) 2024-04-21 08:58:23 -07:00
Karim Elmaaroufi
1484598de1
Add support for logit bias (#697) 2024-04-21 06:53:56 -07:00
Awni Hannun
6abdbe3be8
Fix quant in gguf (#698)
* fix quant in gguf

* fix whisper
2024-04-19 20:07:11 -07:00
Awni Hannun
574ad7f6fe
fix dequantization (#693) 2024-04-19 10:46:59 -07:00
Awni Hannun
2146bcd7ee
Quantize embedding / Update quantize API (#680)
* more async eval

* quantize embedding / update quantize api

* more updates for quantize

* update for quantize embeddings

* update sd quant API

* update sdxl quants

* error for datasets < batch_size

* async

* fix config loading

* fix quant

* fix tests

* fix req

* remove lm head if tie weights is true

* fix test
2024-04-18 18:16:10 -07:00
Anchen
f5f189e48a
fix(mlx-lm): broken server.py (#690)
* fix server.py

* fix var referenced before assignment

* add test

* clean up
2024-04-18 14:26:18 -07:00
Phúc H. Lê Khắc
35206806ac
Create executables for generate, lora, server, merge, convert (#682)
* feat: create executables mlx_lm.<cmd>

* nits in docs

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-04-16 16:08:49 -07:00
dmdaksh
7d7e236061
- Removed unused Python imports (#683)
- bert/model.py:10: tree_unflatten
  - bert/model.py:2: dataclass
  - bert/model.py:8: numpy
  - cifar/resnet.py:6: Any
  - clip/model.py:15: tree_flatten
  - clip/model.py:9: Union
  - gcn/main.py:8: download_cora
  - gcn/main.py:9: cross_entropy
  - llms/gguf_llm/models.py:12: tree_flatten, tree_unflatten
  - llms/gguf_llm/models.py:9: numpy
  - llms/mixtral/mixtral.py:12: tree_map
  - llms/mlx_lm/models/dbrx.py:2: Dict, Union
  - llms/mlx_lm/tuner/trainer.py:5: partial
  - llms/speculative_decoding/decoder.py:1: dataclass, field
  - llms/speculative_decoding/decoder.py:2: Optional
  - llms/speculative_decoding/decoder.py:5: mlx.nn
  - llms/speculative_decoding/decoder.py:6: numpy
  - llms/speculative_decoding/main.py:2: glob
  - llms/speculative_decoding/main.py:3: json
  - llms/speculative_decoding/main.py:5: Path
  - llms/speculative_decoding/main.py:8: mlx.nn
  - llms/speculative_decoding/model.py:6: tree_unflatten
  - llms/speculative_decoding/model.py:7: AutoTokenizer
  - llms/tests/test_lora.py:13: yaml_loader
  - lora/lora.py:14: tree_unflatten
  - lora/models.py:11: numpy
  - lora/models.py:3: glob
  - speechcommands/kwt.py:1: Any
  - speechcommands/main.py:7: mlx.data
  - stable_diffusion/stable_diffusion/model_io.py:4: partial
  - whisper/benchmark.py:5: sys
  - whisper/test.py:5: subprocess
  - whisper/whisper/audio.py:6: Optional
  - whisper/whisper/decoding.py:8: mlx.nn
2024-04-16 07:50:32 -07:00
Angelos Katharopoulos
e55a9e8cb4
Add an SPM detokenizer that doesn't trim initial space (#681) 2024-04-15 14:15:25 -07:00
Awni Hannun
d3f8e4aee9
Fix argpartition call in Mixtral and other MOES (#676)
* Update mixtral.py

* fix all moes

---------

Co-authored-by: yuhai-china <yuhai.china@gmail.com>
2024-04-12 11:00:56 -07:00
Awni Hannun
9c5554d8ee
Use async eval (#670)
* Use async eval

* bump

* bump

* remove workaround for bfloat cumsum
2024-04-11 13:18:23 -07:00
devonthomas35
9f472dc985
Update transformers for ⌘-R+ (#668) 2024-04-11 07:28:12 -07:00
da-z
5a4cad34ef
Always resume downloads (#674)
* Always resume downloads

* format

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-04-11 06:52:32 -07:00
Angelos Katharopoulos
1278994b56
Add streaming detokenizers (#651) 2024-04-08 22:36:01 -07:00
Awni Hannun
c68aa3c7c3
Stable lm 2 (#666)
* stable lm 2

* test and lora

* version bump

* merge stable models
2024-04-08 14:18:55 -07:00
Awni Hannun
1e2f7f50b6
fix for empty initial string (#665) 2024-04-08 10:40:05 -07:00
Awni Hannun
c386dd5f5a
Fix for cohere plus (#650)
* fix for cohere plus

* version bump
2024-04-05 14:11:24 -07:00
Awni Hannun
2bd64b78cf
Save lora config (#636)
* lora config

* comments

* version bump
2024-04-02 13:52:53 -07:00
Prince Canuma
d661440dbb
Add support for qwen2moe (#640)
* add sparsemoe block and update decoder logic

* update file name to match HF

* update name

* Code formatting

* update gates calculation

* add support for Qwen2MoE.

* fix pytest

* code formatting and fix missing comma in utils

* Remove decoder sparse step.

Co-authored-by: bozheng-hit <dsoul0621@gmail.com>

* remove gate layer anti-quantisation

* remove unused argument

---------

Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
2024-04-02 11:33:29 -07:00
Awni Hannun
78c431dc25
cleanup whisper a little (#639) 2024-03-30 13:13:58 -07:00
Chime Ogbuji
f6283ef7ce
Configurable LR schedulers (#604)
* Initial config handler and test

* Added means to run from CLI

* Update lora config loading and tests

* Constrain scheduler config (warmup and minimum LR) for each kind

* Update reference to moved schedule_config module

* Minor fix

* Fix typos

* Moved build_schedule and tests

* nits in schedule config

* flake

* fix path

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-03-29 13:41:10 -07:00
Awni Hannun
b80adbcc3e
DBRX (#628)
* dbrx

* format

* format

* comments

* change scores slightly

* remove inadvertant import
2024-03-28 21:03:53 -07:00