* Add pfnet/plamo-2-1b
* Fix cache.py to support non-top level layers
* Use mlx's BaseModelArgs
* Fix model
* Use sanitize()
* Remove unnecessary changes
* Add plamo2.py
* Apply formatter
* Fix some part
* Allow a cache obj defined externally
* Fix channel first weights to channel last for right use of MLX's conv1d
* Remove unused code part
* Give all inputs when it's the first time call of model
* Fix import
* Include .jsonl files to download from Huggingface hub
* Fix reference to layers
* Remove unnecessary code and add a test for plamo2
* Do not pass mask to prepare_inputs_for_generation
* Fix to use repeat instead of tile
* Add state property to PlamoCache
* Add __iter__ and __next__ methods to PlamoCache
* cleanup
* cleanup
* fix
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* deepseekv3
* use upload_large_file instead of deprecated multi comit
* add pipeline generation and example
* comment
* get fp16 working
* use mlx==0.22
* Adds EXAONE architecture.
* nits + format
* format
* clean up and fix rope
* clean up and fix rope
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* fix rotating kv cache for chat use case
* reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat
* nit in chat
* fix tests
* fix tests
* fix tests
* docs
* chat command
* comments + docs
* Define meta_state on all Cache implementations
* fixes + trim_prompt_cache api
* fix default model
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* initial commit
* initial commit
* Adding first lines
* adding x, and dt projection layers
* adding the clamping mechanism
* First succesful inference
* last commit for today - added custom geenrate function and it works as expected, will try training and then with loading a model from the hub
* clean up
* save up
* almost
* update
* update
* fixed cache handeling
* fixed loading
* added seperate generat_step method in the model and also in the utils to automaticaly use the generate step mthod in the model class
* quick update
* still not working
* save
* still not working
* initial commit
* utils.py logits = logits[:, -1, :] TypeError: tuple indices must be integers or slices, not tuple
* update
* update
* Fixing the Batching Depfwise Comnvolution and multi token input
* fixing generate and logits outputs
* Done!
* Fixing the cache handling, generating works now trying training
* update ACKNOWLEDGEMENTS
* removing the model_type if stuff in the _step loop in generate_step and adding MambaCache in base.py for training easier generations and removing mamba in tuner/utils.
* quick clean up
* update trainer/utils for right initialisation of the layers for LoRA, but not working.
* clean up
* Forther update to trainer/utils for correct layer selection. Successfull training
* removing extra mamba-infer.py file
* clean up, reformating will come later
* reformat and big clean up, final commit
* some speedups and cleanups
* fix test
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* feat: Nemotron
https://huggingface.co/nvidia/Minitron-4B-Base
This is basically Llama with partial RoPE and LayerNorm instead of
BatchNorm. Also they add 1 to the LayerNorm weight for some reason.
* fixup! feat: Nemotron
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* use fast rope
* fix llama
* use fast rope for llama3.1
* requires unreleased mlx
* fix su
* fix deepseek v2
* only one of base or freqs
* nit
* fix
* hard code freqs
* feat: deepseek v1
DeepSeek is still releasing models on the DeepSeek V1 architecture.
```sh
mlx_lm.convert --hf-path deepseek-ai/DeepSeek-Prover-V1.5-RL --mlx-path DeepSeek-Prover-V1.5-RL-8bit --q-bits 8 -q
mlx_lm.generate --model DeepSeek-Prover-V1.5-RL-8bit --ignore-chat-template --max-tokens 512 --prompt 'import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
/-- The second and fourth terms of a geometric sequence are $2$ and $6$. Which of the following is a possible first term?
Show that it is $\frac{2\sqrt{3}}{3}$.-/
theorem amc12b_2003_p6 (a r : ℝ) (u : ℕ → ℝ) (h₀ : ∀ k, u k = a * r ^ k) (h₁ : u 1 = 2)
(h₂ : u 3 = 6) : u 0 = 2 / Real.sqrt 3 ∨ u 0 = -(2 / Real.sqrt 3) := by'
```
* nits
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Unify attention mask creation in LLMs.
Currently, each model implementation in `mlx-examples/llms/models` has ad-hoc
code to create a mask for the attention mechanism. This usually takes the form:
```
mask = None
if h.shape[1] > 1:
mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1])
mask = mask.astype(h.dtype)
```
This correctly creates a mask only if the input consists of more than one token.
But this code assumes the multi-token input is at the beginning of inference.
If, for example, we are evaluating multiple tokens because of speculative
decoding or prompt cache reuse, this mask will not have the correct shape and
and will cause the raising of an exception in the attention computation.
Some of the models correctly implement the mask creation with code like this:
```
mask = None
if h.shape[1] > 1:
mask = create_additive_causal_mask(
h.shape[1], cache[0].offset if cache is not None else 0
)
mask = mask.astype(h.dtype)
```
This commit unifies the attention mask creation for all models with a new
function `create_attention_mask`, reducing code duplication and helping all
models support inference performance enhancements like those mentioned above.
* Allow batches in LLM key-value cache
The current implementation of the LLM key-value cache assumes that
the input batch is of size 1. Input batching (evaluating multiple
alterative inputs at the same time) can be a valuable tool for
speculative sampling and other techniques.
This change removes the hard-coded batch size from the code that
resizes the key-value cache.
* Simplify causal mask creation
Use the same codepath regardless of whether there's an offset or
not. Addresses [this comment](https://github.com/ml-explore/mlx-examples/pull/911#discussion_r1691459717).
* Use old-style type annotation to avoid linter error
* add dynamicNTK scaling rope
* remove unused var
* fix rope base
* llama3.1 fixes
* TODO for rope eval
* vectorise llama3 base freq calculation
* removed the arbitrary 2.0 rope_scale default case
* fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init
* nits + format
* use mx.pi
* fix tests and add test for 3.1
---------
Co-authored-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Add logit soft capping to gemma, and fix precision issues
Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention).
* get rid of comments
* get rid of last comments (sry lol)
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>