Improve documentation clarity by:
1. Fix return type annotation to correctly reflect GenerationResponse
2. Simplify docstring by referencing GenerationResponse class
3. Remove redundant field descriptions
These were "chat.completions" and "chat.completions.chunk"
but should be "chat.completion" and "chat.completion.chunk"
for compatibility with clients expecting an OpenAI API.
In particular, this solves a problem in which aider 0.64.1 reports
hitting a token limit on any completion request, no matter how small,
despite apparently correct counts in the usage property.
Refer to:
https://platform.openai.com/docs/api-reference/chat/object
> object string
> The object type, which is always chat.completion.
https://platform.openai.com/docs/api-reference/chat/streaming
> object string
> The object type, which is always chat.completion.chunk.
* fix rotating kv cache for chat use case
* reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat
* nit in chat
* fix tests
* fix tests
* fix tests
* docs
* chat command
* comments + docs
* Define meta_state on all Cache implementations
* fixes + trim_prompt_cache api
* fix default model
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* feat: QDoRA with tests and a small bug fix for recalculation of self.m
* some simplifications and fixes
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Adding full model weights finetuning
* Updating the LORA.md and ACKNOWLEDGMENTS.md files.
* removing --use-dora and --fulll-training and adding --fine-tune-type
* some clean up
* reformating and fixing dora training
* updated CONFIG_DEFAULTS
* update config example
* update in the config example fie
* Update LORA.md
* merge and commit
* adding argument for dora linear layer
* clean up
* clean up in the example yaml file
* fix
* final fix before sending
* small addition to re md file
* fix for loading the fully trained model by saving all the files and configs correctly
* clean up
* removing the unnesesairy files
* changing lora layers back to 16
* removed max file size
* nits
* resolve merge
* some consistency changes
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* LoRA: support fine-tuning tools datasets
* LoRA: Split small function
* LoRA: add tools format to lora docs
* LoRA: pre-commit fix
* Revert "LoRA: pre-commit fix"
This reverts commit b94b7e0fe7.
* Revert "LoRA: Split small function"
This reverts commit 3f6a5f19fd.
* LoRA: remove ToolsDataset
In a JSONL file, not all data is required to include the tools value.
* nit in readme
* nit in readme
* nit in readme
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add logits_processor option for the generation as in huggingface transformers library
* concatenation correction
* Rename the tokens variable for clarity
* remove the logit_bias argument from generate_step method
* fix the variable name
* nits + test
* test
* add back logit bias + test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add 'models' endpoint to server
* Add test for new 'models' server endpoint
* Check hf_cache for mlx models
* update tests to check hf_cache for models
* simplify test
* doc
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* initial commit
* initial commit
* Adding first lines
* adding x, and dt projection layers
* adding the clamping mechanism
* First succesful inference
* last commit for today - added custom geenrate function and it works as expected, will try training and then with loading a model from the hub
* clean up
* save up
* almost
* update
* update
* fixed cache handeling
* fixed loading
* added seperate generat_step method in the model and also in the utils to automaticaly use the generate step mthod in the model class
* quick update
* still not working
* save
* still not working
* initial commit
* utils.py logits = logits[:, -1, :] TypeError: tuple indices must be integers or slices, not tuple
* update
* update
* Fixing the Batching Depfwise Comnvolution and multi token input
* fixing generate and logits outputs
* Done!
* Fixing the cache handling, generating works now trying training
* update ACKNOWLEDGEMENTS
* removing the model_type if stuff in the _step loop in generate_step and adding MambaCache in base.py for training easier generations and removing mamba in tuner/utils.
* quick clean up
* update trainer/utils for right initialisation of the layers for LoRA, but not working.
* clean up
* Forther update to trainer/utils for correct layer selection. Successfull training
* removing extra mamba-infer.py file
* clean up, reformating will come later
* reformat and big clean up, final commit
* some speedups and cleanups
* fix test
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial commit of --prompt-only and prompt from STDIN feature
* Switch to using --verbose instead of --prompt-only
* Fix capitalization typo
* Fix reference to changed option name
* Update exception text
* Make sure to import the correct "version" module when installing the
mlx_whisper package from local source code.
* Make sure to import the correct "version" module when installing the mlx_lm package from local source code
* fix
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* feat: Nemotron
https://huggingface.co/nvidia/Minitron-4B-Base
This is basically Llama with partial RoPE and LayerNorm instead of
BatchNorm. Also they add 1 to the LayerNorm weight for some reason.
* fixup! feat: Nemotron
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* use fast rope
* fix llama
* use fast rope for llama3.1
* requires unreleased mlx
* fix su
* fix deepseek v2
* only one of base or freqs
* nit
* fix
* hard code freqs
* feat: deepseek v1
DeepSeek is still releasing models on the DeepSeek V1 architecture.
```sh
mlx_lm.convert --hf-path deepseek-ai/DeepSeek-Prover-V1.5-RL --mlx-path DeepSeek-Prover-V1.5-RL-8bit --q-bits 8 -q
mlx_lm.generate --model DeepSeek-Prover-V1.5-RL-8bit --ignore-chat-template --max-tokens 512 --prompt 'import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
/-- The second and fourth terms of a geometric sequence are $2$ and $6$. Which of the following is a possible first term?
Show that it is $\frac{2\sqrt{3}}{3}$.-/
theorem amc12b_2003_p6 (a r : ℝ) (u : ℕ → ℝ) (h₀ : ∀ k, u k = a * r ^ k) (h₁ : u 1 = 2)
(h₂ : u 3 = 6) : u 0 = 2 / Real.sqrt 3 ∨ u 0 = -(2 / Real.sqrt 3) := by'
```
* nits
* nits
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* feature: LoRA adapter for Embeddings
* feature: wire in LoRAEmbedding into the tuner. Allow the embedding and non model.layers Linear layers to be targeted for fine tuning
* feature: DoRA adapter for Embeddings
* feature: wire in DoRAEmbedding
* bugfix: ensure self.m is recalculated when the linear layer is changed in DoRALinear.from_linear
* refactor: prefer from_base over from_linear or from_embedding. prefer fuse over to_linear or to_embedding
* cleanup: remove unused imports in test_dora.py
* refactor: remove unnecessary non_layer_modules
* cleanup: remove wrong comments for lora embedding dropout. remove uncessary parens in dora embedding dropout
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Predict stop sequence matches during streaming
Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met.
* fix typo
* Change sequence_overlap logic
* range isn't inclusive, add 1 to max_overlap
* Add test_server.py
Added a test for the sequence_overlap method
* nits
* eos sequence
* finalize
---------
Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Added functionality to load in adapters through post-requests so you do not need to restart the server
* ran pre-commit
* nits
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Unify attention mask creation in LLMs.
Currently, each model implementation in `mlx-examples/llms/models` has ad-hoc
code to create a mask for the attention mechanism. This usually takes the form:
```
mask = None
if h.shape[1] > 1:
mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1])
mask = mask.astype(h.dtype)
```
This correctly creates a mask only if the input consists of more than one token.
But this code assumes the multi-token input is at the beginning of inference.
If, for example, we are evaluating multiple tokens because of speculative
decoding or prompt cache reuse, this mask will not have the correct shape and
and will cause the raising of an exception in the attention computation.
Some of the models correctly implement the mask creation with code like this:
```
mask = None
if h.shape[1] > 1:
mask = create_additive_causal_mask(
h.shape[1], cache[0].offset if cache is not None else 0
)
mask = mask.astype(h.dtype)
```
This commit unifies the attention mask creation for all models with a new
function `create_attention_mask`, reducing code duplication and helping all
models support inference performance enhancements like those mentioned above.
* Allow batches in LLM key-value cache
The current implementation of the LLM key-value cache assumes that
the input batch is of size 1. Input batching (evaluating multiple
alterative inputs at the same time) can be a valuable tool for
speculative sampling and other techniques.
This change removes the hard-coded batch size from the code that
resizes the key-value cache.
* Simplify causal mask creation
Use the same codepath regardless of whether there's an offset or
not. Addresses [this comment](https://github.com/ml-explore/mlx-examples/pull/911#discussion_r1691459717).
* Use old-style type annotation to avoid linter error
* add dynamicNTK scaling rope
* remove unused var
* fix rope base
* llama3.1 fixes
* TODO for rope eval
* vectorise llama3 base freq calculation
* removed the arbitrary 2.0 rope_scale default case
* fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init
* nits + format
* use mx.pi
* fix tests and add test for 3.1
---------
Co-authored-by: Prince Canuma <prince.gdt@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Generate response with optional arguments
* Reference response generation example
* Include transformers and sentencepiece
* Update example to run Mistral-7B-Instruct-v0.3
* Link to generation example
* Style changes from pre-commit
* Add logit soft capping to gemma, and fix precision issues
Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention).
* get rid of comments
* get rid of last comments (sry lol)
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training
* Pre-commit formatting
* Fix YAML config example
* Print DS info
* Include name
* Add hf_dataset parameter default
* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.
* nits
* update docs
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial implementation
* Fix handling of return_step_logits in return
* Fixed OpenAI parameter expectations and logprob structure and datatypes
* pre-commit black formatting
* Remove unused parameter
* fix log probs
* fix colorize
* nits in server
* nits in server
* Fix top_logprobs structure (a dict) and include tokens in logprobs response
* nits
* fix types
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Tweaks to run dspy-produced calls to the server, with gemma template.
following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936
can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```
and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```
* simpler way to validate float or int
* remove logic that works around incompatible templates, too gemma specific
* tweak messages for common denominator
* use generate.py workaround for DBXR
* put behind flag
* oops
* Solution to chat template issue: pass in a custom template!
The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.
--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
* remove convoluted solution
* Tweak for when None is provided explicitly, and must be set to [] too.
For example, the outlines library provides None explicitly.
* style
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Su-RoPE
* nits
* Update su_rope.py
* Update su_rope.py
Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later."
* Ran isort
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* GPT-2 model support
* Add test for gpt2 model
* Fix weight sanitizing for quantization
* use approx gelu
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* LoRA: Extract pre_processing_model function
* LoRA: Extract small functions(train_model,evaluate_model)
* move test case to test_tuner_utils.py
* nits
* nits
* remove extra param, validate at it 0
* version
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>