* Add logit soft capping to gemma, and fix precision issues
Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention).
* get rid of comments
* get rid of last comments (sry lol)
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training
* Pre-commit formatting
* Fix YAML config example
* Print DS info
* Include name
* Add hf_dataset parameter default
* Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.
* nits
* update docs
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial implementation
* Fix handling of return_step_logits in return
* Fixed OpenAI parameter expectations and logprob structure and datatypes
* pre-commit black formatting
* Remove unused parameter
* fix log probs
* fix colorize
* nits in server
* nits in server
* Fix top_logprobs structure (a dict) and include tokens in logprobs response
* nits
* fix types
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Tweaks to run dspy-produced calls to the server, with gemma template.
following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936
can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```
and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```
* simpler way to validate float or int
* remove logic that works around incompatible templates, too gemma specific
* tweak messages for common denominator
* use generate.py workaround for DBXR
* put behind flag
* oops
* Solution to chat template issue: pass in a custom template!
The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.
--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
* remove convoluted solution
* Tweak for when None is provided explicitly, and must be set to [] too.
For example, the outlines library provides None explicitly.
* style
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Su-RoPE
* nits
* Update su_rope.py
* Update su_rope.py
Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later."
* Ran isort
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* GPT-2 model support
* Add test for gpt2 model
* Fix weight sanitizing for quantization
* use approx gelu
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* LoRA: Extract pre_processing_model function
* LoRA: Extract small functions(train_model,evaluate_model)
* move test case to test_tuner_utils.py
* nits
* nits
* remove extra param, validate at it 0
* version
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* remove unused function
* rebase fix
* move position emebedding to mask creation
* add to tuner and format
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* rebase fix
* move position emebedding to mask creation
* add to tuner and format
* refactor mask
* remove dropout layers
* fix: Added dedicated error handling to load and get_model_path
Added proper error handling to load and get_model_path by adding a dedicated exception class, because when the local path is not right, it still throws the huggingface RepositoryNotFoundError
* fix: Changed error message and resolved lack of import
* fix: Removed redundant try-catch block
* nits in message
* nits in message
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* support dora finetune
* solve problems in lora.py and tuner.utils.py
* add use_dora (bool) in functions of load adapters
* delete all unsupported quantization code and fix all the calculate problems in mlx_lm/tuner/dora.py
* Using stop_gradient to prevent gradients from flowing through ‘norm’ during backpropagation
* set DEFAULT_USE_DORA in mlx_lm/generate.py
* add annotation for all the use_dora
* mlx_lm/fuse.py support fuse dora layers and fix a bug of to_linear() in mlx_lm/tuner/dora.py
* simplify code of juding type of a fused layer in mlx_lm/fuse.py
* add use_dora in mlx_lm/fuse.py when apply_lora_layers()
* style + nits
* style + nits
* more updates
---------
Co-authored-by: chenyifei08 <chenyifei08@baidu.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Support `--add_eos_token` argument to empower users to control the addition of the eos token during LoRA training, addressing issues like incomplete text generation.
* Support `--add_eos_token`, code format
---------
Co-authored-by: Zhan ChengLong <zhanchenglong@bytedance.com>
* Add `model_config` parameter to `load()` and `load_model()`
For easy editing of the loaded model configuration (e.g., for changing RoPE theta or scaling of Phi-3 model)
Example:
```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed", model_config={"rope_theta":50000.0})
response = generate(model, tokenizer, prompt, max_tokens=MAX_TOKENS)
```
* Possible bug (default_loss)
* Revert "Possible bug (default_loss)"
This reverts commit 70a55ace18.
* Fix default_loss for lora
* 1. move load_model's new optional `model_config` arg to the end (fetch_from_hub()'s `model = load_model(model_path, lazy)`) 2. fix indentations (`black` hook)
* Pad mask with zeros for non-square attention matrices
The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary.
This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly.
* Directly create mask instead of padding
* Update llama.py
* Add support for setting MLX cache limit in GB
* Add support for setting MLX cache limit in GB in mlx_lm.server
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add model management functionality for local caches
This commit introduces a set of command-line utilities for managing MLX models downloaded and saved locally in Hugging Face cache. The functionalities include scanning existing models, retrieving detailed information about a specific model, and deleting a model by its name.
* Added mlx_lm.model to setup.py
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Update model card describe
- Add full link jump
- Add the address of the model uploader's Hugging Face homepage
* Add user_info to reduce whoami calls
* Remove the -U argument
* remove HF user info
* run pre-commit
* support for phi-3 4bits quantized gguf weights
* Added link to 4 bits quantized model
* removed some prints
* Added correct comment
* Added correct comment
* removed print
Since last condition already prints warning for when quantization is None