* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* remove unused function
* rebase fix
* move position emebedding to mask creation
* add to tuner and format
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* add support for granite 3-8B config
* add gpt_bigcode
* add positional embedding condition.
* rebase fix
* move position emebedding to mask creation
* add to tuner and format
* refactor mask
* remove dropout layers
* fix: Added dedicated error handling to load and get_model_path
Added proper error handling to load and get_model_path by adding a dedicated exception class, because when the local path is not right, it still throws the huggingface RepositoryNotFoundError
* fix: Changed error message and resolved lack of import
* fix: Removed redundant try-catch block
* nits in message
* nits in message
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* support dora finetune
* solve problems in lora.py and tuner.utils.py
* add use_dora (bool) in functions of load adapters
* delete all unsupported quantization code and fix all the calculate problems in mlx_lm/tuner/dora.py
* Using stop_gradient to prevent gradients from flowing through ‘norm’ during backpropagation
* set DEFAULT_USE_DORA in mlx_lm/generate.py
* add annotation for all the use_dora
* mlx_lm/fuse.py support fuse dora layers and fix a bug of to_linear() in mlx_lm/tuner/dora.py
* simplify code of juding type of a fused layer in mlx_lm/fuse.py
* add use_dora in mlx_lm/fuse.py when apply_lora_layers()
* style + nits
* style + nits
* more updates
---------
Co-authored-by: chenyifei08 <chenyifei08@baidu.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* Add `model_config` parameter to `load()` and `load_model()`
For easy editing of the loaded model configuration (e.g., for changing RoPE theta or scaling of Phi-3 model)
Example:
```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed", model_config={"rope_theta":50000.0})
response = generate(model, tokenizer, prompt, max_tokens=MAX_TOKENS)
```
* Possible bug (default_loss)
* Revert "Possible bug (default_loss)"
This reverts commit 70a55ace18.
* Fix default_loss for lora
* 1. move load_model's new optional `model_config` arg to the end (fetch_from_hub()'s `model = load_model(model_path, lazy)`) 2. fix indentations (`black` hook)
* Pad mask with zeros for non-square attention matrices
The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary.
This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly.
* Directly create mask instead of padding
* Update llama.py
* Add support for setting MLX cache limit in GB
* Add support for setting MLX cache limit in GB in mlx_lm.server
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add model management functionality for local caches
This commit introduces a set of command-line utilities for managing MLX models downloaded and saved locally in Hugging Face cache. The functionalities include scanning existing models, retrieving detailed information about a specific model, and deleting a model by its name.
* Added mlx_lm.model to setup.py
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Update model card describe
- Add full link jump
- Add the address of the model uploader's Hugging Face homepage
* Add user_info to reduce whoami calls
* Remove the -U argument
* remove HF user info
* run pre-commit
* support for phi-3 4bits quantized gguf weights
* Added link to 4 bits quantized model
* removed some prints
* Added correct comment
* Added correct comment
* removed print
Since last condition already prints warning for when quantization is None
* Added support for the MiniCPM architecture
* Added support for the MiniCPM architecture
* Updated utils.py and LORA.md
* Updated utils.py and LORA.md
* Update implementation details for MiniCPM architecture
* Cleaning up
* fixed the missing lm.head layer problem
* Refactor Model class to dynamically handle tied and untied word embeddings
* Quick update
* added a dynamic rope scaling base calucaltion
* Added support for the MiniCPM architecture
* Added support for the MiniCPM architecture
* Updated utils.py and LORA.md
* Updated utils.py and LORA.md
* Update implementation details for MiniCPM architecture
* Cleaning up
* fixed the missing lm.head layer problem
* Refactor Model class to dynamically handle tied and untied word embeddings
* added a dynamic rope scaling base calucaltion
* quick fix and clean up
* clean up again
* removed the MiniCPMNorm class as its not used
* forgot something, sorry
* format
* version bump
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial config handler and test
* Added means to run from CLI
* Update lora config loading and tests
* Constrain scheduler config (warmup and minimum LR) for each kind
* Update reference to moved schedule_config module
* Minor fix
* Fix typos
* Moved build_schedule and tests
* nits in schedule config
* flake
* fix path
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* use nn.RMSNorm, use sdpa, cleanup
* bump mlx versions
* minor update
* use fast layer norm
* version bump
* update requirement for whisper
* update requirement for gguf
* chore(mlx-lm): clean up the top p imp
* chore: clean up
* chore: add test
* chore: address comments
* chore: clean up docs string
* chore: clean up test
* wip
* wip
* feat: convert mlx model to gguf f16
* chore: conver norm layer to float32 to avoid overflow issue
* chore: add support for mixtral
* chore: clean up
* chore: remove unused import statement
* chore: clean up weight name mapping
* version and readme
* actual version bump
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add dropout parameter to lora configuration
A dropout parameter has been added to the lora configuration settings in lora_config.yaml. The LoRALinear class in utils.py has been updated to take this new parameter. Additionally, a AttributeError: 'types.SimpleNamespace' object has no attribute 'prompt' related to `args.prompt` has been removed from lora.py.
* Update lora_config.yaml
Set dropout to 0.0 in the sample config file
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* chore(mlx-lm): fix print_trainable_parameters for quant models
* chore: clean up
* refactor: use layer type to check quant bits
* chore: address comment
* Add --lora-all-linear option to apply LoRa to all linear transfer block layers
* Moved to YAML config and added specification of rank & alpha
* nits in conifg, more tests
* nit
* run tests for prs
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Convert mlx_lm.lora to use YAML configuration
* pre-commit run fixes
* Fix loading of config file
* Remove invalid YAML from doc
* Update command-line options and YAML parameter overriding, per feedback in #503
* Minor wording change
* Positional argument
* Moved config to a (-c/--config) flag
* Removed CLI option defaults (since CLI options take precedence and their defaults are in CONFIG_DEFAULTS)
* pre-commit format updates
* Fix handling of CLI option defaults
* Prevent None values of unspecified CLI options from overwriting values from CONFIG_DEFAULTS
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Use named tuple from typing for typehints
* Add type hints
* Simplify expression
* Type hint fix
* Improved do_POST logic
Use a map of endpoints to methods to reduce redundancy in code
* Fix format
* Improve redundancy
Call method dynamically instead of writing out all arguments twice
* Send response instead of returning
* Fix typo
* Revert change
* Make adapter_file as Optional
* Mark formatter as optional
* format
* Create message generator
Store response data that stays static for the duration of the response inside of the object:
system_fingerprint
request_id
object_type
requested_model
Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline
* Remove leftover
* Update parameters to reflect new object structure
No longer pass all arguments between functions, but use the stores values inside of the object
* Parse body before calling request specific methods
* Call super init
* Update server.py
* Fixed outdated documentation parameter name
* Add documentation
* Fix sending headers twice
During testing I found that when using the streaming option, headers have always been sent twice. This should fix that
* Simplify streaming code by using guard clauses
Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing
* Bug fix
* Use Content-Length header
Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion.
* Update utils.py
* Add top_p documentation
* Type hint model and tokenizer as required
* Use static system fingerprint
System fingerprint now stays the same across requests
* Make type hint more specific
* Bug Fix
Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead.
Mark upload_repo as optional
* Move more of the shared code into do_POST
Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form.
* Store stop_id_sequences as lists instead of np
During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported.
* Update stop_id_sequences docs
* Turn if check to non-inclusive
Only continue if buffer is smaller
* Documentation fix
* Cleared method names
Instead of handle_stream and generate_competion, we should name it handle_completion.
Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive
* Make comment clearer
* fix format
* format
* Add Starcoder2 model and update utils.py
* Refactor model arguments and modules in starcoder2.py
* Refactor FeedForward class to MLP in starcoder2.py
* Fix typo
* pre-commit
* Refactor starcoder2.py: Update model arguments and modules
* Fix LM head and MLP layers
* Rename input layer norm
* Update bias in linear layers
* Refactor token embeddings in Starcoder2Model
* Rename to standard HF attention layer name
* Add LayerNorm
* Add transposed token embeddings (like in Gemma)
* Refactor MLP and TransformerBlock classes
* Add tie_word_embeddings option to ModelArgs and update Model implementation
* Add conditional check for tying word embeddings in Starcoder2Model
* Fix bias in lm_head linear layer
* Remove unused LayerNorm in stablelm
* Update transformers dependency to use GitHub repository
* fix lm head bug, revert transformer req
* Update RoPE initialization in Attention class
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* StableLM now part of Transformers as stablelm rather than stablelm_epoch; changed config to match new changes
* removing old file
* reference new stablelm
* Add metadata when saving safetensors
Add metadata format="pt" for safetensors so that model's are accessible to `transformers` users as well.
* save with metadata format mlx
Save the model weights with metadata format of "mlx".
* Updated llms/mlx_lm/generate.py
* Don't serve local directory
BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods.
* Fix typo in method name
I assume hanlde_stream was intended to be called handle_stream
* Fix outdated typehint
load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change
* Add some more type hints
* Add warnings for using in prod
Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html
* format
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* LoRA:Refactor TrainingCallback to enhance flexibility and extensibility
This commit refactors the TrainingCallback class to accept a dictionary parameter for both on_train_loss_report and on_val_loss_report methods. By switching from multiple parameters to a single dict parameter, this change significantly improves the class's flexibility and makes it easier to extend with new training or validation metrics in the future without altering the method signatures. This approach simplifies the addition of new information to be logged or processed and aligns with best practices for scalable and maintainable code design.
* LoRA: Add printing and callbacks for learning rate during training
Mixtral models throw the following exception
```
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/generate.py", line 119, in <module>
main(args)
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/generate.py", line 96, in main
model, tokenizer = load(args.model, tokenizer_config=tokenizer_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/utils.py", line 278, in load
model = load_model(model_path)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/utils.py", line 221, in load_model
model_class, model_args_class = _get_classes(config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/utils.py", line 46, in _get_classes
arch = importlib.import_module(f"mlx_lm.models.{model_type}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/mlx_lm/models/mixtral.py", line 11, in <module>
@dataclass
^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/dataclasses.py", line 1230, in dataclass
return wrap(cls)
^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/dataclasses.py", line 1220, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/lib/python3.11/dataclasses.py", line 1027, in _process_class
_init_fn(all_init_fields,
File "/opt/homebrew/anaconda3/lib/python3.11/dataclasses.py", line 545, in _init_fn
raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'model_type' follows default argument
```
* LoRA: Improve validation error for LoRA layer count exceeding model layer
This commit enhances the error handling when the specified LoRA layer count exceeds the total number of layers in the model. It clarifies the error message to provide actionable feedback for users, guiding them to adjust their input parameters accordingly.
* format + nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* lazy model import in mlx_lm
* change lora loading
* fix olmo lora
* remove a bunch of unused stuff from plamo
* move phixtral to mlx-lm and out of llms/
* Add checkpoints directory for adapter weights
The code was modified to create a checkpoints directory if it doesn't exist yet. Adapter weights are now saved to this checkpoints directory during the training iterations.
Corrected indentation of Save adapter weights code because it was part of "if eval"
* Fixing a blank added by mistake
* update a few examples to use compile
* update mnist
* add compile to vae and rename some stuff for simplicity
* update reqs
* use state in eval
* GCN example with RNG + dropout
* add a bit of prefetching
* initial commit
* style fixes
* update of ACKNOWLEDGMENTS
* fixed comment
* minor refactoring; removed unused imports
* added cifar and cvae to top-level README.md
* removed mention of cuda/mps in argparse
* fixed training status output
* load_weights() with strict=True
* pretrained model update
* fixed imports and style
* requires mlx>=0.0.9
* updated with results using mlx 0.0.9
* removed mention of private repo
* simplify and combine to one file, more consistency with other exmaples
* few more nits
* nits
* spell
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* chore(mlx-lm): add model weight index in save_weights
* Update llms/mlx_lm/utils.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update llms/mlx_lm/utils.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* chore: save total siZe as param size isntead of file size
* chore: clean up format
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
A new argument "--max_seq_length" has been added to the command-line parser and passed as a parameter to the main function of the lora.py script. This allows users to specify and control the maximum sequence length during training.
* LoRA: Remove unnecessary model type judgments
1. Supported models are already checked in the load_model function in utils, no need to repeat the check in lora
2. The checks in lora are not synchronized with those in utils
* LoRA: add LoRA supported models in mlx_lm utils
* feat(mlx-lm): add de-quant for fuse
* chore: disable quant in to linear when de-quant enabled
* chore: add better error handling for adapter file not found
* chore(mlx-lm): pass max seq len to evaluate in training loop
* chore: make sure the batch seq not exceed max len
* chore: update comment
* chore: add warning before truncate input
* Draft of tiny llama from gguf
* Transpose all
* No transposition with new layout
* Read config from gguf
* Create tokenizer from gguf
* move gguf and update to be similar to hf_llm
* change model to HF style + updates to REAMDE
* nits in REAMDE
* nit readme
* only use mlx for metadata
* fix eos/bos tokenizer
* fix tokenization
* quantization runs
* 8-bit works
* tokenizer fix
* bump mlx version
---------
Co-authored-by: Juarez Bochi <juarez.bochi@grammarly.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* fix the chinese character generation as same as PR #321
* reuse the generate logic to utils.py
* format
* verbose defualt
* fix conflicst with colorize and character check
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Convert HF weights of PLaMo and load it to a plamo model in mlx
* Fix model inference part
* Add bos at the beginning of the prompt
* Fix convert.py to copy tokenizer.model into the converted dir
* Use the required insturction format in generate.py when "--instruct" option is specified
* Change filenames and update existing scripts
* Add README
* Add requirements.txt
* Fix plamo.py to stop generation when EOS appears
* Add quantization to convert.py
* Use mlx>=0.0.9 for mx.core.outer() in PLaMo model
* Update acknowledgements.md
* Fix card text in upload_to_hub()
* Not use prompt template when --instruct is not specified
* Ask if you trust_remote_code for loading tokenizer of PLaMo
* Check the user trusts the remote code when converting
* Remove plamo directory
* Update README
* Add PLaMo model file
* Fix the handling of cache in PLaMo and update README
* Ask if trust_remote_code only when the model is PLaMo
* Remove resolve_trust_remote_code from convert.py and use the latest transformers
* Remove code not to add EOS
* Update README to fix an example not to use noncommercial version of the model
* Remove unused imports
* Remove unnecessary description about the instruct model of PLaMo from README
* format, nits in README
* typo
---------
Co-authored-by: Shunta Saito <shunta@mitmul-mbp.local>
Co-authored-by: Awni Hannun <awni@apple.com>
* Add colorized output option to generate script
Two new functions were added to the script that allow output to be colorized based on the T[0] probability. Changes were made to the `generate_step` function in utils.py to permit colorization. Additionally, an argument for colorization was introduced to the command-line parser.
* Rename 'colorize' parameter with 'return_probability' in generate_step
* add an option to apply the tokenizer chat template
* fix the option to apply the tokenizer chat template
* better error messages for chat template issues
* apply the chat template by default when possible
* nit in comment'
* rebase
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* refactor(qwen): moving qwen into mlx-lm
* chore: update doc
* chore: fix type hint
* add qwen model support in convert
* chore: fix doc
* chore: only load model in quantize_model
* chore: make the convert script only copy tokenizer files instead of load it and save
* chore: update docstring
* chore: remove unnecessary try catch
* chore: clean up for tokenizer and update transformers 4.37
* nits in README
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* refactor: moving phi2 example into hf_llm
* chore: clean up
* chore: update phi2 model args so it can load args from config
* fix phi2 + nits + readme
* allow any HF repo, update README
* fix bug in llama
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* * Add --local flag for reading models from filesystem and related code for doing so
* Disable uploading to huggingface if --local flag is set
* Remove code related to .bin files and merge fetch_from_local and fetch_from_hub into one function.
* Update llms/hf_llm/convert.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* format / nits
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* refactor: make the phi2 example can be directly load the model from hf without convert needed
* chore: add super().__init__() for all module, otherwise will cause error in lora
* refactor: merge deepseek coder example into hf_llm example
* remove deepseek example
* chore: fix format in readme
* chore: remove default rope_scaling dict and use get to access type and factor to avoid key error
* Update llms/hf_llm/models.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* chore: fix lint
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* feat: add example for deepseek coder
* chore: remove hardcoded rope_scaling_factor
* feat: add quantization support
* chore: update readme
* chore: clean up the rope scalling factor param in create cos sin theta
* feat: add repetition_penalty
* style /consistency changes to ease future integration
* nits in README
* one more typo
---------
Co-authored-by: Awni Hannun <awni@apple.com>