* chore(mlx-lm): clean up the top p imp
* chore: clean up
* chore: add test
* chore: address comments
* chore: clean up docs string
* chore: clean up test
* wip
* wip
* feat: convert mlx model to gguf f16
* chore: conver norm layer to float32 to avoid overflow issue
* chore: add support for mixtral
* chore: clean up
* chore: remove unused import statement
* chore: clean up weight name mapping
* version and readme
* actual version bump
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add dropout parameter to lora configuration
A dropout parameter has been added to the lora configuration settings in lora_config.yaml. The LoRALinear class in utils.py has been updated to take this new parameter. Additionally, a AttributeError: 'types.SimpleNamespace' object has no attribute 'prompt' related to `args.prompt` has been removed from lora.py.
* Update lora_config.yaml
Set dropout to 0.0 in the sample config file
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* chore(mlx-lm): fix print_trainable_parameters for quant models
* chore: clean up
* refactor: use layer type to check quant bits
* chore: address comment
* Add --lora-all-linear option to apply LoRa to all linear transfer block layers
* Moved to YAML config and added specification of rank & alpha
* nits in conifg, more tests
* nit
* run tests for prs
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Convert mlx_lm.lora to use YAML configuration
* pre-commit run fixes
* Fix loading of config file
* Remove invalid YAML from doc
* Update command-line options and YAML parameter overriding, per feedback in #503
* Minor wording change
* Positional argument
* Moved config to a (-c/--config) flag
* Removed CLI option defaults (since CLI options take precedence and their defaults are in CONFIG_DEFAULTS)
* pre-commit format updates
* Fix handling of CLI option defaults
* Prevent None values of unspecified CLI options from overwriting values from CONFIG_DEFAULTS
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Update README.md
The default behaviour of where the convert.py saved files was wrong. It also was inconsistent with how the later script test.py is trying to use them (and assuming naming convention).
I don't actually see a quick way to automate this since--as written--the target directory is set directly by an argument. It would probably be best to rewrite it so that the argument is used as an override variable, but the default behaviour is to construct a file path based on set and unset arugments. This also is complex because "defaults" are assumed in the naming convention as well.
* Update README.md
Created an actual script that'll run and do this correctly.
* Update README.md
Typo fix: mlx-models should have been mlx_models. This conforms with standard later in the mlx-examples/whisper code.
* Update README.md
Removed the larger script and changed it back to the simpler script as before.
* nits in readme
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Use named tuple from typing for typehints
* Add type hints
* Simplify expression
* Type hint fix
* Improved do_POST logic
Use a map of endpoints to methods to reduce redundancy in code
* Fix format
* Improve redundancy
Call method dynamically instead of writing out all arguments twice
* Send response instead of returning
* Fix typo
* Revert change
* Make adapter_file as Optional
* Mark formatter as optional
* format
* Create message generator
Store response data that stays static for the duration of the response inside of the object:
system_fingerprint
request_id
object_type
requested_model
Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline
* Remove leftover
* Update parameters to reflect new object structure
No longer pass all arguments between functions, but use the stores values inside of the object
* Parse body before calling request specific methods
* Call super init
* Update server.py
* Fixed outdated documentation parameter name
* Add documentation
* Fix sending headers twice
During testing I found that when using the streaming option, headers have always been sent twice. This should fix that
* Simplify streaming code by using guard clauses
Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing
* Bug fix
* Use Content-Length header
Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion.
* Update utils.py
* Add top_p documentation
* Type hint model and tokenizer as required
* Use static system fingerprint
System fingerprint now stays the same across requests
* Make type hint more specific
* Bug Fix
Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead.
Mark upload_repo as optional
* Move more of the shared code into do_POST
Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form.
* Store stop_id_sequences as lists instead of np
During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported.
* Update stop_id_sequences docs
* Turn if check to non-inclusive
Only continue if buffer is smaller
* Documentation fix
* Cleared method names
Instead of handle_stream and generate_competion, we should name it handle_completion.
Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive
* Make comment clearer
* fix format
* format
* Add Starcoder2 model and update utils.py
* Refactor model arguments and modules in starcoder2.py
* Refactor FeedForward class to MLP in starcoder2.py
* Fix typo
* pre-commit
* Refactor starcoder2.py: Update model arguments and modules
* Fix LM head and MLP layers
* Rename input layer norm
* Update bias in linear layers
* Refactor token embeddings in Starcoder2Model
* Rename to standard HF attention layer name
* Add LayerNorm
* Add transposed token embeddings (like in Gemma)
* Refactor MLP and TransformerBlock classes
* Add tie_word_embeddings option to ModelArgs and update Model implementation
* Add conditional check for tying word embeddings in Starcoder2Model
* Fix bias in lm_head linear layer
* Remove unused LayerNorm in stablelm
* Update transformers dependency to use GitHub repository
* fix lm head bug, revert transformer req
* Update RoPE initialization in Attention class
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* StableLM now part of Transformers as stablelm rather than stablelm_epoch; changed config to match new changes
* removing old file
* reference new stablelm
* Add metadata when saving safetensors
Add metadata format="pt" for safetensors so that model's are accessible to `transformers` users as well.
* save with metadata format mlx
Save the model weights with metadata format of "mlx".
* Updated llms/mlx_lm/generate.py
* Don't serve local directory
BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods.
* Fix typo in method name
I assume hanlde_stream was intended to be called handle_stream
* Fix outdated typehint
load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change
* Add some more type hints
* Add warnings for using in prod
Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html
* format
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>