* Update model card describe
- Add full link jump
- Add the address of the model uploader's Hugging Face homepage
* Add user_info to reduce whoami calls
* Remove the -U argument
* remove HF user info
* run pre-commit
* chore(mlx-lm): clean up the top p imp
* chore: clean up
* chore: add test
* chore: address comments
* chore: clean up docs string
* chore: clean up test
* Use named tuple from typing for typehints
* Add type hints
* Simplify expression
* Type hint fix
* Improved do_POST logic
Use a map of endpoints to methods to reduce redundancy in code
* Fix format
* Improve redundancy
Call method dynamically instead of writing out all arguments twice
* Send response instead of returning
* Fix typo
* Revert change
* Make adapter_file as Optional
* Mark formatter as optional
* format
* Create message generator
Store response data that stays static for the duration of the response inside of the object:
system_fingerprint
request_id
object_type
requested_model
Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline
* Remove leftover
* Update parameters to reflect new object structure
No longer pass all arguments between functions, but use the stores values inside of the object
* Parse body before calling request specific methods
* Call super init
* Update server.py
* Fixed outdated documentation parameter name
* Add documentation
* Fix sending headers twice
During testing I found that when using the streaming option, headers have always been sent twice. This should fix that
* Simplify streaming code by using guard clauses
Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing
* Bug fix
* Use Content-Length header
Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion.
* Update utils.py
* Add top_p documentation
* Type hint model and tokenizer as required
* Use static system fingerprint
System fingerprint now stays the same across requests
* Make type hint more specific
* Bug Fix
Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead.
Mark upload_repo as optional
* Move more of the shared code into do_POST
Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form.
* Store stop_id_sequences as lists instead of np
During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported.
* Update stop_id_sequences docs
* Turn if check to non-inclusive
Only continue if buffer is smaller
* Documentation fix
* Cleared method names
Instead of handle_stream and generate_competion, we should name it handle_completion.
Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive
* Make comment clearer
* fix format
* format
* Add metadata when saving safetensors
Add metadata format="pt" for safetensors so that model's are accessible to `transformers` users as well.
* save with metadata format mlx
Save the model weights with metadata format of "mlx".
* Updated llms/mlx_lm/generate.py
* Don't serve local directory
BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods.
* Fix typo in method name
I assume hanlde_stream was intended to be called handle_stream
* Fix outdated typehint
load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change
* Add some more type hints
* Add warnings for using in prod
Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html
* format
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* lazy model import in mlx_lm
* change lora loading
* fix olmo lora
* remove a bunch of unused stuff from plamo
* move phixtral to mlx-lm and out of llms/
* chore(mlx-lm): add model weight index in save_weights
* Update llms/mlx_lm/utils.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update llms/mlx_lm/utils.py
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* chore: save total siZe as param size isntead of file size
* chore: clean up format
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* LoRA: Remove unnecessary model type judgments
1. Supported models are already checked in the load_model function in utils, no need to repeat the check in lora
2. The checks in lora are not synchronized with those in utils
* LoRA: add LoRA supported models in mlx_lm utils
* feat(mlx-lm): add de-quant for fuse
* chore: disable quant in to linear when de-quant enabled
* chore: add better error handling for adapter file not found
* fix the chinese character generation as same as PR #321
* reuse the generate logic to utils.py
* format
* verbose defualt
* fix conflicst with colorize and character check
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Convert HF weights of PLaMo and load it to a plamo model in mlx
* Fix model inference part
* Add bos at the beginning of the prompt
* Fix convert.py to copy tokenizer.model into the converted dir
* Use the required insturction format in generate.py when "--instruct" option is specified
* Change filenames and update existing scripts
* Add README
* Add requirements.txt
* Fix plamo.py to stop generation when EOS appears
* Add quantization to convert.py
* Use mlx>=0.0.9 for mx.core.outer() in PLaMo model
* Update acknowledgements.md
* Fix card text in upload_to_hub()
* Not use prompt template when --instruct is not specified
* Ask if you trust_remote_code for loading tokenizer of PLaMo
* Check the user trusts the remote code when converting
* Remove plamo directory
* Update README
* Add PLaMo model file
* Fix the handling of cache in PLaMo and update README
* Ask if trust_remote_code only when the model is PLaMo
* Remove resolve_trust_remote_code from convert.py and use the latest transformers
* Remove code not to add EOS
* Update README to fix an example not to use noncommercial version of the model
* Remove unused imports
* Remove unnecessary description about the instruct model of PLaMo from README
* format, nits in README
* typo
---------
Co-authored-by: Shunta Saito <shunta@mitmul-mbp.local>
Co-authored-by: Awni Hannun <awni@apple.com>
* Add colorized output option to generate script
Two new functions were added to the script that allow output to be colorized based on the T[0] probability. Changes were made to the `generate_step` function in utils.py to permit colorization. Additionally, an argument for colorization was introduced to the command-line parser.
* Rename 'colorize' parameter with 'return_probability' in generate_step
* refactor(qwen): moving qwen into mlx-lm
* chore: update doc
* chore: fix type hint
* add qwen model support in convert
* chore: fix doc
* chore: only load model in quantize_model
* chore: make the convert script only copy tokenizer files instead of load it and save
* chore: update docstring
* chore: remove unnecessary try catch
* chore: clean up for tokenizer and update transformers 4.37
* nits in README
---------
Co-authored-by: Awni Hannun <awni@apple.com>