* Added functionality to load in adapters through post-requests so you do not need to restart the server
* ran pre-commit
* nits
* fix test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Initial implementation
* Fix handling of return_step_logits in return
* Fixed OpenAI parameter expectations and logprob structure and datatypes
* pre-commit black formatting
* Remove unused parameter
* fix log probs
* fix colorize
* nits in server
* nits in server
* Fix top_logprobs structure (a dict) and include tokens in logprobs response
* nits
* fix types
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Tweaks to run dspy-produced calls to the server, with gemma template.
following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936
can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```
and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```
* simpler way to validate float or int
* remove logic that works around incompatible templates, too gemma specific
* tweak messages for common denominator
* use generate.py workaround for DBXR
* put behind flag
* oops
* Solution to chat template issue: pass in a custom template!
The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.
--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
* remove convoluted solution
* Tweak for when None is provided explicitly, and must be set to [] too.
For example, the outlines library provides None explicitly.
* style
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add support for setting MLX cache limit in GB
* Add support for setting MLX cache limit in GB in mlx_lm.server
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Use named tuple from typing for typehints
* Add type hints
* Simplify expression
* Type hint fix
* Improved do_POST logic
Use a map of endpoints to methods to reduce redundancy in code
* Fix format
* Improve redundancy
Call method dynamically instead of writing out all arguments twice
* Send response instead of returning
* Fix typo
* Revert change
* Make adapter_file as Optional
* Mark formatter as optional
* format
* Create message generator
Store response data that stays static for the duration of the response inside of the object:
system_fingerprint
request_id
object_type
requested_model
Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline
* Remove leftover
* Update parameters to reflect new object structure
No longer pass all arguments between functions, but use the stores values inside of the object
* Parse body before calling request specific methods
* Call super init
* Update server.py
* Fixed outdated documentation parameter name
* Add documentation
* Fix sending headers twice
During testing I found that when using the streaming option, headers have always been sent twice. This should fix that
* Simplify streaming code by using guard clauses
Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing
* Bug fix
* Use Content-Length header
Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion.
* Update utils.py
* Add top_p documentation
* Type hint model and tokenizer as required
* Use static system fingerprint
System fingerprint now stays the same across requests
* Make type hint more specific
* Bug Fix
Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead.
Mark upload_repo as optional
* Move more of the shared code into do_POST
Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form.
* Store stop_id_sequences as lists instead of np
During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported.
* Update stop_id_sequences docs
* Turn if check to non-inclusive
Only continue if buffer is smaller
* Documentation fix
* Cleared method names
Instead of handle_stream and generate_competion, we should name it handle_completion.
Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive
* Make comment clearer
* fix format
* format
* Don't serve local directory
BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods.
* Fix typo in method name
I assume hanlde_stream was intended to be called handle_stream
* Fix outdated typehint
load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change
* Add some more type hints
* Add warnings for using in prod
Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html
* format
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>