mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-06-24 17:31:18 +08:00

Author	SHA1	Message	Date
Khush Gupta	8fa12b0058	Adapters loading (#902 ) * Added functionality to load in adapters through post-requests so you do not need to restart the server * ran pre-commit * nits * fix test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-01 16:18:18 -07:00
madroid	85dc76f6e0	Server: support stream_options (#913 ) * Server: support stream_options see https://x.com/OpenAIDevs/status/1787573348496773423 * Server: support stream_options * Server: check None type	2024-07-26 08:58:52 -07:00
Awni Hannun	f0c6c6e226	keep the server in a valid state (#889 )	2024-07-15 18:35:36 -07:00
Awni Hannun	68e88d42fb	Fix server for `openai` package (#877 ) * fix * fixes for 9b	2024-07-08 12:34:31 -07:00
Angelos Katharopoulos	f212b770d8	Server loads the model on demand from the request (#851 )	2024-06-27 11:37:57 -07:00
Chime Ogbuji	1d701a1831	Logprobs info to completion API (#806 ) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-23 10:35:13 -07:00
Nada Amin	3cc58e17fb	Tweaks to run dspy-produced calls to the server, with gemma template. (#810 ) * Tweaks to run dspy-produced calls to the server, with gemma template. following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936 can try it out with: ```sh python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143 ``` modulo patching the relative imports in server.py ``` -from .tokenizer_utils import TokenizerWrapper -from .utils import generate_step, load +from mlx_lm.tokenizer_utils import TokenizerWrapper +from mlx_lm.utils import generate_step, load ``` and then, ont the dspy side: ```python import dspy lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250) lm("hello") ``` * simpler way to validate float or int * remove logic that works around incompatible templates, too gemma specific * tweak messages for common denominator * use generate.py workaround for DBXR * put behind flag * oops * Solution to chat template issue: pass in a custom template! The template should likely adhere to the OpenAI chat model. Here is such a template for Gemma. --chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] \| trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}" * remove convoluted solution * Tweak for when None is provided explicitly, and must be set to [] too. For example, the outlines library provides None explicitly. * style --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-12 07:17:06 -07:00
Konstantin Kerekovski	d1c35fa684	Add MLX Cache Limit setting for mlx_lm.generate and mlx_lm.server CLI (#744 ) * Add support for setting MLX cache limit in GB * Add support for setting MLX cache limit in GB in mlx_lm.server * format --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-03 12:42:48 -07:00
Karim Elmaaroufi	4bf2eb17f2	Validate server params & fix logit bias bug (#731 ) * Bug fix in logit bias * Add parameter validations * Fix typo * Update docstrings to match MLX styling * Black style + fix a validation bug	2024-04-30 07:27:40 -07:00
Kristian Muñiz	109ee2f2f8	Use CORS headers for streaming for MLX Server (#716 )	2024-04-25 07:26:04 -07:00
Aaron Ng	8d5cf5b0c8	use logging in mlx server (#705 )	2024-04-22 07:50:06 -07:00
Anchen	749cabf299	fix: unicode decoding (#702 )	2024-04-21 08:58:23 -07:00
Karim Elmaaroufi	1484598de1	Add support for logit bias (#697 )	2024-04-21 06:53:56 -07:00
Anchen	f5f189e48a	fix(mlx-lm): broken server.py (#690 ) * fix server.py * fix var referenced before assignment * add test * clean up	2024-04-18 14:26:18 -07:00
Phúc H. Lê Khắc	35206806ac	Create executables for generate, lora, server, merge, convert (#682 ) * feat: create executables mlx_lm.<cmd> * nits in docs --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-04-16 16:08:49 -07:00
Awni Hannun	2bd64b78cf	Save lora config (#636 ) * lora config * comments * version bump	2024-04-02 13:52:53 -07:00
Matt Wronkiewicz	373dd6f2a2	Set finish_reason in response (#592 )	2024-03-19 20:21:26 -07:00
sweetcard	e2205beb66	Update server.py to add --trust-remote-code to server (#578 ) * Update server.py Add --trust-remote-code to server * format code by running pre-commit --------- Co-authored-by: flymonk <zhou.feng@gsafer.com>	2024-03-14 07:05:19 -07:00
Y4hL	b8e5eda4fd	Refactoring of mlx_lm example (#501 ) * Use named tuple from typing for typehints * Add type hints * Simplify expression * Type hint fix * Improved do_POST logic Use a map of endpoints to methods to reduce redundancy in code * Fix format * Improve redundancy Call method dynamically instead of writing out all arguments twice * Send response instead of returning * Fix typo * Revert change * Make adapter_file as Optional * Mark formatter as optional * format * Create message generator Store response data that stays static for the duration of the response inside of the object: system_fingerprint request_id object_type requested_model Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline * Remove leftover * Update parameters to reflect new object structure No longer pass all arguments between functions, but use the stores values inside of the object * Parse body before calling request specific methods * Call super init * Update server.py * Fixed outdated documentation parameter name * Add documentation * Fix sending headers twice During testing I found that when using the streaming option, headers have always been sent twice. This should fix that * Simplify streaming code by using guard clauses Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing * Bug fix * Use Content-Length header Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion. * Update utils.py * Add top_p documentation * Type hint model and tokenizer as required * Use static system fingerprint System fingerprint now stays the same across requests * Make type hint more specific * Bug Fix Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead. Mark upload_repo as optional * Move more of the shared code into do_POST Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form. * Store stop_id_sequences as lists instead of np During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported. * Update stop_id_sequences docs * Turn if check to non-inclusive Only continue if buffer is smaller * Documentation fix * Cleared method names Instead of handle_stream and generate_competion, we should name it handle_completion. Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive * Make comment clearer * fix format * format	2024-03-06 06:24:31 -08:00
Anchen	3655bfc3bd	chore(mlx-lm): fix broken server.py script (#519 )	2024-03-03 06:04:54 -08:00
Y4hL	ea92f623d6	Prevent llms/mlx_lm from serving the local directory as a webserver (#498 ) * Don't serve local directory BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods. * Fix typo in method name I assume hanlde_stream was intended to be called handle_stream * Fix outdated typehint load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change * Add some more type hints * Add warnings for using in prod Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html * format * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-02-27 19:40:42 -08:00
Awni Hannun	95f82e67a2	Fix import warning (#479 ) * fix import warning * fix version import * remove api, move convert to utils * also update circle to run external PRs	2024-02-27 08:47:56 -08:00
Anchen	82f3f31d93	chore(mlx-lm): refactor server.py to utilize generate_step from utils for consistency (#491 ) * chore(mlx-lm): refactor server.py to utilize generate_step from utils for consistency * chore(mlx-lm): update server doc * chore: remove unused generate func	2024-02-27 06:25:24 -08:00
Anchen	19a21bfce4	chore: add /v1/completions for server (#489 )	2024-02-26 20:59:33 -08:00
Anchen	88458c4e40	feat(mlx-lm): add openAI like api server (#429 ) * feat(mlx-lm): add openAI like api server * chore: fix sse format * chore: add top_p support * chore: fix the load import * chore: add workground for missing space in stream decoding * chore: fix typo * chore: add error handling for streaming * chore: using slicing instead of replace * chore: set host, port via args and improve handle stream token logic * chore: refactor stop sequence function * chore: rename stopping_criteria * fix: unable to load kernel contiguous_scan_inclusive_sum_bfloat16_bfloat16 * chore: fix the streaming unicode issue * Update llms/mlx_lm/server.py Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * refacotr: move stopping_criteria out of generate func --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com>	2024-02-18 14:01:28 -08:00

25 Commits