mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-06-24 17:31:18 +08:00

Author	SHA1	Message	Date
Awni Hannun	c4833a2f55	fix encoding with special tokens + chat template (#1189 )	2025-01-03 10:50:59 -08:00
Awni Hannun	2ba0e36683	[mlx-lm] Use top p in server (#1144 ) * use top p in server * couple other fixes	2024-12-12 11:12:21 -08:00
Kevin Conner	0ffdb6dd20	Fix object property value in mlx_lm.server chat completions response to match OpenAI spec (#1119 ) These were "chat.completions" and "chat.completions.chunk" but should be "chat.completion" and "chat.completion.chunk" for compatibility with clients expecting an OpenAI API. In particular, this solves a problem in which aider 0.64.1 reports hitting a token limit on any completion request, no matter how small, despite apparently correct counts in the usage property. Refer to: https://platform.openai.com/docs/api-reference/chat/object > object string > The object type, which is always chat.completion. https://platform.openai.com/docs/api-reference/chat/streaming > object string > The object type, which is always chat.completion.chunk.	2024-11-24 16:37:37 -08:00
Awni Hannun	0f135396ae	Generation refactor: part 2 (#1099 ) * unify with stream_generate * fixes * nit * some cleanup, warnings, tests * fix test + faster min p + test * version	2024-11-23 11:47:06 -08:00
Awni Hannun	657b4cc0aa	[MLX LM] Sampler refactor + a few improvements (#1094 ) * starting * refactor sampler/processor and a few improvements * fix stream * fix stream generate * fix eos handling in stream generate	2024-11-07 16:15:24 -08:00
Awni Hannun	605c4854f1	Prompt caching in `mlx_lm.server` (#1026 ) * caching in server * nits * fix tests * don't throw if no metal * comments	2024-10-14 10:57:22 -07:00
madroid	36c1d8e8dc	Server: support function calling (#1003 )	2024-10-02 12:36:07 -07:00
jamesm131	d812516d3d	Add /v1/models endpoint to mlx_lm.server (#984 ) * Add 'models' endpoint to server * Add test for new 'models' server endpoint * Check hf_cache for mlx models * update tests to check hf_cache for models * simplify test * doc --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-28 07:21:11 -07:00
tidely	df744c98e6	Predict stop sequence matches during streaming (#541 ) * Predict stop sequence matches during streaming Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met. * fix typo * Change sequence_overlap logic * range isn't inclusive, add 1 to max_overlap * Add test_server.py Added a test for the sequence_overlap method * nits * eos sequence * finalize --------- Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com> Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-06 15:24:15 -07:00
Khush Gupta	8fa12b0058	Adapters loading (#902 ) * Added functionality to load in adapters through post-requests so you do not need to restart the server * ran pre-commit * nits * fix test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-01 16:18:18 -07:00
madroid	85dc76f6e0	Server: support stream_options (#913 ) * Server: support stream_options see https://x.com/OpenAIDevs/status/1787573348496773423 * Server: support stream_options * Server: check None type	2024-07-26 08:58:52 -07:00
Awni Hannun	f0c6c6e226	keep the server in a valid state (#889 )	2024-07-15 18:35:36 -07:00
Awni Hannun	68e88d42fb	Fix server for `openai` package (#877 ) * fix * fixes for 9b	2024-07-08 12:34:31 -07:00
Angelos Katharopoulos	f212b770d8	Server loads the model on demand from the request (#851 )	2024-06-27 11:37:57 -07:00
Chime Ogbuji	1d701a1831	Logprobs info to completion API (#806 ) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-23 10:35:13 -07:00
Nada Amin	3cc58e17fb	Tweaks to run dspy-produced calls to the server, with gemma template. (#810 ) * Tweaks to run dspy-produced calls to the server, with gemma template. following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936 can try it out with: ```sh python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143 ``` modulo patching the relative imports in server.py ``` -from .tokenizer_utils import TokenizerWrapper -from .utils import generate_step, load +from mlx_lm.tokenizer_utils import TokenizerWrapper +from mlx_lm.utils import generate_step, load ``` and then, ont the dspy side: ```python import dspy lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250) lm("hello") ``` * simpler way to validate float or int * remove logic that works around incompatible templates, too gemma specific * tweak messages for common denominator * use generate.py workaround for DBXR * put behind flag * oops * Solution to chat template issue: pass in a custom template! The template should likely adhere to the OpenAI chat model. Here is such a template for Gemma. --chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] \| trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}" * remove convoluted solution * Tweak for when None is provided explicitly, and must be set to [] too. For example, the outlines library provides None explicitly. * style --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-12 07:17:06 -07:00
Konstantin Kerekovski	d1c35fa684	Add MLX Cache Limit setting for mlx_lm.generate and mlx_lm.server CLI (#744 ) * Add support for setting MLX cache limit in GB * Add support for setting MLX cache limit in GB in mlx_lm.server * format --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-03 12:42:48 -07:00
Karim Elmaaroufi	4bf2eb17f2	Validate server params & fix logit bias bug (#731 ) * Bug fix in logit bias * Add parameter validations * Fix typo * Update docstrings to match MLX styling * Black style + fix a validation bug	2024-04-30 07:27:40 -07:00
Kristian Muñiz	109ee2f2f8	Use CORS headers for streaming for MLX Server (#716 )	2024-04-25 07:26:04 -07:00
Aaron Ng	8d5cf5b0c8	use logging in mlx server (#705 )	2024-04-22 07:50:06 -07:00
Anchen	749cabf299	fix: unicode decoding (#702 )	2024-04-21 08:58:23 -07:00
Karim Elmaaroufi	1484598de1	Add support for logit bias (#697 )	2024-04-21 06:53:56 -07:00
Anchen	f5f189e48a	fix(mlx-lm): broken server.py (#690 ) * fix server.py * fix var referenced before assignment * add test * clean up	2024-04-18 14:26:18 -07:00
Phúc H. Lê Khắc	35206806ac	Create executables for generate, lora, server, merge, convert (#682 ) * feat: create executables mlx_lm.<cmd> * nits in docs --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-04-16 16:08:49 -07:00
Awni Hannun	2bd64b78cf	Save lora config (#636 ) * lora config * comments * version bump	2024-04-02 13:52:53 -07:00
Matt Wronkiewicz	373dd6f2a2	Set finish_reason in response (#592 )	2024-03-19 20:21:26 -07:00
sweetcard	e2205beb66	Update server.py to add --trust-remote-code to server (#578 ) * Update server.py Add --trust-remote-code to server * format code by running pre-commit --------- Co-authored-by: flymonk <zhou.feng@gsafer.com>	2024-03-14 07:05:19 -07:00
Y4hL	b8e5eda4fd	Refactoring of mlx_lm example (#501 ) * Use named tuple from typing for typehints * Add type hints * Simplify expression * Type hint fix * Improved do_POST logic Use a map of endpoints to methods to reduce redundancy in code * Fix format * Improve redundancy Call method dynamically instead of writing out all arguments twice * Send response instead of returning * Fix typo * Revert change * Make adapter_file as Optional * Mark formatter as optional * format * Create message generator Store response data that stays static for the duration of the response inside of the object: system_fingerprint request_id object_type requested_model Created a message generator, that dynamically creates messages from the metadata stored inside of the object, and the data from the model pipeline * Remove leftover * Update parameters to reflect new object structure No longer pass all arguments between functions, but use the stores values inside of the object * Parse body before calling request specific methods * Call super init * Update server.py * Fixed outdated documentation parameter name * Add documentation * Fix sending headers twice During testing I found that when using the streaming option, headers have always been sent twice. This should fix that * Simplify streaming code by using guard clauses Don't wrap wfile writes in try blocks, the server class has its own try block to prevent crashing * Bug fix * Use Content-Length header Let the completion type specific methods finish sending the headers. This allows us to send the Content-Length header as the model returns a completion. * Update utils.py * Add top_p documentation * Type hint model and tokenizer as required * Use static system fingerprint System fingerprint now stays the same across requests * Make type hint more specific * Bug Fix Supplying less than 2 models to merge would raise ValueError and calls len on unbound "models". Should be "model_paths" instead. Mark upload_repo as optional * Move more of the shared code into do_POST Processing stop_id_sequences is done no matter the request endpoint or type, move it into the shared section. handle_ methods now just return the prompt in mx.array form. * Store stop_id_sequences as lists instead of np During testing I found that letting the tokenizer return values as python lists and converting them to mlx arrays was around 20% faster than having the tokenizer convert them to np, and from np to mlx. This allows makes it so numpy no longer needs to be imported. * Update stop_id_sequences docs * Turn if check to non-inclusive Only continue if buffer is smaller * Documentation fix * Cleared method names Instead of handle_stream and generate_competion, we should name it handle_completion. Instead of handle_completions and handle_chat_completions, we should name it handle_text_completions, since both are completions, calling it text completions should make it more descriptive * Make comment clearer * fix format * format	2024-03-06 06:24:31 -08:00
Anchen	3655bfc3bd	chore(mlx-lm): fix broken server.py script (#519 )	2024-03-03 06:04:54 -08:00
Y4hL	ea92f623d6	Prevent llms/mlx_lm from serving the local directory as a webserver (#498 ) * Don't serve local directory BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods. * Fix typo in method name I assume hanlde_stream was intended to be called handle_stream * Fix outdated typehint load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change * Add some more type hints * Add warnings for using in prod Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html * format * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-02-27 19:40:42 -08:00
Awni Hannun	95f82e67a2	Fix import warning (#479 ) * fix import warning * fix version import * remove api, move convert to utils * also update circle to run external PRs	2024-02-27 08:47:56 -08:00
Anchen	82f3f31d93	chore(mlx-lm): refactor server.py to utilize generate_step from utils for consistency (#491 ) * chore(mlx-lm): refactor server.py to utilize generate_step from utils for consistency * chore(mlx-lm): update server doc * chore: remove unused generate func	2024-02-27 06:25:24 -08:00
Anchen	19a21bfce4	chore: add /v1/completions for server (#489 )	2024-02-26 20:59:33 -08:00
Anchen	88458c4e40	feat(mlx-lm): add openAI like api server (#429 ) * feat(mlx-lm): add openAI like api server * chore: fix sse format * chore: add top_p support * chore: fix the load import * chore: add workground for missing space in stream decoding * chore: fix typo * chore: add error handling for streaming * chore: using slicing instead of replace * chore: set host, port via args and improve handle stream token logic * chore: refactor stop sequence function * chore: rename stopping_criteria * fix: unable to load kernel contiguous_scan_inclusive_sum_bfloat16_bfloat16 * chore: fix the streaming unicode issue * Update llms/mlx_lm/server.py Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * refacotr: move stopping_criteria out of generate func --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com>	2024-02-18 14:01:28 -08:00

34 Commits