mlx-examples/llms/mlx_lm/SERVER.md
Y4hL ea92f623d6
Prevent llms/mlx_lm from serving the local directory as a webserver (#498)
* Don't serve local directory

BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods.

* Fix typo in method name

I assume hanlde_stream was intended to be called handle_stream

* Fix outdated typehint

load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change

* Add some more type hints

* Add warnings for using in prod

Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html

* format

* nits

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-02-27 19:40:42 -08:00

2.2 KiB

HTTP Model Server

You use mlx-lm to make an HTTP API for generating text with any supported model. The HTTP API is intended to be similar to the OpenAI chat API.

Note

The MLX LM server is not recommended for production as it only implements basic security checks.

Start the server with:

python -m mlx_lm.server --model <path_to_model_or_hf_repo>

For example:

python -m mlx_lm.server --model mistralai/Mistral-7B-Instruct-v0.1

This will start a text generation server on port 8080 of the localhost using Mistral 7B instruct. The model will be downloaded from the provided Hugging Face repo if it is not already in the local cache.

To see a full list of options run:

python -m mlx_lm.server --help

You can make a request to the model by running:

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Request Fields

  • messages: An array of message objects representing the conversation history. Each message object should have a role (e.g. user, assistant) and content (the message text).

  • role_mapping: (Optional) A dictionary to customize the role prefixes in the generated prompt. If not provided, the default mappings are used.

  • stop: (Optional) An array of strings or a single string. Thesse are sequences of tokens on which the generation should stop.

  • max_tokens: (Optional) An integer specifying the maximum number of tokens to generate. Defaults to 100.

  • stream: (Optional) A boolean indicating if the response should be streamed. If true, responses are sent as they are generated. Defaults to false.

  • temperature: (Optional) A float specifying the sampling temperature. Defaults to 1.0.

  • top_p: (Optional) A float specifying the nucleus sampling parameter. Defaults to 1.0.

  • repetition_penalty: (Optional) Applies a penalty to repeated tokens. Defaults to 1.0.

  • repetition_context_size: (Optional) The size of the context window for applying repetition penalty. Defaults to 20.