mlx-examples/llms/mlx_lm/SERVER.md

# HTTP Model Server

You use `mlx-lm` to make an HTTP API for generating text with any supported
model. The HTTP API is intended to be similar to the [OpenAI chat
API](https://platform.openai.com/docs/api-reference).

> [!NOTE]  
> The MLX LM server is not recommended for production as it only implements
> basic security checks.

Start the server with: 

```shell
mlx_lm.server --model <path_to_model_or_hf_repo>
```

For example:

```shell
mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
```

This will start a text generation server on port `8080` of the `localhost`
using Mistral 7B instruct. The model will be downloaded from the provided
Hugging Face repo if it is not already in the local cache.

To see a full list of options run:

```shell
mlx_lm.server --help
```

You can make a request to the model by running:

```shell
curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'
```

### Request Fields

- `messages`: An array of message objects representing the conversation
  history. Each message object should have a role (e.g. user, assistant) and
  content (the message text).

- `role_mapping`: (Optional) A dictionary to customize the role prefixes in
  the generated prompt. If not provided, the default mappings are used.

- `stop`: (Optional) An array of strings or a single string. These are
  sequences of tokens on which the generation should stop.

- `max_tokens`: (Optional) An integer specifying the maximum number of tokens
  to generate. Defaults to `100`.

- `stream`: (Optional) A boolean indicating if the response should be
  streamed. If true, responses are sent as they are generated. Defaults to
  false.

- `temperature`: (Optional) A float specifying the sampling temperature.
  Defaults to `1.0`.

- `top_p`: (Optional) A float specifying the nucleus sampling parameter.
  Defaults to `1.0`.

- `repetition_penalty`: (Optional) Applies a penalty to repeated tokens.
  Defaults to `1.0`.

- `repetition_context_size`: (Optional) The size of the context window for
  applying repetition penalty. Defaults to `20`.

- `logit_bias`: (Optional) A dictionary mapping token IDs to their bias
  values. Defaults to `None`.

- `logprobs`: (Optional) An integer specifying the number of top tokens and
  corresponding log probabilities to return for each output in the generated
  sequence. If set, this can be any value between 1 and 10, inclusive.

- `model`: (Optional) A string path to a local model or Hugging Face repo id.
  If the path is local is must be relative to the directory the server was
  started in.

- `adapters`: (Optional) A string path to low-rank adapters. The path must be
  relative to the directory the server was started in.

### Response Fields

- `id`: A unique identifier for the chat.

- `system_fingerprint`: A unique identifier for the system.

- `object`: Any of "chat.completion", "chat.completion.chunk" (for
  streaming), or "text.completion".

- `model`: The model repo or path (e.g. `"mlx-community/Llama-3.2-3B-Instruct-4bit"`).

- `created`: A time-stamp for when the request was processed.

- `choices`: A list of outputs. Each output is a dictionary containing the fields:
    - `index`: The index in the list.
    - `logprobs`: A dictionary containing the fields:
        - `token_logprobs`: A list of the log probabilities for the generated
          tokens.
        - `tokens`: A list of the generated token ids.
        - `top_logprobs`: A list of lists. Each list contains the `logprobs`
          top tokens (if requested) with their corresponding probabilities.
    - `finish_reason`: The reason the completion ended. This can be either of
      `"stop"` or `"length"`.
    - `message`: The text response from the model.

- `usage`: A dictionary containing the fields:
    - `prompt_tokens`: The number of prompt tokens processed.
    - `completion_tokens`: The number of tokens generated.
    - `total_tokens`: The total number of tokens, i.e. the sum of the above two fields.

### List Models

Use the `v1/models` endpoint to list available models:

```shell
curl localhost:8080/v1/models -H "Content-Type: application/json"
```

This will return a list of locally available models where each model in the
list contains the following fields:

- `id`: The Hugging Face repo id.
- `created`: A time-stamp representing the model creation time.
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			`# HTTP Model Server`

			You use `mlx-lm` to make an HTTP API for generating text with any supported
			`model. The HTTP API is intended to be similar to the [OpenAI chat`
			`API](https://platform.openai.com/docs/api-reference).`

Prevent llms/mlx_lm from serving the local directory as a webserver (#498) * Don't serve local directory BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods. * Fix typo in method name I assume hanlde_stream was intended to be called handle_stream * Fix outdated typehint load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change * Add some more type hints * Add warnings for using in prod Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html * format * nits --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-02-28 11:40:42 +08:00			`> [!NOTE]`
			`> The MLX LM server is not recommended for production as it only implements`
			`> basic security checks.`

Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			`Start the server with:`

			```shell
Create executables for generate, lora, server, merge, convert (#682) * feat: create executables mlx_lm.<cmd> * nits in docs --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-04-17 07:08:49 +08:00			`mlx_lm.server --model <path_to_model_or_hf_repo>`
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			```

			`For example:`

			```shell
Logprobs info to completion API (#806) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-06-24 01:35:13 +08:00			`mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit`
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			```

			This will start a text generation server on port `8080` of the `localhost`
			`using Mistral 7B instruct. The model will be downloaded from the provided`
			`Hugging Face repo if it is not already in the local cache.`

			`To see a full list of options run:`

			```shell
Create executables for generate, lora, server, merge, convert (#682) * feat: create executables mlx_lm.<cmd> * nits in docs --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-04-17 07:08:49 +08:00			`mlx_lm.server --help`
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			```

			`You can make a request to the model by running:`

			```shell
			`curl localhost:8080/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"messages": [{"role": "user", "content": "Say this is a test!"}],`
			`"temperature": 0.7`
			`}'`
			```

			`### Request Fields`

			- `messages`: An array of message objects representing the conversation
			`history. Each message object should have a role (e.g. user, assistant) and`
			`content (the message text).`

			- `role_mapping`: (Optional) A dictionary to customize the role prefixes in
			`the generated prompt. If not provided, the default mappings are used.`

Prompt caching in `mlx_lm.server` (#1026) * caching in server * nits * fix tests * don't throw if no metal * comments 2024-10-15 01:57:22 +08:00			- `stop`: (Optional) An array of strings or a single string. These are
Support for slerp merging models (#455) * support for slerp merging models * docs * update docs * format' 2024-02-20 12:37:15 +08:00			`sequences of tokens on which the generation should stop.`

			- `max_tokens`: (Optional) An integer specifying the maximum number of tokens
			to generate. Defaults to `100`.

			- `stream`: (Optional) A boolean indicating if the response should be
			`streamed. If true, responses are sent as they are generated. Defaults to`
			`false.`

			- `temperature`: (Optional) A float specifying the sampling temperature.
			Defaults to `1.0`.

			- `top_p`: (Optional) A float specifying the nucleus sampling parameter.
			Defaults to `1.0`.
Prevent llms/mlx_lm from serving the local directory as a webserver (#498) * Don't serve local directory BaseHTTPRequestHandler serves the current directory by default. Definitely not intended behaviour. Remove the "do_HEAD" and "do_GET" methods. * Fix typo in method name I assume hanlde_stream was intended to be called handle_stream * Fix outdated typehint load_model returns nn.Module, however fetch_from_hub was not updated to reflect the change * Add some more type hints * Add warnings for using in prod Add a warning to README and runtime, discouraging use in production. The warning is the same as on the python docs for HTTPServer https://docs.python.org/3/library/http.server.html * format * nits --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-02-28 11:40:42 +08:00
			- `repetition_penalty`: (Optional) Applies a penalty to repeated tokens.
			Defaults to `1.0`.

			- `repetition_context_size`: (Optional) The size of the context window for
			applying repetition penalty. Defaults to `20`.
Validate server params & fix logit bias bug (#731) * Bug fix in logit bias * Add parameter validations * Fix typo * Update docstrings to match MLX styling * Black style + fix a validation bug 2024-04-30 22:27:40 +08:00
			- `logit_bias`: (Optional) A dictionary mapping token IDs to their bias
Logprobs info to completion API (#806) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-06-24 01:35:13 +08:00			values. Defaults to `None`.

			- `logprobs`: (Optional) An integer specifying the number of top tokens and
			`corresponding log probabilities to return for each output in the generated`
			`sequence. If set, this can be any value between 1 and 10, inclusive.`
Adapters loading (#902) * Added functionality to load in adapters through post-requests so you do not need to restart the server * ran pre-commit * nits * fix test --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-08-02 07:18:18 +08:00
			- `model`: (Optional) A string path to a local model or Hugging Face repo id.
			`If the path is local is must be relative to the directory the server was`
			`started in.`

			- `adapters`: (Optional) A string path to low-rank adapters. The path must be
Prompt caching in `mlx_lm.server` (#1026) * caching in server * nits * fix tests * don't throw if no metal * comments 2024-10-15 01:57:22 +08:00			`relative to the directory the server was started in.`

			`### Response Fields`

			- `id`: A unique identifier for the chat.

			- `system_fingerprint`: A unique identifier for the system.

Fix object property value in mlx_lm.server chat completions response to match OpenAI spec (#1119) These were "chat.completions" and "chat.completions.chunk" but should be "chat.completion" and "chat.completion.chunk" for compatibility with clients expecting an OpenAI API. In particular, this solves a problem in which aider 0.64.1 reports hitting a token limit on any completion request, no matter how small, despite apparently correct counts in the usage property. Refer to: https://platform.openai.com/docs/api-reference/chat/object > object string > The object type, which is always chat.completion. https://platform.openai.com/docs/api-reference/chat/streaming > object string > The object type, which is always chat.completion.chunk. 2024-11-25 08:37:37 +08:00			- `object`: Any of "chat.completion", "chat.completion.chunk" (for
Prompt caching in `mlx_lm.server` (#1026) * caching in server * nits * fix tests * don't throw if no metal * comments 2024-10-15 01:57:22 +08:00			`streaming), or "text.completion".`

			- `model`: The model repo or path (e.g. `"mlx-community/Llama-3.2-3B-Instruct-4bit"`).

			- `created`: A time-stamp for when the request was processed.

			- `choices`: A list of outputs. Each output is a dictionary containing the fields:
			- `index`: The index in the list.
			- `logprobs`: A dictionary containing the fields:
			- `token_logprobs`: A list of the log probabilities for the generated
			`tokens.`
			- `tokens`: A list of the generated token ids.
			- `top_logprobs`: A list of lists. Each list contains the `logprobs`
			`top tokens (if requested) with their corresponding probabilities.`
			- `finish_reason`: The reason the completion ended. This can be either of
			`"stop"` or `"length"`.
			- `message`: The text response from the model.

			- `usage`: A dictionary containing the fields:
			- `prompt_tokens`: The number of prompt tokens processed.
			- `completion_tokens`: The number of tokens generated.
			- `total_tokens`: The total number of tokens, i.e. the sum of the above two fields.
Add /v1/models endpoint to mlx_lm.server (#984) * Add 'models' endpoint to server * Add test for new 'models' server endpoint * Check hf_cache for mlx models * update tests to check hf_cache for models * simplify test * doc --------- Co-authored-by: Awni Hannun <awni@apple.com> 2024-09-28 22:21:11 +08:00
			`### List Models`

			Use the `v1/models` endpoint to list available models:

			```shell
			`curl localhost:8080/v1/models -H "Content-Type: application/json"`
			```

			`This will return a list of locally available models where each model in the`
			`list contains the following fields:`

Prompt caching in `mlx_lm.server` (#1026) * caching in server * nits * fix tests * don't throw if no metal * comments 2024-10-15 01:57:22 +08:00			- `id`: The Hugging Face repo id.
			- `created`: A time-stamp representing the model creation time.