2024-02-20 12:37:15 +08:00
|
|
|
# HTTP Model Server
|
|
|
|
|
|
|
|
You use `mlx-lm` to make an HTTP API for generating text with any supported
|
|
|
|
model. The HTTP API is intended to be similar to the [OpenAI chat
|
|
|
|
API](https://platform.openai.com/docs/api-reference).
|
|
|
|
|
2024-02-28 11:40:42 +08:00
|
|
|
> [!NOTE]
|
|
|
|
> The MLX LM server is not recommended for production as it only implements
|
|
|
|
> basic security checks.
|
|
|
|
|
2024-02-20 12:37:15 +08:00
|
|
|
Start the server with:
|
|
|
|
|
|
|
|
```shell
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.server --model <path_to_model_or_hf_repo>
|
2024-02-20 12:37:15 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
For example:
|
|
|
|
|
|
|
|
```shell
|
2024-06-24 01:35:13 +08:00
|
|
|
mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
|
2024-02-20 12:37:15 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
This will start a text generation server on port `8080` of the `localhost`
|
|
|
|
using Mistral 7B instruct. The model will be downloaded from the provided
|
|
|
|
Hugging Face repo if it is not already in the local cache.
|
|
|
|
|
|
|
|
To see a full list of options run:
|
|
|
|
|
|
|
|
```shell
|
2024-04-17 07:08:49 +08:00
|
|
|
mlx_lm.server --help
|
2024-02-20 12:37:15 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
You can make a request to the model by running:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
curl localhost:8080/v1/chat/completions \
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
-d '{
|
|
|
|
"messages": [{"role": "user", "content": "Say this is a test!"}],
|
|
|
|
"temperature": 0.7
|
|
|
|
}'
|
|
|
|
```
|
|
|
|
|
|
|
|
### Request Fields
|
|
|
|
|
|
|
|
- `messages`: An array of message objects representing the conversation
|
|
|
|
history. Each message object should have a role (e.g. user, assistant) and
|
|
|
|
content (the message text).
|
|
|
|
|
|
|
|
- `role_mapping`: (Optional) A dictionary to customize the role prefixes in
|
|
|
|
the generated prompt. If not provided, the default mappings are used.
|
|
|
|
|
2024-10-15 01:57:22 +08:00
|
|
|
- `stop`: (Optional) An array of strings or a single string. These are
|
2024-02-20 12:37:15 +08:00
|
|
|
sequences of tokens on which the generation should stop.
|
|
|
|
|
|
|
|
- `max_tokens`: (Optional) An integer specifying the maximum number of tokens
|
|
|
|
to generate. Defaults to `100`.
|
|
|
|
|
|
|
|
- `stream`: (Optional) A boolean indicating if the response should be
|
|
|
|
streamed. If true, responses are sent as they are generated. Defaults to
|
|
|
|
false.
|
|
|
|
|
|
|
|
- `temperature`: (Optional) A float specifying the sampling temperature.
|
|
|
|
Defaults to `1.0`.
|
|
|
|
|
|
|
|
- `top_p`: (Optional) A float specifying the nucleus sampling parameter.
|
|
|
|
Defaults to `1.0`.
|
2024-02-28 11:40:42 +08:00
|
|
|
|
|
|
|
- `repetition_penalty`: (Optional) Applies a penalty to repeated tokens.
|
|
|
|
Defaults to `1.0`.
|
|
|
|
|
|
|
|
- `repetition_context_size`: (Optional) The size of the context window for
|
|
|
|
applying repetition penalty. Defaults to `20`.
|
2024-04-30 22:27:40 +08:00
|
|
|
|
|
|
|
- `logit_bias`: (Optional) A dictionary mapping token IDs to their bias
|
2024-06-24 01:35:13 +08:00
|
|
|
values. Defaults to `None`.
|
|
|
|
|
|
|
|
- `logprobs`: (Optional) An integer specifying the number of top tokens and
|
|
|
|
corresponding log probabilities to return for each output in the generated
|
|
|
|
sequence. If set, this can be any value between 1 and 10, inclusive.
|
2024-08-02 07:18:18 +08:00
|
|
|
|
|
|
|
- `model`: (Optional) A string path to a local model or Hugging Face repo id.
|
|
|
|
If the path is local is must be relative to the directory the server was
|
|
|
|
started in.
|
|
|
|
|
|
|
|
- `adapters`: (Optional) A string path to low-rank adapters. The path must be
|
2024-10-15 01:57:22 +08:00
|
|
|
relative to the directory the server was started in.
|
|
|
|
|
|
|
|
### Response Fields
|
|
|
|
|
|
|
|
- `id`: A unique identifier for the chat.
|
|
|
|
|
|
|
|
- `system_fingerprint`: A unique identifier for the system.
|
|
|
|
|
2024-11-25 08:37:37 +08:00
|
|
|
- `object`: Any of "chat.completion", "chat.completion.chunk" (for
|
2024-10-15 01:57:22 +08:00
|
|
|
streaming), or "text.completion".
|
|
|
|
|
|
|
|
- `model`: The model repo or path (e.g. `"mlx-community/Llama-3.2-3B-Instruct-4bit"`).
|
|
|
|
|
|
|
|
- `created`: A time-stamp for when the request was processed.
|
|
|
|
|
|
|
|
- `choices`: A list of outputs. Each output is a dictionary containing the fields:
|
|
|
|
- `index`: The index in the list.
|
|
|
|
- `logprobs`: A dictionary containing the fields:
|
|
|
|
- `token_logprobs`: A list of the log probabilities for the generated
|
|
|
|
tokens.
|
|
|
|
- `tokens`: A list of the generated token ids.
|
|
|
|
- `top_logprobs`: A list of lists. Each list contains the `logprobs`
|
|
|
|
top tokens (if requested) with their corresponding probabilities.
|
|
|
|
- `finish_reason`: The reason the completion ended. This can be either of
|
|
|
|
`"stop"` or `"length"`.
|
|
|
|
- `message`: The text response from the model.
|
|
|
|
|
|
|
|
- `usage`: A dictionary containing the fields:
|
|
|
|
- `prompt_tokens`: The number of prompt tokens processed.
|
|
|
|
- `completion_tokens`: The number of tokens generated.
|
|
|
|
- `total_tokens`: The total number of tokens, i.e. the sum of the above two fields.
|
2024-09-28 22:21:11 +08:00
|
|
|
|
|
|
|
### List Models
|
|
|
|
|
|
|
|
Use the `v1/models` endpoint to list available models:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
curl localhost:8080/v1/models -H "Content-Type: application/json"
|
|
|
|
```
|
|
|
|
|
|
|
|
This will return a list of locally available models where each model in the
|
|
|
|
list contains the following fields:
|
|
|
|
|
2024-10-15 01:57:22 +08:00
|
|
|
- `id`: The Hugging Face repo id.
|
|
|
|
- `created`: A time-stamp representing the model creation time.
|