mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-06-25 01:41:19 +08:00

Fix object property value in mlx_lm.server chat completions response to match OpenAI spec (#1119 )

These were "chat.completions" and "chat.completions.chunk"
but should be "chat.completion" and "chat.completion.chunk"
for compatibility with clients expecting an OpenAI API.

In particular, this solves a problem in which aider 0.64.1 reports
hitting a token limit on any completion request, no matter how small,
despite apparently correct counts in the usage property.

Refer to:

https://platform.openai.com/docs/api-reference/chat/object

> object string
> The object type, which is always chat.completion.

https://platform.openai.com/docs/api-reference/chat/streaming

> object string
> The object type, which is always chat.completion.chunk.

2024-11-24 16:37:37 -08:00

4.4 KiB

Raw Blame History

HTTP Model Server

You use mlx-lm to make an HTTP API for generating text with any supported model. The HTTP API is intended to be similar to the OpenAI chat API.

Note

The MLX LM server is not recommended for production as it only implements basic security checks.

Start the server with:

mlx_lm.server --model <path_to_model_or_hf_repo>

For example:

mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit

This will start a text generation server on port 8080 of the localhost using Mistral 7B instruct. The model will be downloaded from the provided Hugging Face repo if it is not already in the local cache.

To see a full list of options run:

mlx_lm.server --help

You can make a request to the model by running:

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Request Fields

messages: An array of message objects representing the conversation history. Each message object should have a role (e.g. user, assistant) and content (the message text).
role_mapping: (Optional) A dictionary to customize the role prefixes in the generated prompt. If not provided, the default mappings are used.
stop: (Optional) An array of strings or a single string. These are sequences of tokens on which the generation should stop.
max_tokens: (Optional) An integer specifying the maximum number of tokens to generate. Defaults to 100.
stream: (Optional) A boolean indicating if the response should be streamed. If true, responses are sent as they are generated. Defaults to false.
temperature: (Optional) A float specifying the sampling temperature. Defaults to 1.0.
top_p: (Optional) A float specifying the nucleus sampling parameter. Defaults to 1.0.
repetition_penalty: (Optional) Applies a penalty to repeated tokens. Defaults to 1.0.
repetition_context_size: (Optional) The size of the context window for applying repetition penalty. Defaults to 20.
logit_bias: (Optional) A dictionary mapping token IDs to their bias values. Defaults to None.
logprobs: (Optional) An integer specifying the number of top tokens and corresponding log probabilities to return for each output in the generated sequence. If set, this can be any value between 1 and 10, inclusive.
model: (Optional) A string path to a local model or Hugging Face repo id. If the path is local is must be relative to the directory the server was started in.
adapters: (Optional) A string path to low-rank adapters. The path must be relative to the directory the server was started in.

Response Fields

id: A unique identifier for the chat.
system_fingerprint: A unique identifier for the system.
object: Any of "chat.completion", "chat.completion.chunk" (for streaming), or "text.completion".
model: The model repo or path (e.g. "mlx-community/Llama-3.2-3B-Instruct-4bit").
created: A time-stamp for when the request was processed.
choices: A list of outputs. Each output is a dictionary containing the fields:
- index: The index in the list.
- logprobs: A dictionary containing the fields:
  - token_logprobs: A list of the log probabilities for the generated tokens.
  - tokens: A list of the generated token ids.
  - top_logprobs: A list of lists. Each list contains the logprobs top tokens (if requested) with their corresponding probabilities.
- finish_reason: The reason the completion ended. This can be either of "stop" or "length".
- message: The text response from the model.
usage: A dictionary containing the fields:
- prompt_tokens: The number of prompt tokens processed.
- completion_tokens: The number of tokens generated.
- total_tokens: The total number of tokens, i.e. the sum of the above two fields.

List Models

Use the v1/models endpoint to list available models:

curl localhost:8080/v1/models -H "Content-Type: application/json"

This will return a list of locally available models where each model in the list contains the following fields:

id: The Hugging Face repo id.
created: A time-stamp representing the model creation time.

4.4 KiB Raw Blame History

HTTP Model Server

Request Fields

Response Fields

List Models

4.4 KiB

Raw Blame History