mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --trust-remote-code --port 8722

2025-08-30 02:53:41 +08:00 · 2024-07-24 11:45:37 +08:00 · 2024-07-24 11:45:37 +08:00 · 8e3f04f66c
commit 8e3f04f66c
parent cd8efc7fbc
2 changed files with 437 additions and 0 deletions
--- a/llms/mlx_lm/Makefile
+++ b/llms/mlx_lm/Makefile
@ -0,0 +1,11 @@
 run:
 	mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --trust-remote-code --port 8722
 k:
 	ps -ef|grep 'mlx_lm.server'|awk '{print $2}'|xargs kill -9
 w:
 	curl -X GET "http://127.0.0.1:9000/api/ai/WriteBlogRandomlyWithLLM?model=MLXLMServer" -H  "Request-Origion:SwaggerBootstrapUi" -H  "accept:*/*"
 c:
 	conda activate m3mlx
--- a/llms/mlx_lm/README.md
+++ b/llms/mlx_lm/README.md
@ -8,3 +8,429 @@ parent directory.
 This package also supports fine tuning with LoRA or QLoRA. For more information
 see the [LoRA documentation](LORA.md).
 ## Install mlx_lm locally
 ```shell 
 cd llms
 pip install -e .  
 ---------------------------------------------------------                                                                                                                
 Looking in indexes: https://bytedpypi.byted.org/simple/
 Obtaining file:///Users/bytedance/ai/mlx-examples/llms
  Preparing metadata (setup.py) ... done
 Requirement already satisfied: mlx>=0.14.1 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from mlx-lm==0.16.0) (0.16.0)
 Requirement already satisfied: numpy in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from mlx-lm==0.16.0) (1.26.4)
 Requirement already satisfied: transformers>=4.39.3 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (4.41.1)
 Requirement already satisfied: protobuf in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from mlx-lm==0.16.0) (5.27.0)
 Requirement already satisfied: pyyaml in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from mlx-lm==0.16.0) (6.0.1)
 Requirement already satisfied: jinja2 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from mlx-lm==0.16.0) (3.1.4)
 Requirement already satisfied: filelock in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (3.14.0)
 Requirement already satisfied: huggingface-hub<1.0,>=0.23.0 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (0.23.4)
 Requirement already satisfied: packaging>=20.0 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (24.0)
 Requirement already satisfied: regex!=2019.12.17 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (2024.5.15)
 Requirement already satisfied: requests in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (2.32.2)
 Requirement already satisfied: tokenizers<0.20,>=0.19 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (0.19.1)
 Requirement already satisfied: safetensors>=0.4.1 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (0.4.3)
 Requirement already satisfied: tqdm>=4.27 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (4.66.4)
 Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (0.2.0)
 Requirement already satisfied: MarkupSafe>=2.0 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from jinja2->mlx-lm==0.16.0) (2.1.5)
 Requirement already satisfied: fsspec>=2023.5.0 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.23.0->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (2024.5.0)
 Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.23.0->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (4.12.0)
 Requirement already satisfied: charset-normalizer<4,>=2 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from requests->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (3.3.2)
 Requirement already satisfied: idna<4,>=2.5 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from requests->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (3.7)
 Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from requests->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (2.2.1)
 Requirement already satisfied: certifi>=2017.4.17 in /opt/miniconda3/envs/m3mlx/lib/python3.11/site-packages (from requests->transformers>=4.39.3->transformers[sentencepiece]>=4.39.3->mlx-lm==0.16.0) (2024.2.2)
 Installing collected packages: mlx-lm
  Attempting uninstall: mlx-lm
    Found existing installation: mlx-lm 0.15.3
    Uninstalling mlx-lm-0.15.3:
      Successfully uninstalled mlx-lm-0.15.3
  Running setup.py develop for mlx-lm
 Successfully installed mlx-lm-0.16.0
 ```
 ## Run MXL LLM Server
 ```shell
 cd llms/mlx_lm
 ```
 Start the server with:
 > see: [SERVER.md](llms%2Fmlx_lm%2FSERVER.md)
 ```shell
 mlx_lm.server --model <path_to_model_or_hf_repo>
 ```
 For example:
 ```shell
 mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --trust-remote-code --port 8722
 mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --trust-remote-code --port 8722
 mlx_lm.server --model mlx-community/internlm2_5-7b-chat-8bit --trust-remote-code --port 8722
 ```
 This will start a text generation server on port `8080` of the `localhost`
 using Mistral 7B instruct. The model will be downloaded from the provided
 Hugging Face repo if it is not already in the local cache.
 To see a full list of options run:
 ```shell
 mlx_lm.server --help
 ```
 You can make a request to the model by running:
 ```shell
 curl localhost:8722/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7,
     "max_tokens": 100,
   }'
 ```
 output:
 ```json 
 {
  "id": "chatcmpl-74e66064-8727-411a-ada3-d5287b2c83a2",
  "system_fingerprint": "fp_73a731bd-bd00-4dcd-8fac-8f3f452210a2",
  "object": "chat.completions",
  "model": "default_model",
  "created": 1721634359,
  "choices": [
    {
      "index": 0,
      "logprobs": {
        "token_logprobs": [
          -2.4453125,
          -1.28125,
          -1.421875,
          -0.25,
          -7.53125,
          -1.15625,
          -4.09375,
          -0.390625,
          -3.0625,
          -0.84375,
          -2.53125,
          -0.125,
          -0.40625,
          -0.015625,
          -0.15625,
          -0.265625,
          -1.015625,
          -1.6484375,
          -1.0625,
          -0.40625,
          -4.390625,
          -0.296875,
          -1.078125,
          -3.0625,
          -0.328125,
          -0.21875,
          -0.390625,
          -2.015625,
          -3.46875,
          0.0,
          -0.765625,
          -2.609375,
          -1.921875,
          -1.078125,
          -1.859375,
          -1.625,
          -0.09375,
          -0.015625,
          -1.5625,
          -2.1015625,
          -1.65625,
          -0.21875,
          0.0,
          0.0,
          -1.640625,
          -0.0625,
          0.0,
          -1.234375,
          -0.6875,
          -0.53125,
          -0.078125,
          -0.03125,
          -1.015625,
          -0.109375,
          -3.4765625,
          -0.015625,
          -2.140625,
          -1.34375,
          -1.0625,
          -2.21875,
          -1.046875,
          -0.046875,
          -0.375,
          -1.0,
          -1.0625,
          -3.21875,
          -0.5,
          -0.234375,
          -0.15625,
          -2.015625,
          -1.265625,
          -0.390625,
          -2.265625,
          -0.0625,
          -1.59375,
          -3.5625,
          -0.59375,
          -0.46875,
          -1.0,
          -1.3515625,
          -0.296875,
          -1.4375,
          0.0,
          -1.1875,
          -0.46875,
          -0.15625,
          -0.375,
          -0.0625,
          -0.0625,
          -3.90625,
          -0.9375,
          -0.5625,
          -0.25,
          -2.53125,
          -0.28125,
          -2.640625,
          -0.59375,
          -0.75,
          -0.53125,
          -0.71875
        ],
        "top_logprobs": [],
        "tokens": [
          39584,
          346,
          5846,
          725,
          3716,
          489,
          4330,
          25341,
          16375,
          3103,
          1226,
          725,
          395,
          3556,
          1593,
          43916,
          465,
          2423,
          57436,
          334,
          19109,
          446,
          395,
          16375,
          22006,
          55098,
          465,
          53057,
          51040,
          334,
          465,
          848,
          285,
          3235,
          53057,
          4144,
          334,
          465,
          461,
          2423,
          57436,
          830,
          285,
          3235,
          5168,
          334,
          465,
          461,
          2136,
          505,
          395,
          1420,
          17338,
          465,
          312,
          281,
          5128,
          285,
          2423,
          5128,
          1883,
          938,
          334,
          55098,
          10363,
          6069,
          410,
          1420,
          328,
          410,
          2863,
          46301,
          2119,
          517,
          2014,
          334,
          4872,
          285,
          3235,
          2423,
          11740,
          334,
          465,
          29581,
          560,
          410,
          1420,
          4736,
          505,
          6662,
          12590,
          281,
          1239,
          1377,
          3089,
          22865,
          560,
          810,
          6025,
          3328
        ]
      },
      "finish_reason": "length",
      "message": {
        "role": "assistant",
        "content": "Sure! Here's What I'd Say Given That It's a Test:\n\n---\n\n**Test Scenario: Validation of a Given Statement**\n\n**Scenario Outline:** \n- **Scenario Name:** \"Test Scenario\"\n- **Description:** \"This is a test!\"\n\n**1. Pre-Test Preparations:**\n\nBefore starting the test, the following preparations must be made: \n\n- **Test Environment:** Ensure that the test environment is setup correctly. This may include ensuring that all necessary software"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 100,
    "total_tokens": 116
  }
 }
 ```
 ### Request Fields
 - `messages`: An array of message objects representing the conversation
  history. Each message object should have a role (e.g. user, assistant) and
  content (the message text).
 - `role_mapping`: (Optional) A dictionary to customize the role prefixes in
  the generated prompt. If not provided, the default mappings are used.
 - `stop`: (Optional) An array of strings or a single string. Thesse are
  sequences of tokens on which the generation should stop.
 - `max_tokens`: (Optional) An integer specifying the maximum number of tokens
  to generate. Defaults to `100`.
 - `stream`: (Optional) A boolean indicating if the response should be
  streamed. If true, responses are sent as they are generated. Defaults to
  false.
 - `temperature`: (Optional) A float specifying the sampling temperature.
  Defaults to `1.0`.
 - `top_p`: (Optional) A float specifying the nucleus sampling parameter.
  Defaults to `1.0`.
 - `repetition_penalty`: (Optional) Applies a penalty to repeated tokens.
  Defaults to `1.0`.
 - `repetition_context_size`: (Optional) The size of the context window for
  applying repetition penalty. Defaults to `20`.
 - `logit_bias`: (Optional) A dictionary mapping token IDs to their bias
  values. Defaults to `None`.
 - `logprobs`: (Optional) An integer specifying the number of top tokens and
  corresponding log probabilities to return for each output in the generated
  sequence. If set, this can be any value between 1 and 10, inclusive.
 ### Text Models 
 - [MLX LM](llms/README.md) a package for LLM text generation, fine-tuning, and more.
 - [Transformer language model](transformer_lm) training.
 - Minimal examples of large scale text generation with [LLaMA](llms/llama),
  [Mistral](llms/mistral), and more in the [LLMs](llms) directory.
 - A mixture-of-experts (MoE) language model with [Mixtral 8x7B](llms/mixtral).
 - Parameter efficient fine-tuning with [LoRA or QLoRA](lora).
 - Text-to-text multi-task Transformers with [T5](t5).
 - Bidirectional language understanding with [BERT](bert).
 ### Image Models 
 - Image classification using [ResNets on CIFAR-10](cifar).
 - Generating images with [Stable Diffusion or SDXL](stable_diffusion).
 - Convolutional variational autoencoder [(CVAE) on MNIST](cvae).
 ### Audio Models
 - Speech recognition with [OpenAI's Whisper](whisper).
 ### Multimodal models
 - Joint text and image embeddings with [CLIP](clip).
 - Text generation from image and text inputs with [LLaVA](llava).
 ### Other Models 
 - Semi-supervised learning on graph-structured data with [GCN](gcn).
 - Real NVP [normalizing flow](normalizing_flow) for density estimation and
  sampling.
 ### Hugging Face
 Note: You can now directly download a few converted checkpoints from the [MLX
 Community](https://huggingface.co/mlx-community) organization on Hugging Face.
 We encourage you to join the community and [contribute new
 models](https://github.com/ml-explore/mlx-examples/issues/155).
 ## Contributing 
 We are grateful for all of [our
 contributors](ACKNOWLEDGMENTS.md#Individual-Contributors). If you contribute
 to MLX Examples and wish to be acknowledged, please add your name to the list in your
 pull request.
 ## Citing MLX Examples
 The MLX software suite was initially developed with equal contribution by Awni
 Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. If you find
 MLX Examples useful in your research and wish to cite it, please use the following
 BibTex entry:
 ```
@software{mlx2023,
  author = {Awni Hannun and Jagrit Digani and Angelos Katharopoulos and Ronan Collobert},
  title = {{MLX}: Efficient and flexible machine learning on Apple silicon},
  url = {https://github.com/ml-explore},
  version = {0.0},
  year = {2023},
 }
 ```