mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-08-30 02:53:41 +08:00

History

chenguangjian.jk c4e0f04b90 mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-8bit --trust-remote-code --port 8722		2024-10-02 04:20:20 +08:00
..
examples	Adding full finetuning (#903 )	2024-09-29 17:12:47 -07:00
models	Adding support for mamba (#940 )	2024-09-28 07:02:53 -07:00
tuner	Feature: QDoRA (#891 )	2024-09-30 08:01:11 -07:00
__init__.py	Make sure to import the correct "version" module when installing mlx_whisper and mlx_lm from local source code. (#969 )	2024-09-03 13:16:21 -07:00
_version.py	Update LLM generation docs to use chat template (#973 )	2024-09-07 06:06:15 -07:00
cache_prompt.py	Fix the cache_prompt (#979 )	2024-09-06 20:19:27 -07:00
convert.py	Create executables for generate, lora, server, merge, convert (#682 )	2024-04-16 16:08:49 -07:00
fuse.py	Adding full finetuning (#903 )	2024-09-29 17:12:47 -07:00
generate.py	Add prompt piping (#962 )	2024-09-03 13:29:10 -07:00
gguf.py	Fix export to gguf (#993 )	2024-09-20 13:33:45 -07:00
kill.sh	mlx_lm.server --model mlx-community/Mistral-Nemo-Instruct-2407-8bit --trust-remote-code --port 8722	2024-07-25 10:55:25 +08:00
LORA.md	LoRA: Support HuggingFace dataset via data parameter (#996 )	2024-09-30 07:36:21 -07:00
lora.py	LoRA: Support HuggingFace dataset via data parameter (#996 )	2024-09-30 07:36:21 -07:00
Makefile	mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-8bit --trust-remote-code --port 8722	2024-10-02 04:20:20 +08:00
MANAGE.md	Add model management functionality for local caches (#736 )	2024-05-03 12:20:13 -07:00
manage.py	Add model management functionality for local caches (#736 )	2024-05-03 12:20:13 -07:00
MERGE.md	Create executables for generate, lora, server, merge, convert (#682 )	2024-04-16 16:08:49 -07:00
merge.py	Create executables for generate, lora, server, merge, convert (#682 )	2024-04-16 16:08:49 -07:00
py.typed	Add `py.typed` to support PEP-561 (type-hinting) (#389 )	2024-01-30 21:17:38 -08:00
README.md	mlx_lm.server --model mlx-community/Qwen2.5-14B-Instruct-8bit --trust-remote-code --port 8722	2024-10-02 04:20:20 +08:00
requirements.txt	Use fast rope (#945 )	2024-08-23 13:18:51 -07:00
sample_utils.py	Min P implementation (#926 )	2024-08-15 15:45:02 -07:00
SERVER.md	Add /v1/models endpoint to mlx_lm.server (#984 )	2024-09-28 07:21:11 -07:00
server.py	Add /v1/models endpoint to mlx_lm.server (#984 )	2024-09-28 07:21:11 -07:00
tokenizer_utils.py	Fix setattr for the TokenizerWrapper (#961 )	2024-08-28 14:47:33 -07:00
UPLOAD.md	Mlx llm package (#301 )	2024-01-12 10:25:56 -08:00
utils.py	repetiton_penalty and logits_bias just using logits_processors (#1004 )	2024-09-30 08:49:03 -07:00

README.md

Generate Text with MLX and 🤗 Hugging Face

This an example of large language model text generation that can pull models from the Hugging Face Hub.

For more information on this example, see the README in the parent directory.

This package also supports fine tuning with LoRA or QLoRA. For more information see the LoRA documentation.

Install mlx_lm locally

# go to the mlx-examples directory, sync fork:
# https://github.com/LLMAppArchitect/mlx-lm/tree/main

git pull 
cd llms
pip install -e .

Run MXL LLM Server

cd llms/mlx_lm

Start the server with:

see: SERVER.md

mlx_lm.server --model <path_to_model_or_hf_repo>

For example:

mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --trust-remote-code --port 8722
mlx_lm.server --model mlx-community/Mistral-Nemo-Instruct-2407-8bit --trust-remote-code --port 8722
mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --trust-remote-code --port 8722
mlx_lm.server --model mlx-community/internlm2_5-7b-chat-8bit --trust-remote-code --port 8722

This will start a text generation server on port 8080 of the localhost using Mistral 7B instruct. The model will be downloaded from the provided Hugging Face repo if it is not already in the local cache.

To see a full list of options run:

mlx_lm.server --help

You can make a request to the model by running:

curl localhost:8722/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7,
     "max_tokens": 100,
   }'

output:

{
  "id": "chatcmpl-74e66064-8727-411a-ada3-d5287b2c83a2",
  "system_fingerprint": "fp_73a731bd-bd00-4dcd-8fac-8f3f452210a2",
  "object": "chat.completions",
  "model": "default_model",
  "created": 1721634359,
  "choices": [
    {
      "index": 0,
      "logprobs": {
        "token_logprobs": [
          -2.4453125,
          -1.28125,
          -1.421875,
          -0.25,
          -7.53125,
          -1.15625,
          -4.09375,
          -0.390625,
          -3.0625,
          -0.84375,
          -2.53125,
          -0.125,
          -0.40625,
          -0.015625,
          -0.15625,
          -0.265625,
          -1.015625,
          -1.6484375,
          -1.0625,
          -0.40625,
          -4.390625,
          -0.296875,
          -1.078125,
          -3.0625,
          -0.328125,
          -0.21875,
          -0.390625,
          -2.015625,
          -3.46875,
          0.0,
          -0.765625,
          -2.609375,
          -1.921875,
          -1.078125,
          -1.859375,
          -1.625,
          -0.09375,
          -0.015625,
          -1.5625,
          -2.1015625,
          -1.65625,
          -0.21875,
          0.0,
          0.0,
          -1.640625,
          -0.0625,
          0.0,
          -1.234375,
          -0.6875,
          -0.53125,
          -0.078125,
          -0.03125,
          -1.015625,
          -0.109375,
          -3.4765625,
          -0.015625,
          -2.140625,
          -1.34375,
          -1.0625,
          -2.21875,
          -1.046875,
          -0.046875,
          -0.375,
          -1.0,
          -1.0625,
          -3.21875,
          -0.5,
          -0.234375,
          -0.15625,
          -2.015625,
          -1.265625,
          -0.390625,
          -2.265625,
          -0.0625,
          -1.59375,
          -3.5625,
          -0.59375,
          -0.46875,
          -1.0,
          -1.3515625,
          -0.296875,
          -1.4375,
          0.0,
          -1.1875,
          -0.46875,
          -0.15625,
          -0.375,
          -0.0625,
          -0.0625,
          -3.90625,
          -0.9375,
          -0.5625,
          -0.25,
          -2.53125,
          -0.28125,
          -2.640625,
          -0.59375,
          -0.75,
          -0.53125,
          -0.71875
        ],
        "top_logprobs": [],
        "tokens": [
          39584,
          346,
          5846,
          725,
          3716,
          489,
          4330,
          25341,
          16375,
          3103,
          1226,
          725,
          395,
          3556,
          1593,
          43916,
          465,
          2423,
          57436,
          334,
          19109,
          446,
          395,
          16375,
          22006,
          55098,
          465,
          53057,
          51040,
          334,
          465,
          848,
          285,
          3235,
          53057,
          4144,
          334,
          465,
          461,
          2423,
          57436,
          830,
          285,
          3235,
          5168,
          334,
          465,
          461,
          2136,
          505,
          395,
          1420,
          17338,
          465,
          312,
          281,
          5128,
          285,
          2423,
          5128,
          1883,
          938,
          334,
          55098,
          10363,
          6069,
          410,
          1420,
          328,
          410,
          2863,
          46301,
          2119,
          517,
          2014,
          334,
          4872,
          285,
          3235,
          2423,
          11740,
          334,
          465,
          29581,
          560,
          410,
          1420,
          4736,
          505,
          6662,
          12590,
          281,
          1239,
          1377,
          3089,
          22865,
          560,
          810,
          6025,
          3328
        ]
      },
      "finish_reason": "length",
      "message": {
        "role": "assistant",
        "content": "Sure! Here's What I'd Say Given That It's a Test:\n\n---\n\n**Test Scenario: Validation of a Given Statement**\n\n**Scenario Outline:** \n- **Scenario Name:** \"Test Scenario\"\n- **Description:** \"This is a test!\"\n\n**1. Pre-Test Preparations:**\n\nBefore starting the test, the following preparations must be made: \n\n- **Test Environment:** Ensure that the test environment is setup correctly. This may include ensuring that all necessary software"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 100,
    "total_tokens": 116
  }
}

Request Fields

messages: An array of message objects representing the conversation history. Each message object should have a role (e.g. user, assistant) and content (the message text).
role_mapping: (Optional) A dictionary to customize the role prefixes in the generated prompt. If not provided, the default mappings are used.
stop: (Optional) An array of strings or a single string. Thesse are sequences of tokens on which the generation should stop.
max_tokens: (Optional) An integer specifying the maximum number of tokens to generate. Defaults to 100.
stream: (Optional) A boolean indicating if the response should be streamed. If true, responses are sent as they are generated. Defaults to false.
temperature: (Optional) A float specifying the sampling temperature. Defaults to 1.0.
top_p: (Optional) A float specifying the nucleus sampling parameter. Defaults to 1.0.
repetition_penalty: (Optional) Applies a penalty to repeated tokens. Defaults to 1.0.
repetition_context_size: (Optional) The size of the context window for applying repetition penalty. Defaults to 20.
logit_bias: (Optional) A dictionary mapping token IDs to their bias values. Defaults to None.
logprobs: (Optional) An integer specifying the number of top tokens and corresponding log probabilities to return for each output in the generated sequence. If set, this can be any value between 1 and 10, inclusive.

Text Models

MLX LM a package for LLM text generation, fine-tuning, and more.
Transformer language model training.
Minimal examples of large scale text generation with LLaMA, Mistral, and more in the LLMs directory.
A mixture-of-experts (MoE) language model with Mixtral 8x7B.
Parameter efficient fine-tuning with LoRA or QLoRA.
Text-to-text multi-task Transformers with T5.
Bidirectional language understanding with BERT.

Image Models

Image classification using ResNets on CIFAR-10.
Generating images with Stable Diffusion or SDXL.
Convolutional variational autoencoder (CVAE) on MNIST.

Audio Models

Speech recognition with OpenAI's Whisper.

Multimodal models

Joint text and image embeddings with CLIP.
Text generation from image and text inputs with LLaVA.

Other Models

Semi-supervised learning on graph-structured data with GCN.
Real NVP normalizing flow for density estimation and sampling.

Hugging Face

Note: You can now directly download a few converted checkpoints from the MLX Community organization on Hugging Face. We encourage you to join the community and contribute new models.

Contributing

We are grateful for all of our contributors. If you contribute to MLX Examples and wish to be acknowledged, please add your name to the list in your pull request.

Citing MLX Examples

The MLX software suite was initially developed with equal contribution by Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. If you find MLX Examples useful in your research and wish to cite it, please use the following BibTex entry:

@software{mlx2023,
  author = {Awni Hannun and Jagrit Digani and Angelos Katharopoulos and Ronan Collobert},
  title = {{MLX}: Efficient and flexible machine learning on Apple silicon},
  url = {https://github.com/ml-explore},
  version = {0.0},
  year = {2023},
}