mlx-examples/llms/mlx_lm/BENCH.md
2025-03-12 23:58:39 +09:00

3.2 KiB

MLX-LM Benchmark Tool

MLX-LM Benchmark Tool is a command-line utility for measuring and comparing the performance of MLX-format language models. It generates synthetic prompt tokens and captures various performance metrics, similar to llama.cpp's llama-bench tool.

Features

  • Measures multiple performance metrics:
    • Model load time (seconds)
    • Prompt token processing speed (TPS)
    • Generation token processing speed (TPS)
    • Total execution time (seconds)
    • Memory usage (GB)
  • Supports testing combinations of multiple model configurations
  • Customizable prompt token count and generation token count
  • Multiple output formats (CSV, JSON, JSONL, Markdown)
  • Configurable test repetitions for averaging performance

Installation

Ensure you have MLX-LM installed:

pip install mlx-lm

Usage

Basic usage:

mlx_lm.bench -m [MODEL_PATH] -p [PROMPT_TOKENS] -n [GEN_TOKENS] -r [REPETITIONS]

Parameters

  • -m, --model: Path to the MLX model(s) to benchmark, can specify multiple (comma-separated)
  • -p, --n-prompt: Input Sequence Length (ISL), number of synthetic prompt tokens, can specify multiple (comma-separated)
  • -n, --n-gen: Output Sequence Length (OSL), number of tokens to generate, can specify multiple (comma-separated)
  • -r, --repetitions: Number of benchmark repetitions to average results over
  • -o, --output-format: Output format for benchmark results (csv, json, jsonl, md)
  • -f, --output-filename: Output filename (without extension)
  • --gen-args: Additional keyword arguments for generate() function in key=value format

Example

Benchmark two different Qwen models with different generation token counts:

mlx_lm.bench -m $HOME/Files/mlx/Qwen/Qwen2.5-3B-Instruct-Q4,$HOME/Files/mlx/Qwen/Qwen2.5-7B-Instruct-Q4 -p 1 -n 16,32 -r 2 -o md

Sample output:

Model Model Load Time (s) Prompt Tokens Prompt TPS Response Tokens Response TPS Execution Time (s) Memory Usage (GB)
Qwen2.5-3B-Instruct-Q4 0.469 1 140.084 16 184.93 0.094 1.75
Qwen2.5-3B-Instruct-Q4 0.469 1 137.294 32 178.829 0.186 1.75
Qwen2.5-7B-Instruct-Q4 0.537 1 110.817 16 139.308 0.124 6.02
Qwen2.5-7B-Instruct-Q4 0.537 1 109.005 32 134.764 0.247 6.02

Advanced Usage

Run more complex benchmarks:

# Test multiple models with various prompt and generation length combinations
mlx_lm.bench -m path/to/model1,path/to/model2 -p 1,8,64,128 -n 16,128,512 -r 3 -o json -f detailed_results

# Pass additional arguments to the generation function
mlx_lm.bench -m path/to/model -p 128 -n 128 -r 3 --gen-args kv_group_size=64

Output Metrics

Benchmark results include the following metrics:

  • Model: Path of the model being tested
  • Model Load Time (s): Time required to load the model (seconds)
  • Prompt Tokens: Number of prompt tokens processed
  • Prompt TPS: Prompt token processing speed (tokens per second)
  • Response Tokens: Number of response tokens generated
  • Response TPS: Response token generation speed (tokens per second)
  • Execution Time (s): Total execution time (seconds)
  • Memory Usage (GB): Peak memory usage (GB)