## Generate Text with LLMs and MLX The easiest way to get started is to install the `mlx-lm` package: **With `pip`**: ```sh pip install mlx-lm ``` **With `conda`**: ```sh conda install -c conda-forge mlx-lm ``` The `mlx-lm` package also has: - [LoRA, QLoRA, and full fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md) - [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md) - [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md) ### Quick Start To generate text with an LLM use: ```bash mlx_lm.generate --prompt "Hi!" ``` To chat with an LLM use: ```bash mlx_lm.chat ``` This will give you a chat REPL that you can use to interact with the LLM. The chat context is preserved during the lifetime of the REPL. Commands in `mlx-lm` typically take command line options which let you specify the model, sampling parameters, and more. Use `-h` to see a list of available options for a command, e.g.: ```bash mlx_lm.generate -h ``` ### Python API You can use `mlx-lm` as a module: ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) ``` To see a description of all the arguments you can do: ``` >>> help(generate) ``` Check out the [generation example](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py) to see how to use the API in more detail. The `mlx-lm` package also comes with functionality to quantize and optionally upload models to the Hugging Face Hub. You can convert models using the Python API: ```python from mlx_lm import convert repo = "mistralai/Mistral-7B-Instruct-v0.3" upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit" convert(repo, quantize=True, upload_repo=upload_repo) ``` This will generate a 4-bit quantized Mistral 7B and upload it to the repo `mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the converted model in the path `mlx_model` by default. To see a description of all the arguments you can do: ``` >>> help(convert) ``` #### Streaming For streaming generation, use the `stream_generate` function. This yields a generation response object. For example, ```python from mlx_lm import load, stream_generate repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit" model, tokenizer = load(repo) prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) for response in stream_generate(model, tokenizer, prompt, max_tokens=512): print(response.text, end="", flush=True) print() ``` ### Command Line You can also use `mlx-lm` from the command line with: ``` mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello" ``` This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt. For a full list of options run: ``` mlx_lm.generate --help ``` To quantize a model from the command line run: ``` mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q ``` For more options run: ``` mlx_lm.convert --help ``` You can upload new models to Hugging Face by specifying `--upload-repo` to `convert`. For example, to upload a quantized Mistral-7B model to the [MLX Hugging Face community](https://huggingface.co/mlx-community) you can do: ``` mlx_lm.convert \ --hf-path mistralai/Mistral-7B-Instruct-v0.3 \ -q \ --upload-repo mlx-community/my-4bit-mistral ``` Models can also be converted and quantized directly in the [mlx-my-repo](https://huggingface.co/spaces/mlx-community/mlx-my-repo) Hugging Face Space. ### Long Prompts and Generations `mlx-lm` has some tools to scale efficiently to long prompts and generations: - A rotating fixed-size key-value cache. - Prompt caching To use the rotating key-value cache pass the argument `--max-kv-size n` where `n` can be any integer. Smaller values like `512` will use very little RAM but result in worse quality. Larger values like `4096` or higher will use more RAM but have better quality. Caching prompts can substantially speedup reusing the same long context with different queries. To cache a prompt use `mlx_lm.cache_prompt`. For example: ```bash cat prompt.txt | mlx_lm.cache_prompt \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --prompt - \ --prompt-cache-file mistral_prompt.safetensors ``` Then use the cached prompt with `mlx_lm.generate`: ``` mlx_lm.generate \ --prompt-cache-file mistral_prompt.safetensors \ --prompt "\nSummarize the above text." ``` The cached prompt is treated as a prefix to the supplied prompt. Also notice when using a cached prompt, the model to use is read from the cache and need not be supplied explicitly. Prompt caching can also be used in the Python API in order to to avoid recomputing the prompt. This is useful in multi-turn dialogues or across requests that use the same context. See the [example](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/chat.py) for more usage details. ### Supported Models `mlx-lm` supports thousands of Hugging Face format LLMs. If the model you want to run is not supported, file an [issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet, submit a pull request. Here are a few examples of Hugging Face models that work with this example: - [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) - [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) - [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) - [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) - [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) - [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) - [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B) - [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b) - [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct) - [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b) - [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b) - [tiiuae/falcon-mamba-7b-instruct](https://huggingface.co/tiiuae/falcon-mamba-7b-instruct) Most [Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending), [Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending), [Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending), and [Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending) style models should work out of the box. For some models (such as `Qwen` and `plamo`) the tokenizer requires you to enable the `trust_remote_code` option. You can do this by passing `--trust-remote-code` in the command line. If you don't specify the flag explicitly, you will be prompted to trust remote code in the terminal when running the model. For `Qwen` models you must also specify the `eos_token`. You can do this by passing `--eos-token "<|endoftext|>"` in the command line. These options can also be set in the Python API. For example: ```python model, tokenizer = load( "qwen/Qwen-7B", tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True}, ) ``` ### Large Models > [!NOTE] This requires macOS 15.0 or higher to work. Models which are large relative to the total RAM available on the machine can be slow. `mlx-lm` will attempt to make them faster by wiring the memory occupied by the model and cache. This requires macOS 15 or higher to work. If you see the following warning message: > [WARNING] Generating with a model that requires ... then the model will likely be slow on the given machine. If the model fits in RAM then it can often be sped up by increasing the system wired memory limit. To increase the limit, set the following `sysctl`: ```bash sudo sysctl iogpu.wired_limit_mb=N ``` The value `N` should be larger than the size of the model in megabytes but smaller than the memory size of the machine.