mlx-examples/llms
Nada Amin 3cc58e17fb
Tweaks to run dspy-produced calls to the server, with gemma template. (#810)
* Tweaks to run dspy-produced calls to the server, with gemma template.

following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936

can try it out with:
```sh
python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143
```
modulo patching the relative imports in server.py
```
-from .tokenizer_utils import TokenizerWrapper
-from .utils import generate_step, load
+from mlx_lm.tokenizer_utils import TokenizerWrapper
+from mlx_lm.utils import generate_step, load
```

and then, ont the dspy side:
```python
import dspy
lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250)
lm("hello")
```

* simpler way to validate float or int

* remove logic that works around incompatible templates, too gemma specific

* tweak messages for common denominator

* use generate.py workaround for DBXR

* put behind flag

* oops

* Solution to chat template issue: pass in a custom template!

The template should likely adhere to the OpenAI chat model.
Here is such a template for Gemma.

--chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] | trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

* remove convoluted solution

* Tweak for when None is provided explicitly, and must be set to [] too.

For example, the outlines library provides None explicitly.

* style

---------

Co-authored-by: Awni Hannun <awni@apple.com>
2024-06-12 07:17:06 -07:00
..
gguf_llm fixed the requirements (#803) 2024-05-29 06:14:19 -07:00
llama Quantize embedding / Update quantize API (#680) 2024-04-18 18:16:10 -07:00
mistral Quantize embedding / Update quantize API (#680) 2024-04-18 18:16:10 -07:00
mixtral Quantize embedding / Update quantize API (#680) 2024-04-18 18:16:10 -07:00
mlx_lm Tweaks to run dspy-produced calls to the server, with gemma template. (#810) 2024-06-12 07:17:06 -07:00
speculative_decoding Fix incorrect type annotation (#720) 2024-04-24 15:52:43 -07:00
tests GPT2 Support (#798) 2024-06-02 16:33:20 -07:00
CONTRIBUTING.md Enable unit testing in Circle and start some MLX LM tests (#545) 2024-03-07 09:31:57 -08:00
MANIFEST.in Mlx llm package (#301) 2024-01-12 10:25:56 -08:00
README.md mlx_lm: Add Streaming Capability to Generate Function (#807) 2024-06-03 09:04:39 -07:00
setup.py Add model management functionality for local caches (#736) 2024-05-03 12:20:13 -07:00

Generate Text with LLMs and MLX

The easiest way to get started is to install the mlx-lm package:

With pip:

pip install mlx-lm

With conda:

conda install -c conda-forge mlx-lm

The mlx-lm package also has:

Python API

You can use mlx-lm as a module:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

response = generate(model, tokenizer, prompt="hello", verbose=True)

To see a description of all the arguments you can do:

>>> help(generate)

The mlx-lm package also comes with functionality to quantize and optionally upload models to the Hugging Face Hub.

You can convert models in the Python API with:

from mlx_lm import convert

repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

convert(repo, quantize=True, upload_repo=upload_repo)

This will generate a 4-bit quantized Mistral 7B and upload it to the repo mlx-community/My-Mistral-7B-Instruct-v0.3-4bit. It will also save the converted model in the path mlx_model by default.

To see a description of all the arguments you can do:

>>> help(convert)

Streaming

For streaming generation, use the stream_generate function. This returns a generator object which streams the output text. For example,

from mlx_lm import load, stream_generate

repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)

prompt = "Write a story about Einstein"

for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(t, end="", flush=True)
print()

Command Line

You can also use mlx-lm from the command line with:

mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"

This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.

For a full list of options run:

mlx_lm.generate --help

To quantize a model from the command line run:

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q

For more options run:

mlx_lm.convert --help

You can upload new models to Hugging Face by specifying --upload-repo to convert. For example, to upload a quantized Mistral-7B model to the MLX Hugging Face community you can do:

mlx_lm.convert \
    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral

Supported Models

The example supports Hugging Face format Mistral, Llama, and Phi-2 style models. If the model you want to run is not supported, file an issue or better yet, submit a pull request.

Here are a few examples of Hugging Face models that work with this example:

Most Mistral, Llama, Phi-2, and Mixtral style models should work out of the box.

For some models (such as Qwen and plamo) the tokenizer requires you to enable the trust_remote_code option. You can do this by passing --trust-remote-code in the command line. If you don't specify the flag explicitly, you will be prompted to trust remote code in the terminal when running the model.

For Qwen models you must also specify the eos_token. You can do this by passing --eos-token "<|endoftext|>" in the command line.

These options can also be set in the Python API. For example:

model, tokenizer = load(
    "qwen/Qwen-7B",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)