mlx-examples/llms/mlx_lm/chat.py

# Copyright © 2023-2024 Apple Inc.

import argparse
import json

import mlx.core as mx

from .models.cache import make_prompt_cache
from .sample_utils import make_sampler
from .utils import load, stream_generate

DEFAULT_TEMP = 0.0
DEFAULT_TOP_P = 1.0
DEFAULT_SEED = None
DEFAULT_MAX_TOKENS = 256
DEFAULT_MODEL = "mlx-community/Llama-3.2-3B-Instruct-4bit"


def setup_arg_parser():
    """Set up and return the argument parser."""
    parser = argparse.ArgumentParser(description="Chat with an LLM")
    parser.add_argument(
        "--model",
        type=str,
        help="The path to the local model directory or Hugging Face repo.",
        default=DEFAULT_MODEL,
    )
    parser.add_argument(
        "--adapter-path",
        type=str,
        help="Optional path for the trained adapter weights and config.",
    )
    parser.add_argument(
        "--temp", type=float, default=DEFAULT_TEMP, help="Sampling temperature"
    )
    parser.add_argument(
        "--top-p", type=float, default=DEFAULT_TOP_P, help="Sampling top-p"
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=DEFAULT_SEED,
        help="PRNG seed",
    )
    parser.add_argument(
        "--max-kv-size",
        type=int,
        help="Set the maximum key-value cache size",
        default=None,
    )
    parser.add_argument(
        "--max-tokens",
        "-m",
        type=int,
        default=DEFAULT_MAX_TOKENS,
        help="Maximum number of tokens to generate",
    )
    return parser


def main():
    parser = setup_arg_parser()
    args = parser.parse_args()

    if args.seed is not None:
        mx.random.seed(args.seed)

    model, tokenizer = load(
        args.model,
        adapter_path=args.adapter_path,
        tokenizer_config={"trust_remote_code": True},
    )

    def print_help():
        print("The command list:")
        print("- 'q' to exit")
        print("- 'r' to reset the chat")
        print("- 'h' to display these commands")

    print(f"[INFO] Starting chat session with {args.model}.")
    print_help()
    prompt_cache = make_prompt_cache(model, args.max_kv_size)
    while True:
        query = input(">> ")
        if query == "q":
            break
        if query == "r":
            prompt_cache = make_prompt_cache(model, args.max_kv_size)
            continue
        if query == "h":
            print_help()
            continue
        messages = [{"role": "user", "content": query}]
        prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
        for response in stream_generate(
            model,
            tokenizer,
            prompt,
            max_tokens=args.max_tokens,
            sampler=make_sampler(args.temp, args.top_p),
            prompt_cache=prompt_cache,
        ):
            print(response.text, flush=True, end="")
        print()


if __name__ == "__main__":
    main()
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`# Copyright © 2023-2024 Apple Inc.`

			`import argparse`
			`import json`

			`import mlx.core as mx`

Generation refactor: part 2 (#1099) * unify with stream_generate * fixes * nit * some cleanup, warnings, tests * fix test + faster min p + test * version 2024-11-24 03:47:06 +08:00			`from .models.cache import make_prompt_cache`
			`from .sample_utils import make_sampler`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`from .utils import load, stream_generate`

			`DEFAULT_TEMP = 0.0`
			`DEFAULT_TOP_P = 1.0`
Change DEFAULT_SEED to None for stochastic generation by default (#1323) * Change DEFAULT_SEED to None for stochastic generation by default * Update llms/mlx_lm/chat.py * Update llms/mlx_lm/generate.py --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com> 2025-03-06 22:49:35 +08:00			`DEFAULT_SEED = None`
chore(mlx-lm): add max token arg for mlx_lm.chat (#1089) * chore(mlx-lm): add max token arg for mlx_lm.chat * chore: update the default max token value 2024-11-04 22:06:34 +08:00			`DEFAULT_MAX_TOKENS = 256`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`DEFAULT_MODEL = "mlx-community/Llama-3.2-3B-Instruct-4bit"`


			`def setup_arg_parser():`
			`"""Set up and return the argument parser."""`
			`parser = argparse.ArgumentParser(description="Chat with an LLM")`
			`parser.add_argument(`
			`"--model",`
			`type=str,`
			`help="The path to the local model directory or Hugging Face repo.",`
			`default=DEFAULT_MODEL,`
			`)`
			`parser.add_argument(`
			`"--adapter-path",`
			`type=str,`
			`help="Optional path for the trained adapter weights and config.",`
			`)`
			`parser.add_argument(`
			`"--temp", type=float, default=DEFAULT_TEMP, help="Sampling temperature"`
			`)`
			`parser.add_argument(`
			`"--top-p", type=float, default=DEFAULT_TOP_P, help="Sampling top-p"`
			`)`
Change DEFAULT_SEED to None for stochastic generation by default (#1323) * Change DEFAULT_SEED to None for stochastic generation by default * Update llms/mlx_lm/chat.py * Update llms/mlx_lm/generate.py --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com> 2025-03-06 22:49:35 +08:00			`parser.add_argument(`
			`"--seed",`
			`type=int,`
			`default=DEFAULT_SEED,`
			`help="PRNG seed",`
			`)`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`parser.add_argument(`
			`"--max-kv-size",`
			`type=int,`
			`help="Set the maximum key-value cache size",`
			`default=None,`
			`)`
chore(mlx-lm): add max token arg for mlx_lm.chat (#1089) * chore(mlx-lm): add max token arg for mlx_lm.chat * chore: update the default max token value 2024-11-04 22:06:34 +08:00			`parser.add_argument(`
			`"--max-tokens",`
			`"-m",`
			`type=int,`
			`default=DEFAULT_MAX_TOKENS,`
			`help="Maximum number of tokens to generate",`
			`)`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`return parser`


			`def main():`
			`parser = setup_arg_parser()`
			`args = parser.parse_args()`

Change DEFAULT_SEED to None for stochastic generation by default (#1323) * Change DEFAULT_SEED to None for stochastic generation by default * Update llms/mlx_lm/chat.py * Update llms/mlx_lm/generate.py --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com> 2025-03-06 22:49:35 +08:00			`if args.seed is not None:`
			`mx.random.seed(args.seed)`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00
			`model, tokenizer = load(`
			`args.model,`
			`adapter_path=args.adapter_path,`
			`tokenizer_config={"trust_remote_code": True},`
			`)`

support kimi + more options in chat mode (#1312) 2025-03-01 03:33:18 +08:00			`def print_help():`
			`print("The command list:")`
			`print("- 'q' to exit")`
			`print("- 'r' to reset the chat")`
			`print("- 'h' to display these commands")`

			`print(f"[INFO] Starting chat session with {args.model}.")`
			`print_help()`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`prompt_cache = make_prompt_cache(model, args.max_kv_size)`
			`while True:`
			`query = input(">> ")`
			`if query == "q":`
			`break`
support kimi + more options in chat mode (#1312) 2025-03-01 03:33:18 +08:00			`if query == "r":`
			`prompt_cache = make_prompt_cache(model, args.max_kv_size)`
			`continue`
			`if query == "h":`
			`print_help()`
			`continue`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`messages = [{"role": "user", "content": query}]`
fix encoding with special tokens + chat template (#1189) 2025-01-04 02:50:59 +08:00			`prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)`
Generation refactor: part 2 (#1099) * unify with stream_generate * fixes * nit * some cleanup, warnings, tests * fix test + faster min p + test * version 2024-11-24 03:47:06 +08:00			`for response in stream_generate(`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`model,`
			`tokenizer,`
			`prompt,`
Fix max_tokens (#1148) 2024-12-11 03:26:04 +08:00			`max_tokens=args.max_tokens,`
Generation refactor: part 2 (#1099) * unify with stream_generate * fixes * nit * some cleanup, warnings, tests * fix test + faster min p + test * version 2024-11-24 03:47:06 +08:00			`sampler=make_sampler(args.temp, args.top_p),`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`prompt_cache=prompt_cache,`
			`):`
Generation refactor: part 2 (#1099) * unify with stream_generate * fixes * nit * some cleanup, warnings, tests * fix test + faster min p + test * version 2024-11-24 03:47:06 +08:00			`print(response.text, flush=True, end="")`
More cache improvements (#1015) * fix rotating kv cache for chat use case * reorg + fixes to caching, unify prompt caching across types and use cases for e.g. caching during a chat * nit in chat * fix tests * fix tests * fix tests * docs * chat command * comments + docs * Define meta_state on all Cache implementations * fixes + trim_prompt_cache api * fix default model --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-10-08 11:45:51 +08:00			`print()`


			`if __name__ == "__main__":`
			`main()`