* Add support for setting MLX cache limit in GB
* Add support for setting MLX cache limit in GB in mlx_lm.server
* format
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* fix the chinese character generation as same as PR #321
* reuse the generate logic to utils.py
* format
* verbose defualt
* fix conflicst with colorize and character check
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Add colorized output option to generate script
Two new functions were added to the script that allow output to be colorized based on the T[0] probability. Changes were made to the `generate_step` function in utils.py to permit colorization. Additionally, an argument for colorization was introduced to the command-line parser.
* Rename 'colorize' parameter with 'return_probability' in generate_step
* add an option to apply the tokenizer chat template
* fix the option to apply the tokenizer chat template
* better error messages for chat template issues
* apply the chat template by default when possible
* nit in comment'
* rebase
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* refactor(qwen): moving qwen into mlx-lm
* chore: update doc
* chore: fix type hint
* add qwen model support in convert
* chore: fix doc
* chore: only load model in quantize_model
* chore: make the convert script only copy tokenizer files instead of load it and save
* chore: update docstring
* chore: remove unnecessary try catch
* chore: clean up for tokenizer and update transformers 4.37
* nits in README
---------
Co-authored-by: Awni Hannun <awni@apple.com>