mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-10-23 05:58:07 +08:00

Author	SHA1	Message	Date
Alex Cheema	cd8efc7fbc	Add support for Llama-3.1 (#907 ) * add dynamicNTK scaling rope * remove unused var * fix rope base * llama3.1 fixes * TODO for rope eval * vectorise llama3 base freq calculation * removed the arbitrary 2.0 rope_scale default case * fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init * nits + format * use mx.pi * fix tests and add test for 3.1 --------- Co-authored-by: Prince Canuma <prince.gdt@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-23 13:21:32 -07:00
Prince Canuma	3f337e0f0a	Add Mistral NeMo (fix) (#895 ) * fix head_dim * Update llms/mlx_lm/models/llama.py * fix kv error * formatting * Delete test.py --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com>	2024-07-22 06:09:24 -07:00
Prince Canuma	3d365b612a	Add support for InternLM-2.5 (#871 ) * fix internlm-2 * formatting * add dynamic ntk rope * formatting * move dynamic scaling rope to intermlm2.py * add default max_position_embeddings	2024-07-17 16:38:22 -07:00
Anchen	561dcf5643	Add support for deepseek coder v2 lite (#882 ) * feat: add support for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct * fix softmax + some cleanup * more nits * fix rope * fix original_max_position_embeddings in rope * fix original_max_position_embeddings in rope config * add group greedy --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-17 07:23:28 -07:00
Awni Hannun	f0c6c6e226	keep the server in a valid state (#889 )	2024-07-15 18:35:36 -07:00
JosefAlbers	bfc1f2763b	longrope (#886 )	2024-07-12 07:19:11 -07:00
Chime Ogbuji	8bf397e450	Pass use_dora parameter to linear_to_lora_layers (#885 )	2024-07-11 14:34:34 -07:00
nicolov	fbe3247772	Add GPT-neox model (#863 )	2024-07-11 06:13:17 -07:00
Alex Wozniakowski	63800c8feb	Example of response generation with optional arguments (#853 ) * Generate response with optional arguments * Reference response generation example * Include transformers and sentencepiece * Update example to run Mistral-7B-Instruct-v0.3 * Link to generation example * Style changes from pre-commit	2024-07-09 06:49:59 -07:00
Awni Hannun	68e88d42fb	Fix server for `openai` package (#877 ) * fix * fixes for 9b	2024-07-08 12:34:31 -07:00
Awni Hannun	20e221f7f7	Add recurrent gemma (#856 ) * add recurrent gemma * fix window cache	2024-07-07 12:10:04 -07:00
n8programs	1e05aef344	Add logit soft capping to gemma, and fix precision issues (#857 ) * Add logit soft capping to gemma, and fix precision issues Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention). * get rid of comments * get rid of last comments (sry lol) * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-02 07:52:39 -07:00
Angelos Katharopoulos	f212b770d8	Server loads the model on demand from the request (#851 )	2024-06-27 11:37:57 -07:00
Awni Hannun	538339b599	gemma2 (#855 )	2024-06-27 10:06:28 -07:00
Awni Hannun	9f10728145	fix yi (#852 )	2024-06-27 06:38:19 -07:00
Chime Ogbuji	df6bc09d74	Configuration-based use of HF hub-hosted datasets for training (#701 ) * Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training * Pre-commit formatting * Fix YAML config example * Print DS info * Include name * Add hf_dataset parameter default * Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility. * nits * update docs --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-26 10:20:50 -07:00
Chime Ogbuji	1d701a1831	Logprobs info to completion API (#806 ) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-23 10:35:13 -07:00
Yi Wang	a7598e9456	Fix mypy errors with models/{qwen2,qwen2_moe,startcoder2}.py (#835 ) * Fix starcoder.py * Fix qwen2 * Remvoe unnecessary assert not None	2024-06-14 09:44:50 -07:00
Awni Hannun	d8b073e3a7	Add eos token to lora fine-tunes (#818 ) * add eos token to lora fine-tunes * Comment	2024-06-12 07:44:21 -07:00
Nada Amin	3cc58e17fb	Tweaks to run dspy-produced calls to the server, with gemma template. (#810 ) * Tweaks to run dspy-produced calls to the server, with gemma template. following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936 can try it out with: ```sh python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143 ``` modulo patching the relative imports in server.py ``` -from .tokenizer_utils import TokenizerWrapper -from .utils import generate_step, load +from mlx_lm.tokenizer_utils import TokenizerWrapper +from mlx_lm.utils import generate_step, load ``` and then, ont the dspy side: ```python import dspy lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250) lm("hello") ``` * simpler way to validate float or int * remove logic that works around incompatible templates, too gemma specific * tweak messages for common denominator * use generate.py workaround for DBXR * put behind flag * oops * Solution to chat template issue: pass in a custom template! The template should likely adhere to the OpenAI chat model. Here is such a template for Gemma. --chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] \| trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}" * remove convoluted solution * Tweak for when None is provided explicitly, and must be set to [] too. For example, the outlines library provides None explicitly. * style --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-12 07:17:06 -07:00
Yi Wang	6da07fb1b0	make models/phi3.py and models/phi3small.py compatible with mypy (#833 )	2024-06-12 06:53:55 -07:00
JosefAlbers	fda41545a6	Su-RoPE(Rotary Position Embedding) for Phi-3 (#813 ) * Su-RoPE * nits * Update su_rope.py * Update su_rope.py Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later." * Ran isort --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-11 06:20:04 -07:00
Yi Wang	a54dfd698e	Correct the type annotation of cache in llama.py (#828 ) * Update * Fix isort	2024-06-10 15:18:34 -07:00
Yi Wang	bb8227f181	Correct type annotation of llama.ModelArgs.num_key_value_heads (#827 )	2024-06-10 14:47:31 -07:00
Robin Glauser	4872727f14	Fixing "NameError: name 'resume_adapter_file' is not defined" (#817 ) args. is missing from resume_adapter_file so the name is not defined.	2024-06-05 10:07:31 -07:00
Michał Kurc	43d6deb3c1	mlx_lm: Add Streaming Capability to Generate Function (#807 ) * Add streaming feature to text generation function * separate stream and regular functions --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-03 09:04:39 -07:00
Derek Lewis	89b0b75250	GPT2 Support (#798 ) * GPT-2 model support * Add test for gpt2 model * Fix weight sanitizing for quantization * use approx gelu --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-02 16:33:20 -07:00
madroid	c457a3f88b	LoRA: Extract small function (#614 ) * LoRA: Extract pre_processing_model function * LoRA: Extract small functions(train_model,evaluate_model) * move test case to test_tuner_utils.py * nits * nits * remove extra param, validate at it 0 * version * fix test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-02 06:38:42 -07:00
Awni Hannun	81318ad4a8	Port of phi3small (#794 ) * start port of phi3small * fix phi3 * use block sparsity * compile activation * nits in readme / mlx lm version	2024-05-31 12:54:14 -07:00
Awni Hannun	09aaeac72c	fix moe conversion (#802 )	2024-05-31 12:36:05 -07:00
Behnam Moh	f49c5f2829	fixed the requirements (#803 )	2024-05-29 06:14:19 -07:00
Chen Xin	aac98ca6f4	support internlm2 (#797 ) * support internlm2 * only attention projections --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-27 06:22:21 -07:00
Awni Hannun	ca7ce60c91	Rename block sparse to gather (#793 ) * rename block sparse to gather * pin mlx version	2024-05-23 19:47:35 -07:00
Prince Canuma	69700d8431	Add support for Phi-3 Medium (#790 ) * update to support phi-3 medium * fuse qkv split	2024-05-22 16:47:06 -07:00
Prince Canuma	b044ce2acf	Add support for ibm granite (#758 ) * add support for granite 3-8B config * add gpt_bigcode * add positional embedding condition. * add support for granite 3-8B config * add gpt_bigcode * add positional embedding condition. * remove unused function * rebase fix * move position emebedding to mask creation * add to tuner and format * add support for granite 3-8B config * add gpt_bigcode * add positional embedding condition. * add support for granite 3-8B config * add gpt_bigcode * add positional embedding condition. * rebase fix * move position emebedding to mask creation * add to tuner and format * refactor mask * remove dropout layers	2024-05-21 20:16:31 -07:00
Awni Hannun	9fc6efbd90	version bump + some fixes (#792 )	2024-05-21 20:09:35 -07:00
Angelos Katharopoulos	9f671228cd	Block sparse MM MoEs (#782 ) - Adds SwitchLinear - Adds QuantizedSwitchLinear	2024-05-21 15:58:08 -07:00
AtakanTekparmak	199df9e110	fix: Added dedicated error handling to load and get_model_path (#775 ) * fix: Added dedicated error handling to load and get_model_path Added proper error handling to load and get_model_path by adding a dedicated exception class, because when the local path is not right, it still throws the huggingface RepositoryNotFoundError * fix: Changed error message and resolved lack of import * fix: Removed redundant try-catch block * nits in message * nits in message --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-20 06:39:05 -07:00
alexC-nonsense4k	42458914c8	support dora finetune in mlx-examples/llms/mlx_lm (#779 ) * support dora finetune * solve problems in lora.py and tuner.utils.py * add use_dora (bool) in functions of load adapters * delete all unsupported quantization code and fix all the calculate problems in mlx_lm/tuner/dora.py * Using stop_gradient to prevent gradients from flowing through ‘norm’ during backpropagation * set DEFAULT_USE_DORA in mlx_lm/generate.py * add annotation for all the use_dora * mlx_lm/fuse.py support fuse dora layers and fix a bug of to_linear() in mlx_lm/tuner/dora.py * simplify code of juding type of a fused layer in mlx_lm/fuse.py * add use_dora in mlx_lm/fuse.py when apply_lora_layers() * style + nits * style + nits * more updates --------- Co-authored-by: chenyifei08 <chenyifei08@baidu.com> Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-16 08:21:26 -07:00
Awni Hannun	69181e0058	Support non incremental kv cache growth (#766 )	2024-05-15 12:56:24 -07:00
JosefAlbers	10853b57d9	Add `model_config` parameter to `load()` and `load_model()` (#770 ) * Add `model_config` parameter to `load()` and `load_model()` For easy editing of the loaded model configuration (e.g., for changing RoPE theta or scaling of Phi-3 model) Example: ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed", model_config={"rope_theta":50000.0}) response = generate(model, tokenizer, prompt, max_tokens=MAX_TOKENS) ``` * Possible bug (default_loss) * Revert "Possible bug (default_loss)" This reverts commit `70a55ace18`. * Fix default_loss for lora * 1. move load_model's new optional `model_config` arg to the end (fetch_from_hub()'s `model = load_model(model_path, lazy)`) 2. fix indentations (`black` hook)	2024-05-10 10:13:34 -07:00
Awni Hannun	6f0a69e682	fix lora for openelm (#773 )	2024-05-10 09:51:41 -07:00
Awni Hannun	fad9598372	Fix llama cache check (#763 ) * fix llama cache check * add test	2024-05-08 08:35:54 -07:00
Awni Hannun	ee60e2a9d5	Kv cache (#643 ) * in place kv_cache * fix * fix kv cache size * partially fix kv cache dtype * step kv cache * multiple of step size * more teests + kv cache * more kv cache * udpate all models to use kv cache	2024-05-08 08:18:13 -07:00
Kevin Wang	c0019c4908	Pad mask with zeros for non-square attention matrices (#715 ) * Pad mask with zeros for non-square attention matrices The current implementation of the mask assumes the attention matrix is square, which is true if there is no cache. However, if one wishes to produce multiple tokens at a time, such as in speculative decoding implementations, a rectangular mask is necessary. This change pads the bottom of the mask with zeros so multi-token decoding with a cache works correctly. * Directly create mask instead of padding * Update llama.py	2024-05-04 16:32:25 -07:00
Anchen	f30413b63c	chore(mlx-lm): fix the number of validation batches configuration. (#752 ) * chore: fix number of validation batches * clean up * address comment	2024-05-04 06:52:42 -07:00
Awni Hannun	2bf11c4633	Use stable url for MNIST (#749 ) * use stable url * remove deprecated flag	2024-05-03 17:13:05 -07:00
Konstantin Kerekovski	d1c35fa684	Add MLX Cache Limit setting for mlx_lm.generate and mlx_lm.server CLI (#744 ) * Add support for setting MLX cache limit in GB * Add support for setting MLX cache limit in GB in mlx_lm.server * format --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-03 12:42:48 -07:00
Ivan Fioravanti	b468091f7f	Add model management functionality for local caches (#736 ) * Add model management functionality for local caches This commit introduces a set of command-line utilities for managing MLX models downloaded and saved locally in Hugging Face cache. The functionalities include scanning existing models, retrieving detailed information about a specific model, and deleting a model by its name. * Added mlx_lm.model to setup.py * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-05-03 12:20:13 -07:00
Awni Hannun	92430df0a0	Fix lora for qwen moe (#743 ) * fix lora for qwen moe * use max seq length in test as well	2024-05-02 21:55:09 -07:00

... 2 3 4 5 6 ...

395 Commits