mlx-examples

mirror of https://github.com/ml-explore/mlx-examples.git synced 2025-12-16 02:08:55 +08:00

Author	SHA1	Message	Date
Awni Hannun	f530f56df2	don't use internal exception (#990 )	2024-09-17 16:22:48 -07:00
Awni Hannun	6c2369e4b9	Fix bug in upload + docs nit (#981 ) * fix bug in upload + docs nit * nit	2024-09-07 14:46:57 -07:00
Awni Hannun	c3e3411756	Update LLM generation docs to use chat template (#973 ) * fix docs * add template to model cards as well * revert * version	2024-09-07 06:06:15 -07:00
Angelos Katharopoulos	324184d670	Fix the cache_prompt (#979 )	2024-09-06 20:19:27 -07:00
madroid	bd29aec299	Support HuggingFace model tree (#957 ) * Hub: Update quantization configuration fields * Hub: add base_model metadata * Hub: add quantization_config for model tree Quantized type * Hub: update quantization_config value * Hub: remove config print	2024-09-04 06:19:32 -07:00
Chime Ogbuji	83a209e200	Add prompt piping (#962 ) * Initial commit of --prompt-only and prompt from STDIN feature * Switch to using --verbose instead of --prompt-only * Fix capitalization typo * Fix reference to changed option name * Update exception text	2024-09-03 13:29:10 -07:00
James Zhao	bf921afcbe	Make sure to import the correct "version" module when installing mlx_whisper and mlx_lm from local source code. (#969 ) * Make sure to import the correct "version" module when installing the mlx_whisper package from local source code. * Make sure to import the correct "version" module when installing the mlx_lm package from local source code * fix --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-03 13:16:21 -07:00
Awni Hannun	3c6e8b11af	fix (#965 )	2024-08-30 05:56:27 -07:00
L	fc93c55723	feat(mlx_lm): Nemotron (#949 ) * feat: Nemotron https://huggingface.co/nvidia/Minitron-4B-Base This is basically Llama with partial RoPE and LayerNorm instead of BatchNorm. Also they add 1 to the LayerNorm weight for some reason. * fixup! feat: Nemotron * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-29 21:08:57 -07:00
Awni Hannun	b1186e2a81	Docs on prompt scaling (#963 ) * docs on prompt scaling * remove unused var * nits	2024-08-29 15:05:17 -07:00
Angelos Katharopoulos	1003a8b2dd	Add the ability to load the KV cache from a file (#956 )	2024-08-28 22:11:45 -07:00
Angelos Katharopoulos	7f8c961287	Fix setattr for the TokenizerWrapper (#961 )	2024-08-28 14:47:33 -07:00
Prince Canuma	b5e18ef1e3	Add Phi-3.5-MoE (#946 ) * add phimoe * add phimoe to tunner * add switch_mlp * fix SuScaled args * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-24 06:52:33 -07:00
Awni Hannun	6731254e76	Use fast rope (#945 ) * use fast rope * fix llama * use fast rope for llama3.1 * requires unreleased mlx * fix su * fix deepseek v2 * only one of base or freqs * nit * fix * hard code freqs	2024-08-23 13:18:51 -07:00
Awni Hannun	58591a1b41	fine tune deepseek (#932 )	2024-08-22 10:41:21 -07:00
L	0164d2058b	feat: DeepSeek MoE v1 (#942 ) * feat: deepseek v1 DeepSeek is still releasing models on the DeepSeek V1 architecture. ```sh mlx_lm.convert --hf-path deepseek-ai/DeepSeek-Prover-V1.5-RL --mlx-path DeepSeek-Prover-V1.5-RL-8bit --q-bits 8 -q mlx_lm.generate --model DeepSeek-Prover-V1.5-RL-8bit --ignore-chat-template --max-tokens 512 --prompt 'import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat /-- The second and fourth terms of a geometric sequence are $2$ and $6$. Which of the following is a possible first term? Show that it is $\frac{2\sqrt{3}}{3}$.-/ theorem amc12b_2003_p6 (a r : ℝ) (u : ℕ → ℝ) (h₀ : ∀ k, u k = a * r ^ k) (h₁ : u 1 = 2) (h₂ : u 3 = 6) : u 0 = 2 / Real.sqrt 3 ∨ u 0 = -(2 / Real.sqrt 3) := by' ``` * nits * nits * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-17 07:18:09 -07:00
Awni Hannun	7be292c0c9	Handle longer prompt/generation (#931 ) * rebase * nits * nit * fix rotating cache with step prefill * update version	2024-08-16 15:28:39 -07:00
Zai Thottakath	4e01700816	Allow the entire model to be targed for LoRA and DoRA fine tuning: LoRA and DoRA embeddings with small DoRALinear bug fix (#914 ) * feature: LoRA adapter for Embeddings * feature: wire in LoRAEmbedding into the tuner. Allow the embedding and non model.layers Linear layers to be targeted for fine tuning * feature: DoRA adapter for Embeddings * feature: wire in DoRAEmbedding * bugfix: ensure self.m is recalculated when the linear layer is changed in DoRALinear.from_linear * refactor: prefer from_base over from_linear or from_embedding. prefer fuse over to_linear or to_embedding * cleanup: remove unused imports in test_dora.py * refactor: remove unnecessary non_layer_modules * cleanup: remove wrong comments for lora embedding dropout. remove uncessary parens in dora embedding dropout * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-16 07:38:36 -07:00
Chime Ogbuji	c50971e860	Min P implementation (#926 ) * Min P implementation * Change default to 0 (no min_p) * nits * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-15 15:45:02 -07:00
Awni Hannun	9b83004631	Faster sampling with `mx.compile` (#937 ) * faster sampling with compile * fix test	2024-08-15 11:29:09 -07:00
Awni Hannun	95840f32e2	Fix whipser conversion for safetensors models (#935 ) * fix whipser conversion for safetensor only. error in mlx lm for existing paths * fix tests	2024-08-14 10:22:04 -07:00
Awni Hannun	33905447f9	Whisper updates to allow HF models (#923 ) * simplify conversion and update convert for HF models * use npz for compat * fixes * fixes * fix gguf * allow user supplied path	2024-08-09 11:11:58 -07:00
tidely	df744c98e6	Predict stop sequence matches during streaming (#541 ) * Predict stop sequence matches during streaming Check for overlap of stop sequences and the tokens array for potential sequence matches after more tokens get generated. Generate tokens until we can confirm that the stop sequence is not met. * fix typo * Change sequence_overlap logic * range isn't inclusive, add 1 to max_overlap * Add test_server.py Added a test for the sequence_overlap method * nits * eos sequence * finalize --------- Co-authored-by: Y4hL <43219534+Y4hL@users.noreply.github.com> Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-06 15:24:15 -07:00
Khush Gupta	8fa12b0058	Adapters loading (#902 ) * Added functionality to load in adapters through post-requests so you do not need to restart the server * ran pre-commit * nits * fix test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-01 16:18:18 -07:00
madroid	85dc76f6e0	Server: support stream_options (#913 ) * Server: support stream_options see https://x.com/OpenAIDevs/status/1787573348496773423 * Server: support stream_options * Server: check None type	2024-07-26 08:58:52 -07:00
otriscon	46da74fea2	Unify attention mask in LLMs (#911 ) * Unify attention mask creation in LLMs. Currently, each model implementation in `mlx-examples/llms/models` has ad-hoc code to create a mask for the attention mechanism. This usually takes the form: ``` mask = None if h.shape[1] > 1: mask = nn.MultiHeadAttention.create_additive_causal_mask(h.shape[1]) mask = mask.astype(h.dtype) ``` This correctly creates a mask only if the input consists of more than one token. But this code assumes the multi-token input is at the beginning of inference. If, for example, we are evaluating multiple tokens because of speculative decoding or prompt cache reuse, this mask will not have the correct shape and and will cause the raising of an exception in the attention computation. Some of the models correctly implement the mask creation with code like this: ``` mask = None if h.shape[1] > 1: mask = create_additive_causal_mask( h.shape[1], cache[0].offset if cache is not None else 0 ) mask = mask.astype(h.dtype) ``` This commit unifies the attention mask creation for all models with a new function `create_attention_mask`, reducing code duplication and helping all models support inference performance enhancements like those mentioned above. * Allow batches in LLM key-value cache The current implementation of the LLM key-value cache assumes that the input batch is of size 1. Input batching (evaluating multiple alterative inputs at the same time) can be a valuable tool for speculative sampling and other techniques. This change removes the hard-coded batch size from the code that resizes the key-value cache. * Simplify causal mask creation Use the same codepath regardless of whether there's an offset or not. Addresses [this comment](https://github.com/ml-explore/mlx-examples/pull/911#discussion_r1691459717). * Use old-style type annotation to avoid linter error	2024-07-25 16:45:22 -07:00
Anchen	7a3ab1620a	support load model by custom get_model_classes (#899 ) * feature(mlx_lm): support load model by custom get classes * rename the param	2024-07-25 11:01:17 -07:00
Alex Cheema	cd8efc7fbc	Add support for Llama-3.1 (#907 ) * add dynamicNTK scaling rope * remove unused var * fix rope base * llama3.1 fixes * TODO for rope eval * vectorise llama3 base freq calculation * removed the arbitrary 2.0 rope_scale default case * fix slow llama3.1 generation by evaluating stateless part of DynamicNTKScalingRoPE in init * nits + format * use mx.pi * fix tests and add test for 3.1 --------- Co-authored-by: Prince Canuma <prince.gdt@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-23 13:21:32 -07:00
Prince Canuma	3f337e0f0a	Add Mistral NeMo (fix) (#895 ) * fix head_dim * Update llms/mlx_lm/models/llama.py * fix kv error * formatting * Delete test.py --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com>	2024-07-22 06:09:24 -07:00
Prince Canuma	3d365b612a	Add support for InternLM-2.5 (#871 ) * fix internlm-2 * formatting * add dynamic ntk rope * formatting * move dynamic scaling rope to intermlm2.py * add default max_position_embeddings	2024-07-17 16:38:22 -07:00
Anchen	561dcf5643	Add support for deepseek coder v2 lite (#882 ) * feat: add support for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct * fix softmax + some cleanup * more nits * fix rope * fix original_max_position_embeddings in rope * fix original_max_position_embeddings in rope config * add group greedy --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-17 07:23:28 -07:00
Awni Hannun	f0c6c6e226	keep the server in a valid state (#889 )	2024-07-15 18:35:36 -07:00
JosefAlbers	bfc1f2763b	longrope (#886 )	2024-07-12 07:19:11 -07:00
Chime Ogbuji	8bf397e450	Pass use_dora parameter to linear_to_lora_layers (#885 )	2024-07-11 14:34:34 -07:00
nicolov	fbe3247772	Add GPT-neox model (#863 )	2024-07-11 06:13:17 -07:00
Alex Wozniakowski	63800c8feb	Example of response generation with optional arguments (#853 ) * Generate response with optional arguments * Reference response generation example * Include transformers and sentencepiece * Update example to run Mistral-7B-Instruct-v0.3 * Link to generation example * Style changes from pre-commit	2024-07-09 06:49:59 -07:00
Awni Hannun	68e88d42fb	Fix server for `openai` package (#877 ) * fix * fixes for 9b	2024-07-08 12:34:31 -07:00
Awni Hannun	20e221f7f7	Add recurrent gemma (#856 ) * add recurrent gemma * fix window cache	2024-07-07 12:10:04 -07:00
n8programs	1e05aef344	Add logit soft capping to gemma, and fix precision issues (#857 ) * Add logit soft capping to gemma, and fix precision issues Gemma was babbling nonsense - so I figured out it was due to not having logit softcapping and precision issues causing NaNs (so I implemented the softcapping and added more float32 inference). gemma-27b-it-4bit now works flawlessly (or near-flawlessly, no sliding-window attention). * get rid of comments * get rid of last comments (sry lol) * nits --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-07-02 07:52:39 -07:00
Angelos Katharopoulos	f212b770d8	Server loads the model on demand from the request (#851 )	2024-06-27 11:37:57 -07:00
Awni Hannun	538339b599	gemma2 (#855 )	2024-06-27 10:06:28 -07:00
Awni Hannun	9f10728145	fix yi (#852 )	2024-06-27 06:38:19 -07:00
Chime Ogbuji	df6bc09d74	Configuration-based use of HF hub-hosted datasets for training (#701 ) * Add hf_dataset configuration for using HF hub-hosted datasets for (Q)LoRA training * Pre-commit formatting * Fix YAML config example * Print DS info * Include name * Add hf_dataset parameter default * Remove TextHFDataset and CompletionsHFDataset and use Dataset and CompletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility. * nits * update docs --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-26 10:20:50 -07:00
Chime Ogbuji	1d701a1831	Logprobs info to completion API (#806 ) * Initial implementation * Fix handling of return_step_logits in return * Fixed OpenAI parameter expectations and logprob structure and datatypes * pre-commit black formatting * Remove unused parameter * fix log probs * fix colorize * nits in server * nits in server * Fix top_logprobs structure (a dict) and include tokens in logprobs response * nits * fix types --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-23 10:35:13 -07:00
Yi Wang	a7598e9456	Fix mypy errors with models/{qwen2,qwen2_moe,startcoder2}.py (#835 ) * Fix starcoder.py * Fix qwen2 * Remvoe unnecessary assert not None	2024-06-14 09:44:50 -07:00
Awni Hannun	d8b073e3a7	Add eos token to lora fine-tunes (#818 ) * add eos token to lora fine-tunes * Comment	2024-06-12 07:44:21 -07:00
Nada Amin	3cc58e17fb	Tweaks to run dspy-produced calls to the server, with gemma template. (#810 ) * Tweaks to run dspy-produced calls to the server, with gemma template. following comment https://github.com/stanfordnlp/dspy/issues/385#issuecomment-1998939936 can try it out with: ```sh python -m server --model mlx-community/gemma-1.1-7b-it-4bit --port 1143 ``` modulo patching the relative imports in server.py ``` -from .tokenizer_utils import TokenizerWrapper -from .utils import generate_step, load +from mlx_lm.tokenizer_utils import TokenizerWrapper +from mlx_lm.utils import generate_step, load ``` and then, ont the dspy side: ```python import dspy lm = dspy.OpenAI(model_type="chat", api_base="http://localhost:11434/v1/", api_key="not_needed", max_tokens=250) lm("hello") ``` * simpler way to validate float or int * remove logic that works around incompatible templates, too gemma specific * tweak messages for common denominator * use generate.py workaround for DBXR * put behind flag * oops * Solution to chat template issue: pass in a custom template! The template should likely adhere to the OpenAI chat model. Here is such a template for Gemma. --chat-template "{{ bos_token }}{% set extra_system = '' %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{% if role == 'system' %}{% set extra_system = extra_system + message['content'] %}{% else %}{% if role == 'user' and extra_system %}{% set message_system = 'System: ' + extra_system %}{% else %}{% set message_system = '' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message_system + message['content'] \| trim + '<end_of_turn>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}" * remove convoluted solution * Tweak for when None is provided explicitly, and must be set to [] too. For example, the outlines library provides None explicitly. * style --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-12 07:17:06 -07:00
Yi Wang	6da07fb1b0	make models/phi3.py and models/phi3small.py compatible with mypy (#833 )	2024-06-12 06:53:55 -07:00
JosefAlbers	fda41545a6	Su-RoPE(Rotary Position Embedding) for Phi-3 (#813 ) * Su-RoPE * nits * Update su_rope.py * Update su_rope.py Per GPT4: "The error TypeError: 'type' object is not subscriptable is caused by using the type hint list[float] in a version of Python that does not support it. This syntax is only available in Python 3.9 and later." * Ran isort --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-06-11 06:20:04 -07:00
Yi Wang	a54dfd698e	Correct the type annotation of cache in llama.py (#828 ) * Update * Fix isort	2024-06-10 15:18:34 -07:00

1 2 3 4 5 ...

272 Commits