zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-07-20 16:11:14 +08:00

Author	SHA1	Message	Date
Jagrit Digani	3290bfa690	Add new sdpa function overload (#2035 ) * Add new sdpa function overload * Address comments * Remove std::varaint from cpp sdpa function	2025-04-03 11:58:28 -07:00
Awni Hannun	05d7118561	causal vector sdpa (#2018 ) * causal vector sdpa * get rid of memory threshold	2025-03-28 12:36:13 -07:00
Awni Hannun	bc62932984	sdpa specialization for head dim 256 (#2007 )	2025-03-27 19:31:25 -07:00
Awni Hannun	a84cc0123f	promote mask when needed (#1998 )	2025-03-23 19:58:28 -07:00
Awni Hannun	005e7efa64	fix mask in sdpa (#1980 ) * fix mask in sdpa * fix attention mask * Re-enable routing for array mask --------- Co-authored-by: Jagrit Digani <digani@apple.com>	2025-03-20 14:53:12 -07:00
Jagrit Digani	b42d13ec84	Update attention tests to show diff, disable array masks (#1978 )	2025-03-20 14:25:38 -07:00
Jagrit Digani	9adcd1a650	Support fused masking in Attention (#1924 ) * Update API to allow mask='causal' in fast::sdpa * Add fallback * Update steel::AttnParams * Fix typo * WIP, basic causal * Update tests * Update benchmarking * Update masking loop limits * Add bool masking and update tests * Update additive mask * Update benchmarks * Update benchmarks * Update tests * Update for bfloat error * Update early exit * Add random seed to tests	2025-03-20 11:01:32 -07:00
Alex Barron	fd0d63ba5b	Affine quant always in fp32 (#1925 ) * do affine quant in fp32 * static cast	2025-03-04 17:50:19 -08:00
Awni Hannun	e613d0eaf0	SDPA support for small batch (over sequence) queries (#1922 ) * batch query sdpa * batch sdpa for query	2025-03-04 10:59:04 -08:00
Angelos Katharopoulos	5e6c130d93	RMS norm without scaling (#1915 )	2025-02-28 20:26:57 -08:00
Jagrit Digani	89d327075f	Enabling fused attention for head dim 128 (#1899 ) * Share KV smem * Fix bfloat error * Unroll O = S @ V loop * Perf upgrade * Remove commented out function * Add -Wno-c++17-extensions flag to metal flags * Add -Wno-c++17-extensions flag to metal extension flags	2025-02-26 10:02:06 -08:00
Angelos Katharopoulos	78ba24c37d	Raise an exception in the rope op if input is integer (#1884 )	2025-02-19 14:43:39 -08:00
Angelos Katharopoulos	f5cc1eea72	Allow different value dimensions in sdpa_vector (#1811 )	2025-01-31 20:58:59 -08:00
Awni Hannun	121d9a0702	Fix rope fallback to not upcast (#1797 ) * fix rope fallback to not upcast * Update mlx/fast.cpp Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2025-01-26 19:07:21 -08:00
Awni Hannun	d1766f2c70	Add boolean mask support in vector SDPA (#1757 )	2025-01-07 20:24:53 -08:00
Awni Hannun	d5ec172c95	Allow boolean mask in sdpa (#1753 ) * allow boolean mask in sdpa * more permissive donation in ternary	2025-01-06 16:57:07 -08:00
Awni Hannun	4ba0c24a8f	Export / import functions to / from a file (#1642 ) * export and import functions * refactor + works for few primitives * nit * allow primitives with state * nit * nit * simplify serialize / deserialize * fix for constants * python bindings * maybe fix serialize failure case * add example * more primitives, training kind of works * same result for python and c++ * some fixes * fix export * template it up * some simplificatoin * rebase * allow kwargs and multiple functions * exporter * more primitives for exporting * deal with endianness * handle invalid stream * add docstring	2024-12-24 11:19:13 -08:00
Awni Hannun	0308e9af71	Allow offset to be an mx.array for `mx.fast.rope` (#1724 ) * allow offset for rope * comment	2024-12-19 15:51:44 -08:00
Awni Hannun	e03f0372b1	More shape type (#1705 ) * more shape type * fix	2024-12-19 08:08:20 -08:00
Alex Barron	f17536af9c	More lenient mask type check in SDPA (#1723 ) * check mask type * require promotion	2024-12-18 19:41:38 -08:00
Alex Barron	49c34c4161	check mask type (#1721 )	2024-12-18 14:25:18 -08:00
Awni Hannun	50f3535693	Use expand_dims / unflatten / etc in more places (#1696 ) * use expand_dims / unflatten in a couple more places * few more * few more * fix	2024-12-12 17:00:44 -08:00
Awni Hannun	4e1e9520e1	Flatten and unflatten (#1692 ) * flatten and unflatten * fix grad * fix shape infer * use squeeze + unsqueeze in get_item	2024-12-11 21:51:37 -08:00
Awni Hannun	40c62c1321	Use int64 stride everywhere (#1671 ) * use int64 stride everywhere * fix ext * fix ext * more shape + cleanup * one more * few more	2024-12-09 11:09:02 -08:00
Cheng	d92ea094f1	Use && instead of and (#1663 ) * Use && instead of and * Remove "and" in ops.cpp	2024-12-07 18:26:39 -08:00
Awni Hannun	d0b6cb0425	More primitives for compiling with shapeless (#1653 ) * more shapeless and more Shape * more shape * fix * fix	2024-12-06 11:29:18 -08:00
Awni Hannun	c5b0928c1f	fix fallback (#1646 )	2024-12-05 11:59:53 -08:00
Jagrit Digani	02bec0bb6d	Matrix Attention kernel (#1610 ) * Rough INIT * [WIP]: Loading and Matmuls added * [WIP]: Reductions and min working aligned kernel at headdim = 64 * [WIP] Added headdim 80 for testing * [WIP] Update dispatch params for testing * [WIP] Add support for unaligned seq lengths - still looks messy * Update sdpa_benchmarks * Update sdpa_benchmarks * Update sdpa_benchmarks * Enable gqa support * Update benchmark and switch off 128 headdim * Update headdim 128 tuning * Remove older fast attention code. Write out O strided * Disable hd=128 until further optimizations * Enable bf16 * Fix data size bug * Enable attn build outside of jit	2024-11-22 10:34:05 -08:00
Alex Barron	c79f6a4a8c	3 and 6 bit quantization (#1613 ) * Support 3 and 6 bit quantization	2024-11-22 10:22:13 -08:00
Awni Hannun	dfa0b9aab4	Cpu fast quantize (#1578 ) * cpu quantize * fix	2024-11-08 20:10:39 -08:00
Awni Hannun	91c0277356	fix per-example mask + docs in sdpa (#1574 )	2024-11-08 11:51:15 -08:00
Awni Hannun	76f275b4df	error in rms for wrong size (#1562 )	2024-11-04 13:24:02 -08:00
Alex Barron	d15fa13daf	Batched Quantized Matmul + Fast Small QMV (#1503 ) * add fast qmv for small dims * fix test * batched cpu * add batched template param * refactor metal quantized.cpp	2024-10-21 16:23:17 -07:00
Angelos Katharopoulos	50d8bed468	Fused attention for single query (#1497 )	2024-10-18 00:58:52 -07:00
Awni Hannun	020f048cd0	A few updates for CPU (#1482 ) * some updates * format * fix * nit	2024-10-14 12:45:49 -07:00
Awni Hannun	0b4a58699e	Some overhead reductions in mx.fast.metal_kernel (#1437 ) * some overhead reductions * fix * use += * use more +=	2024-09-25 17:25:21 -07:00
Awni Hannun	3ae6aabe9f	throw for certain cases of non captured inputs in compile (#1401 )	2024-09-09 14:54:31 -07:00
Awni Hannun	ba3e913c7a	Simplifications for MLX C (#1396 ) * simplifications for MLX C * use vectors instead of map * update examples	2024-09-06 19:16:50 -07:00
Alex Barron	1d94ac3f90	Add optional headers to ``mx.fast.metal_kernel`` (#1358 )	2024-08-26 21:45:45 -07:00
Alex Barron	b96e105244	Add `grid_sample` example to `metal_kernel` docs (#1352 ) * Add `zero_outputs` and `atomic_outputs` options to `metal_kernel` * add grid sample to docs * zero_outputs -> init_value * add missing header for linux	2024-08-23 18:24:16 -07:00
Alex Barron	da8deb2b62	fix bug with multiple attributes (#1348 ) Co-authored-by: Alex Barron <abarron22@apple.com>	2024-08-23 10:06:15 -07:00
Alex Barron	0fd2a1f4b0	Custom Metal Kernels from Python (#1325 ) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only	2024-08-22 13:46:29 -07:00
Awni Hannun	d40e76809f	Fix rope (#1340 ) * add test * fix rope * fix test	2024-08-20 17:37:52 -07:00
Awni Hannun	bb1b76d9dc	RoPE with frequencies as optional input (#1337 ) * start rope with freq input * rope with frequencies * nits * fix bug * fix bug + test * cleanup * optional base	2024-08-19 18:30:50 -07:00
Brian Keene	19fb69e2ed	Add memory_efficient_threshold kwarg to sdpa kernel (#1319 ) Allows opt-in to memory efficient GPU shader at proscribed sequence length. Otherwise, utilizes aggregate MLX primitives for best latency.	2024-08-12 12:57:09 -07:00
Alex Barron	c52d1600f0	Fused Affine Quantize/Dequantize ops (#1282 ) * Add fast affine dequantize * add full quantize kernel * fused kernel with scale/bias computation * fix docstring * fix no jit error * fix test * test fix * reduce fast api to only affine_quantize	2024-07-29 15:11:38 -07:00
Angelos Katharopoulos	5c1fa64fb0	Custom transforms (#1246 )	2024-07-10 18:00:01 -07:00
Brian Keene	1865299a30	Metal shaders for memory efficient self attention on large sequences (#964 ) * Metal shaders for efficient self attention on large sequences Updated fast attention: GEMM-ified with Steel primitives Uses flash attention 1 for scale correction * more compiler silencing * Address rebase issues * Templatize kernel instantiation, revise cpu bindings * Safer writes to output * Permit batch size > 1 * Numerical fixes for sdpa self attention * Re-enable test, remove unused variable * add benchmarking script * Disable sdpa prior to perf tuning, and simplify tests for per-patch CI	2024-06-03 09:16:19 -07:00
Awni Hannun	e142aaf8a1	Option for precise softmax (#953 ) * precise softmax * Add an equivalency check * Make the threadgroup memory definition fixed * precise cpu softmax * precise option on cpu * remove print --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-04-04 08:32:35 -07:00
Angelos Katharopoulos	29221fa238	Implement vjps for some primitives in the fast namespace (#883 ) * Implement rope vjp in terms of rope * RMSNormVJP primitive and kernel * Add LayerNormVJP primitive and kernel	2024-03-26 16:35:34 -07:00

1 2

61 Commits