zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-12-16 01:49:05 +08:00

Author	SHA1	Message	Date
Angelos Katharopoulos	c4189a38e4	Add float mask to sdpa vector (#2068 )	2025-04-11 17:29:40 -07:00
Awni Hannun	00794c42bc	Fix causal mask sdpa vec (#2053 ) * fix sdpa vector causal mask * test	2025-04-08 09:11:23 -07:00
Awni Hannun	05d7118561	causal vector sdpa (#2018 ) * causal vector sdpa * get rid of memory threshold	2025-03-28 12:36:13 -07:00
Awni Hannun	7b7e2352cd	fix malloc or wait deadlock (#1976 )	2025-03-20 16:48:43 -07:00
Awni Hannun	005e7efa64	fix mask in sdpa (#1980 ) * fix mask in sdpa * fix attention mask * Re-enable routing for array mask --------- Co-authored-by: Jagrit Digani <digani@apple.com>	2025-03-20 14:53:12 -07:00
Jagrit Digani	9adcd1a650	Support fused masking in Attention (#1924 ) * Update API to allow mask='causal' in fast::sdpa * Add fallback * Update steel::AttnParams * Fix typo * WIP, basic causal * Update tests * Update benchmarking * Update masking loop limits * Add bool masking and update tests * Update additive mask * Update benchmarks * Update benchmarks * Update tests * Update for bfloat error * Update early exit * Add random seed to tests	2025-03-20 11:01:32 -07:00
Awni Hannun	3c3e558c60	Support transposed head/seq for kv (#1950 ) * support transposed head/seq for kv * fix flaky test * nit	2025-03-10 10:53:45 -07:00
Awni Hannun	c4230747a1	redesign for faster cpu/gpu synch (#1869 ) * redesign for faster cpu/gpu synch * load + more async CPU * use command encoder API and move more ops to use it * make fence back-end generic + CPU only fence * faster build * fix async eval * fixes + handle temporaries * fix / improve cpu conv * remove unused status, fix siblings * fix extensions * fix * fix no cpu build * format * comments * fix perf regression, remove unecessary abort * fix events, task limit cpu * fix waiting * fix donation / temporaries in normalization	2025-03-06 19:23:38 -08:00
Awni Hannun	e613d0eaf0	SDPA support for small batch (over sequence) queries (#1922 ) * batch query sdpa * batch sdpa for query	2025-03-04 10:59:04 -08:00
Angelos Katharopoulos	f5cc1eea72	Allow different value dimensions in sdpa_vector (#1811 )	2025-01-31 20:58:59 -08:00
Awni Hannun	d1766f2c70	Add boolean mask support in vector SDPA (#1757 )	2025-01-07 20:24:53 -08:00
Awni Hannun	40c62c1321	Use int64 stride everywhere (#1671 ) * use int64 stride everywhere * fix ext * fix ext * more shape + cleanup * one more * few more	2024-12-09 11:09:02 -08:00
Jagrit Digani	9d40e521d7	Stop matrix copies with new attention kernel (#1639 )	2024-12-02 14:12:38 -08:00
Jagrit Digani	02bec0bb6d	Matrix Attention kernel (#1610 ) * Rough INIT * [WIP]: Loading and Matmuls added * [WIP]: Reductions and min working aligned kernel at headdim = 64 * [WIP] Added headdim 80 for testing * [WIP] Update dispatch params for testing * [WIP] Add support for unaligned seq lengths - still looks messy * Update sdpa_benchmarks * Update sdpa_benchmarks * Update sdpa_benchmarks * Enable gqa support * Update benchmark and switch off 128 headdim * Update headdim 128 tuning * Remove older fast attention code. Write out O strided * Disable hd=128 until further optimizations * Enable bf16 * Fix data size bug * Enable attn build outside of jit	2024-11-22 10:34:05 -08:00
Angelos Katharopoulos	073076ac7d	2-Pass Sdpa Inference Kernel (#1597 )	2024-11-18 17:31:53 -08:00
Awni Hannun	b35f1e3c9c	fix donation in sdpa (#1587 )	2024-11-13 17:21:13 -08:00
Awni Hannun	9f0d5c12fc	Fully wrap the command encoder (#1572 ) * fully wrap the command encoder * use consistent style + fix extensions	2024-11-08 11:50:21 -08:00
Awni Hannun	9a3842a2d9	fix (#1566 )	2024-11-06 17:10:33 -08:00
Angelos Katharopoulos	62f297b51d	Sdpa fix (#1558 )	2024-11-02 21:25:46 -07:00
Awni Hannun	c26208f67d	Remove Hazard tracking with Fences (#1509 ) * remove hazard tracking * with fence map * no hazard tracking with fences * nits * fix fence retain * cleanup * fix quantized rebase	2024-10-21 19:33:32 -07:00
Angelos Katharopoulos	50d8bed468	Fused attention for single query (#1497 )	2024-10-18 00:58:52 -07:00
Brian Keene	1865299a30	Metal shaders for memory efficient self attention on large sequences (#964 ) * Metal shaders for efficient self attention on large sequences Updated fast attention: GEMM-ified with Steel primitives Uses flash attention 1 for scale correction * more compiler silencing * Address rebase issues * Templatize kernel instantiation, revise cpu bindings * Safer writes to output * Permit batch size > 1 * Numerical fixes for sdpa self attention * Re-enable test, remove unused variable * add benchmarking script * Disable sdpa prior to perf tuning, and simplify tests for per-patch CI	2024-06-03 09:16:19 -07:00
Awni Hannun	06375e6605	Split encoders in non-concurrent context with a max ops per encoder (#1085 ) * split encoders * fix race	2024-05-09 16:21:02 -07:00
Awni Hannun	12d4507ee3	Explicit barriers with concurrent dispatch (#977 )	2024-04-10 21:45:31 -07:00
Daniel Strobusch	479051ce1c	add numeric type hierarchy and issubdtype as well as a set_dtype meth… (#427 ) * add numeric type hierarchy and issubdtype as well as a set_dtype method to nn.Module with predicate numeric type hierarchy and issubtype is compatible to the [numpy hierarchy](`220f0ab2c5/numpy/_core/numerictypes.py (L42)`). Closes #285. * nits in docs * unify type category checking * nits in docs * nits in docs * more docs nits * fix callable type --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-03-25 12:32:59 -07:00
Awni Hannun	7c441600fe	Compile stride bug (#812 ) * fix compile stride bug * revert sdpa fix * fix cpu * fix bug with simplifying outputs	2024-03-11 06:31:31 -07:00
Jagrit Digani	ec8a4864fa	Fix SDPA kernel bug on Mac OS 13.3 SDK (#805 ) * Move sdpa kernel to allocate tgp mem statically and allow macOS 13.3 SDK builds * Style	2024-03-07 10:18:09 -08:00
Brian Keene	0787724c44	Fast Inference SDPA op (#735 ) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com>	2024-03-04 21:06:11 -08:00

28 Commits