zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-12-16 01:49:05 +08:00

Author	SHA1	Message	Date
Awni Hannun	0dbc7e5bee	Centralize NAX condition (#2811 ) Some checks failed Build and Test / check_lint (push) Has been cancelled Details Build and Test / linux_build_and_test (ubuntu-22.04) (push) Has been cancelled Details Build and Test / linux_build_and_test (ubuntu-22.04-arm) (push) Has been cancelled Details Build and Test / mac_build_and_test (14.0) (push) Has been cancelled Details Build and Test / mac_build_and_test (15.0) (push) Has been cancelled Details Build and Test / cuda_build_and_test (cuda-12.6) (push) Has been cancelled Details Build and Test / cuda_build_and_test (cuda-12.9) (push) Has been cancelled Details Build and Test / build_documentation (push) Has been cancelled Details Build and Test / Linux Fedora CPP Build (aarch64) (push) Has been cancelled Details Build and Test / Linux Fedora CPP Build (x86_64) (push) Has been cancelled Details	2025-11-21 13:28:15 -08:00
Jagrit Digani	54f1cc6e3e	Add Neural Accelerator Support (#2772 )	2025-11-19 15:06:00 -08:00
CCYeh	b3825ac149	Add Masked Scatter (#2663 ) Co-authored-by: Awni Hannun <awni@apple.com> Co-authored-by: Angelos Katharopoulos <katharas@gmail.com> Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2025-11-19 14:53:32 -08:00
Awni Hannun	26ceb507eb	only build for macos 14 and up (#2731 ) * only build for macos 14 and up * bump metal cpp	2025-11-04 09:44:15 -08:00
Awni Hannun	ec72b44417	Add quantize/dequantize for mxfp8 and nvfp4 (#2688 ) * Add quantize/dequantize slow path for mxfp8 and nvfp4 * fast cuda kernel for mx/nv quantization * fallback for cuda < 12.8 (#2697) * format (#2700) * fix (#2701) * metal kernels * docs * fix jit * add default bits and group sizes * improve quant docs * fix output type of mxfp4 matmuls	2025-10-28 16:23:12 -07:00
Awni Hannun	111f1e71af	Faster contiguous gather for indices in the first axis (#2552 ) * faster contiguous gather for indices in the first axis * work per thread > 1 * angelos suggestion for scales / biases	2025-08-28 21:26:30 -07:00
Awni Hannun	827003d568	fix METAL quantization in JIT (#2553 )	2025-08-28 18:26:25 -07:00
Angelos Katharopoulos	4a9b29a875	MoE backward improvements (#2335 )	2025-07-07 17:59:53 -07:00
Awni Hannun	f1606486d2	Generalize gpu backend (#2138 ) * generalize gpu backend * fix no_gpu build * fix no_gpu build * generalize gpu backend	2025-04-30 09:08:17 -07:00
Angelos Katharopoulos	99eefd2ec0	Gather mm new kernel and small refactoring (#2040 )	2025-04-14 16:37:36 -07:00
Awni Hannun	de5f38fd48	Custom logsumexp (#2028 ) * initial custom logsumexp * more tests * comments + fix	2025-03-31 07:36:55 -07:00
Jesper Stemann Andersen	2d8e667400	MinGW support (#1806 ) * Changed /bin/bash to bash for generating compiling preamble * Fix wrt jit_compiler mingw like msvc wrt. WEXITSTATUS * Solved ambiguity wrt. bernoulli test shape * Disabled distributed/ring on Windows * Fixed jit_compiler command wrt. MinGW * Extended jit_compiler patch wrt. WEXITSTATUS to FreeBSD	2025-02-01 12:40:06 -08:00
Awni Hannun	b7c9f1d38f	scatter axis + gather axis primitives (#1813 ) * scatter axis + gather axis primitives * add transforms * comment	2025-01-31 20:48:08 -08:00
Awni Hannun	a4667da1eb	Faster synchronization `Fence` primitive (#1773 ) * try faster synchronization move event fixes update bench fix fix * non-functioning kernel * try alternative fence * cleanup barrier * get rid of event_fence * update benchmarks * doc string in metal fence	2025-01-17 18:42:19 -08:00
Awni Hannun	9d7fa6b8e6	Use osx deployment target to pick Metal version (#1595 ) * choose metal based on deployment target rather than system version * nit * unused compile def	2024-11-18 19:16:49 -08:00
Awni Hannun	610af352d4	Dispatch bf16 at run time when using the JIT (#1584 ) * Dispatch bf16 at run time when using the JIT * fix extension * fix extension build * fix extension build * Update utils.h	2024-11-15 16:54:36 -08:00
Awni Hannun	4f72c66911	improvements to scatter / gather (#1541 )	2024-10-30 19:30:54 -07:00
Awni Hannun	0eb56d5be0	Wired (#1510 ) * expose residency sets as wire/unwire * returns wired size * fix * runtime support check * fix os check * fix test * fix no metal build * docs * nit * nits in docs * nits	2024-10-25 09:35:33 -07:00
Nripesh Niketan	669c27140d	Chore: add pre-commit hook for cmake (#1362 ) * reset and lint * format --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-16 12:53:01 -07:00
Alex Barron	28be4de7c2	Fix JIT reductions (#1373 )	2024-08-28 16:39:11 -07:00
Awni Hannun	5f7d19d1f5	MPI ops in GPU stream for faster comms (#1356 )	2024-08-26 15:12:50 -07:00
Alex Barron	0fd2a1f4b0	Custom Metal Kernels from Python (#1325 ) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only	2024-08-22 13:46:29 -07:00
Awni Hannun	30bbea2f08	Add gemv masked to JIT plus some fixes (#1310 ) * add gemv masked to JIT plus some fixes * some cleanup * add utils * fix * fix 2 * more cleaning * fix * remove unused mps matmul support * one more nit * revert	2024-08-07 13:38:07 -07:00
Alex Barron	a3c287354f	Fast Hadamard Transform (#1249 ) * Working hadamard for powers of 2 * working for m2^k add scale and check contiguity * add size check * clean up * fix test * add grads + vmap * gpu only * skip on linux * test typo * add cpu impl * remove gpu only tests * fix linux build + add is_equivalent	2024-07-09 20:39:01 -07:00
Awni Hannun	56c8a33439	Get metal version from xcode (#1228 ) * get metal version from xcode * typo * fix	2024-06-26 07:02:11 -07:00
Alex Barron	dd7d8e5e29	Add Quantized Ops to the JIT (#1204 ) * JIT for quantized ops * remove unused imports * address comments * fix imports * second attempt to fix imports --------- Co-authored-by: Alex Barron <abarron22@apple.com>	2024-06-12 09:47:12 -07:00
Alex Barron	27d70c7d9d	Feature complete Metal FFT (#1102 ) * feature complete metal fft * fix contiguity bug * jit fft * simplify rader/bluestein constant computation * remove kernel/utils.h dep * remove bf16.h dep * format --------- Co-authored-by: Alex Barron <abarron22@apple.com>	2024-06-06 12:57:25 -07:00
Alex Barron	375a8bbdcc	Add some internal GPU apis (#1177 ) * Add unary/binary/ternay/slice/concat internal GPU ops * add pad internal op * formatting + no_cpu fix	2024-06-04 09:24:26 -07:00
Awni Hannun	7e26fd8032	Option to JIT steel gemm / conv (#1139 )	2024-05-23 18:07:34 -07:00
Awni Hannun	0189ab6ab6	More jitting (#1132 ) * docs + circle min size build * jit scan, arange, softmax * add sort * jit reductions * remove print * fix deps * clean includes / nits	2024-05-23 16:23:44 -07:00
Awni Hannun	226748b3e7	JIT compile option for binary minimization (#1091 ) * try cpp 20 for compile * unary, binary, ternary in jit * nits * fix gather/scatter * fix rebase * reorg compile * add ternary to compile * jit copy * jit compile flag * fix build * use linked function for ternary * some nits * docs + circle min size build * docs + circle min size build * fix extension * fix no cpu build * improve includes	2024-05-22 12:57:13 -07:00
Awni Hannun	1873ffda01	Detect metal version and propagate correctly for JIT (#1109 ) * detect metal version and propagate correctly for JIT * remove softmax * fix versions	2024-05-15 17:42:09 -07:00
Awni Hannun	8a0677d56d	Shared events for synchronization + async eval (#998 ) * more async eval * fix rebase * try correct async eval * fix async * more tests for async eval * use shared events for synchronization * comment + cleanup * with autorelease pool * fix no metal build * fix compile * fix patch * don't eval if asyn evale'd * don't use is_evaled * comments * more multi stream tests * try and cleanup use of is_evaled * use a status flag	2024-04-17 06:16:02 -07:00
Angelos Katharopoulos	2225374060	Adds mx.fast.layer_norm (#870 )	2024-03-21 13:55:51 -07:00
Awni Hannun	a54f06b16f	Fast RMS Norm (#862 ) * fast rmsnorm * no rms gpu * kernel * fix shared mem * looped rms and donation in softmax * Make the squaring in float32 to avoid underflow * Fix the default StreamOrDevice for rope and rms_norm in fast * nits --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-03-21 07:20:54 -07:00
Brian Keene	0787724c44	Fast Inference SDPA op (#735 ) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com>	2024-03-04 21:06:11 -08:00
Awni Hannun	ac02cf33bd	Fix some issues using MLX in C++ (#739 ) * fix preamble build * fix some issues with using MLX as a dep in C++	2024-02-24 22:20:57 -08:00
Awni Hannun	ccf1645995	Custom primitive + RoPE fat op (#676 ) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-02-14 14:04:25 -08:00
Angelos Katharopoulos	28eac18571	Kernel generation (#614 ) Generate reusable element-wise kernels given a computation graph.	2024-02-07 13:15:59 -08:00
Awni Hannun	d75ae52ecd	Compile primitive (#571 ) * Compiled primitive with basic binary, unary graph-level fusion	2024-02-05 06:51:22 -08:00
Angelos Katharopoulos	dfa9f4bc58	An initial quantized matmul implementation (#205 ) * Add quantized matvec * Add quantized matrix matrix with 2nd matrix transposed * Add quantized matmul tests * Add a slow cpu quantized matmul * Add a slightly faster vectorized cpu version	2023-12-18 23:18:57 -08:00
Awni Hannun	8ca7f9e8e9	awni's commit files	2023-11-29 10:30:41 -08:00

42 Commits