zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-12-16 01:49:05 +08:00

Author	SHA1	Message	Date
Awni Hannun	b61a65e313	fix copies in sdpa (#2563 )	2025-09-02 11:00:36 -07:00
Awni Hannun	8ce49cd39e	fix quantized vjp for mxfp4 (#2555 )	2025-08-29 10:06:15 -07:00
Awni Hannun	70560b6bd5	Add mode parameter for quantization (#2499 ) * add mode parameter for quantization * mxfp4 quantize/dequantize + start of optional biases * mxfp4 works * speedup * cpu mxfp4 * fix * fix test tol * fix * refactor * add quant mode enum	2025-08-28 06:45:26 -07:00
Awni Hannun	7ef8a6f2d5	[CUDA] fix sort (#2550 ) * [CUDA] fix sort * fix test	2025-08-27 19:48:43 -07:00
Awni Hannun	5458d43247	add load with path tests (#2543 )	2025-08-26 14:24:47 -07:00
Awni Hannun	3dcb286baf	Remove stream from average grads so it uses default (#2532 ) * Remove stream from average grads so it uses default * comment	2025-08-25 15:56:29 -07:00
Cheng	4822c3dbe9	[CUDA] Implement DynamicSlice/DynamicSliceUpdate (#2533 ) * Move DynamicSlice to gpu/primitives * Implement compute_dynamic_offset in CUDA	2025-08-26 07:31:39 +09:00
Anastasiia Filippova	9392fc3f88	NCCL backend (#2476 )	2025-08-21 11:56:15 -07:00
Awni Hannun	e843c4d8d5	fix power (#2523 )	2025-08-21 06:46:01 -07:00
Angelos Katharopoulos	e397177f6e	Custom cuda kernel (#2517 )	2025-08-20 17:20:22 -07:00
Cheng	f4c8888cbe	[CUDA] Fix stride of singleton dims before passing to cuDNN (#2521 )	2025-08-21 08:55:26 +09:00
Angelos Katharopoulos	25c1e03205	Fix overflow in large filter small channels (#2520 )	2025-08-20 08:03:29 -07:00
Cheng	ac85ddfdb7	[CUDA] Add GEMM-based fallback convolution kernels (#2511 ) * Add gemm_conv * Add gemm_grouped_conv	2025-08-20 10:06:22 +09:00
Awni Hannun	e7c6e1db82	no segfault with uninitialized array.at (#2514 )	2025-08-18 08:33:38 -07:00
Awni Hannun	c5fcd5b61b	fix custom kernel test (#2510 )	2025-08-18 06:45:59 -07:00
Cheng	1ba18ff7d9	[CUDA] Fix conv grads with groups (#2495 ) * Put reshape utils in one file * [CUDA] Fix conv grads with groups * Put the reshape utils in gpu/copy.h	2025-08-16 10:09:18 +09:00
Luca Vivona	728d4db582	Support destination arg in tree flatten/unflatten (#2450 )	2025-08-06 15:34:59 -07:00
Awni Hannun	fa89f0b150	faster gather qmm sorted test (#2463 )	2025-08-05 06:27:40 -07:00
Awni Hannun	0b807893a7	fix wraps compile (#2461 )	2025-08-04 16:14:18 -07:00
Cheng	86c6a15571	[CUDA] Backward convolution (#2431 )	2025-08-01 09:54:05 +09:00
junpeiz	8b25ce62d5	Add tests for export including control flow models and quantized models (#2430 ) * Add tests for export, including control flow export and quantized model export. * Skip quantization related test for CUDA backend.	2025-07-31 11:06:26 -07:00
Awni Hannun	d32519c8ee	fix gemv regression (#2445 )	2025-07-30 14:23:01 -07:00
Awni Hannun	b405591249	fix circular reference (#2443 )	2025-07-30 09:37:44 -07:00
Awni Hannun	ef631d63af	faster rms norm (#2433 )	2025-07-29 13:12:00 -07:00
Awni Hannun	5597fa089c	Fix qvm splitk (#2415 )	2025-07-25 11:50:24 -07:00
Cheng	6f5874a2f2	[CUDA] Initial implementation of Convolution with cuDNN (#2385 ) * Link with cuDNN * Initial implementation * Remove backend apis * Fix recording cudnn conv * More unused backend apis * Fix C++ conv tests * include cudnn as python dep * Install libcudnn9-dev-cuda-12 in CI * cudnn only accepts contiguous inputs * Switch to backend apis * Plan needs to be kept alive * Turn off tf32 * Add cache * Test the native cuda graph api * Set cudnn stream before execution * Make LRUCache more like a normal container * Do error check for cublas handle * Zero-initilizing array * Use tf32 for conv * Skip TestConv.test_torch_conv_2D test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2025-07-25 08:12:10 +09:00
Awni Hannun	e1840853ce	full row mask in sdpa consistently gives nan (#2406 )	2025-07-23 16:37:03 -07:00
Gökdeniz Gülmez	deee214a95	Adding support for the Muon Optimizer (#1914 ) * initial commit with workong optmimizer * update ACKNOWLEDGMENTS.md * nits and adding it to test * nits * G.astype(mx.bfloat16) to G.astype(G.dtype) * G.ndim >= 2 to assert G.ndim == 2 * remove coments * replace with mx.addmm * remove comments * format * nits * match muon * fix addmm --------- Co-authored-by: Awni Hannun <awni@apple.com>	2025-07-18 12:25:28 -07:00
Awni Hannun	f409b229a4	fix ring distributed test (#2380 )	2025-07-16 11:25:24 -07:00
Awni Hannun	d7734edd9f	fix complex reduce + nan propagation in min and max (#2377 )	2025-07-15 18:19:47 -07:00
Awni Hannun	49114f28ab	fix flaky test (#2371 )	2025-07-14 17:16:18 -07:00
Awni Hannun	e7d2ebadd2	[CUDA] Affine quantize (#2354 ) * affine quantize and dequantize kernels * format * fix * format	2025-07-14 15:45:44 -07:00
Angelos Katharopoulos	5201df5030	Fix imag() vjp (#2367 )	2025-07-14 13:11:16 -07:00
Cheng	8347575ba1	[CUDA] Implement Scan kernel (#2347 ) * Contiguous scan * Strided scan * Enable tests * Fix failing logaddexp test * Use cexpf in Metal	2025-07-10 16:54:12 -07:00
Angelos Katharopoulos	0eb035b4b1	Fix type promotion in Adam with bias correction (#2350 )	2025-07-10 11:14:42 -07:00
jhavukainen	8c7bc30ce4	Align mlx::core::min op nan propagation with NumPy (#2346 )	2025-07-10 06:20:43 -07:00
jhavukainen	8b9a3f3cea	Align mlx::core::max op nan propagation with NumPy (#2339 ) * Make max op NaN propagation rules align with numpy * Adding benchmarks and testing for max op nanpropagation * Pre-commit formatting * Fix max complex64 nan propagation and add test * Improve the cpp unittest * Only check nans on non-integral types in simd_reduce_impl. * Cleanup using namespace alias * Add cpu Max nanpropagation. Fix a small fib in cpu max dispatch data types for int8/int16. * Make the max nanpropagation test more meaningful for integer types * Remove tuple unpacking syntax to comply with earlier python versions. Add cuda skip to nanpropagation tests, fix cuda implementation in a separate PR.	2025-07-09 11:26:27 -07:00
Angelos Katharopoulos	4a9b29a875	MoE backward improvements (#2335 )	2025-07-07 17:59:53 -07:00
Awni Hannun	ec0d5db67b	[CUDA] Switch to CUDA graphs (#2317 ) * cuda graph prototype fix signal bug + start to add dependencies capture more capture more ops remaining ops fix reduce and rope deps add concurrent context try update, but not working cosistent topology order use node api use node api directly to reduce overhead fix bug use kernels in unary cache graph format fix synchronization format * comment	2025-07-02 15:59:13 -07:00
Awni Hannun	cfb6a244ea	allow parameters to be deleted (#2325 )	2025-07-01 21:27:23 -07:00
Awni Hannun	dd4f53db63	use fp32 for testing, add more complex ops (#2322 )	2025-07-01 07:30:00 -07:00
Awni Hannun	33bf1a244b	Fix module update in strict mode (#2321 ) * fix module update in strict mode * allow GELU to be pickled	2025-06-29 11:12:29 -07:00
Angelos Katharopoulos	772f471ff2	[CUDA] Fix reductions (#2314 )	2025-06-27 12:59:20 -07:00
Angelos Katharopoulos	2c11d10f8d	Split broadcast so it is always fused in compile (#2318 )	2025-06-26 22:08:18 -07:00
Awni Hannun	81bb9a2a9e	Compile float64 functions on CPU (#2311 )	2025-06-24 10:18:52 -07:00
Angelos Katharopoulos	5adf185f86	Fix `update_modules()` when providing a subset (#2308 )	2025-06-20 17:19:46 -07:00
Awni Hannun	cad5c0241c	[CUDA] synch properly waits for all tasks to finish and clear (#2303 ) * cuda synch properly waits for all tasks to finish and clear * fix copy	2025-06-17 12:03:25 -07:00
Awni Hannun	b8022c578a	divmod, partition, sort fixes (#2302 )	2025-06-16 18:49:32 -07:00
Awni Hannun	bc53f8293f	Cuda bug fixes 2 (#2298 ) * more bug fixes * more bug fixes * format	2025-06-16 13:14:46 -07:00
Awni Hannun	c552ff2451	[CUDA] Fix back-end bugs and enable corresponding tests (#2296 ) * Fix some cuda back-end bugs and enable corresponding tests * more fixes * enable more tests * format	2025-06-16 08:45:40 -07:00

1 2 3 4 5 ...

549 Commits