Awni Hannun 
							
						 
					 
					
						
						
							
						
						63f663d9c6 
					 
					
						
						
							
							fix cuda manylinux version to match others ( #2388 )  
						
						
						
						
							
						
					 
					
						2025-07-18 21:02:16 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						84b4d96efa 
					 
					
						
						
							
							fix release build + patch bump ( #2387 )  
						
						
						
						
							
 
						
					 
					
						2025-07-18 14:47:37 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						aec67f2fa6 
					 
					
						
						
							
							patch bump ( #2386 )  
						
						
						
						
							
						
					 
					
						2025-07-18 12:25:48 -07:00 
						 
				 
			
				
					
						
							
							
								Gökdeniz Gülmez 
							
						 
					 
					
						
						
							
						
						deee214a95 
					 
					
						
						
							
							Adding support for the Muon Optimizer ( #1914 )  
						
						... 
						
						
						
						* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with  mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com > 
						
						
							
						
					 
					
						2025-07-18 12:25:28 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						45adec102c 
					 
					
						
						
							
							Add contiguous_copy_gpu util for copying array ( #2379 )  
						
						
						
						
							
						
					 
					
						2025-07-18 06:44:25 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						31fc530c76 
					 
					
						
						
							
							[CUDA] Add more ways finding CCCL headers in JIT ( #2382 )  
						
						
						
						
							
						
					 
					
						2025-07-17 15:25:34 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						fbb3f65a1a 
					 
					
						
						
							
							fix resource leaks in matmul and graph ( #2383 )  
						
						
						
						
							
						
					 
					
						2025-07-17 06:50:15 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						6b1b8ea91b 
					 
					
						
						
							
							[CUDA] Add work per thread to compile ( #2368 )  
						
						
						
						
							
						
					 
					
						2025-07-17 06:47:52 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b2273733ea 
					 
					
						
						
							
							Test with CUDA 12.2 ( #2375 )  
						
						... 
						
						
						
						* Test with CUDA 12.0
* try older image
* fix cpu sort 
						
						
							
						
					 
					
						2025-07-16 13:00:37 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f409b229a4 
					 
					
						
						
							
							fix ring distributed test ( #2380 )  
						
						
						
						
							
						
					 
					
						2025-07-16 11:25:24 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						30571e2326 
					 
					
						
						
							
							Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )  
						
						
						
						
							
						
					 
					
						2025-07-16 07:34:24 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d7734edd9f 
					 
					
						
						
							
							fix complex reduce + nan propagation in min and max ( #2377 )  
						
						
						
						
							
						
					 
					
						2025-07-15 18:19:47 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						2ba69bc8fa 
					 
					
						
						
							
							lower memory uniform sampling ( #2361 )  
						
						... 
						
						
						
						* lower memory uniform
* use fp32
* fix 
						
						
							
						
					 
					
						2025-07-15 14:22:07 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						cb349a291c 
					 
					
						
						
							
							[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )  
						
						
						
						
							
						
					 
					
						2025-07-15 00:36:13 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f0a0b077a0 
					 
					
						
						
							
							Install linux with mlx[cuda] and mlx[cpu] ( #2356 )  
						
						... 
						
						
						
						* install linux with mlx[cuda] and mlx[cpu]
* temp for testing
* cleanup circle, fix cuda repair
* update circle
* update circle
* decouple python bindings from core libraries 
						
						
							
						
					 
					
						2025-07-14 17:17:33 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						49114f28ab 
					 
					
						
						
							
							fix flaky test ( #2371 )  
						
						
						
						
							
						
					 
					
						2025-07-14 17:16:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e7d2ebadd2 
					 
					
						
						
							
							[CUDA] Affine quantize ( #2354 )  
						
						... 
						
						
						
						* affine quantize and dequantize kernels
* format
* fix
* format 
						
						
							
						
					 
					
						2025-07-14 15:45:44 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e569803d7c 
					 
					
						
						
							
							update linux build ( #2370 )  
						
						
						
						
							
						
					 
					
						2025-07-14 15:13:56 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						d34f887abc 
					 
					
						
						
							
							Add Primitive::name and remove Primitive::print ( #2365 )  
						
						
						
						
							
						
					 
					
						2025-07-14 14:06:35 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						5201df5030 
					 
					
						
						
							
							Fix imag() vjp ( #2367 )  
						
						
						
						
							
						
					 
					
						2025-07-14 13:11:16 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						2d3c26c565 
					 
					
						
						
							
							[CUDA] Do not put kernels in annoymous namespace ( #2362 )  
						
						
						
						
							
						
					 
					
						2025-07-12 14:24:45 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						6325f60d52 
					 
					
						
						
							
							[CUDA] Bundle CCCL for JIT compilation ( #2357 )  
						
						... 
						
						
						
						* Ship CCCL for JIT compilation
* Remove cexpf 
						
						
							
						
					 
					
						2025-07-11 18:45:37 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						42cc9cfbc7 
					 
					
						
						
							
							fix copy dispatch ( #2360 )  
						
						
						
						
							
						
					 
					
						2025-07-11 10:59:35 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						8347575ba1 
					 
					
						
						
							
							[CUDA] Implement Scan kernel ( #2347 )  
						
						... 
						
						
						
						* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal 
						
						
							
						
					 
					
						2025-07-10 16:54:12 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						b6eec20260 
					 
					
						
						
							
							Fix edge check in qmm_n QuantizedLoader ( #2355 )  
						
						
						
						
							
						
					 
					
						2025-07-10 16:28:50 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						0eb035b4b1 
					 
					
						
						
							
							Fix type promotion in Adam with bias correction ( #2350 )  
						
						
						
						
							
						
					 
					
						2025-07-10 11:14:42 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						afb9817599 
					 
					
						
						
							
							[CUDA] Put version in ptx cache dir path ( #2352 )  
						
						
						
						
							
						
					 
					
						2025-07-10 07:24:21 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						8fb3e7a26c 
					 
					
						
						
							
							[CUDA] Set current device before cudaGraphLaunch ( #2351 )  
						
						
						
						
							
						
					 
					
						2025-07-10 07:24:02 -07:00 
						 
				 
			
				
					
						
							
							
								jhavukainen 
							
						 
					 
					
						
						
							
						
						8c7bc30ce4 
					 
					
						
						
							
							Align mlx::core::min op nan propagation with NumPy ( #2346 )  
						
						
						
						
							
						
					 
					
						2025-07-10 06:20:43 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						85873cb162 
					 
					
						
						
							
							[CUDA] Do vectorized store/load in contiguous elementwise ops ( #2342 )  
						
						... 
						
						
						
						* Do vectorized store/load in unary ops
* Do vectorized store/load in binary_two ops
* Do vectorized store/load in copy ops
* Do vectorized store/load in ternary ops
* Use int32_t for IdxT
* binary => binary_two in binary_two.cu
* Fix tests on large arrays
* Use uint as index type
* Contig uses uint as index and non-contig uses int 
						
						
							
						
					 
					
						2025-07-09 18:48:43 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e14ee12491 
					 
					
						
						
							
							add zero for argsort vjp ( #2345 )  
						
						
						
						
							
						
					 
					
						2025-07-09 14:37:14 -07:00 
						 
				 
			
				
					
						
							
							
								jhavukainen 
							
						 
					 
					
						
						
							
						
						8b9a3f3cea 
					 
					
						
						
							
							Align mlx::core::max op nan propagation with NumPy ( #2339 )  
						
						... 
						
						
						
						* Make max op NaN propagation rules align with numpy
* Adding benchmarks and testing for max op nanpropagation
* Pre-commit formatting
* Fix max complex64 nan propagation and add test
* Improve the cpp unittest
* Only check nans on non-integral types in simd_reduce_impl.
* Cleanup using namespace alias
* Add cpu Max nanpropagation. Fix a small fib in cpu max dispatch data types for int8/int16.
* Make the max nanpropagation test more meaningful for integer types
* Remove tuple unpacking syntax to comply with earlier python versions. Add cuda skip to nanpropagation tests, fix cuda implementation in a separate PR. 
						
						
							
						
					 
					
						2025-07-09 11:26:27 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						fb4e8b896b 
					 
					
						
						
							
							patch bump ( #2343 )  
						
						
						
						
							
 
						
					 
					
						2025-07-08 14:26:07 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						2ca533b279 
					 
					
						
						
							
							Fix compilation with CUDA 11 ( #2331 )  
						
						
						
						
							
						
					 
					
						2025-07-07 20:00:43 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						4a9b29a875 
					 
					
						
						
							
							MoE backward improvements ( #2335 )  
						
						
						
						
							
						
					 
					
						2025-07-07 17:59:53 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						a4fcc893cd 
					 
					
						
						
							
							auto build linux release ( #2341 )  
						
						
						
						
							
						
					 
					
						2025-07-07 09:29:23 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						9d10239af7 
					 
					
						
						
							
							[CUDA] Do vectorized store/load in binary ops ( #2330 )  
						
						
						
						
							
						
					 
					
						2025-07-07 08:44:14 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						19facd4b20 
					 
					
						
						
							
							Build with all cpu cores by default ( #2336 )  
						
						
						
						
							
						
					 
					
						2025-07-07 06:06:45 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						f5299f72cd 
					 
					
						
						
							
							Fix layernorm race condition ( #2340 )  
						
						
						
						
							
						
					 
					
						2025-07-07 06:06:01 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						0e0d9ac522 
					 
					
						
						
							
							[CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size ( #2329 )  
						
						
						
						
							
						
					 
					
						2025-07-05 08:33:29 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						8917022deb 
					 
					
						
						
							
							fix graphs for older cuda ( #2328 )  
						
						
						
						
							
						
					 
					
						2025-07-02 19:37:58 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ec0d5db67b 
					 
					
						
						
							
							[CUDA] Switch to CUDA graphs ( #2317 )  
						
						... 
						
						
						
						* cuda graph prototype
fix signal bug + start to add dependencies
capture more
capture more ops
remaining ops
fix reduce and rope deps
add concurrent context
try update, but not working
cosistent topology order
use node api
use node api directly to reduce overhead
fix bug
use kernels in unary
cache graph
format
fix synchronization
format
* comment 
						
						
							
						
					 
					
						2025-07-02 15:59:13 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						e76e9b87f0 
					 
					
						
						
							
							Fix compilation error from integral_constant ( #2326 )  
						
						
						
						
							
						
					 
					
						2025-07-02 06:04:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						cfb6a244ea 
					 
					
						
						
							
							allow parameters to be deleted ( #2325 )  
						
						
						
						
							
						
					 
					
						2025-07-01 21:27:23 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						58f3860306 
					 
					
						
						
							
							patch bump ( #2324 )  
						
						
						
						
							
 
						
					 
					
						2025-07-01 12:12:16 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						dd4f53db63 
					 
					
						
						
							
							use fp32 for testing, add more complex ops ( #2322 )  
						
						
						
						
							
						
					 
					
						2025-07-01 07:30:00 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						3d5e17e507 
					 
					
						
						
							
							MLX_SWITCH macros to templates ( #2320 )  
						
						
						
						
							
						
					 
					
						2025-07-01 01:33:44 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						33bf1a244b 
					 
					
						
						
							
							Fix module update in strict mode ( #2321 )  
						
						... 
						
						
						
						* fix module update in strict mode
* allow GELU to be pickled 
						
						
							
						
					 
					
						2025-06-29 11:12:29 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						772f471ff2 
					 
					
						
						
							
							[CUDA] Fix reductions ( #2314 )  
						
						
						
						
							
						
					 
					
						2025-06-27 12:59:20 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						2c11d10f8d 
					 
					
						
						
							
							Split broadcast so it is always fused in compile ( #2318 )  
						
						
						
						
							
						
					 
					
						2025-06-26 22:08:18 -07:00