Awni Hannun 
							
						 
					 
					
						
						
							
						
						3adba92ebe 
					 
					
						
						
							
							Cuda faster softmax ( #2435 )  
						
						... 
						
						
						
						* faster softmax and logsumexp
* faster softmax and logsumexp
* format 
						
						
							
						
					 
					
						2025-07-29 17:18:12 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ef631d63af 
					 
					
						
						
							
							faster rms norm ( #2433 )  
						
						
						
						
							
						
					 
					
						2025-07-29 13:12:00 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						970dbe8e25 
					 
					
						
						
							
							Use ccache in CI ( #2414 )  
						
						... 
						
						
						
						* Detect ccache
* Use ccache in CI
* Separate cache for different images
* Test both 12.2 and 12.9 for PRs 
						
						
							
						
					 
					
						2025-07-29 08:43:22 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						641be9463b 
					 
					
						
						
							
							Add more CUDA architectures for PyPi package ( #2427 )  
						
						... 
						
						
						
						* add cuda sm 90
* add more archs 
						
						
							
						
					 
					
						2025-07-28 12:35:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ab0e608862 
					 
					
						
						
							
							[CUDA] More sizes for gemv ( #2429 )  
						
						... 
						
						
						
						* route more to gemv
* route more sizes to custom gemv 
						
						
							
						
					 
					
						2025-07-28 12:35:01 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						1588659062 
					 
					
						
						
							
							no occupancy query for launch params ( #2426 )  
						
						
						
						
							
						
					 
					
						2025-07-28 09:09:41 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b9e88fb976 
					 
					
						
						
							
							[CUDA] Fix segfault on exit ( #2424 )  
						
						... 
						
						
						
						* fix cuda segfault on exit
* comment 
						
						
							
						
					 
					
						2025-07-27 08:08:13 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						4ad53414dd 
					 
					
						
						
							
							fix cuda pypi package ( #2423 )  
						
						... 
						
						
						
						* fix cuda pypi package
* patch bump 
						
						
							
 
						
					 
					
						2025-07-25 15:20:29 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d1165b215e 
					 
					
						
						
							
							version ( #2420 )  
						
						
						
						
							
						
					 
					
						2025-07-25 13:29:28 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						dcb8319f3d 
					 
					
						
						
							
							update install docs and requirements ( #2419 )  
						
						
						
						
							
						
					 
					
						2025-07-25 12:13:19 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						5597fa089c 
					 
					
						
						
							
							Fix qvm splitk ( #2415 )  
						
						
						
						
							
						
					 
					
						2025-07-25 11:50:24 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						9acec364c2 
					 
					
						
						
							
							[CUDA] Always use batched matmul ( #2404 )  
						
						... 
						
						
						
						* cuda batched mm
* addmm as well
* comment 
						
						
							
						
					 
					
						2025-07-24 20:46:02 -07:00 
						 
				 
			
				
					
						
							
							
								Skonor 
							
						 
					 
					
						
						
							
						
						7d9d6ef456 
					 
					
						
						
							
							docs: fix adam and adamw eps placement ( #2416 )  
						
						... 
						
						
						
						Co-authored-by: Mikhail Gorbunov <m_gorbunov@apple.com > 
						
						
							
						
					 
					
						2025-07-24 16:40:45 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						6f5874a2f2 
					 
					
						
						
							
							[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )  
						
						... 
						
						
						
						* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com > 
						
						
							
						
					 
					
						2025-07-25 08:12:10 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						70dc336785 
					 
					
						
						
							
							Test on cuda 12.2 and 12.9 ( #2413 )  
						
						
						
						
							
						
					 
					
						2025-07-24 06:06:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						4e504039f5 
					 
					
						
						
							
							[Metal] Release metal events ( #2412 )  
						
						... 
						
						
						
						* release metal events
* fix
* fix 
						
						
							
						
					 
					
						2025-07-23 19:53:42 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d1f4d291e8 
					 
					
						
						
							
							Fix uv install and add dev release ( #2411 )  
						
						... 
						
						
						
						* fix uv install and add dev release
* fix docstring
* pin cuda deps
* cuda release on cpu-only machine 
						
						
							
						
					 
					
						2025-07-23 16:54:19 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e1840853ce 
					 
					
						
						
							
							full row mask in sdpa consistently gives nan ( #2406 )  
						
						
						
						
							
						
					 
					
						2025-07-23 16:37:03 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						0f5ce173da 
					 
					
						
						
							
							[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:11:11 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						588854195f 
					 
					
						
						
							
							Remove unused code in Convolution::vjp ( #2408 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:11:00 -07:00 
						 
				 
			
				
					
						
							
							
								Fangjun Kuang 
							
						 
					 
					
						
						
							
						
						28d068bce6 
					 
					
						
						
							
							Fix an error in the comment for mx.dequantize ( #2409 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:10:50 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d107d8d495 
					 
					
						
						
							
							add cuda gemv ( #2400 )  
						
						
						
						
							
						
					 
					
						2025-07-22 08:24:13 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						1e496ddb82 
					 
					
						
						
							
							[CUDA] Simplify allocator ( #2392 )  
						
						... 
						
						
						
						* simplify allocator and fixe race with small pool
* Don't use shared event in worker
* use cuda buffer in small pool
* comment
* comment 
						
						
							
						
					 
					
						2025-07-22 08:24:01 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						74eccbf3fa 
					 
					
						
						
							
							use size option in binary ( #2399 )  
						
						
						
						
							
						
					 
					
						2025-07-22 07:00:53 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						08638223ca 
					 
					
						
						
							
							Fix including stubs in wheel ( #2398 )  
						
						... 
						
						
						
						* fix including stubs in wheel
* fix bool_ 
						
						
							
						
					 
					
						2025-07-22 06:30:17 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						56cc858af9 
					 
					
						
						
							
							Add contiguous_copy_cpu util for copying array ( #2397 )  
						
						
						
						
							
						
					 
					
						2025-07-21 07:30:35 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f55c4ed1d6 
					 
					
						
						
							
							Remove thrust iterators ( #2396 )  
						
						
						
						
							
						
					 
					
						2025-07-21 07:30:27 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						93d70419e7 
					 
					
						
						
							
							[CUDA] speedup handling scalars ( #2389 )  
						
						... 
						
						
						
						* speedup scalars in cuda
* comment 
						
						
							
						
					 
					
						2025-07-18 21:47:31 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						63f663d9c6 
					 
					
						
						
							
							fix cuda manylinux version to match others ( #2388 )  
						
						
						
						
							
						
					 
					
						2025-07-18 21:02:16 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						84b4d96efa 
					 
					
						
						
							
							fix release build + patch bump ( #2387 )  
						
						
						
						
							
 
						
					 
					
						2025-07-18 14:47:37 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						aec67f2fa6 
					 
					
						
						
							
							patch bump ( #2386 )  
						
						
						
						
							
						
					 
					
						2025-07-18 12:25:48 -07:00 
						 
				 
			
				
					
						
							
							
								Gökdeniz Gülmez 
							
						 
					 
					
						
						
							
						
						deee214a95 
					 
					
						
						
							
							Adding support for the Muon Optimizer ( #1914 )  
						
						... 
						
						
						
						* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with  mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com > 
						
						
							
						
					 
					
						2025-07-18 12:25:28 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						45adec102c 
					 
					
						
						
							
							Add contiguous_copy_gpu util for copying array ( #2379 )  
						
						
						
						
							
						
					 
					
						2025-07-18 06:44:25 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						31fc530c76 
					 
					
						
						
							
							[CUDA] Add more ways finding CCCL headers in JIT ( #2382 )  
						
						
						
						
							
						
					 
					
						2025-07-17 15:25:34 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						fbb3f65a1a 
					 
					
						
						
							
							fix resource leaks in matmul and graph ( #2383 )  
						
						
						
						
							
						
					 
					
						2025-07-17 06:50:15 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						6b1b8ea91b 
					 
					
						
						
							
							[CUDA] Add work per thread to compile ( #2368 )  
						
						
						
						
							
						
					 
					
						2025-07-17 06:47:52 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b2273733ea 
					 
					
						
						
							
							Test with CUDA 12.2 ( #2375 )  
						
						... 
						
						
						
						* Test with CUDA 12.0
* try older image
* fix cpu sort 
						
						
							
						
					 
					
						2025-07-16 13:00:37 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f409b229a4 
					 
					
						
						
							
							fix ring distributed test ( #2380 )  
						
						
						
						
							
						
					 
					
						2025-07-16 11:25:24 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						30571e2326 
					 
					
						
						
							
							Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )  
						
						
						
						
							
						
					 
					
						2025-07-16 07:34:24 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d7734edd9f 
					 
					
						
						
							
							fix complex reduce + nan propagation in min and max ( #2377 )  
						
						
						
						
							
						
					 
					
						2025-07-15 18:19:47 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						2ba69bc8fa 
					 
					
						
						
							
							lower memory uniform sampling ( #2361 )  
						
						... 
						
						
						
						* lower memory uniform
* use fp32
* fix 
						
						
							
						
					 
					
						2025-07-15 14:22:07 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						cb349a291c 
					 
					
						
						
							
							[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )  
						
						
						
						
							
						
					 
					
						2025-07-15 00:36:13 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f0a0b077a0 
					 
					
						
						
							
							Install linux with mlx[cuda] and mlx[cpu] ( #2356 )  
						
						... 
						
						
						
						* install linux with mlx[cuda] and mlx[cpu]
* temp for testing
* cleanup circle, fix cuda repair
* update circle
* update circle
* decouple python bindings from core libraries 
						
						
							
						
					 
					
						2025-07-14 17:17:33 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						49114f28ab 
					 
					
						
						
							
							fix flaky test ( #2371 )  
						
						
						
						
							
						
					 
					
						2025-07-14 17:16:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e7d2ebadd2 
					 
					
						
						
							
							[CUDA] Affine quantize ( #2354 )  
						
						... 
						
						
						
						* affine quantize and dequantize kernels
* format
* fix
* format 
						
						
							
						
					 
					
						2025-07-14 15:45:44 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e569803d7c 
					 
					
						
						
							
							update linux build ( #2370 )  
						
						
						
						
							
						
					 
					
						2025-07-14 15:13:56 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						d34f887abc 
					 
					
						
						
							
							Add Primitive::name and remove Primitive::print ( #2365 )  
						
						
						
						
							
						
					 
					
						2025-07-14 14:06:35 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						5201df5030 
					 
					
						
						
							
							Fix imag() vjp ( #2367 )  
						
						
						
						
							
						
					 
					
						2025-07-14 13:11:16 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						2d3c26c565 
					 
					
						
						
							
							[CUDA] Do not put kernels in annoymous namespace ( #2362 )  
						
						
						
						
							
						
					 
					
						2025-07-12 14:24:45 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						6325f60d52 
					 
					
						
						
							
							[CUDA] Bundle CCCL for JIT compilation ( #2357 )  
						
						... 
						
						
						
						* Ship CCCL for JIT compilation
* Remove cexpf 
						
						
							
						
					 
					
						2025-07-11 18:45:37 -07:00