Jagrit Digani 
							
						 
					 
					
						
						
							
						
						7f8ba2a003 
					 
					
						
						
							
							[WIP] 2 pass sdpav  
						
						
						
						
							
						
					 
					
						2025-08-06 09:56:39 -07:00 
						 
				 
			
				
					
						
							
							
								Jagrit Digani 
							
						 
					 
					
						
						
							
						
						c28249b81a 
					 
					
						
						
							
							Add more nvtx range for debug  
						
						
						
						
							
						
					 
					
						2025-08-06 09:56:39 -07:00 
						 
				 
			
				
					
						
							
							
								Jagrit Digani 
							
						 
					 
					
						
						
							
						
						e74bcdc5e3 
					 
					
						
						
							
							Add sdpa file  
						
						
						
						
							
						
					 
					
						2025-08-06 09:56:39 -07:00 
						 
				 
			
				
					
						
							
							
								Jagrit Digani 
							
						 
					 
					
						
						
							
						
						d8ed6c1aa3 
					 
					
						
						
							
							Add base cudnn attention support  
						
						
						
						
							
						
					 
					
						2025-08-06 09:56:39 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						db5c7efcf6 
					 
					
						
						
							
							revert default cuda install ( #2465 )  
						
						... 
						
						
						
						* revert default cuda install
* revert default cuda install 
						
						
							
						
					 
					
						2025-08-06 06:19:12 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						7bb96e4249 
					 
					
						
						
							
							fix cublas on h100 ( #2466 )  
						
						
						
						
							
						
					 
					
						2025-08-06 06:18:58 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						fa89f0b150 
					 
					
						
						
							
							faster gather qmm sorted test ( #2463 )  
						
						
						
						
							
						
					 
					
						2025-08-05 06:27:40 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ca973d1e83 
					 
					
						
						
							
							fix install tags ( #2464 )  
						
						
						
						
							
						
					 
					
						2025-08-04 20:01:23 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						828c5f1137 
					 
					
						
						
							
							Use SmallVector for shapes and strides ( #2454 )  
						
						... 
						
						
						
						* Use SmallVector for shapes and strides
* Convert SmallVector to tuple 
						
						
							
						
					 
					
						2025-08-05 09:41:03 +09:00 
						 
				 
			
				
					
						
							
							
								Gaétan Lepage 
							
						 
					 
					
						
						
							
						
						7d86a5c108 
					 
					
						
						
							
							Feat: add USE_SYSTEM_FMT CMake option ( #2219 )  
						
						
						
						
							
						
					 
					
						2025-08-04 16:36:11 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						0b807893a7 
					 
					
						
						
							
							fix wraps compile ( #2461 )  
						
						
						
						
							
						
					 
					
						2025-08-04 16:14:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						6ad0889c8a 
					 
					
						
						
							
							default install cuda on linux ( #2462 )  
						
						
						
						
							
						
					 
					
						2025-08-04 15:33:05 -07:00 
						 
				 
			
				
					
						
							
							
								Zamderax 
							
						 
					 
					
						
						
							
						
						737dd6d1ac 
					 
					
						
						
							
							Add missing <algorithm> header to jit_compiler.cpp ( #2460 )  
						
						... 
						
						
						
						Fixes compilation error on Linux where std::find_if is used on line 121
but the <algorithm> header was not included. While this might work on
some platforms due to transitive includes, it's not guaranteed by the
C++ standard.
Resolves issue #2459  
						
						
							
						
					 
					
						2025-08-04 14:00:46 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						aaf78f4c6b 
					 
					
						
						
							
							Use LRU cache for cuda graph ( #2448 )  
						
						... 
						
						
						
						* Use LRU cache for cuda graph
* Remove unused destructor 
						
						
							
						
					 
					
						2025-08-02 21:28:57 +09:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						8831064493 
					 
					
						
						
							
							Fix arctan2 grads ( #2453 )  
						
						
						
						
							
						
					 
					
						2025-08-01 21:06:04 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						be9bc96da4 
					 
					
						
						
							
							[CUDA] Matmul utils initial commit ( #2441 )  
						
						
						
						
							
						
					 
					
						2025-08-01 14:22:25 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						86258f292f 
					 
					
						
						
							
							[CUDA] Vectorize generated kernels ( #2444 )  
						
						
						
						
							
						
					 
					
						2025-07-31 18:18:57 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						b26d88591c 
					 
					
						
						
							
							[CUDA] Save primitive inputs faster ( #2449 )  
						
						... 
						
						
						
						* Add more nvtx loggings
* [CUDA] Saving primitive inputs faster
* Remove unneeded check 
						
						
							
						
					 
					
						2025-08-01 10:16:06 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						86c6a15571 
					 
					
						
						
							
							[CUDA] Backward convolution ( #2431 )  
						
						
						
						
							
						
					 
					
						2025-08-01 09:54:05 +09:00 
						 
				 
			
				
					
						
							
							
								junpeiz 
							
						 
					 
					
						
						
							
						
						8b25ce62d5 
					 
					
						
						
							
							Add tests for export including control flow models and quantized models ( #2430 )  
						
						... 
						
						
						
						* Add tests for export, including control flow export and quantized model export.
* Skip quantization related test for CUDA backend. 
						
						
							
						
					 
					
						2025-07-31 11:06:26 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						da5912e4f2 
					 
					
						
						
							
							fix custom metal extension ( #2446 )  
						
						
						
						
							
						
					 
					
						2025-07-31 06:25:36 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						daafee676f 
					 
					
						
						
							
							Fix wrong graph key when using concurrent context ( #2447 )  
						
						
						
						
							
						
					 
					
						2025-07-31 06:01:05 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d32519c8ee 
					 
					
						
						
							
							fix gemv regression ( #2445 )  
						
						
						
						
							
						
					 
					
						2025-07-30 14:23:01 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b405591249 
					 
					
						
						
							
							fix circular reference ( #2443 )  
						
						
						
						
							
						
					 
					
						2025-07-30 09:37:44 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						3bf81ed1bd 
					 
					
						
						
							
							[CUDA] Quantized refactoring ( #2442 )  
						
						
						
						
							
						
					 
					
						2025-07-30 08:27:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						2204182bba 
					 
					
						
						
							
							Make CI faster ( #2440 )  
						
						
						
						
							
						
					 
					
						2025-07-30 02:26:36 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						3628e5d497 
					 
					
						
						
							
							Use load_vector in arg_reduce ( #2439 )  
						
						
						
						
							
						
					 
					
						2025-07-30 17:40:26 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						a0ae49d397 
					 
					
						
						
							
							Move arange to its own file ( #2438 )  
						
						
						
						
							
						
					 
					
						2025-07-30 13:05:51 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						254476718b 
					 
					
						
						
							
							Remove the kernel arg from get_launch_args ( #2437 )  
						
						
						
						
							
						
					 
					
						2025-07-30 11:43:02 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						3adba92ebe 
					 
					
						
						
							
							Cuda faster softmax ( #2435 )  
						
						... 
						
						
						
						* faster softmax and logsumexp
* faster softmax and logsumexp
* format 
						
						
							
						
					 
					
						2025-07-29 17:18:12 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ef631d63af 
					 
					
						
						
							
							faster rms norm ( #2433 )  
						
						
						
						
							
						
					 
					
						2025-07-29 13:12:00 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						970dbe8e25 
					 
					
						
						
							
							Use ccache in CI ( #2414 )  
						
						... 
						
						
						
						* Detect ccache
* Use ccache in CI
* Separate cache for different images
* Test both 12.2 and 12.9 for PRs 
						
						
							
						
					 
					
						2025-07-29 08:43:22 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						641be9463b 
					 
					
						
						
							
							Add more CUDA architectures for PyPi package ( #2427 )  
						
						... 
						
						
						
						* add cuda sm 90
* add more archs 
						
						
							
						
					 
					
						2025-07-28 12:35:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ab0e608862 
					 
					
						
						
							
							[CUDA] More sizes for gemv ( #2429 )  
						
						... 
						
						
						
						* route more to gemv
* route more sizes to custom gemv 
						
						
							
						
					 
					
						2025-07-28 12:35:01 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						1588659062 
					 
					
						
						
							
							no occupancy query for launch params ( #2426 )  
						
						
						
						
							
						
					 
					
						2025-07-28 09:09:41 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b9e88fb976 
					 
					
						
						
							
							[CUDA] Fix segfault on exit ( #2424 )  
						
						... 
						
						
						
						* fix cuda segfault on exit
* comment 
						
						
							
						
					 
					
						2025-07-27 08:08:13 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						4ad53414dd 
					 
					
						
						
							
							fix cuda pypi package ( #2423 )  
						
						... 
						
						
						
						* fix cuda pypi package
* patch bump 
						
						
							
 
						
					 
					
						2025-07-25 15:20:29 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d1165b215e 
					 
					
						
						
							
							version ( #2420 )  
						
						
						
						
							
						
					 
					
						2025-07-25 13:29:28 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						dcb8319f3d 
					 
					
						
						
							
							update install docs and requirements ( #2419 )  
						
						
						
						
							
						
					 
					
						2025-07-25 12:13:19 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						5597fa089c 
					 
					
						
						
							
							Fix qvm splitk ( #2415 )  
						
						
						
						
							
						
					 
					
						2025-07-25 11:50:24 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						9acec364c2 
					 
					
						
						
							
							[CUDA] Always use batched matmul ( #2404 )  
						
						... 
						
						
						
						* cuda batched mm
* addmm as well
* comment 
						
						
							
						
					 
					
						2025-07-24 20:46:02 -07:00 
						 
				 
			
				
					
						
							
							
								Skonor 
							
						 
					 
					
						
						
							
						
						7d9d6ef456 
					 
					
						
						
							
							docs: fix adam and adamw eps placement ( #2416 )  
						
						... 
						
						
						
						Co-authored-by: Mikhail Gorbunov <m_gorbunov@apple.com > 
						
						
							
						
					 
					
						2025-07-24 16:40:45 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						6f5874a2f2 
					 
					
						
						
							
							[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )  
						
						... 
						
						
						
						* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com > 
						
						
							
						
					 
					
						2025-07-25 08:12:10 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						70dc336785 
					 
					
						
						
							
							Test on cuda 12.2 and 12.9 ( #2413 )  
						
						
						
						
							
						
					 
					
						2025-07-24 06:06:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						4e504039f5 
					 
					
						
						
							
							[Metal] Release metal events ( #2412 )  
						
						... 
						
						
						
						* release metal events
* fix
* fix 
						
						
							
						
					 
					
						2025-07-23 19:53:42 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d1f4d291e8 
					 
					
						
						
							
							Fix uv install and add dev release ( #2411 )  
						
						... 
						
						
						
						* fix uv install and add dev release
* fix docstring
* pin cuda deps
* cuda release on cpu-only machine 
						
						
							
						
					 
					
						2025-07-23 16:54:19 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e1840853ce 
					 
					
						
						
							
							full row mask in sdpa consistently gives nan ( #2406 )  
						
						
						
						
							
						
					 
					
						2025-07-23 16:37:03 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						0f5ce173da 
					 
					
						
						
							
							[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:11:11 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						588854195f 
					 
					
						
						
							
							Remove unused code in Convolution::vjp ( #2408 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:11:00 -07:00 
						 
				 
			
				
					
						
							
							
								Fangjun Kuang 
							
						 
					 
					
						
						
							
						
						28d068bce6 
					 
					
						
						
							
							Fix an error in the comment for mx.dequantize ( #2409 )  
						
						
						
						
							
						
					 
					
						2025-07-23 06:10:50 -07:00