russellizadi 
							
						 
					 
					
						
						
							
						
						512281781c 
					 
					
						
						
							
							Remove state return from function example in compile documentation ( #2518 )  
						
						
						
						
							
						
					 
					
						2025-08-20 00:45:05 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						ac85ddfdb7 
					 
					
						
						
							
							[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )  
						
						... 
						
						
						
						* Add gemm_conv
* Add gemm_grouped_conv 
						
						
							
						
					 
					
						2025-08-20 10:06:22 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						65d0d40232 
					 
					
						
						
							
							Split cuDNN helpers into a separate header ( #2491 )  
						
						... 
						
						
						
						* Add RAII managed CudaGraph class
* Implement forward rms_norm with cuDNN
* Revert back to old rms norm kernel 
						
						
							
						
					 
					
						2025-08-20 09:29:28 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						cea9369610 
					 
					
						
						
							
							fix lapack svd ( #2515 )  
						
						
						
						
							
						
					 
					
						2025-08-18 15:07:59 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e7c6e1db82 
					 
					
						
						
							
							no segfault with uninitialized array.at ( #2514 )  
						
						
						
						
							
						
					 
					
						2025-08-18 08:33:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						c5fcd5b61b 
					 
					
						
						
							
							fix custom kernel test ( #2510 )  
						
						
						
						
							
						
					 
					
						2025-08-18 06:45:59 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						1df9887998 
					 
					
						
						
							
							Ensure no oob read in gemv_masked ( #2508 )  
						
						
						
						
							
						
					 
					
						2025-08-17 08:42:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						73f22d6226 
					 
					
						
						
							
							Ensure small sort doesn't use indices if not argsort ( #2506 )  
						
						
						
						
							
						
					 
					
						2025-08-17 08:42:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						c422050ca7 
					 
					
						
						
							
							Update cuDNN Frontend to v1.14 ( #2505 )  
						
						
						
						
							
						
					 
					
						2025-08-17 19:13:01 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						1ba18ff7d9 
					 
					
						
						
							
							[CUDA] Fix conv grads with groups ( #2495 )  
						
						... 
						
						
						
						* Put reshape utils in one file
* [CUDA] Fix conv grads with groups
* Put the reshape utils in gpu/copy.h 
						
						
							
						
					 
					
						2025-08-16 10:09:18 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						37b440faa8 
					 
					
						
						
							
							Clean up code handling both std::vector and SmallVector ( #2493 )  
						
						
						
						
							
						
					 
					
						2025-08-16 09:01:10 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						888b13ed63 
					 
					
						
						
							
							Remove the hack around SmallVector in cpu compile ( #2494 )  
						
						
						
						
							
						
					 
					
						2025-08-16 08:17:24 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						4abb218d21 
					 
					
						
						
							
							The naive_conv_2d is no longer used ( #2496 )  
						
						
						
						
							
						
					 
					
						2025-08-16 07:57:30 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						6441c21a94 
					 
					
						
						
							
							Faster general unary op ( #2472 )  
						
						... 
						
						
						
						* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general 
						
						
							
						
					 
					
						2025-08-15 15:04:12 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						dfb5022eab 
					 
					
						
						
							
							Rename cu::Matmul to CublasGemm ( #2488 )  
						
						
						
						
							
						
					 
					
						2025-08-13 09:37:40 +09:00 
						 
				 
			
				
					
						
							
							
								Daniel Yeh 
							
						 
					 
					
						
						
							
						
						ac207ce7aa 
					 
					
						
						
							
							make code blocks copyable ( #2480 )  
						
						... 
						
						
						
						Co-authored-by: Chen-Chen Yeh <ge96noj@mytum.de > 
						
						
							
						
					 
					
						2025-08-12 12:29:02 -07:00 
						 
				 
			
				
					
						
							
							
								Abe Leininger 
							
						 
					 
					
						
						
							
						
						fce53b61d6 
					 
					
						
						
							
							Fix reduce sum/prod overflow ( #2477 )  
						
						
						
						
							
						
					 
					
						2025-08-12 00:05:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						8ae4a76308 
					 
					
						
						
							
							Use CMake <4.1 to avoid the nvpl error ( #2489 )  
						
						
						
						
							
						
					 
					
						2025-08-12 00:03:42 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						7fde1b6a1e 
					 
					
						
						
							
							Fix logsumexp/softmax not fused for some cases ( #2474 )  
						
						
						
						
							
						
					 
					
						2025-08-08 14:07:17 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						aa7b47481a 
					 
					
						
						
							
							[CUDA] Optimize set_mm_device_pointers for small ndim ( #2473 )  
						
						
						
						
							
						
					 
					
						2025-08-08 15:23:30 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						56be773610 
					 
					
						
						
							
							version ( #2470 )  
						
						
						
						
							
 
						
					 
					
						2025-08-07 00:36:04 -07:00 
						 
				 
			
				
					
						
							
							
								Jagrit Digani 
							
						 
					 
					
						
						
							
						
						a9bdd67baa 
					 
					
						
						
							
							Add CUDA sdpa vector ( #2468 )  
						
						
						
						
							
						
					 
					
						2025-08-06 21:40:26 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						f2adb5638d 
					 
					
						
						
							
							Fix typo in metal command encoder ( #2471 )  
						
						
						
						
							
						
					 
					
						2025-08-06 16:58:23 -07:00 
						 
				 
			
				
					
						
							
							
								Luca Vivona 
							
						 
					 
					
						
						
							
						
						728d4db582 
					 
					
						
						
							
							Support destination arg in tree flatten/unflatten ( #2450 )  
						
						
						
						
							
						
					 
					
						2025-08-06 15:34:59 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						db5c7efcf6 
					 
					
						
						
							
							revert default cuda install ( #2465 )  
						
						... 
						
						
						
						* revert default cuda install
* revert default cuda install 
						
						
							
						
					 
					
						2025-08-06 06:19:12 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						7bb96e4249 
					 
					
						
						
							
							fix cublas on h100 ( #2466 )  
						
						
						
						
							
						
					 
					
						2025-08-06 06:18:58 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						fa89f0b150 
					 
					
						
						
							
							faster gather qmm sorted test ( #2463 )  
						
						
						
						
							
						
					 
					
						2025-08-05 06:27:40 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						ca973d1e83 
					 
					
						
						
							
							fix install tags ( #2464 )  
						
						
						
						
							
						
					 
					
						2025-08-04 20:01:23 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						828c5f1137 
					 
					
						
						
							
							Use SmallVector for shapes and strides ( #2454 )  
						
						... 
						
						
						
						* Use SmallVector for shapes and strides
* Convert SmallVector to tuple 
						
						
							
						
					 
					
						2025-08-05 09:41:03 +09:00 
						 
				 
			
				
					
						
							
							
								Gaétan Lepage 
							
						 
					 
					
						
						
							
						
						7d86a5c108 
					 
					
						
						
							
							Feat: add USE_SYSTEM_FMT CMake option ( #2219 )  
						
						
						
						
							
						
					 
					
						2025-08-04 16:36:11 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						0b807893a7 
					 
					
						
						
							
							fix wraps compile ( #2461 )  
						
						
						
						
							
						
					 
					
						2025-08-04 16:14:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						6ad0889c8a 
					 
					
						
						
							
							default install cuda on linux ( #2462 )  
						
						
						
						
							
						
					 
					
						2025-08-04 15:33:05 -07:00 
						 
				 
			
				
					
						
							
							
								Zamderax 
							
						 
					 
					
						
						
							
						
						737dd6d1ac 
					 
					
						
						
							
							Add missing <algorithm> header to jit_compiler.cpp ( #2460 )  
						
						... 
						
						
						
						Fixes compilation error on Linux where std::find_if is used on line 121
but the <algorithm> header was not included. While this might work on
some platforms due to transitive includes, it's not guaranteed by the
C++ standard.
Resolves issue #2459  
						
						
							
						
					 
					
						2025-08-04 14:00:46 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						aaf78f4c6b 
					 
					
						
						
							
							Use LRU cache for cuda graph ( #2448 )  
						
						... 
						
						
						
						* Use LRU cache for cuda graph
* Remove unused destructor 
						
						
							
						
					 
					
						2025-08-02 21:28:57 +09:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						8831064493 
					 
					
						
						
							
							Fix arctan2 grads ( #2453 )  
						
						
						
						
							
						
					 
					
						2025-08-01 21:06:04 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						be9bc96da4 
					 
					
						
						
							
							[CUDA] Matmul utils initial commit ( #2441 )  
						
						
						
						
							
						
					 
					
						2025-08-01 14:22:25 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						86258f292f 
					 
					
						
						
							
							[CUDA] Vectorize generated kernels ( #2444 )  
						
						
						
						
							
						
					 
					
						2025-07-31 18:18:57 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						b26d88591c 
					 
					
						
						
							
							[CUDA] Save primitive inputs faster ( #2449 )  
						
						... 
						
						
						
						* Add more nvtx loggings
* [CUDA] Saving primitive inputs faster
* Remove unneeded check 
						
						
							
						
					 
					
						2025-08-01 10:16:06 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						86c6a15571 
					 
					
						
						
							
							[CUDA] Backward convolution ( #2431 )  
						
						
						
						
							
						
					 
					
						2025-08-01 09:54:05 +09:00 
						 
				 
			
				
					
						
							
							
								junpeiz 
							
						 
					 
					
						
						
							
						
						8b25ce62d5 
					 
					
						
						
							
							Add tests for export including control flow models and quantized models ( #2430 )  
						
						... 
						
						
						
						* Add tests for export, including control flow export and quantized model export.
* Skip quantization related test for CUDA backend. 
						
						
							
						
					 
					
						2025-07-31 11:06:26 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						da5912e4f2 
					 
					
						
						
							
							fix custom metal extension ( #2446 )  
						
						
						
						
							
						
					 
					
						2025-07-31 06:25:36 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						daafee676f 
					 
					
						
						
							
							Fix wrong graph key when using concurrent context ( #2447 )  
						
						
						
						
							
						
					 
					
						2025-07-31 06:01:05 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d32519c8ee 
					 
					
						
						
							
							fix gemv regression ( #2445 )  
						
						
						
						
							
						
					 
					
						2025-07-30 14:23:01 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						b405591249 
					 
					
						
						
							
							fix circular reference ( #2443 )  
						
						
						
						
							
						
					 
					
						2025-07-30 09:37:44 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						3bf81ed1bd 
					 
					
						
						
							
							[CUDA] Quantized refactoring ( #2442 )  
						
						
						
						
							
						
					 
					
						2025-07-30 08:27:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						2204182bba 
					 
					
						
						
							
							Make CI faster ( #2440 )  
						
						
						
						
							
						
					 
					
						2025-07-30 02:26:36 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						3628e5d497 
					 
					
						
						
							
							Use load_vector in arg_reduce ( #2439 )  
						
						
						
						
							
						
					 
					
						2025-07-30 17:40:26 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						a0ae49d397 
					 
					
						
						
							
							Move arange to its own file ( #2438 )  
						
						
						
						
							
						
					 
					
						2025-07-30 13:05:51 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						254476718b 
					 
					
						
						
							
							Remove the kernel arg from get_launch_args ( #2437 )  
						
						
						
						
							
						
					 
					
						2025-07-30 11:43:02 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						3adba92ebe 
					 
					
						
						
							
							Cuda faster softmax ( #2435 )  
						
						... 
						
						
						
						* faster softmax and logsumexp
* faster softmax and logsumexp
* format 
						
						
							
						
					 
					
						2025-07-29 17:18:12 -07:00