Cheng 
							
						 
					 
					
						
						
							
						
						31c6f6e33f 
					 
					
						
						
							
							[CUDA] Use ConcurrentContext in concatenate_gpu ( #2549 )  
						
						
						
						
							
						
					 
					
						2025-08-28 09:30:08 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						584d48458e 
					 
					
						
						
							
							link with nccl ( #2546 )  
						
						
						
						
							
						
					 
					
						2025-08-27 10:01:07 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						5cf984ca87 
					 
					
						
						
							
							Separate cpu compilation cache by versions ( #2548 )  
						
						
						
						
							
						
					 
					
						2025-08-27 11:25:15 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						a9bac3d9e5 
					 
					
						
						
							
							Run CPP tests for CUDA build in CI ( #2544 )  
						
						
						
						
							
						
					 
					
						2025-08-27 08:06:46 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						5458d43247 
					 
					
						
						
							
							add load with path tests ( #2543 )  
						
						
						
						
							
						
					 
					
						2025-08-26 14:24:47 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						a4dba65220 
					 
					
						
						
							
							Enable cuda graph toggle ( #2545 )  
						
						... 
						
						
						
						* enable cuda graph toggle
* increase cache size 
						
						
							
						
					 
					
						2025-08-26 12:50:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						3dcb286baf 
					 
					
						
						
							
							Remove stream from average grads so it uses default ( #2532 )  
						
						... 
						
						
						
						* Remove stream from average grads so it uses default
* comment 
						
						
							
						
					 
					
						2025-08-25 15:56:29 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						4822c3dbe9 
					 
					
						
						
							
							[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )  
						
						... 
						
						
						
						* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA 
						
						
							
						
					 
					
						2025-08-26 07:31:39 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						2ca75bb529 
					 
					
						
						
							
							Remove nccl install in release ( #2542 )  
						
						
						
						
							
						
					 
					
						2025-08-25 15:20:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						db14e29a0b 
					 
					
						
						
							
							allow pathlib.Path to save/load functions ( #2541 )  
						
						
						
						
							
						
					 
					
						2025-08-25 14:58:49 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d2f540f4e0 
					 
					
						
						
							
							Use nccl header only when nccl is not present ( #2539 )  
						
						... 
						
						
						
						* use nccl header only when nccl is not present
* larger machine for cuda build 
						
						
							
						
					 
					
						2025-08-25 14:17:25 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						333ffea273 
					 
					
						
						
							
							[CUDA] Remove thrust in arange ( #2535 )  
						
						
						
						
							
						
					 
					
						2025-08-24 16:22:36 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f55b6f1f2f 
					 
					
						
						
							
							Enable COMPILE_WARNING_AS_ERROR for linux builds in CI ( #2534 )  
						
						
						
						
							
						
					 
					
						2025-08-24 15:33:08 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						30561229c7 
					 
					
						
						
							
							Fix allocation bug in NCCL ( #2530 )  
						
						
						
						
							
						
					 
					
						2025-08-22 14:39:43 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						068a4612e9 
					 
					
						
						
							
							nccl default for backend=any ( #2528 )  
						
						... 
						
						
						
						* nccl default for backend=any
* check num gpus + ensure row contiguous for all reduce
* comment 
						
						
							
						
					 
					
						2025-08-22 12:24:27 -07:00 
						 
				 
			
				
					
						
							
							
								Andrey Portnoy 
							
						 
					 
					
						
						
							
						
						5722c147de 
					 
					
						
						
							
							[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13  ( #2525 )  
						
						... 
						
						
						
						* [CUDA] Update cudaMemAdvise and cudaGraphAddDependencies for CUDA 13
These functions' signatures changed in CUDA 13, so we differentiate
between CUDA 13 and preceding releases at compile time.
* Mention NVIDIA in ACKNOWLEDGMENTS.md 
						
						
							
						
					 
					
						2025-08-21 19:57:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f6819a1f26 
					 
					
						
						
							
							Fix warning 186-D from nvcc ( #2527 )  
						
						
						
						
							
						
					 
					
						2025-08-22 10:29:55 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f93f87c802 
					 
					
						
						
							
							nccl dep + default for cuda ( #2526 )  
						
						
						
						
							
						
					 
					
						2025-08-21 17:57:49 -07:00 
						 
				 
			
				
					
						
							
							
								Anastasiia Filippova 
							
						 
					 
					
						
						
							
						
						9392fc3f88 
					 
					
						
						
							
							NCCL backend ( #2476 )  
						
						
						
						
							
						
					 
					
						2025-08-21 11:56:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e843c4d8d5 
					 
					
						
						
							
							fix power ( #2523 )  
						
						
						
						
							
						
					 
					
						2025-08-21 06:46:01 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						0c5fc63a36 
					 
					
						
						
							
							Fix docs omission ( #2524 )  
						
						
						
						
							
						
					 
					
						2025-08-20 17:56:06 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						e397177f6e 
					 
					
						
						
							
							Custom cuda kernel ( #2517 )  
						
						
						
						
							
						
					 
					
						2025-08-20 17:20:22 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f4c8888cbe 
					 
					
						
						
							
							[CUDA] Fix stride of singleton dims before passing to cuDNN ( #2521 )  
						
						
						
						
							
						
					 
					
						2025-08-21 08:55:26 +09:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						25c1e03205 
					 
					
						
						
							
							Fix overflow in large filter small channels ( #2520 )  
						
						
						
						
							
						
					 
					
						2025-08-20 08:03:29 -07:00 
						 
				 
			
				
					
						
							
							
								russellizadi 
							
						 
					 
					
						
						
							
						
						512281781c 
					 
					
						
						
							
							Remove state return from function example in compile documentation ( #2518 )  
						
						
						
						
							
						
					 
					
						2025-08-20 00:45:05 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						ac85ddfdb7 
					 
					
						
						
							
							[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )  
						
						... 
						
						
						
						* Add gemm_conv
* Add gemm_grouped_conv 
						
						
							
						
					 
					
						2025-08-20 10:06:22 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						65d0d40232 
					 
					
						
						
							
							Split cuDNN helpers into a separate header ( #2491 )  
						
						... 
						
						
						
						* Add RAII managed CudaGraph class
* Implement forward rms_norm with cuDNN
* Revert back to old rms norm kernel 
						
						
							
						
					 
					
						2025-08-20 09:29:28 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						cea9369610 
					 
					
						
						
							
							fix lapack svd ( #2515 )  
						
						
						
						
							
						
					 
					
						2025-08-18 15:07:59 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e7c6e1db82 
					 
					
						
						
							
							no segfault with uninitialized array.at ( #2514 )  
						
						
						
						
							
						
					 
					
						2025-08-18 08:33:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						c5fcd5b61b 
					 
					
						
						
							
							fix custom kernel test ( #2510 )  
						
						
						
						
							
						
					 
					
						2025-08-18 06:45:59 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						1df9887998 
					 
					
						
						
							
							Ensure no oob read in gemv_masked ( #2508 )  
						
						
						
						
							
						
					 
					
						2025-08-17 08:42:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						73f22d6226 
					 
					
						
						
							
							Ensure small sort doesn't use indices if not argsort ( #2506 )  
						
						
						
						
							
						
					 
					
						2025-08-17 08:42:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						c422050ca7 
					 
					
						
						
							
							Update cuDNN Frontend to v1.14 ( #2505 )  
						
						
						
						
							
						
					 
					
						2025-08-17 19:13:01 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						1ba18ff7d9 
					 
					
						
						
							
							[CUDA] Fix conv grads with groups ( #2495 )  
						
						... 
						
						
						
						* Put reshape utils in one file
* [CUDA] Fix conv grads with groups
* Put the reshape utils in gpu/copy.h 
						
						
							
						
					 
					
						2025-08-16 10:09:18 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						37b440faa8 
					 
					
						
						
							
							Clean up code handling both std::vector and SmallVector ( #2493 )  
						
						
						
						
							
						
					 
					
						2025-08-16 09:01:10 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						888b13ed63 
					 
					
						
						
							
							Remove the hack around SmallVector in cpu compile ( #2494 )  
						
						
						
						
							
						
					 
					
						2025-08-16 08:17:24 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						4abb218d21 
					 
					
						
						
							
							The naive_conv_2d is no longer used ( #2496 )  
						
						
						
						
							
						
					 
					
						2025-08-16 07:57:30 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						6441c21a94 
					 
					
						
						
							
							Faster general unary op ( #2472 )  
						
						... 
						
						
						
						* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general 
						
						
							
						
					 
					
						2025-08-15 15:04:12 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						dfb5022eab 
					 
					
						
						
							
							Rename cu::Matmul to CublasGemm ( #2488 )  
						
						
						
						
							
						
					 
					
						2025-08-13 09:37:40 +09:00 
						 
				 
			
				
					
						
							
							
								Daniel Yeh 
							
						 
					 
					
						
						
							
						
						ac207ce7aa 
					 
					
						
						
							
							make code blocks copyable ( #2480 )  
						
						... 
						
						
						
						Co-authored-by: Chen-Chen Yeh <ge96noj@mytum.de > 
						
						
							
						
					 
					
						2025-08-12 12:29:02 -07:00 
						 
				 
			
				
					
						
							
							
								Abe Leininger 
							
						 
					 
					
						
						
							
						
						fce53b61d6 
					 
					
						
						
							
							Fix reduce sum/prod overflow ( #2477 )  
						
						
						
						
							
						
					 
					
						2025-08-12 00:05:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						8ae4a76308 
					 
					
						
						
							
							Use CMake <4.1 to avoid the nvpl error ( #2489 )  
						
						
						
						
							
						
					 
					
						2025-08-12 00:03:42 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						7fde1b6a1e 
					 
					
						
						
							
							Fix logsumexp/softmax not fused for some cases ( #2474 )  
						
						
						
						
							
						
					 
					
						2025-08-08 14:07:17 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						aa7b47481a 
					 
					
						
						
							
							[CUDA] Optimize set_mm_device_pointers for small ndim ( #2473 )  
						
						
						
						
							
						
					 
					
						2025-08-08 15:23:30 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						56be773610 
					 
					
						
						
							
							version ( #2470 )  
						
						
						
						
							
 
						
					 
					
						2025-08-07 00:36:04 -07:00 
						 
				 
			
				
					
						
							
							
								Jagrit Digani 
							
						 
					 
					
						
						
							
						
						a9bdd67baa 
					 
					
						
						
							
							Add CUDA sdpa vector ( #2468 )  
						
						
						
						
							
						
					 
					
						2025-08-06 21:40:26 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						f2adb5638d 
					 
					
						
						
							
							Fix typo in metal command encoder ( #2471 )  
						
						
						
						
							
						
					 
					
						2025-08-06 16:58:23 -07:00 
						 
				 
			
				
					
						
							
							
								Luca Vivona 
							
						 
					 
					
						
						
							
						
						728d4db582 
					 
					
						
						
							
							Support destination arg in tree flatten/unflatten ( #2450 )  
						
						
						
						
							
						
					 
					
						2025-08-06 15:34:59 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						db5c7efcf6 
					 
					
						
						
							
							revert default cuda install ( #2465 )  
						
						... 
						
						
						
						* revert default cuda install
* revert default cuda install 
						
						
							
						
					 
					
						2025-08-06 06:19:12 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						7bb96e4249 
					 
					
						
						
							
							fix cublas on h100 ( #2466 )  
						
						
						
						
							
						
					 
					
						2025-08-06 06:18:58 -07:00