Awni Hannun 
							
						 
					 
					
						
						
							
						
						9c68b50853 
					 
					
						
						
							
							version bump ( #2554 )  
						
						
						
						
					 
					
						2025-08-29 06:54:17 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						111f1e71af 
					 
					
						
						
							
							Faster contiguous gather for indices in the first axis ( #2552 )  
						
						... 
						
						
						
						* faster contiguous gather for indices in the first axis
* work per thread > 1
* angelos suggestion for scales / biases 
						
						
					 
					
						2025-08-28 21:26:30 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						827003d568 
					 
					
						
						
							
							fix METAL quantization in JIT ( #2553 )  
						
						
						
						
					 
					
						2025-08-28 18:26:25 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d363a76aa4 
					 
					
						
						
							
							Bump xcode in circle ( #2551 )  
						
						... 
						
						
						
						* bump xcode in circle
* bump xcode in circle
* bump xcode in circle 
						
						
					 
					
						2025-08-28 13:13:34 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						70560b6bd5 
					 
					
						
						
							
							Add mode parameter for quantization ( #2499 )  
						
						... 
						
						
						
						* add mode parameter for quantization
* mxfp4 quantize/dequantize + start of optional biases
* mxfp4 works
* speedup
* cpu mxfp4
* fix
* fix test tol
* fix
* refactor
* add quant mode enum 
						
						
					 
					
						2025-08-28 06:45:26 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						7ef8a6f2d5 
					 
					
						
						
							
							[CUDA] fix sort ( #2550 )  
						
						... 
						
						
						
						* [CUDA] fix sort
* fix test 
						
						
					 
					
						2025-08-27 19:48:43 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						31c6f6e33f 
					 
					
						
						
							
							[CUDA] Use ConcurrentContext in concatenate_gpu ( #2549 )  
						
						
						
						
					 
					
						2025-08-28 09:30:08 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						584d48458e 
					 
					
						
						
							
							link with nccl ( #2546 )  
						
						
						
						
					 
					
						2025-08-27 10:01:07 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						5cf984ca87 
					 
					
						
						
							
							Separate cpu compilation cache by versions ( #2548 )  
						
						
						
						
					 
					
						2025-08-27 11:25:15 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						a9bac3d9e5 
					 
					
						
						
							
							Run CPP tests for CUDA build in CI ( #2544 )  
						
						
						
						
					 
					
						2025-08-27 08:06:46 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						5458d43247 
					 
					
						
						
							
							add load with path tests ( #2543 )  
						
						
						
						
					 
					
						2025-08-26 14:24:47 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						a4dba65220 
					 
					
						
						
							
							Enable cuda graph toggle ( #2545 )  
						
						... 
						
						
						
						* enable cuda graph toggle
* increase cache size 
						
						
					 
					
						2025-08-26 12:50:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						3dcb286baf 
					 
					
						
						
							
							Remove stream from average grads so it uses default ( #2532 )  
						
						... 
						
						
						
						* Remove stream from average grads so it uses default
* comment 
						
						
					 
					
						2025-08-25 15:56:29 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						4822c3dbe9 
					 
					
						
						
							
							[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )  
						
						... 
						
						
						
						* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA 
						
						
					 
					
						2025-08-26 07:31:39 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						2ca75bb529 
					 
					
						
						
							
							Remove nccl install in release ( #2542 )  
						
						
						
						
					 
					
						2025-08-25 15:20:18 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						db14e29a0b 
					 
					
						
						
							
							allow pathlib.Path to save/load functions ( #2541 )  
						
						
						
						
					 
					
						2025-08-25 14:58:49 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						d2f540f4e0 
					 
					
						
						
							
							Use nccl header only when nccl is not present ( #2539 )  
						
						... 
						
						
						
						* use nccl header only when nccl is not present
* larger machine for cuda build 
						
						
					 
					
						2025-08-25 14:17:25 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						333ffea273 
					 
					
						
						
							
							[CUDA] Remove thrust in arange ( #2535 )  
						
						
						
						
					 
					
						2025-08-24 16:22:36 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f55b6f1f2f 
					 
					
						
						
							
							Enable COMPILE_WARNING_AS_ERROR for linux builds in CI ( #2534 )  
						
						
						
						
					 
					
						2025-08-24 15:33:08 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						30561229c7 
					 
					
						
						
							
							Fix allocation bug in NCCL ( #2530 )  
						
						
						
						
					 
					
						2025-08-22 14:39:43 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						068a4612e9 
					 
					
						
						
							
							nccl default for backend=any ( #2528 )  
						
						... 
						
						
						
						* nccl default for backend=any
* check num gpus + ensure row contiguous for all reduce
* comment 
						
						
					 
					
						2025-08-22 12:24:27 -07:00 
						 
				 
			
				
					
						
							
							
								Andrey Portnoy 
							
						 
					 
					
						
						
							
						
						5722c147de 
					 
					
						
						
							
							[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13  ( #2525 )  
						
						... 
						
						
						
						* [CUDA] Update cudaMemAdvise and cudaGraphAddDependencies for CUDA 13
These functions' signatures changed in CUDA 13, so we differentiate
between CUDA 13 and preceding releases at compile time.
* Mention NVIDIA in ACKNOWLEDGMENTS.md 
						
						
					 
					
						2025-08-21 19:57:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f6819a1f26 
					 
					
						
						
							
							Fix warning 186-D from nvcc ( #2527 )  
						
						
						
						
					 
					
						2025-08-22 10:29:55 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						f93f87c802 
					 
					
						
						
							
							nccl dep + default for cuda ( #2526 )  
						
						
						
						
					 
					
						2025-08-21 17:57:49 -07:00 
						 
				 
			
				
					
						
							
							
								Anastasiia Filippova 
							
						 
					 
					
						
						
							
						
						9392fc3f88 
					 
					
						
						
							
							NCCL backend ( #2476 )  
						
						
						
						
					 
					
						2025-08-21 11:56:15 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e843c4d8d5 
					 
					
						
						
							
							fix power ( #2523 )  
						
						
						
						
					 
					
						2025-08-21 06:46:01 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						0c5fc63a36 
					 
					
						
						
							
							Fix docs omission ( #2524 )  
						
						
						
						
					 
					
						2025-08-20 17:56:06 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						e397177f6e 
					 
					
						
						
							
							Custom cuda kernel ( #2517 )  
						
						
						
						
					 
					
						2025-08-20 17:20:22 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						f4c8888cbe 
					 
					
						
						
							
							[CUDA] Fix stride of singleton dims before passing to cuDNN ( #2521 )  
						
						
						
						
					 
					
						2025-08-21 08:55:26 +09:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						25c1e03205 
					 
					
						
						
							
							Fix overflow in large filter small channels ( #2520 )  
						
						
						
						
					 
					
						2025-08-20 08:03:29 -07:00 
						 
				 
			
				
					
						
							
							
								russellizadi 
							
						 
					 
					
						
						
							
						
						512281781c 
					 
					
						
						
							
							Remove state return from function example in compile documentation ( #2518 )  
						
						
						
						
					 
					
						2025-08-20 00:45:05 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						ac85ddfdb7 
					 
					
						
						
							
							[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )  
						
						... 
						
						
						
						* Add gemm_conv
* Add gemm_grouped_conv 
						
						
					 
					
						2025-08-20 10:06:22 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						65d0d40232 
					 
					
						
						
							
							Split cuDNN helpers into a separate header ( #2491 )  
						
						... 
						
						
						
						* Add RAII managed CudaGraph class
* Implement forward rms_norm with cuDNN
* Revert back to old rms norm kernel 
						
						
					 
					
						2025-08-20 09:29:28 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						cea9369610 
					 
					
						
						
							
							fix lapack svd ( #2515 )  
						
						
						
						
					 
					
						2025-08-18 15:07:59 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						e7c6e1db82 
					 
					
						
						
							
							no segfault with uninitialized array.at ( #2514 )  
						
						
						
						
					 
					
						2025-08-18 08:33:38 -07:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						c5fcd5b61b 
					 
					
						
						
							
							fix custom kernel test ( #2510 )  
						
						
						
						
					 
					
						2025-08-18 06:45:59 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						1df9887998 
					 
					
						
						
							
							Ensure no oob read in gemv_masked ( #2508 )  
						
						
						
						
					 
					
						2025-08-17 08:42:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						73f22d6226 
					 
					
						
						
							
							Ensure small sort doesn't use indices if not argsort ( #2506 )  
						
						
						
						
					 
					
						2025-08-17 08:42:20 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						c422050ca7 
					 
					
						
						
							
							Update cuDNN Frontend to v1.14 ( #2505 )  
						
						
						
						
					 
					
						2025-08-17 19:13:01 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						1ba18ff7d9 
					 
					
						
						
							
							[CUDA] Fix conv grads with groups ( #2495 )  
						
						... 
						
						
						
						* Put reshape utils in one file
* [CUDA] Fix conv grads with groups
* Put the reshape utils in gpu/copy.h 
						
						
					 
					
						2025-08-16 10:09:18 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						37b440faa8 
					 
					
						
						
							
							Clean up code handling both std::vector and SmallVector ( #2493 )  
						
						
						
						
					 
					
						2025-08-16 09:01:10 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						888b13ed63 
					 
					
						
						
							
							Remove the hack around SmallVector in cpu compile ( #2494 )  
						
						
						
						
					 
					
						2025-08-16 08:17:24 +09:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						4abb218d21 
					 
					
						
						
							
							The naive_conv_2d is no longer used ( #2496 )  
						
						
						
						
					 
					
						2025-08-16 07:57:30 +09:00 
						 
				 
			
				
					
						
							
							
								Awni Hannun 
							
						 
					 
					
						
						
							
						
						6441c21a94 
					 
					
						
						
							
							Faster general unary op ( #2472 )  
						
						... 
						
						
						
						* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general 
						
						
					 
					
						2025-08-15 15:04:12 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						dfb5022eab 
					 
					
						
						
							
							Rename cu::Matmul to CublasGemm ( #2488 )  
						
						
						
						
					 
					
						2025-08-13 09:37:40 +09:00 
						 
				 
			
				
					
						
							
							
								Daniel Yeh 
							
						 
					 
					
						
						
							
						
						ac207ce7aa 
					 
					
						
						
							
							make code blocks copyable ( #2480 )  
						
						... 
						
						
						
						Co-authored-by: Chen-Chen Yeh <ge96noj@mytum.de > 
						
						
					 
					
						2025-08-12 12:29:02 -07:00 
						 
				 
			
				
					
						
							
							
								Abe Leininger 
							
						 
					 
					
						
						
							
						
						fce53b61d6 
					 
					
						
						
							
							Fix reduce sum/prod overflow ( #2477 )  
						
						
						
						
					 
					
						2025-08-12 00:05:33 -07:00 
						 
				 
			
				
					
						
							
							
								Angelos Katharopoulos 
							
						 
					 
					
						
						
							
						
						8ae4a76308 
					 
					
						
						
							
							Use CMake <4.1 to avoid the nvpl error ( #2489 )  
						
						
						
						
					 
					
						2025-08-12 00:03:42 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						7fde1b6a1e 
					 
					
						
						
							
							Fix logsumexp/softmax not fused for some cases ( #2474 )  
						
						
						
						
					 
					
						2025-08-08 14:07:17 -07:00 
						 
				 
			
				
					
						
							
							
								Cheng 
							
						 
					 
					
						
						
							
						
						aa7b47481a 
					 
					
						
						
							
							[CUDA] Optimize set_mm_device_pointers for small ndim ( #2473 )  
						
						
						
						
					 
					
						2025-08-08 15:23:30 +09:00