Awni Hannun
6441c21a94
Faster general unary op ( #2472 )
...
* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general
2025-08-15 15:04:12 -07:00
Cheng
dfb5022eab
Rename cu::Matmul to CublasGemm ( #2488 )
2025-08-13 09:37:40 +09:00
Jagrit Digani
a9bdd67baa
Add CUDA sdpa vector ( #2468 )
2025-08-06 21:40:26 -07:00
Angelos Katharopoulos
3bf81ed1bd
[CUDA] Quantized refactoring ( #2442 )
2025-07-30 08:27:20 -07:00
Cheng
a0ae49d397
Move arange to its own file ( #2438 )
2025-07-30 13:05:51 +09:00
Awni Hannun
641be9463b
Add more CUDA architectures for PyPi package ( #2427 )
...
* add cuda sm 90
* add more archs
2025-07-28 12:35:15 -07:00
Awni Hannun
9acec364c2
[CUDA] Always use batched matmul ( #2404 )
...
* cuda batched mm
* addmm as well
* comment
2025-07-24 20:46:02 -07:00
Cheng
6f5874a2f2
[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )
...
* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-07-25 08:12:10 +09:00
Cheng
0f5ce173da
[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )
2025-07-23 06:11:11 -07:00
Awni Hannun
d107d8d495
add cuda gemv ( #2400 )
2025-07-22 08:24:13 -07:00
Awni Hannun
74eccbf3fa
use size option in binary ( #2399 )
2025-07-22 07:00:53 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation ( #2357 )
...
* Ship CCCL for JIT compilation
* Remove cexpf
2025-07-11 18:45:37 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
772f471ff2
[CUDA] Fix reductions ( #2314 )
2025-06-27 12:59:20 -07:00
Awni Hannun
b8022c578a
divmod, partition, sort fixes ( #2302 )
2025-06-16 18:49:32 -07:00
Angelos Katharopoulos
580776559b
RoPE for CUDA ( #2293 )
...
* First working CUDA rope
* Fix random
2025-06-15 06:08:07 -07:00
Cheng
c8b4787e4e
CUDA backend: indexing ops ( #2277 )
2025-06-12 21:44:19 -07:00
Awni Hannun
2188199ff8
[CUDA] ternary with select op ( #2283 )
...
* cuda ternary with select op
* comment + fix
* fix
2025-06-12 20:24:43 -07:00
Awni Hannun
aa07429bad
Fix cuda build ( #2284 )
2025-06-12 17:48:05 -07:00
Awni Hannun
918761a25a
[CUDA] RMSNorm and VJP ( #2280 )
...
* rms norm start
* nit
2025-06-12 17:09:49 -07:00
Cheng
a4fc671d3e
CUDA backend: compile ( #2276 )
...
* CUDA backend: compile
* Rename kernels/ to device/
2025-06-12 17:08:39 -07:00
Cheng
c2dd81a8aa
Fix warnings from latest CUDA toolkit ( #2275 )
2025-06-12 06:03:01 -07:00
Cheng
d7e680ffe4
CUDA backend: layernorm ( #2271 )
2025-06-11 15:48:32 -07:00
Cheng
c371baf53a
CUDA backend: softmax ( #2272 )
2025-06-11 13:55:22 -07:00
Cheng
ccf78f566c
CUDA backend: argreduce ( #2270 )
2025-06-11 13:26:17 -07:00
Cheng
c9fa68664a
CUDA backend: reduce ( #2269 )
2025-06-11 11:22:25 -07:00
Awni Hannun
c35f4d089a
start cuda circle config ( #2256 )
...
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
2025-06-10 21:19:47 -07:00
Cheng
99c33d011d
rebase + nit ( #2260 )
...
Co-authored-by: Awni Hannun <awni@apple.com>
2025-06-10 10:51:51 -07:00
Cheng
7c4eb5d03e
CUDA backend: random ( #2261 )
2025-06-10 08:59:56 -07:00
Cheng
bae9a6b404
CUDA backend: sort ( #2262 )
...
Co-authored-by: Awni Hannun <awni@apple.com>
2025-06-10 08:59:47 -07:00
Cheng
7ebb2e0193
CUDA backend: binary ops ( #2259 )
2025-06-10 06:37:40 -07:00
Cheng
f8bad60609
CUDA backend: unary ops ( #2158 )
2025-06-09 06:45:08 -07:00
Cheng
24f89173d1
CUDA backend: matmul ( #2241 )
2025-06-06 12:24:04 -07:00
Cheng
52dc8c8cd5
Add profiler annotations in common primitives for CUDA backend ( #2244 )
2025-06-04 19:55:12 -07:00
Cheng
85a8beb5e4
Avoid atomic updates across CPU/GPU in CUDA event ( #2231 )
2025-06-03 16:49:06 -07:00
Cheng
f76ee1ffd2
Move some dims utils to common ( #2223 )
2025-05-29 06:48:30 -07:00
Cheng
54a71f270a
Remove unused defines ( #2217 )
2025-05-23 06:14:58 -07:00
Cheng
35c87741cf
Build for compute capability 70 instead of 75 ( #2209 )
2025-05-20 19:42:48 -07:00
Cheng
237f9e58a8
Fix BEFORE keyword in target_include_directories ( #2204 )
2025-05-19 06:10:44 -07:00
Cheng
0cae0bdac8
CUDA backend: backbone ( #2075 )
2025-05-06 21:26:46 -07:00