Cheng
c9f4dc851f
Merge build-cuda and build-linux actions ( #2783 )
Build and Test / Check Lint (push) Has been cancelled
Build and Test / Linux (cpu, aarch64) (push) Has been cancelled
Build and Test / Linux (cpu, x86_64) (push) Has been cancelled
Build and Test / Linux (cuda-12.6, aarch64) (push) Has been cancelled
Build and Test / Linux (cuda-12.9, aarch64) (push) Has been cancelled
Build and Test / Linux (cuda-12.6, x86_64) (push) Has been cancelled
Build and Test / Linux (cuda-12.9, x86_64) (push) Has been cancelled
Build and Test / macOS (14.0) (push) Has been cancelled
Build and Test / macOS (15.0) (push) Has been cancelled
Build and Test / Build Documentation (push) Has been cancelled
Build and Test / Linux Fedora (aarch64) (push) Has been cancelled
Build and Test / Linux Fedora (x86_64) (push) Has been cancelled
2025-11-25 20:06:42 +09:00
Cheng
6f35017d1b
[CUDA] cuDNN backward attention ( #2762 )
Build and Test / check_lint (push) Has been cancelled
Build and Test / linux_build_and_test (ubuntu-22.04) (push) Has been cancelled
Build and Test / linux_build_and_test (ubuntu-22.04-arm) (push) Has been cancelled
Build and Test / mac_build_and_test (14.0) (push) Has been cancelled
Build and Test / mac_build_and_test (15.0) (push) Has been cancelled
Build and Test / cuda_build_and_test (cuda-12.6) (push) Has been cancelled
Build and Test / cuda_build_and_test (cuda-12.9) (push) Has been cancelled
Build and Test / build_documentation (push) Has been cancelled
Build and Test / Linux Fedora CPP Build (aarch64) (push) Has been cancelled
Build and Test / Linux Fedora CPP Build (x86_64) (push) Has been cancelled
Nightly Build / build_linux_release (3.10) (push) Has been cancelled
Nightly Build / build_linux_release (3.14) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_mac_release (3.10) (push) Has been cancelled
Nightly Build / build_mac_release (3.13) (push) Has been cancelled
Nightly Build / build_cuda_release (push) Has been cancelled
2025-11-19 08:13:50 +09:00
Cheng
940f4c7818
Fix building with CUDA < 12.8 ( #2782 )
Build and Test / check_lint (push) Has been cancelled
Build and Test / linux_build_and_test (ubuntu-22.04) (push) Has been cancelled
Build and Test / linux_build_and_test (ubuntu-22.04-arm) (push) Has been cancelled
Build and Test / mac_build_and_test (14.0) (push) Has been cancelled
Build and Test / mac_build_and_test (15.0) (push) Has been cancelled
Build and Test / cuda_build_and_test (cuda-12.6) (push) Has been cancelled
Build and Test / cuda_build_and_test (cuda-12.9) (push) Has been cancelled
Build and Test / build_documentation (push) Has been cancelled
Build and Test / Linux Fedora CPP Build (aarch64) (push) Has been cancelled
Build and Test / Linux Fedora CPP Build (x86_64) (push) Has been cancelled
Nightly Build / build_linux_release (3.10) (push) Has been cancelled
Nightly Build / build_linux_release (3.14) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_mac_release (3.10) (push) Has been cancelled
Nightly Build / build_mac_release (3.13) (push) Has been cancelled
Nightly Build / build_cuda_release (push) Has been cancelled
2025-11-18 12:55:19 +09:00
Awni Hannun
1bf605d56d
use arch specific targets when possible ( #2771 )
2025-11-14 20:04:18 -08:00
Cheng
3b2ffcefc3
[CUDA] cuDNN forward attention ( #2743 )
...
Nightly Build / build_linux_release (3.10) (push) Has been cancelled
Nightly Build / build_linux_release (3.14) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.10, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.10, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14, ubuntu-22.04-arm) (push) Has been cancelled
Nightly Build / build_mac_release (3.10) (push) Has been cancelled
Nightly Build / build_mac_release (3.13) (push) Has been cancelled
Nightly Build / build_cuda_with_tests (push) Has been cancelled
Nightly Build / build_cuda_release (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (aarch64) (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (x86_64) (push) Has been cancelled
* Separate sdpa kernels in another file
* Initial support for cuDNN SDPA
* Diable a few corner cases
* Remove scaled_dot_product_attention.h
* Use cuDNN attention for prefilling
* cuDNN SDPA requires Ampere and later
* Address reviews
* Do contiguous copy of inputs
2025-11-14 09:23:56 +09:00
Awni Hannun
df58b4133a
[CUDA] Reduce use of managed memory ( #2725 )
...
Nightly Build / build_linux_release (3.10) (push) Has been cancelled
Nightly Build / build_linux_release (3.14) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.10) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14) (push) Has been cancelled
Nightly Build / build_mac_release (3.10) (push) Has been cancelled
Nightly Build / build_mac_release (3.13) (push) Has been cancelled
Nightly Build / build_cuda_with_tests (push) Has been cancelled
Nightly Build / build_cuda_release (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (aarch64) (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (x86_64) (push) Has been cancelled
* Use async cuda malloc managed with cuda 13
* add pool threshold
* refactor for regular cuda malloc
* load eval gpu for cuda
* remove use of cuda pool, use cuda free async
* fix
* fix
* fix
* fix
* fix + comment
2025-11-05 16:05:23 -08:00
Awni Hannun
ec72b44417
Add quantize/dequantize for mxfp8 and nvfp4 ( #2688 )
...
* Add quantize/dequantize slow path for mxfp8 and nvfp4
* fast cuda kernel for mx/nv quantization
* fallback for cuda < 12.8 (#2697 )
* format (#2700 )
* fix (#2701 )
* metal kernels
* docs
* fix jit
* add default bits and group sizes
* improve quant docs
* fix output type of mxfp4 matmuls
2025-10-28 16:23:12 -07:00
Awni Hannun
969924cc69
Fp8 conversion ( #2686 )
...
* add fp8 e4m3 converters
* add cuda
* default saturate to min/max
* fix for older OS
* fix no gpu/cpu
* fix saturate
* fix compile
2025-10-27 16:35:50 -07:00
Awni Hannun
4bce5f9b2d
suppress gcc 10.1 warnings ( #2679 )
...
* suppress gcc 10.1 warnings
* suppress gcc 10.1 warnings
2025-10-17 12:09:21 -07:00
Anastasiia Filippova
9392fc3f88
NCCL backend ( #2476 )
2025-08-21 11:56:15 -07:00
Angelos Katharopoulos
e397177f6e
Custom cuda kernel ( #2517 )
2025-08-20 17:20:22 -07:00
Cheng
ac85ddfdb7
[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )
...
* Add gemm_conv
* Add gemm_grouped_conv
2025-08-20 10:06:22 +09:00
Cheng
65d0d40232
Split cuDNN helpers into a separate header ( #2491 )
...
* Add RAII managed CudaGraph class
* Implement forward rms_norm with cuDNN
* Revert back to old rms norm kernel
2025-08-20 09:29:28 +09:00
Cheng
c422050ca7
Update cuDNN Frontend to v1.14 ( #2505 )
2025-08-17 19:13:01 +09:00
Awni Hannun
6441c21a94
Faster general unary op ( #2472 )
...
* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general
2025-08-15 15:04:12 -07:00
Cheng
dfb5022eab
Rename cu::Matmul to CublasGemm ( #2488 )
2025-08-13 09:37:40 +09:00
Jagrit Digani
a9bdd67baa
Add CUDA sdpa vector ( #2468 )
2025-08-06 21:40:26 -07:00
Angelos Katharopoulos
3bf81ed1bd
[CUDA] Quantized refactoring ( #2442 )
2025-07-30 08:27:20 -07:00
Cheng
a0ae49d397
Move arange to its own file ( #2438 )
2025-07-30 13:05:51 +09:00
Awni Hannun
641be9463b
Add more CUDA architectures for PyPi package ( #2427 )
...
* add cuda sm 90
* add more archs
2025-07-28 12:35:15 -07:00
Awni Hannun
9acec364c2
[CUDA] Always use batched matmul ( #2404 )
...
* cuda batched mm
* addmm as well
* comment
2025-07-24 20:46:02 -07:00
Cheng
6f5874a2f2
[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )
...
* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com >
2025-07-25 08:12:10 +09:00
Cheng
0f5ce173da
[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )
2025-07-23 06:11:11 -07:00
Awni Hannun
d107d8d495
add cuda gemv ( #2400 )
2025-07-22 08:24:13 -07:00
Awni Hannun
74eccbf3fa
use size option in binary ( #2399 )
2025-07-22 07:00:53 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation ( #2357 )
...
* Ship CCCL for JIT compilation
* Remove cexpf
2025-07-11 18:45:37 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
772f471ff2
[CUDA] Fix reductions ( #2314 )
2025-06-27 12:59:20 -07:00
Awni Hannun
b8022c578a
divmod, partition, sort fixes ( #2302 )
2025-06-16 18:49:32 -07:00
Angelos Katharopoulos
580776559b
RoPE for CUDA ( #2293 )
...
* First working CUDA rope
* Fix random
2025-06-15 06:08:07 -07:00
Cheng
c8b4787e4e
CUDA backend: indexing ops ( #2277 )
2025-06-12 21:44:19 -07:00
Awni Hannun
2188199ff8
[CUDA] ternary with select op ( #2283 )
...
* cuda ternary with select op
* comment + fix
* fix
2025-06-12 20:24:43 -07:00
Awni Hannun
aa07429bad
Fix cuda build ( #2284 )
2025-06-12 17:48:05 -07:00
Awni Hannun
918761a25a
[CUDA] RMSNorm and VJP ( #2280 )
...
* rms norm start
* nit
2025-06-12 17:09:49 -07:00
Cheng
a4fc671d3e
CUDA backend: compile ( #2276 )
...
* CUDA backend: compile
* Rename kernels/ to device/
2025-06-12 17:08:39 -07:00
Cheng
c2dd81a8aa
Fix warnings from latest CUDA toolkit ( #2275 )
2025-06-12 06:03:01 -07:00
Cheng
d7e680ffe4
CUDA backend: layernorm ( #2271 )
2025-06-11 15:48:32 -07:00
Cheng
c371baf53a
CUDA backend: softmax ( #2272 )
2025-06-11 13:55:22 -07:00
Cheng
ccf78f566c
CUDA backend: argreduce ( #2270 )
2025-06-11 13:26:17 -07:00
Cheng
c9fa68664a
CUDA backend: reduce ( #2269 )
2025-06-11 11:22:25 -07:00
Awni Hannun
c35f4d089a
start cuda circle config ( #2256 )
...
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
2025-06-10 21:19:47 -07:00
Cheng
99c33d011d
rebase + nit ( #2260 )
...
Co-authored-by: Awni Hannun <awni@apple.com >
2025-06-10 10:51:51 -07:00
Cheng
7c4eb5d03e
CUDA backend: random ( #2261 )
2025-06-10 08:59:56 -07:00
Cheng
bae9a6b404
CUDA backend: sort ( #2262 )
...
Co-authored-by: Awni Hannun <awni@apple.com >
2025-06-10 08:59:47 -07:00
Cheng
7ebb2e0193
CUDA backend: binary ops ( #2259 )
2025-06-10 06:37:40 -07:00
Cheng
f8bad60609
CUDA backend: unary ops ( #2158 )
2025-06-09 06:45:08 -07:00
Cheng
24f89173d1
CUDA backend: matmul ( #2241 )
2025-06-06 12:24:04 -07:00
Cheng
52dc8c8cd5
Add profiler annotations in common primitives for CUDA backend ( #2244 )
2025-06-04 19:55:12 -07:00
Cheng
85a8beb5e4
Avoid atomic updates across CPU/GPU in CUDA event ( #2231 )
2025-06-03 16:49:06 -07:00