Angelos Katharopoulos
|
b3d7b85376
|
Make ptx cache settable by environment variable (#2304)
|
2025-06-17 23:55:56 -07:00 |
|
Awni Hannun
|
cad5c0241c
|
[CUDA] synch properly waits for all tasks to finish and clear (#2303)
* cuda synch properly waits for all tasks to finish and clear
* fix copy
|
2025-06-17 12:03:25 -07:00 |
|
Awni Hannun
|
b8022c578a
|
divmod, partition, sort fixes (#2302)
|
2025-06-16 18:49:32 -07:00 |
|
Awni Hannun
|
bc53f8293f
|
Cuda bug fixes 2 (#2298)
* more bug fixes
* more bug fixes
* format
|
2025-06-16 13:14:46 -07:00 |
|
Awni Hannun
|
c552ff2451
|
[CUDA] Fix back-end bugs and enable corresponding tests (#2296)
* Fix some cuda back-end bugs and enable corresponding tests
* more fixes
* enable more tests
* format
|
2025-06-16 08:45:40 -07:00 |
|
Angelos Katharopoulos
|
580776559b
|
RoPE for CUDA (#2293)
* First working CUDA rope
* Fix random
|
2025-06-15 06:08:07 -07:00 |
|
Awni Hannun
|
a14aaa7c9d
|
Fix cuda arg reduce (#2291)
|
2025-06-14 17:54:00 -07:00 |
|
Awni Hannun
|
a6d780154f
|
fix cuda gemm for bf16 (#2288)
|
2025-06-13 22:10:46 -07:00 |
|
Awni Hannun
|
6871e2eeb7
|
fix cuda jit (#2287)
|
2025-06-13 19:21:46 -07:00 |
|
Awni Hannun
|
8402a2acf4
|
Fix complex power and print (#2286)
* fix complex power and print
* fix complex matmul shape
|
2025-06-13 11:13:00 -07:00 |
|
Cheng
|
c8b4787e4e
|
CUDA backend: indexing ops (#2277)
|
2025-06-12 21:44:19 -07:00 |
|
Awni Hannun
|
2188199ff8
|
[CUDA] ternary with select op (#2283)
* cuda ternary with select op
* comment + fix
* fix
|
2025-06-12 20:24:43 -07:00 |
|
Awni Hannun
|
aa07429bad
|
Fix cuda build (#2284)
|
2025-06-12 17:48:05 -07:00 |
|
Awni Hannun
|
918761a25a
|
[CUDA] RMSNorm and VJP (#2280)
* rms norm start
* nit
|
2025-06-12 17:09:49 -07:00 |
|
Cheng
|
a4fc671d3e
|
CUDA backend: compile (#2276)
* CUDA backend: compile
* Rename kernels/ to device/
|
2025-06-12 17:08:39 -07:00 |
|
Awni Hannun
|
f5f65ef48c
|
Make sliceUpdate general (#2282)
* Make sliceUpdate general
* fix
|
2025-06-12 16:48:54 -07:00 |
|
Cheng
|
c2dd81a8aa
|
Fix warnings from latest CUDA toolkit (#2275)
|
2025-06-12 06:03:01 -07:00 |
|
Cheng
|
d7e680ffe4
|
CUDA backend: layernorm (#2271)
|
2025-06-11 15:48:32 -07:00 |
|
Cheng
|
c371baf53a
|
CUDA backend: softmax (#2272)
|
2025-06-11 13:55:22 -07:00 |
|
Cheng
|
ccf78f566c
|
CUDA backend: argreduce (#2270)
|
2025-06-11 13:26:17 -07:00 |
|
Cheng
|
c9fa68664a
|
CUDA backend: reduce (#2269)
|
2025-06-11 11:22:25 -07:00 |
|
Awni Hannun
|
c35f4d089a
|
start cuda circle config (#2256)
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
|
2025-06-10 21:19:47 -07:00 |
|
Cheng
|
99c33d011d
|
rebase + nit (#2260)
Co-authored-by: Awni Hannun <awni@apple.com>
|
2025-06-10 10:51:51 -07:00 |
|
Cheng
|
7c4eb5d03e
|
CUDA backend: random (#2261)
|
2025-06-10 08:59:56 -07:00 |
|
Cheng
|
bae9a6b404
|
CUDA backend: sort (#2262)
Co-authored-by: Awni Hannun <awni@apple.com>
|
2025-06-10 08:59:47 -07:00 |
|
Cheng
|
7ebb2e0193
|
CUDA backend: binary ops (#2259)
|
2025-06-10 06:37:40 -07:00 |
|
Cheng
|
f8bad60609
|
CUDA backend: unary ops (#2158)
|
2025-06-09 06:45:08 -07:00 |
|
Cheng
|
24f89173d1
|
CUDA backend: matmul (#2241)
|
2025-06-06 12:24:04 -07:00 |
|
Cheng
|
52dc8c8cd5
|
Add profiler annotations in common primitives for CUDA backend (#2244)
|
2025-06-04 19:55:12 -07:00 |
|
Cheng
|
85a8beb5e4
|
Avoid atomic updates across CPU/GPU in CUDA event (#2231)
|
2025-06-03 16:49:06 -07:00 |
|
Cheng
|
5685ceb3c7
|
Avoid invoking allocator::malloc when creating CUDA event (#2232)
|
2025-06-03 16:48:40 -07:00 |
|
Cheng
|
1b021f6984
|
Fast primitives decide when to use the fallback (#2216)
|
2025-06-02 13:26:37 -07:00 |
|
Cheng
|
db5a7c6192
|
Add memory cache to CUDA backend (#2221)
* Move BufferCache out of allocator
* Add memory cache to cuda backend allocator
* Simplify BufferCache assuming buf can not be null
|
2025-05-30 12:12:54 -07:00 |
|
Cheng
|
f76ee1ffd2
|
Move some dims utils to common (#2223)
|
2025-05-29 06:48:30 -07:00 |
|
Cheng
|
54a71f270a
|
Remove unused defines (#2217)
|
2025-05-23 06:14:58 -07:00 |
|
Cheng
|
35c87741cf
|
Build for compute capability 70 instead of 75 (#2209)
|
2025-05-20 19:42:48 -07:00 |
|
Cheng
|
237f9e58a8
|
Fix BEFORE keyword in target_include_directories (#2204)
|
2025-05-19 06:10:44 -07:00 |
|
Cheng
|
7d4b378952
|
Include cuda_bf16.h for bfloat16 overloads (#2192)
* Include cuda_bf16.h for bfloat16 overloads
* Add NO_GPU_MULTI(Eig) in cuda backend
|
2025-05-16 06:44:42 -07:00 |
|
Cheng
|
0cae0bdac8
|
CUDA backend: backbone (#2075)
|
2025-05-06 21:26:46 -07:00 |
|