Commit Graph

9 Commits

Author SHA1 Message Date
Awni Hannun
ef631d63af
faster rms norm (#2433) 2025-07-29 13:12:00 -07:00
Cheng
f55c4ed1d6
Remove thrust iterators (#2396) 2025-07-21 07:30:27 -07:00
Cheng
45adec102c
Add contiguous_copy_gpu util for copying array (#2379) 2025-07-18 06:44:25 -07:00
Awni Hannun
ec0d5db67b
[CUDA] Switch to CUDA graphs (#2317)
* cuda graph prototype

fix signal bug + start to add dependencies

capture more

capture more ops

remaining ops

fix reduce and rope deps

add concurrent context

try update, but not working

cosistent topology order

use node api

use node api directly to reduce overhead

fix bug

use kernels in unary

cache graph

format

fix synchronization

format

* comment
2025-07-02 15:59:13 -07:00
Cheng
e76e9b87f0
Fix compilation error from integral_constant (#2326) 2025-07-02 06:04:38 -07:00
Awni Hannun
dd4f53db63
use fp32 for testing, add more complex ops (#2322) 2025-07-01 07:30:00 -07:00
Angelos Katharopoulos
3d5e17e507
MLX_SWITCH macros to templates (#2320) 2025-07-01 01:33:44 -07:00
Awni Hannun
918761a25a
[CUDA] RMSNorm and VJP (#2280)
* rms norm start

* nit
2025-06-12 17:09:49 -07:00
Cheng
d7e680ffe4
CUDA backend: layernorm (#2271) 2025-06-11 15:48:32 -07:00