Cheng
9d10239af7
[CUDA] Do vectorized store/load in binary ops ( #2330 )
2025-07-07 08:44:14 -07:00
Cheng
19facd4b20
Build with all cpu cores by default ( #2336 )
2025-07-07 06:06:45 -07:00
Angelos Katharopoulos
f5299f72cd
Fix layernorm race condition ( #2340 )
2025-07-07 06:06:01 -07:00
Cheng
0e0d9ac522
[CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size ( #2329 )
2025-07-05 08:33:29 -07:00
Awni Hannun
8917022deb
fix graphs for older cuda ( #2328 )
2025-07-02 19:37:58 -07:00
Awni Hannun
ec0d5db67b
[CUDA] Switch to CUDA graphs ( #2317 )
...
* cuda graph prototype
fix signal bug + start to add dependencies
capture more
capture more ops
remaining ops
fix reduce and rope deps
add concurrent context
try update, but not working
cosistent topology order
use node api
use node api directly to reduce overhead
fix bug
use kernels in unary
cache graph
format
fix synchronization
format
* comment
2025-07-02 15:59:13 -07:00
Cheng
e76e9b87f0
Fix compilation error from integral_constant ( #2326 )
2025-07-02 06:04:38 -07:00
Awni Hannun
cfb6a244ea
allow parameters to be deleted ( #2325 )
2025-07-01 21:27:23 -07:00
Awni Hannun
58f3860306
patch bump ( #2324 )
2025-07-01 12:12:16 -07:00
Awni Hannun
dd4f53db63
use fp32 for testing, add more complex ops ( #2322 )
2025-07-01 07:30:00 -07:00
Angelos Katharopoulos
3d5e17e507
MLX_SWITCH macros to templates ( #2320 )
2025-07-01 01:33:44 -07:00
Awni Hannun
33bf1a244b
Fix module update in strict mode ( #2321 )
...
* fix module update in strict mode
* allow GELU to be pickled
2025-06-29 11:12:29 -07:00
Angelos Katharopoulos
772f471ff2
[CUDA] Fix reductions ( #2314 )
2025-06-27 12:59:20 -07:00
Angelos Katharopoulos
2c11d10f8d
Split broadcast so it is always fused in compile ( #2318 )
2025-06-26 22:08:18 -07:00
Angelos Katharopoulos
656ed7f780
Fix get 2d grid dims ( #2316 )
2025-06-25 13:03:09 -07:00
Awni Hannun
81bb9a2a9e
Compile float64 functions on CPU ( #2311 )
2025-06-24 10:18:52 -07:00
Angelos Katharopoulos
5adf185f86
Fix update_modules()
when providing a subset ( #2308 )
2025-06-20 17:19:46 -07:00
Awni Hannun
c9a9180584
Cuda perf tuning ( #2307 )
...
* perf tuning
* fix adding inputs arrays in matmul / srot
* format
* fix
2025-06-20 14:50:57 -07:00
Awni Hannun
76831ed83d
Build CUDA release in Circle ( #2306 )
...
* cuda release
* add license
2025-06-19 15:26:36 -07:00
Angelos Katharopoulos
b3d7b85376
Make ptx cache settable by environment variable ( #2304 )
2025-06-17 23:55:56 -07:00
Awni Hannun
cad5c0241c
[CUDA] synch properly waits for all tasks to finish and clear ( #2303 )
...
* cuda synch properly waits for all tasks to finish and clear
* fix copy
2025-06-17 12:03:25 -07:00
Awni Hannun
b8022c578a
divmod, partition, sort fixes ( #2302 )
2025-06-16 18:49:32 -07:00
Awni Hannun
bc53f8293f
Cuda bug fixes 2 ( #2298 )
...
* more bug fixes
* more bug fixes
* format
2025-06-16 13:14:46 -07:00
Awni Hannun
c552ff2451
[CUDA] Fix back-end bugs and enable corresponding tests ( #2296 )
...
* Fix some cuda back-end bugs and enable corresponding tests
* more fixes
* enable more tests
* format
2025-06-16 08:45:40 -07:00
Awni Hannun
4fda5fbdf9
add python testing for cuda with ability to skip list of tests ( #2295 )
2025-06-15 10:56:48 -07:00
Angelos Katharopoulos
580776559b
RoPE for CUDA ( #2293 )
...
* First working CUDA rope
* Fix random
2025-06-15 06:08:07 -07:00
Awni Hannun
a14aaa7c9d
Fix cuda arg reduce ( #2291 )
2025-06-14 17:54:00 -07:00
Awni Hannun
a6d780154f
fix cuda gemm for bf16 ( #2288 )
2025-06-13 22:10:46 -07:00
Awni Hannun
6871e2eeb7
fix cuda jit ( #2287 )
2025-06-13 19:21:46 -07:00
Awni Hannun
8402a2acf4
Fix complex power and print ( #2286 )
...
* fix complex power and print
* fix complex matmul shape
2025-06-13 11:13:00 -07:00
Jagrit Digani
fddb6933e1
Collection of refactors ( #2274 )
...
* Refactor gemv into a function
* Refactor splitk step 1
* Refactor split k axpby
* Rearrange steel_gemm_regular
* Redirect steel_gemm_regular
* Add axpby routing to steel_matmul_regular
* Refactor AddMM step 1
* Redirect steel_gemm
* Update addmm
* Comments and format
* Some cleanup
* Add architecture gen to device
* Update no copy condition in normalization to account for axis size 1
2025-06-13 10:44:56 -07:00
Cheng
c8b4787e4e
CUDA backend: indexing ops ( #2277 )
2025-06-12 21:44:19 -07:00
Awni Hannun
2188199ff8
[CUDA] ternary with select op ( #2283 )
...
* cuda ternary with select op
* comment + fix
* fix
2025-06-12 20:24:43 -07:00
Awni Hannun
aa07429bad
Fix cuda build ( #2284 )
2025-06-12 17:48:05 -07:00
Awni Hannun
918761a25a
[CUDA] RMSNorm and VJP ( #2280 )
...
* rms norm start
* nit
2025-06-12 17:09:49 -07:00
Cheng
a4fc671d3e
CUDA backend: compile ( #2276 )
...
* CUDA backend: compile
* Rename kernels/ to device/
2025-06-12 17:08:39 -07:00
Awni Hannun
f5f65ef48c
Make sliceUpdate general ( #2282 )
...
* Make sliceUpdate general
* fix
2025-06-12 16:48:54 -07:00
Cheng
c2dd81a8aa
Fix warnings from latest CUDA toolkit ( #2275 )
2025-06-12 06:03:01 -07:00
Cheng
d7e680ffe4
CUDA backend: layernorm ( #2271 )
2025-06-11 15:48:32 -07:00
Cheng
c371baf53a
CUDA backend: softmax ( #2272 )
2025-06-11 13:55:22 -07:00
Cheng
ccf78f566c
CUDA backend: argreduce ( #2270 )
2025-06-11 13:26:17 -07:00
Cheng
c9fa68664a
CUDA backend: reduce ( #2269 )
2025-06-11 11:22:25 -07:00
Awni Hannun
c35f4d089a
start cuda circle config ( #2256 )
...
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
2025-06-10 21:19:47 -07:00
Angelos Katharopoulos
8590c0941e
Add load_safe to the general conv loaders ( #2258 )
2025-06-10 20:58:16 -07:00
Cheng
095163b8d1
Fix building cpp benchmarks on Linux ( #2268 )
2025-06-10 17:10:24 -07:00
Cheng
99c33d011d
rebase + nit ( #2260 )
...
Co-authored-by: Awni Hannun <awni@apple.com>
2025-06-10 10:51:51 -07:00
Awni Hannun
62fecf3e13
fix conv export ( #2265 )
2025-06-10 09:34:01 -07:00
Cheng
7c4eb5d03e
CUDA backend: random ( #2261 )
2025-06-10 08:59:56 -07:00
Cheng
bae9a6b404
CUDA backend: sort ( #2262 )
...
Co-authored-by: Awni Hannun <awni@apple.com>
2025-06-10 08:59:47 -07:00
Christopher Fleetwood
004c1d8ef2
Report number of missing parameters ( #2264 )
...
* chore: inform
* chore: format
---------
Co-authored-by: FL33TW00D <FL33TW00D@users.noreply.github.com>
2025-06-10 06:37:50 -07:00