Jagrit Digani
a9bdd67baa
Add CUDA sdpa vector ( #2468 )
2025-08-06 21:40:26 -07:00
Angelos Katharopoulos
f2adb5638d
Fix typo in metal command encoder ( #2471 )
2025-08-06 16:58:23 -07:00
Awni Hannun
7bb96e4249
fix cublas on h100 ( #2466 )
2025-08-06 06:18:58 -07:00
Cheng
828c5f1137
Use SmallVector for shapes and strides ( #2454 )
...
* Use SmallVector for shapes and strides
* Convert SmallVector to tuple
2025-08-05 09:41:03 +09:00
Zamderax
737dd6d1ac
Add missing <algorithm> header to jit_compiler.cpp ( #2460 )
...
Fixes compilation error on Linux where std::find_if is used on line 121
but the <algorithm> header was not included. While this might work on
some platforms due to transitive includes, it's not guaranteed by the
C++ standard.
Resolves issue #2459
2025-08-04 14:00:46 -07:00
Cheng
aaf78f4c6b
Use LRU cache for cuda graph ( #2448 )
...
* Use LRU cache for cuda graph
* Remove unused destructor
2025-08-02 21:28:57 +09:00
Angelos Katharopoulos
be9bc96da4
[CUDA] Matmul utils initial commit ( #2441 )
2025-08-01 14:22:25 -07:00
Angelos Katharopoulos
86258f292f
[CUDA] Vectorize generated kernels ( #2444 )
2025-07-31 18:18:57 -07:00
Cheng
b26d88591c
[CUDA] Save primitive inputs faster ( #2449 )
...
* Add more nvtx loggings
* [CUDA] Saving primitive inputs faster
* Remove unneeded check
2025-08-01 10:16:06 +09:00
Cheng
86c6a15571
[CUDA] Backward convolution ( #2431 )
2025-08-01 09:54:05 +09:00
Cheng
daafee676f
Fix wrong graph key when using concurrent context ( #2447 )
2025-07-31 06:01:05 -07:00
Awni Hannun
d32519c8ee
fix gemv regression ( #2445 )
2025-07-30 14:23:01 -07:00
Angelos Katharopoulos
3bf81ed1bd
[CUDA] Quantized refactoring ( #2442 )
2025-07-30 08:27:20 -07:00
Cheng
3628e5d497
Use load_vector in arg_reduce ( #2439 )
2025-07-30 17:40:26 +09:00
Cheng
a0ae49d397
Move arange to its own file ( #2438 )
2025-07-30 13:05:51 +09:00
Cheng
254476718b
Remove the kernel arg from get_launch_args ( #2437 )
2025-07-30 11:43:02 +09:00
Awni Hannun
3adba92ebe
Cuda faster softmax ( #2435 )
...
* faster softmax and logsumexp
* faster softmax and logsumexp
* format
2025-07-29 17:18:12 -07:00
Awni Hannun
ef631d63af
faster rms norm ( #2433 )
2025-07-29 13:12:00 -07:00
Awni Hannun
641be9463b
Add more CUDA architectures for PyPi package ( #2427 )
...
* add cuda sm 90
* add more archs
2025-07-28 12:35:15 -07:00
Awni Hannun
ab0e608862
[CUDA] More sizes for gemv ( #2429 )
...
* route more to gemv
* route more sizes to custom gemv
2025-07-28 12:35:01 -07:00
Awni Hannun
1588659062
no occupancy query for launch params ( #2426 )
2025-07-28 09:09:41 -07:00
Awni Hannun
b9e88fb976
[CUDA] Fix segfault on exit ( #2424 )
...
* fix cuda segfault on exit
* comment
2025-07-27 08:08:13 -07:00
Awni Hannun
5597fa089c
Fix qvm splitk ( #2415 )
2025-07-25 11:50:24 -07:00
Awni Hannun
9acec364c2
[CUDA] Always use batched matmul ( #2404 )
...
* cuda batched mm
* addmm as well
* comment
2025-07-24 20:46:02 -07:00
Cheng
6f5874a2f2
[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )
...
* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-07-25 08:12:10 +09:00
Awni Hannun
4e504039f5
[Metal] Release metal events ( #2412 )
...
* release metal events
* fix
* fix
2025-07-23 19:53:42 -07:00
Cheng
0f5ce173da
[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )
2025-07-23 06:11:11 -07:00
Awni Hannun
d107d8d495
add cuda gemv ( #2400 )
2025-07-22 08:24:13 -07:00
Awni Hannun
1e496ddb82
[CUDA] Simplify allocator ( #2392 )
...
* simplify allocator and fixe race with small pool
* Don't use shared event in worker
* use cuda buffer in small pool
* comment
* comment
2025-07-22 08:24:01 -07:00
Awni Hannun
74eccbf3fa
use size option in binary ( #2399 )
2025-07-22 07:00:53 -07:00
Cheng
56cc858af9
Add contiguous_copy_cpu util for copying array ( #2397 )
2025-07-21 07:30:35 -07:00
Cheng
f55c4ed1d6
Remove thrust iterators ( #2396 )
2025-07-21 07:30:27 -07:00
Awni Hannun
93d70419e7
[CUDA] speedup handling scalars ( #2389 )
...
* speedup scalars in cuda
* comment
2025-07-18 21:47:31 -07:00
Gökdeniz Gülmez
deee214a95
Adding support for the Muon Optimizer ( #1914 )
...
* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-07-18 12:25:28 -07:00
Cheng
45adec102c
Add contiguous_copy_gpu util for copying array ( #2379 )
2025-07-18 06:44:25 -07:00
Cheng
31fc530c76
[CUDA] Add more ways finding CCCL headers in JIT ( #2382 )
2025-07-17 15:25:34 -07:00
Awni Hannun
fbb3f65a1a
fix resource leaks in matmul and graph ( #2383 )
2025-07-17 06:50:15 -07:00
Angelos Katharopoulos
6b1b8ea91b
[CUDA] Add work per thread to compile ( #2368 )
2025-07-17 06:47:52 -07:00
Awni Hannun
b2273733ea
Test with CUDA 12.2 ( #2375 )
...
* Test with CUDA 12.0
* try older image
* fix cpu sort
2025-07-16 13:00:37 -07:00
Cheng
30571e2326
Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )
2025-07-16 07:34:24 -07:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max ( #2377 )
2025-07-15 18:19:47 -07:00
Cheng
cb349a291c
[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )
2025-07-15 00:36:13 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Cheng
d34f887abc
Add Primitive::name and remove Primitive::print ( #2365 )
2025-07-14 14:06:35 -07:00
Cheng
2d3c26c565
[CUDA] Do not put kernels in annoymous namespace ( #2362 )
2025-07-12 14:24:45 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation ( #2357 )
...
* Ship CCCL for JIT compilation
* Remove cexpf
2025-07-11 18:45:37 -07:00
Awni Hannun
42cc9cfbc7
fix copy dispatch ( #2360 )
2025-07-11 10:59:35 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
b6eec20260
Fix edge check in qmm_n QuantizedLoader ( #2355 )
2025-07-10 16:28:50 -07:00
Cheng
afb9817599
[CUDA] Put version in ptx cache dir path ( #2352 )
2025-07-10 07:24:21 -07:00