Cheng
48e796bb91
Do error check for cublas handle
2025-07-24 00:32:25 +00:00
Cheng
4c0dc7745f
Make LRUCache more like a normal container
2025-07-24 00:32:25 +00:00
Cheng
3d16cb5071
Set cudnn stream before execution
2025-07-24 00:32:25 +00:00
Cheng
67a5f7b2a8
Test the native cuda graph api
2025-07-24 00:32:25 +00:00
Cheng
85510dae78
Add cache
2025-07-24 00:32:25 +00:00
Cheng
0430a6a74a
Turn off tf32
2025-07-24 00:32:25 +00:00
Cheng
6444b29651
Plan needs to be kept alive
2025-07-24 00:32:25 +00:00
Cheng
c6076fc77b
Switch to backend apis
2025-07-24 00:32:25 +00:00
Cheng
bb6a75bc4a
cudnn only accepts contiguous inputs
2025-07-24 00:32:25 +00:00
Cheng
fecc67509d
Install libcudnn9-dev-cuda-12 in CI
2025-07-24 00:32:24 +00:00
Awni Hannun
75bcb46069
include cudnn as python dep
2025-07-24 00:31:23 +00:00
Cheng
180ec0d3a5
Fix C++ conv tests
2025-07-24 00:30:38 +00:00
Cheng
cea3af6622
More unused backend apis
2025-07-24 00:30:38 +00:00
Cheng
ae9dbb1a9b
Fix recording cudnn conv
2025-07-24 00:30:38 +00:00
Cheng
6571df6ad7
Remove backend apis
2025-07-24 00:30:38 +00:00
Cheng
ad44c4bcd9
Initial implementation
2025-07-24 00:30:38 +00:00
Cheng
04bd515370
Link with cuDNN
2025-07-24 00:30:38 +00:00
Awni Hannun
d1f4d291e8
Fix uv install and add dev release ( #2411 )
...
* fix uv install and add dev release
* fix docstring
* pin cuda deps
* cuda release on cpu-only machine
2025-07-23 16:54:19 -07:00
Awni Hannun
e1840853ce
full row mask in sdpa consistently gives nan ( #2406 )
2025-07-23 16:37:03 -07:00
Cheng
0f5ce173da
[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )
2025-07-23 06:11:11 -07:00
Cheng
588854195f
Remove unused code in Convolution::vjp ( #2408 )
2025-07-23 06:11:00 -07:00
Fangjun Kuang
28d068bce6
Fix an error in the comment for mx.dequantize ( #2409 )
2025-07-23 06:10:50 -07:00
Awni Hannun
d107d8d495
add cuda gemv ( #2400 )
2025-07-22 08:24:13 -07:00
Awni Hannun
1e496ddb82
[CUDA] Simplify allocator ( #2392 )
...
* simplify allocator and fixe race with small pool
* Don't use shared event in worker
* use cuda buffer in small pool
* comment
* comment
2025-07-22 08:24:01 -07:00
Awni Hannun
74eccbf3fa
use size option in binary ( #2399 )
2025-07-22 07:00:53 -07:00
Awni Hannun
08638223ca
Fix including stubs in wheel ( #2398 )
...
* fix including stubs in wheel
* fix bool_
2025-07-22 06:30:17 -07:00
Cheng
56cc858af9
Add contiguous_copy_cpu util for copying array ( #2397 )
2025-07-21 07:30:35 -07:00
Cheng
f55c4ed1d6
Remove thrust iterators ( #2396 )
2025-07-21 07:30:27 -07:00
Awni Hannun
93d70419e7
[CUDA] speedup handling scalars ( #2389 )
...
* speedup scalars in cuda
* comment
2025-07-18 21:47:31 -07:00
Awni Hannun
63f663d9c6
fix cuda manylinux version to match others ( #2388 )
2025-07-18 21:02:16 -07:00
Awni Hannun
84b4d96efa
fix release build + patch bump ( #2387 )
v0.26.5
2025-07-18 14:47:37 -07:00
Awni Hannun
aec67f2fa6
patch bump ( #2386 )
2025-07-18 12:25:48 -07:00
Gökdeniz Gülmez
deee214a95
Adding support for the Muon Optimizer ( #1914 )
...
* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com >
2025-07-18 12:25:28 -07:00
Cheng
45adec102c
Add contiguous_copy_gpu util for copying array ( #2379 )
2025-07-18 06:44:25 -07:00
Cheng
31fc530c76
[CUDA] Add more ways finding CCCL headers in JIT ( #2382 )
2025-07-17 15:25:34 -07:00
Awni Hannun
fbb3f65a1a
fix resource leaks in matmul and graph ( #2383 )
2025-07-17 06:50:15 -07:00
Angelos Katharopoulos
6b1b8ea91b
[CUDA] Add work per thread to compile ( #2368 )
2025-07-17 06:47:52 -07:00
Awni Hannun
b2273733ea
Test with CUDA 12.2 ( #2375 )
...
* Test with CUDA 12.0
* try older image
* fix cpu sort
2025-07-16 13:00:37 -07:00
Awni Hannun
f409b229a4
fix ring distributed test ( #2380 )
2025-07-16 11:25:24 -07:00
Cheng
30571e2326
Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )
2025-07-16 07:34:24 -07:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max ( #2377 )
2025-07-15 18:19:47 -07:00
Awni Hannun
2ba69bc8fa
lower memory uniform sampling ( #2361 )
...
* lower memory uniform
* use fp32
* fix
2025-07-15 14:22:07 -07:00
Cheng
cb349a291c
[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )
2025-07-15 00:36:13 -07:00
Awni Hannun
f0a0b077a0
Install linux with mlx[cuda] and mlx[cpu] ( #2356 )
...
* install linux with mlx[cuda] and mlx[cpu]
* temp for testing
* cleanup circle, fix cuda repair
* update circle
* update circle
* decouple python bindings from core libraries
2025-07-14 17:17:33 -07:00
Awni Hannun
49114f28ab
fix flaky test ( #2371 )
2025-07-14 17:16:18 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Awni Hannun
e569803d7c
update linux build ( #2370 )
2025-07-14 15:13:56 -07:00
Cheng
d34f887abc
Add Primitive::name and remove Primitive::print ( #2365 )
2025-07-14 14:06:35 -07:00
Angelos Katharopoulos
5201df5030
Fix imag() vjp ( #2367 )
2025-07-14 13:11:16 -07:00
Cheng
2d3c26c565
[CUDA] Do not put kernels in annoymous namespace ( #2362 )
2025-07-12 14:24:45 -07:00