Awni Hannun
d9d0777c2e
docs up
2025-07-25 22:53:08 +00:00
Awni Hannun
d67cd9230c
docs up
2025-07-25 22:53:07 +00:00
Awni Hannun
d03b91923e
docs update
2025-07-25 22:53:07 +00:00
Awni Hannun
8bea0a4eb8
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
0250e203f6
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
f75712551d
update docs
2025-07-25 22:53:07 +00:00
Awni Hannun
af2c3689fe
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
5ac2eec7b3
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
f89de9c25d
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
9ad2650c9d
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
ea288788f8
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
efe5c824af
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
e6ffce1a9b
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
a847d1dbd0
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
dca1d17eb9
docs
2025-07-25 22:53:07 +00:00
Awni Hannun
4ad53414dd
fix cuda pypi package ( #2423 )
...
* fix cuda pypi package
* patch bump
2025-07-25 15:20:29 -07:00
Awni Hannun
d1165b215e
version ( #2420 )
2025-07-25 13:29:28 -07:00
Awni Hannun
dcb8319f3d
update install docs and requirements ( #2419 )
2025-07-25 12:13:19 -07:00
Awni Hannun
5597fa089c
Fix qvm splitk ( #2415 )
2025-07-25 11:50:24 -07:00
Awni Hannun
9acec364c2
[CUDA] Always use batched matmul ( #2404 )
...
* cuda batched mm
* addmm as well
* comment
2025-07-24 20:46:02 -07:00
Skonor
7d9d6ef456
docs: fix adam and adamw eps placement ( #2416 )
...
Co-authored-by: Mikhail Gorbunov <m_gorbunov@apple.com>
2025-07-24 16:40:45 -07:00
Cheng
6f5874a2f2
[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )
...
* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-07-25 08:12:10 +09:00
Awni Hannun
70dc336785
Test on cuda 12.2 and 12.9 ( #2413 )
2025-07-24 06:06:15 -07:00
Awni Hannun
4e504039f5
[Metal] Release metal events ( #2412 )
...
* release metal events
* fix
* fix
2025-07-23 19:53:42 -07:00
Awni Hannun
d1f4d291e8
Fix uv install and add dev release ( #2411 )
...
* fix uv install and add dev release
* fix docstring
* pin cuda deps
* cuda release on cpu-only machine
2025-07-23 16:54:19 -07:00
Awni Hannun
e1840853ce
full row mask in sdpa consistently gives nan ( #2406 )
2025-07-23 16:37:03 -07:00
Cheng
0f5ce173da
[CUDA] --compress-mode requires CUDA 12.8 ( #2407 )
2025-07-23 06:11:11 -07:00
Cheng
588854195f
Remove unused code in Convolution::vjp ( #2408 )
2025-07-23 06:11:00 -07:00
Fangjun Kuang
28d068bce6
Fix an error in the comment for mx.dequantize ( #2409 )
2025-07-23 06:10:50 -07:00
Awni Hannun
d107d8d495
add cuda gemv ( #2400 )
2025-07-22 08:24:13 -07:00
Awni Hannun
1e496ddb82
[CUDA] Simplify allocator ( #2392 )
...
* simplify allocator and fixe race with small pool
* Don't use shared event in worker
* use cuda buffer in small pool
* comment
* comment
2025-07-22 08:24:01 -07:00
Awni Hannun
74eccbf3fa
use size option in binary ( #2399 )
2025-07-22 07:00:53 -07:00
Awni Hannun
08638223ca
Fix including stubs in wheel ( #2398 )
...
* fix including stubs in wheel
* fix bool_
2025-07-22 06:30:17 -07:00
Cheng
56cc858af9
Add contiguous_copy_cpu util for copying array ( #2397 )
2025-07-21 07:30:35 -07:00
Cheng
f55c4ed1d6
Remove thrust iterators ( #2396 )
2025-07-21 07:30:27 -07:00
Awni Hannun
93d70419e7
[CUDA] speedup handling scalars ( #2389 )
...
* speedup scalars in cuda
* comment
2025-07-18 21:47:31 -07:00
Awni Hannun
63f663d9c6
fix cuda manylinux version to match others ( #2388 )
2025-07-18 21:02:16 -07:00
Awni Hannun
84b4d96efa
fix release build + patch bump ( #2387 )
2025-07-18 14:47:37 -07:00
Awni Hannun
aec67f2fa6
patch bump ( #2386 )
2025-07-18 12:25:48 -07:00
Gökdeniz Gülmez
deee214a95
Adding support for the Muon Optimizer ( #1914 )
...
* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-07-18 12:25:28 -07:00
Cheng
45adec102c
Add contiguous_copy_gpu util for copying array ( #2379 )
2025-07-18 06:44:25 -07:00
Cheng
31fc530c76
[CUDA] Add more ways finding CCCL headers in JIT ( #2382 )
2025-07-17 15:25:34 -07:00
Awni Hannun
fbb3f65a1a
fix resource leaks in matmul and graph ( #2383 )
2025-07-17 06:50:15 -07:00
Angelos Katharopoulos
6b1b8ea91b
[CUDA] Add work per thread to compile ( #2368 )
2025-07-17 06:47:52 -07:00
Awni Hannun
b2273733ea
Test with CUDA 12.2 ( #2375 )
...
* Test with CUDA 12.0
* try older image
* fix cpu sort
2025-07-16 13:00:37 -07:00
Awni Hannun
f409b229a4
fix ring distributed test ( #2380 )
2025-07-16 11:25:24 -07:00
Cheng
30571e2326
Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )
2025-07-16 07:34:24 -07:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max ( #2377 )
2025-07-15 18:19:47 -07:00
Awni Hannun
2ba69bc8fa
lower memory uniform sampling ( #2361 )
...
* lower memory uniform
* use fp32
* fix
2025-07-15 14:22:07 -07:00
Cheng
cb349a291c
[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )
2025-07-15 00:36:13 -07:00