wrmsr
04cbb4191c
Fix dequantize python sig ( #2562 )
2025-09-01 11:50:20 -07:00
Artur Antonov
c5460762e7
Fix AdamW weight_decay default value in docstring ( #2557 )
2025-08-31 21:29:30 -07:00
Awni Hannun
8ce49cd39e
fix quantized vjp for mxfp4 ( #2555 )
2025-08-29 10:06:15 -07:00
Awni Hannun
111f1e71af
Faster contiguous gather for indices in the first axis ( #2552 )
...
* faster contiguous gather for indices in the first axis
* work per thread > 1
* angelos suggestion for scales / biases
2025-08-28 21:26:30 -07:00
Awni Hannun
70560b6bd5
Add mode parameter for quantization ( #2499 )
...
* add mode parameter for quantization
* mxfp4 quantize/dequantize + start of optional biases
* mxfp4 works
* speedup
* cpu mxfp4
* fix
* fix test tol
* fix
* refactor
* add quant mode enum
2025-08-28 06:45:26 -07:00
Awni Hannun
7ef8a6f2d5
[CUDA] fix sort ( #2550 )
...
* [CUDA] fix sort
* fix test
2025-08-27 19:48:43 -07:00
Awni Hannun
5458d43247
add load with path tests ( #2543 )
2025-08-26 14:24:47 -07:00
Awni Hannun
3dcb286baf
Remove stream from average grads so it uses default ( #2532 )
...
* Remove stream from average grads so it uses default
* comment
2025-08-25 15:56:29 -07:00
Cheng
4822c3dbe9
[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )
...
* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA
2025-08-26 07:31:39 +09:00
Awni Hannun
db14e29a0b
allow pathlib.Path to save/load functions ( #2541 )
2025-08-25 14:58:49 -07:00
Awni Hannun
068a4612e9
nccl default for backend=any ( #2528 )
...
* nccl default for backend=any
* check num gpus + ensure row contiguous for all reduce
* comment
2025-08-22 12:24:27 -07:00
Awni Hannun
f93f87c802
nccl dep + default for cuda ( #2526 )
2025-08-21 17:57:49 -07:00
Anastasiia Filippova
9392fc3f88
NCCL backend ( #2476 )
2025-08-21 11:56:15 -07:00
Awni Hannun
e843c4d8d5
fix power ( #2523 )
2025-08-21 06:46:01 -07:00
Angelos Katharopoulos
e397177f6e
Custom cuda kernel ( #2517 )
2025-08-20 17:20:22 -07:00
Cheng
f4c8888cbe
[CUDA] Fix stride of singleton dims before passing to cuDNN ( #2521 )
2025-08-21 08:55:26 +09:00
Angelos Katharopoulos
25c1e03205
Fix overflow in large filter small channels ( #2520 )
2025-08-20 08:03:29 -07:00
Cheng
ac85ddfdb7
[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )
...
* Add gemm_conv
* Add gemm_grouped_conv
2025-08-20 10:06:22 +09:00
Awni Hannun
e7c6e1db82
no segfault with uninitialized array.at ( #2514 )
2025-08-18 08:33:38 -07:00
Awni Hannun
c5fcd5b61b
fix custom kernel test ( #2510 )
2025-08-18 06:45:59 -07:00
Cheng
1ba18ff7d9
[CUDA] Fix conv grads with groups ( #2495 )
...
* Put reshape utils in one file
* [CUDA] Fix conv grads with groups
* Put the reshape utils in gpu/copy.h
2025-08-16 10:09:18 +09:00
Luca Vivona
728d4db582
Support destination arg in tree flatten/unflatten ( #2450 )
2025-08-06 15:34:59 -07:00
Awni Hannun
fa89f0b150
faster gather qmm sorted test ( #2463 )
2025-08-05 06:27:40 -07:00
Cheng
828c5f1137
Use SmallVector for shapes and strides ( #2454 )
...
* Use SmallVector for shapes and strides
* Convert SmallVector to tuple
2025-08-05 09:41:03 +09:00
Awni Hannun
0b807893a7
fix wraps compile ( #2461 )
2025-08-04 16:14:18 -07:00
Cheng
86c6a15571
[CUDA] Backward convolution ( #2431 )
2025-08-01 09:54:05 +09:00
junpeiz
8b25ce62d5
Add tests for export including control flow models and quantized models ( #2430 )
...
* Add tests for export, including control flow export and quantized model export.
* Skip quantization related test for CUDA backend.
2025-07-31 11:06:26 -07:00
Awni Hannun
d32519c8ee
fix gemv regression ( #2445 )
2025-07-30 14:23:01 -07:00
Awni Hannun
b405591249
fix circular reference ( #2443 )
2025-07-30 09:37:44 -07:00
Awni Hannun
ef631d63af
faster rms norm ( #2433 )
2025-07-29 13:12:00 -07:00
Awni Hannun
4ad53414dd
fix cuda pypi package ( #2423 )
...
* fix cuda pypi package
* patch bump
2025-07-25 15:20:29 -07:00
Awni Hannun
dcb8319f3d
update install docs and requirements ( #2419 )
2025-07-25 12:13:19 -07:00
Awni Hannun
5597fa089c
Fix qvm splitk ( #2415 )
2025-07-25 11:50:24 -07:00
Skonor
7d9d6ef456
docs: fix adam and adamw eps placement ( #2416 )
...
Co-authored-by: Mikhail Gorbunov <m_gorbunov@apple.com >
2025-07-24 16:40:45 -07:00
Cheng
6f5874a2f2
[CUDA] Initial implementation of Convolution with cuDNN ( #2385 )
...
* Link with cuDNN
* Initial implementation
* Remove backend apis
* Fix recording cudnn conv
* More unused backend apis
* Fix C++ conv tests
* include cudnn as python dep
* Install libcudnn9-dev-cuda-12 in CI
* cudnn only accepts contiguous inputs
* Switch to backend apis
* Plan needs to be kept alive
* Turn off tf32
* Add cache
* Test the native cuda graph api
* Set cudnn stream before execution
* Make LRUCache more like a normal container
* Do error check for cublas handle
* Zero-initilizing array
* Use tf32 for conv
* Skip TestConv.test_torch_conv_2D test
---------
Co-authored-by: Awni Hannun <awni@apple.com >
2025-07-25 08:12:10 +09:00
Awni Hannun
d1f4d291e8
Fix uv install and add dev release ( #2411 )
...
* fix uv install and add dev release
* fix docstring
* pin cuda deps
* cuda release on cpu-only machine
2025-07-23 16:54:19 -07:00
Awni Hannun
e1840853ce
full row mask in sdpa consistently gives nan ( #2406 )
2025-07-23 16:37:03 -07:00
Fangjun Kuang
28d068bce6
Fix an error in the comment for mx.dequantize ( #2409 )
2025-07-23 06:10:50 -07:00
Awni Hannun
63f663d9c6
fix cuda manylinux version to match others ( #2388 )
2025-07-18 21:02:16 -07:00
Gökdeniz Gülmez
deee214a95
Adding support for the Muon Optimizer ( #1914 )
...
* initial commit with workong optmimizer
* update ACKNOWLEDGMENTS.md
* nits and adding it to test
* nits
* G.astype(mx.bfloat16) to G.astype(G.dtype)
* G.ndim >= 2 to assert G.ndim == 2
* remove coments
* replace with mx.addmm
* remove comments
* format
* nits
* match muon
* fix addmm
---------
Co-authored-by: Awni Hannun <awni@apple.com >
2025-07-18 12:25:28 -07:00
Awni Hannun
f409b229a4
fix ring distributed test ( #2380 )
2025-07-16 11:25:24 -07:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max ( #2377 )
2025-07-15 18:19:47 -07:00
Awni Hannun
f0a0b077a0
Install linux with mlx[cuda] and mlx[cpu] ( #2356 )
...
* install linux with mlx[cuda] and mlx[cpu]
* temp for testing
* cleanup circle, fix cuda repair
* update circle
* update circle
* decouple python bindings from core libraries
2025-07-14 17:17:33 -07:00
Awni Hannun
49114f28ab
fix flaky test ( #2371 )
2025-07-14 17:16:18 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Cheng
d34f887abc
Add Primitive::name and remove Primitive::print ( #2365 )
2025-07-14 14:06:35 -07:00
Angelos Katharopoulos
5201df5030
Fix imag() vjp ( #2367 )
2025-07-14 13:11:16 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
0eb035b4b1
Fix type promotion in Adam with bias correction ( #2350 )
2025-07-10 11:14:42 -07:00
jhavukainen
8c7bc30ce4
Align mlx::core::min op nan propagation with NumPy ( #2346 )
2025-07-10 06:20:43 -07:00