Awni Hannun
8435c047e1
fix addmm
2025-07-18 09:49:39 -07:00
Awni Hannun
508bd25e29
match muon
2025-07-18 06:43:11 -07:00
Awni Hannun
0a8bb904d7
nits
2025-07-17 11:58:41 -07:00
Gökdeniz Gülmez
c535d8c1b5
Merge branch 'ml-explore:main' into adding-Muon-optimizer
2025-07-17 20:10:02 +02:00
Goekdeniz-Guelmez
4b3d7634cd
format
2025-07-17 20:03:19 +02:00
Goekdeniz-Guelmez
516d172ba5
remove comments
2025-07-17 20:02:27 +02:00
Goekdeniz-Guelmez
698daee214
replace with mx.addmm
2025-07-17 19:57:18 +02:00
Goekdeniz-Guelmez
4c0f7c713b
remove coments
2025-07-17 19:53:56 +02:00
Goekdeniz-Guelmez
3889c805da
G.ndim >= 2 to assert G.ndim == 2
2025-07-17 19:52:00 +02:00
Goekdeniz-Guelmez
060404d862
G.astype(mx.bfloat16) to G.astype(G.dtype)
2025-07-17 19:49:26 +02:00
Awni Hannun
fbb3f65a1a
fix resource leaks in matmul and graph ( #2383 )
2025-07-17 06:50:15 -07:00
Angelos Katharopoulos
6b1b8ea91b
[CUDA] Add work per thread to compile ( #2368 )
2025-07-17 06:47:52 -07:00
Awni Hannun
7f39e9c299
nits
2025-07-17 06:26:43 -07:00
Gökdeniz Gülmez
baad6e392b
Merge branch 'ml-explore:main' into adding-Muon-optimizer
2025-07-17 13:07:54 +02:00
Awni Hannun
b2273733ea
Test with CUDA 12.2 ( #2375 )
...
* Test with CUDA 12.0
* try older image
* fix cpu sort
2025-07-16 13:00:37 -07:00
Gökdeniz Gülmez
784e0716fe
Merge branch 'ml-explore:main' into adding-Muon-optimizer
2025-07-16 21:58:17 +02:00
Awni Hannun
f409b229a4
fix ring distributed test ( #2380 )
2025-07-16 11:25:24 -07:00
Goekdeniz-Guelmez
df6d9e972f
nits and adding it to test
2025-07-16 19:13:40 +02:00
Cheng
30571e2326
Rename the copy util in cpu/copy.h to copy_cpu ( #2378 )
2025-07-16 07:34:24 -07:00
Gökdeniz Gülmez
650c956fe6
Merge branch 'ml-explore:main' into adding-Muon-optimizer
2025-07-16 16:29:10 +02:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max ( #2377 )
2025-07-15 18:19:47 -07:00
Awni Hannun
2ba69bc8fa
lower memory uniform sampling ( #2361 )
...
* lower memory uniform
* use fp32
* fix
2025-07-15 14:22:07 -07:00
Cheng
cb349a291c
[CUDA] Use cuda::std::complex in place of cuComplex ( #2372 )
2025-07-15 00:36:13 -07:00
Awni Hannun
f0a0b077a0
Install linux with mlx[cuda] and mlx[cpu] ( #2356 )
...
* install linux with mlx[cuda] and mlx[cpu]
* temp for testing
* cleanup circle, fix cuda repair
* update circle
* update circle
* decouple python bindings from core libraries
2025-07-14 17:17:33 -07:00
Awni Hannun
49114f28ab
fix flaky test ( #2371 )
2025-07-14 17:16:18 -07:00
Awni Hannun
e7d2ebadd2
[CUDA] Affine quantize ( #2354 )
...
* affine quantize and dequantize kernels
* format
* fix
* format
2025-07-14 15:45:44 -07:00
Awni Hannun
e569803d7c
update linux build ( #2370 )
2025-07-14 15:13:56 -07:00
Cheng
d34f887abc
Add Primitive::name and remove Primitive::print ( #2365 )
2025-07-14 14:06:35 -07:00
Angelos Katharopoulos
5201df5030
Fix imag() vjp ( #2367 )
2025-07-14 13:11:16 -07:00
Cheng
2d3c26c565
[CUDA] Do not put kernels in annoymous namespace ( #2362 )
2025-07-12 14:24:45 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation ( #2357 )
...
* Ship CCCL for JIT compilation
* Remove cexpf
2025-07-11 18:45:37 -07:00
Awni Hannun
42cc9cfbc7
fix copy dispatch ( #2360 )
2025-07-11 10:59:35 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
b6eec20260
Fix edge check in qmm_n QuantizedLoader ( #2355 )
2025-07-10 16:28:50 -07:00
Angelos Katharopoulos
0eb035b4b1
Fix type promotion in Adam with bias correction ( #2350 )
2025-07-10 11:14:42 -07:00
Cheng
afb9817599
[CUDA] Put version in ptx cache dir path ( #2352 )
2025-07-10 07:24:21 -07:00
Cheng
8fb3e7a26c
[CUDA] Set current device before cudaGraphLaunch ( #2351 )
2025-07-10 07:24:02 -07:00
jhavukainen
8c7bc30ce4
Align mlx::core::min op nan propagation with NumPy ( #2346 )
2025-07-10 06:20:43 -07:00
Cheng
85873cb162
[CUDA] Do vectorized store/load in contiguous elementwise ops ( #2342 )
...
* Do vectorized store/load in unary ops
* Do vectorized store/load in binary_two ops
* Do vectorized store/load in copy ops
* Do vectorized store/load in ternary ops
* Use int32_t for IdxT
* binary => binary_two in binary_two.cu
* Fix tests on large arrays
* Use uint as index type
* Contig uses uint as index and non-contig uses int
2025-07-09 18:48:43 -07:00
Awni Hannun
e14ee12491
add zero for argsort vjp ( #2345 )
2025-07-09 14:37:14 -07:00
jhavukainen
8b9a3f3cea
Align mlx::core::max op nan propagation with NumPy ( #2339 )
...
* Make max op NaN propagation rules align with numpy
* Adding benchmarks and testing for max op nanpropagation
* Pre-commit formatting
* Fix max complex64 nan propagation and add test
* Improve the cpp unittest
* Only check nans on non-integral types in simd_reduce_impl.
* Cleanup using namespace alias
* Add cpu Max nanpropagation. Fix a small fib in cpu max dispatch data types for int8/int16.
* Make the max nanpropagation test more meaningful for integer types
* Remove tuple unpacking syntax to comply with earlier python versions. Add cuda skip to nanpropagation tests, fix cuda implementation in a separate PR.
2025-07-09 11:26:27 -07:00
Awni Hannun
fb4e8b896b
patch bump ( #2343 )
v0.26.3
2025-07-08 14:26:07 -07:00
Cheng
2ca533b279
Fix compilation with CUDA 11 ( #2331 )
2025-07-07 20:00:43 -07:00
Angelos Katharopoulos
4a9b29a875
MoE backward improvements ( #2335 )
2025-07-07 17:59:53 -07:00
Awni Hannun
a4fcc893cd
auto build linux release ( #2341 )
2025-07-07 09:29:23 -07:00
Cheng
9d10239af7
[CUDA] Do vectorized store/load in binary ops ( #2330 )
2025-07-07 08:44:14 -07:00
Cheng
19facd4b20
Build with all cpu cores by default ( #2336 )
2025-07-07 06:06:45 -07:00
Angelos Katharopoulos
f5299f72cd
Fix layernorm race condition ( #2340 )
2025-07-07 06:06:01 -07:00
Cheng
0e0d9ac522
[CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size ( #2329 )
2025-07-05 08:33:29 -07:00
Awni Hannun
8917022deb
fix graphs for older cuda ( #2328 )
2025-07-02 19:37:58 -07:00