Daniel Yeh
22a5da76c8
Faster complex matmul ( #2571 )
2025-10-02 23:33:15 -07:00
Awni Hannun
b0cc71ae71
Faster triu, tril, where with scalar ( #2644 )
2025-10-02 12:21:27 -07:00
AN Long
e76a8dd5c5
Fix incorrect path and typos ( #2630 )
2025-09-28 06:03:04 -07:00
Jagrit Digani
7c7e48dbd1
New tuning for small K gemv ( #2620 )
...
* New tuning for small K gemv
2025-09-23 12:28:35 -07:00
Awni Hannun
711a645807
avoid producing NaN in attention ( #2608 )
2025-09-22 13:10:43 -07:00
Awni Hannun
ec2ab42888
Lower sorted QMM gather threshold ( #2609 )
2025-09-19 18:22:55 -07:00
Oleksandr Bilous
e8b604a6a3
fix: library loading for swift dynamic frameworks ( #2568 )
2025-09-18 13:54:59 -07:00
Awni Hannun
caecbe876a
no copy batch rope ( #2595 )
2025-09-15 14:23:48 -07:00
Awni Hannun
6ccfa603cd
fix metal scan ( #2591 )
2025-09-15 11:01:57 -07:00
Awni Hannun
d6977f2a57
Add sdpa with sinks ( #2558 )
...
* add sdpa with sinks
* fix 2 pass
* fix matrix sdpa
* fix perf regression
* add to cuda (#2580 )
2025-09-10 14:53:00 -07:00
Awni Hannun
17310d91a6
Add batch offsets for mx.fast.rope ( #2564 )
...
* implement batch rope for Metal
* cuda rope (#2576 )
2025-09-08 17:35:07 -07:00
Awni Hannun
e5a33f2223
faster depthwise 1D conv ( #2567 )
2025-09-08 11:37:23 -07:00
Awni Hannun
b61a65e313
fix copies in sdpa ( #2563 )
2025-09-02 11:00:36 -07:00
Awni Hannun
111f1e71af
Faster contiguous gather for indices in the first axis ( #2552 )
...
* faster contiguous gather for indices in the first axis
* work per thread > 1
* angelos suggestion for scales / biases
2025-08-28 21:26:30 -07:00
Awni Hannun
827003d568
fix METAL quantization in JIT ( #2553 )
2025-08-28 18:26:25 -07:00
Awni Hannun
70560b6bd5
Add mode parameter for quantization ( #2499 )
...
* add mode parameter for quantization
* mxfp4 quantize/dequantize + start of optional biases
* mxfp4 works
* speedup
* cpu mxfp4
* fix
* fix test tol
* fix
* refactor
* add quant mode enum
2025-08-28 06:45:26 -07:00
Cheng
4822c3dbe9
[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )
...
* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA
2025-08-26 07:31:39 +09:00
Awni Hannun
e843c4d8d5
fix power ( #2523 )
2025-08-21 06:46:01 -07:00
Angelos Katharopoulos
e397177f6e
Custom cuda kernel ( #2517 )
2025-08-20 17:20:22 -07:00
Angelos Katharopoulos
25c1e03205
Fix overflow in large filter small channels ( #2520 )
2025-08-20 08:03:29 -07:00
Angelos Katharopoulos
1df9887998
Ensure no oob read in gemv_masked ( #2508 )
2025-08-17 08:42:33 -07:00
Angelos Katharopoulos
73f22d6226
Ensure small sort doesn't use indices if not argsort ( #2506 )
2025-08-17 08:42:20 -07:00
Cheng
37b440faa8
Clean up code handling both std::vector and SmallVector ( #2493 )
2025-08-16 09:01:10 +09:00
Cheng
4abb218d21
The naive_conv_2d is no longer used ( #2496 )
2025-08-16 07:57:30 +09:00
Abe Leininger
fce53b61d6
Fix reduce sum/prod overflow ( #2477 )
2025-08-12 00:05:33 -07:00
Angelos Katharopoulos
f2adb5638d
Fix typo in metal command encoder ( #2471 )
2025-08-06 16:58:23 -07:00
Cheng
828c5f1137
Use SmallVector for shapes and strides ( #2454 )
...
* Use SmallVector for shapes and strides
* Convert SmallVector to tuple
2025-08-05 09:41:03 +09:00
Awni Hannun
5597fa089c
Fix qvm splitk ( #2415 )
2025-07-25 11:50:24 -07:00
Awni Hannun
4e504039f5
[Metal] Release metal events ( #2412 )
...
* release metal events
* fix
* fix
2025-07-23 19:53:42 -07:00
Awni Hannun
1e496ddb82
[CUDA] Simplify allocator ( #2392 )
...
* simplify allocator and fixe race with small pool
* Don't use shared event in worker
* use cuda buffer in small pool
* comment
* comment
2025-07-22 08:24:01 -07:00
Cheng
45adec102c
Add contiguous_copy_gpu util for copying array ( #2379 )
2025-07-18 06:44:25 -07:00
Cheng
d34f887abc
Add Primitive::name and remove Primitive::print ( #2365 )
2025-07-14 14:06:35 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation ( #2357 )
...
* Ship CCCL for JIT compilation
* Remove cexpf
2025-07-11 18:45:37 -07:00
Awni Hannun
42cc9cfbc7
fix copy dispatch ( #2360 )
2025-07-11 10:59:35 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel ( #2347 )
...
* Contiguous scan
* Strided scan
* Enable tests
* Fix failing logaddexp test
* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Angelos Katharopoulos
b6eec20260
Fix edge check in qmm_n QuantizedLoader ( #2355 )
2025-07-10 16:28:50 -07:00
jhavukainen
8c7bc30ce4
Align mlx::core::min op nan propagation with NumPy ( #2346 )
2025-07-10 06:20:43 -07:00
jhavukainen
8b9a3f3cea
Align mlx::core::max op nan propagation with NumPy ( #2339 )
...
* Make max op NaN propagation rules align with numpy
* Adding benchmarks and testing for max op nanpropagation
* Pre-commit formatting
* Fix max complex64 nan propagation and add test
* Improve the cpp unittest
* Only check nans on non-integral types in simd_reduce_impl.
* Cleanup using namespace alias
* Add cpu Max nanpropagation. Fix a small fib in cpu max dispatch data types for int8/int16.
* Make the max nanpropagation test more meaningful for integer types
* Remove tuple unpacking syntax to comply with earlier python versions. Add cuda skip to nanpropagation tests, fix cuda implementation in a separate PR.
2025-07-09 11:26:27 -07:00
Angelos Katharopoulos
4a9b29a875
MoE backward improvements ( #2335 )
2025-07-07 17:59:53 -07:00
Angelos Katharopoulos
f5299f72cd
Fix layernorm race condition ( #2340 )
2025-07-07 06:06:01 -07:00
Awni Hannun
8402a2acf4
Fix complex power and print ( #2286 )
...
* fix complex power and print
* fix complex matmul shape
2025-06-13 11:13:00 -07:00
Jagrit Digani
fddb6933e1
Collection of refactors ( #2274 )
...
* Refactor gemv into a function
* Refactor splitk step 1
* Refactor split k axpby
* Rearrange steel_gemm_regular
* Redirect steel_gemm_regular
* Add axpby routing to steel_matmul_regular
* Refactor AddMM step 1
* Redirect steel_gemm
* Update addmm
* Comments and format
* Some cleanup
* Add architecture gen to device
* Update no copy condition in normalization to account for axis size 1
2025-06-13 10:44:56 -07:00
Awni Hannun
f5f65ef48c
Make sliceUpdate general ( #2282 )
...
* Make sliceUpdate general
* fix
2025-06-12 16:48:54 -07:00
Awni Hannun
c35f4d089a
start cuda circle config ( #2256 )
...
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
2025-06-10 21:19:47 -07:00
Angelos Katharopoulos
8590c0941e
Add load_safe to the general conv loaders ( #2258 )
2025-06-10 20:58:16 -07:00
Cheng
f8bad60609
CUDA backend: unary ops ( #2158 )
2025-06-09 06:45:08 -07:00
Awni Hannun
1ca616844b
Fix unintuitive metal kernel caching ( #2242 )
...
* Fix unintuitive metal kernel caching
* alternative solution
2025-06-06 20:08:15 -07:00
Angelos Katharopoulos
2e8cf0b450
Change layernorms to two pass algorithm ( #2246 )
2025-06-06 13:34:56 -07:00
Cheng
24f89173d1
CUDA backend: matmul ( #2241 )
2025-06-06 12:24:04 -07:00
Awni Hannun
c6a20b427a
Improve metal elementwise kernels ( #2247 )
...
* improve metal elementwise kernels
* compile and copy
* fix jit
2025-06-06 11:37:40 -07:00