Awni Hannun
c5fcd5b61b
fix custom kernel test ( #2510 )
2025-08-18 06:45:59 -07:00
Awni Hannun
4fda5fbdf9
add python testing for cuda with ability to skip list of tests ( #2295 )
2025-06-15 10:56:48 -07:00
Awni Hannun
1ca616844b
Fix unintuitive metal kernel caching ( #2242 )
...
* Fix unintuitive metal kernel caching
* alternative solution
2025-06-06 20:08:15 -07:00
Angelos Katharopoulos
5e6c130d93
RMS norm without scaling ( #1915 )
2025-02-28 20:26:57 -08:00
Angelos Katharopoulos
78ba24c37d
Raise an exception in the rope op if input is integer ( #1884 )
2025-02-19 14:43:39 -08:00
Awni Hannun
121d9a0702
Fix rope fallback to not upcast ( #1797 )
...
* fix rope fallback to not upcast
* Update mlx/fast.cpp
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-01-26 19:07:21 -08:00
Awni Hannun
0308e9af71
Allow offset to be an mx.array for mx.fast.rope
( #1724 )
...
* allow offset for rope
* comment
2024-12-19 15:51:44 -08:00
Alex Barron
c79f6a4a8c
3 and 6 bit quantization ( #1613 )
...
* Support 3 and 6 bit quantization
2024-11-22 10:22:13 -08:00
Awni Hannun
76f275b4df
error in rms for wrong size ( #1562 )
2024-11-04 13:24:02 -08:00
Awni Hannun
3ae6aabe9f
throw for certain cases of non captured inputs in compile ( #1401 )
2024-09-09 14:54:31 -07:00
Awni Hannun
ba3e913c7a
Simplifications for MLX C ( #1396 )
...
* simplifications for MLX C
* use vectors instead of map
* update examples
2024-09-06 19:16:50 -07:00
Awni Hannun
7cca1727af
Fix slice data size ( #1394 )
...
* fix slice data size and add tests
* fix contiguous flag
* simplify stride and perform copy for non-contiguous arrays
* fix cpu
* comment
2024-09-04 19:10:43 -07:00
Alex Barron
1d94ac3f90
Add optional headers to `mx.fast.metal_kernel
` ( #1358 )
2024-08-26 21:45:45 -07:00
Alex Barron
da8deb2b62
fix bug with multiple attributes ( #1348 )
...
Co-authored-by: Alex Barron <abarron22@apple.com>
2024-08-23 10:06:15 -07:00
Alex Barron
0fd2a1f4b0
Custom Metal Kernels from Python ( #1325 )
...
* start
* simple kernels working
* restructure
* inverse example working
* docs + fixes
* missing file
* fix imports
* address comments
* add docs + fix test
* Review comments + refactor to a single function
* update docs
* remove hashing
* fix contig bug in test
* back to a class
* trailing whitespace
* fix tests
* match c++ and python apis
* add link + make args kw_only
2024-08-22 13:46:29 -07:00
Awni Hannun
d40e76809f
Fix rope ( #1340 )
...
* add test
* fix rope
* fix test
2024-08-20 17:37:52 -07:00
Awni Hannun
bb1b76d9dc
RoPE with frequencies as optional input ( #1337 )
...
* start rope with freq input
* rope with frequencies
* nits
* fix bug
* fix bug + test
* cleanup
* optional base
2024-08-19 18:30:50 -07:00
Angelos Katharopoulos
780c197f95
Fix test tolerance and patch bump ( #1315 )
2024-08-08 14:51:09 -07:00
Alex Barron
c52d1600f0
Fused Affine Quantize/Dequantize ops ( #1282 )
...
* Add fast affine dequantize
* add full quantize kernel
* fused kernel with scale/bias computation
* fix docstring
* fix no jit error
* fix test
* test fix
* reduce fast api to only affine_quantize
2024-07-29 15:11:38 -07:00
Awni Hannun
ae18326533
No copy command encoder ( #986 )
...
* no copy command encoder
* up layer norm test tolerances
2024-04-11 21:15:36 -07:00
Awni Hannun
d88d2124b5
segfaut layer norm grad ( #955 )
2024-04-04 10:59:15 -07:00
Angelos Katharopoulos
110d9b149d
Layer norm grad fix donation bug ( #941 )
...
* add layer norm grad test
* Fix donation bug in layernorm vjp
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2024-04-01 06:15:50 -07:00
Angelos Katharopoulos
29221fa238
Implement vjps for some primitives in the fast namespace ( #883 )
...
* Implement rope vjp in terms of rope
* RMSNormVJP primitive and kernel
* Add LayerNormVJP primitive and kernel
2024-03-26 16:35:34 -07:00
Angelos Katharopoulos
fcda3a0e66
Increase test tolerance for fast.layer_norm ( #880 )
2024-03-22 12:10:27 -07:00
Angelos Katharopoulos
2225374060
Adds mx.fast.layer_norm ( #870 )
2024-03-21 13:55:51 -07:00
Awni Hannun
a54f06b16f
Fast RMS Norm ( #862 )
...
* fast rmsnorm
* no rms gpu
* kernel
* fix shared mem
* looped rms and donation in softmax
* Make the squaring in float32 to avoid underflow
* Fix the default StreamOrDevice for rope and rms_norm in fast
* nits
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-03-21 07:20:54 -07:00
Awni Hannun
ccf1645995
Custom primitive + RoPE fat op ( #676 )
...
* extensions start
* rope custom op
* fix build
* docs + rope benchmark
* fix test
* Add a Metal kernel for RoPE
* Fix position of traditional
* transform tests
* Move rope computation to float and fix tests
* Fix the test and a typo
* change to fast
* fix no metal build
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2024-02-14 14:04:25 -08:00