* Fast Inference SDPA op
Implements metal shaders for:
o = mx.fast_inference_sdpa(queries, keys, values, scale, mask)
Supports fp16, fp32 dtypes; assumes d_k = 128.
Generic op support / prompt encoding supported via mlx primitives.
Metal implementation is for the inference use case only.
Majority of performance benefits appears to results from GQA & reduced
bandwidth requirements; there is approximate performance parity for the
MHA use case (from some measurements on M3 Max).
* Flush shared memory to zero before unprotected reads for (scores @ values)
* Move to fast:: namespace, address reviewer comments
... also attempt to revert formatter auto-change for files not relevant
to this change
* Shared memory flush to top of kernel
* Resolve compiler warnings
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update docstring per PR feedback
* Softmax in higher precision, ...
* route to fallback for more use cases - batch size > 1, head_dim other
than 128, etc.
* Address linux build failure
* Address other reviewer comments
* Remove extraneous eval_cpu function per review
---------
Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: atila <atiorh@icloud.com>
* try to fix build for ios
* skip cpu compile
* fix namespace
* fix namespace
* Use CMake for platform specific cpu compile
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Split reduction files to reduce compile times
* Add small and medium axis size specializations for row reductions
* Add non-row-reduction options for small and med kernels
* refactor tree utils
* fix compile + tree code refactor
* Add an extra test
* add a few missing activations to docs
* hash structure
* Encode the full argument structure
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Add linear warmup to schedules for use with existing schedules
* Changed parameters for simplicity of most common case (0 initial value)
* Added ScheduleJoiner and updated documentation
* ScheduleJoiner -> join_schedules (ala optax #)
* black compliance
* Different evaluation of schedules
* nits
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* Fix case for step=inf in arange and add inf check for start/stop
* Add test cases for arange
* Update ops.cpp to include climits header
* Fix arange
* Fix formatting
* Refactor
* Add missing include
* Faster scatter.
Add specialization for 1-d index tensors.
* Address review comments.
- Check for row contiguity of index, update tensors
instead of checking strides.
- Add support for 1d specialization with col contiguous update
tensor, along with a test.
* Nit1
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Nit2
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* shapeless compilation for some graphs
* update compile benchmark
* default compile a few activations
* buffer donation
* bugfix
* shapeless fix
* update tests to work for cpu and gpu fusion
* test kwargs
* add kwargs to compile
* Recompile when python arguments change
* no compile for tanh
* some constant tests
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* refactor cpu preamble
* fix include order
* fix some issues'
* fixes for linux
* try to fix includes
* add back warning suppression
* more linux fixes
* Add a few LR schedulers
* Move parents's constructor call to the top
* Fix docstring
* refactor optimizers into two files
* add docs
* nit
* Fix Callable type annotation for python 3.8
---------
Co-authored-by: Awni Hannun <awni@apple.com>
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* extensions start
* rope custom op
* fix build
* docs + rope benchmark
* fix test
* Add a Metal kernel for RoPE
* Fix position of traditional
* transform tests
* Move rope computation to float and fix tests
* Fix the test and a typo
* change to fast
* fix no metal build
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Replace argument encoder usage for gather and scatter
* Use constant address space for shapes and strides
* Split gather and scatter to improve compile times
* Enable the GPU tests
* Update the CI config
* Fix scatter dispatch for scalar indices
* Remove arg encoder utils
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>