* Fast Inference SDPA op
Implements metal shaders for:
o = mx.fast_inference_sdpa(queries, keys, values, scale, mask)
Supports fp16, fp32 dtypes; assumes d_k = 128.
Generic op support / prompt encoding supported via mlx primitives.
Metal implementation is for the inference use case only.
Majority of performance benefits appears to results from GQA & reduced
bandwidth requirements; there is approximate performance parity for the
MHA use case (from some measurements on M3 Max).
* Flush shared memory to zero before unprotected reads for (scores @ values)
* Move to fast:: namespace, address reviewer comments
... also attempt to revert formatter auto-change for files not relevant
to this change
* Shared memory flush to top of kernel
* Resolve compiler warnings
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update docstring per PR feedback
* Softmax in higher precision, ...
* route to fallback for more use cases - batch size > 1, head_dim other
than 128, etc.
* Address linux build failure
* Address other reviewer comments
* Remove extraneous eval_cpu function per review
---------
Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: atila <atiorh@icloud.com>
* try to fix build for ios
* skip cpu compile
* fix namespace
* fix namespace
* Use CMake for platform specific cpu compile
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Split reduction files to reduce compile times
* Add small and medium axis size specializations for row reductions
* Add non-row-reduction options for small and med kernels
* Fix case for step=inf in arange and add inf check for start/stop
* Add test cases for arange
* Update ops.cpp to include climits header
* Fix arange
* Fix formatting
* Refactor
* Add missing include
* Faster scatter.
Add specialization for 1-d index tensors.
* Address review comments.
- Check for row contiguity of index, update tensors
instead of checking strides.
- Add support for 1d specialization with col contiguous update
tensor, along with a test.
* Nit1
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Nit2
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* shapeless compilation for some graphs
* update compile benchmark
* default compile a few activations
* buffer donation
* bugfix
* shapeless fix
* update tests to work for cpu and gpu fusion
* test kwargs
* add kwargs to compile
* Recompile when python arguments change
* no compile for tanh
* some constant tests
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* refactor cpu preamble
* fix include order
* fix some issues'
* fixes for linux
* try to fix includes
* add back warning suppression
* more linux fixes
* extensions start
* rope custom op
* fix build
* docs + rope benchmark
* fix test
* Add a Metal kernel for RoPE
* Fix position of traditional
* transform tests
* Move rope computation to float and fix tests
* Fix the test and a typo
* change to fast
* fix no metal build
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Replace argument encoder usage for gather and scatter
* Use constant address space for shapes and strides
* Split gather and scatter to improve compile times
* Enable the GPU tests
* Update the CI config
* Fix scatter dispatch for scalar indices
* Remove arg encoder utils
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
Reduce unnecessary integer ops, especially since
there kernels are integer bound.
Increase number of iterations for benchmarks for
better smoothing.
Github Issue #506
Co-authored-by: Vijay Krishnamoorthy <vijay_krish@apple.com>
Launch 2D grid to eliminate divide and mod in device code,
since 64b integer division is very expensive.
Github Issue #506
Co-authored-by: Vijay Krishnamoorthy <vijay_krish@apple.com>
* Simple kernel generation
* Remove the generate kernel from graph_utils
* fix multi-output with compile
* fuse with stopgrad
* v1 input, output capture in compile
* cleanup tree update with visitor update
* nit
* remove todo
* state for model, optional explicit init and more pure optimizer steps
* move learning rate to state
* add lr to opt state, some fixes in capture
* fix optim
* update tuple of containers as well
* fix stream for compiled output
* rng state for compile
* nit
* updates and comments
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Implement custom_vjp and checkpointing
* Add a dependency management primitive
* Change the eval order to deep branches first
* Add graph depth tracking to the array
* Add option to specialize metal functions on function constants
* Update Compute Pipeline Creation API
* Add options to make libraries from source and stitching
* Update function specialization name options
* Implement diagonal operator
This implements mx.diagonal in operator level, inspired by
@ManishAradwad.
* added `mx.diag` with tests
* corrected few things
* nits in bindings
* updates to diag
---------
Co-authored-by: ManishAradwad <manisharadwad@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* propagate nans in binary ops
* handle empty matmul
* cpu minimum/maximum propagate nan
* benchmark maximum
* add min as well
* throw on negative indices with full
* verbose on linux
* fix matmul for zero K
* buffer donation
* fix to move shared pointer
* format
* gpu in place for copy and binary
* revert ops test
* cpu in place
* a little cleanup
* remove useless bench
* fix tests for linux
* make a move on compile
* basic compile scaffold works
* compile binding
* clean
* fix
* fix grad, more tests
* basic python tests
* fix segfault on python exit
* compile works with python closures
* fix test
* fix python globals bug, and erase
* simplify
* more cpp tests
* bug fix with move function and compile at exit
* simplify inputs also
* enable and disable compiler
* remove simplify
* simplify tests use compile now
* fix multi-output with compile
* clear output tree from cache when function goes out of scope
* ../python/src/transforms.cpp
* remove closure capture
* comments
* support disable metal buffer cache, due to large unused memory buffered when llm generated long context tokens
* Run format and add "cache_enabled" feature tests
* Organize and collect metal subroutine templates and elements in `metal/kernels/steel/`
* Update gemm elements for better performance
* Add split-K specialization for gemm
* Add `addmm` primitive, op and bindings for fused matmul and bias addition
* Update tests and benchmarks as needed
* resolved extension build issues and added test to ci
* missing gguflib
* rebased
* force mlx install from fix branch
* linux build issue
* point to git install and comment out ci tests
* Allow arbitrary first dim on qmm_t and qmv
* Allow arbitrary first dim on qmm and qvm
* Specialized aligned vs unaligned case
* Add more checks for valid quantizations
* feat: add logicalAnd and logicalOR
* run pre-commit
* Refactor logical_and and logical_or functions
* Add acknowledgement
* Add logical AND and logical OR operators
* Refactor logical_and and logical_or functions
* Add support for logical operators on bool arrays
* Update mlx/ops.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update mlx/ops.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Add logical AND and OR operators for arrays and scalars
* Refactor vjp and jvp methods in primitives.cpp
* Add overloaded operators for logical AND and OR
* format
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* inner / outer impl
* python tests
* ops list and ack
* updated descriptions
* use test helper
* removed dtype check and flatten outer to 1-D
* updated docs
* just use the reshape to flatten
* support python mlx.array creation from list of mlx.array's
* include bfloat16 in UT
* refactor so that sub array made of all python primitive types gets initialized by fill_vector
* address PR comment: arr.shape().size() -> arr.ndim()
* address PR comment: get back Dtype constness and let stack to handle type promotions automatically
* move all ObjC (via metal-cpp) interaction until post static initializers
- metal-cpp relies on static initializers to cache class and selector pointers
- code in mlx was using metal-cpp to set up NSAutoreleasePools during its own static init time
- but this code was silently failing as the class and selector pointers from metal-cpp were still nil
- defer the creation of NSAutoreleasePools until after static init time
- ensure that we have coverage where autorelease pools are needed
* Update device.cpp
remove commented code
* Update device.cpp
remove commented out code
* Update scheduler.h
update comment
* per discussion use the pool inside the task() -- this will be metal only, not needed for cpu
* Update allocator.cpp
move pool to release/alloc area
* try alternative gc
* try no cache
* add forced swap
* remove cache for now
* add cache back
* change fit crtieria
* remove unused function
* nit in comment
* tune / fix allocation
* increase block limit to original
* add repeat function
* fix styling
* optimizing repeat
* fixed minor issues
* not sure why that folder is there xD
* fixed now for sure
* test repeat not repeat test
* Fixed
---------
Co-authored-by: Bahaa Eddin tabbakha <bahaa@Bahaas-MacBook-Pro.local>
* implemented vector_norm in cpp
added linalg to mlx
* implemented vector_norm python binding
* renamed vector_norm to norm, implemented norm without provided ord
* completed the implementation of the norm
* added tests
* removed unused import in linalg.cpp
* updated python bindings
* added some tests for python bindings
* handling inf, -inf as numpy does, more extensive tests of compatibility with numpy
* added better docs and examples
* refactored mlx.linalg.norm bindings
* reused existing util for implementation of linalg.norm
* more tests
* fixed a bug with no ord and axis provided
* removed unused imports
* some style and API consistency updates to linalg norm
* remove unused includes
* fix python tests
* fixed a bug with frobenius norm of a complex-valued matrix
* complex for vector too
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* added tri / tril / triu
* fixed tests
* ctest tests
* tri overload and simplified tests
* changes from comment
* more tests for m
* ensure assert if not 2-D
* remove broadcast_to
* minor tweaks
---------
Co-authored-by: Awni Hannun <awni@apple.com>
- The exponent and base were swapped because accelerate is using
exponent-base instead of base-exponent
- Fix also the test for binary ops as it was testing op(x, x) which
couldn't catch ordering errors like that