When compositing transforms lots of temporary of arrays will be created
and passed to next primitive, and by making ops accepting args by value
we can avoid lots of copies of temporary arrays.
Without the && args would be copied and perfect forwarding won't work.
Also add template utils to make sure the function only forwards array
and not vector<array>.
* add numeric type hierarchy and issubdtype as well as a set_dtype method to nn.Module with predicate
numeric type hierarchy and issubtype is compatible to the [numpy hierarchy](220f0ab2c5/numpy/_core/numerictypes.py (L42)).
Closes#285.
* nits in docs
* unify type category checking
* nits in docs
* nits in docs
* more docs nits
* fix callable type
---------
Co-authored-by: Awni Hannun <awni@apple.com>
* some small overhead improvements
* use result_type in rms_norm
* remove release force
* fix + use non-vector version
* revert compile change
* fix ops
* a little more overhead
* a little more cleanup and overhead
1. Move shapes into outputs instead of copying them.
2. Pass primitive by const ref as it is always copied into outputs, which
removes a copy when calling make_array.
* fast rmsnorm
* no rms gpu
* kernel
* fix shared mem
* looped rms and donation in softmax
* Make the squaring in float32 to avoid underflow
* Fix the default StreamOrDevice for rope and rms_norm in fast
* nits
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
Without the && args would be copied and perfect forwarding won't work.
To avoid eval calling itself recursively, the vector version of eval is
changed to take by value instead, which will save a copy of array when a
rvalue is passed.
* Update mlx_set_item to handle regular slices without expanding
* Refactor ellipsis handling
* Route mlx_set_item to slice_update where possible
* Update mlx_scatter_args_slice
* Don't route to gather if no array indices
* Enable copy to work with int64 strides
* Fix uniform buffer indices or copy kernel arguments
* Update utils.h
* Remove manual unrolling of elem to loc loop
* GPU copy updated to handle negative strides
* Add slice update primitive
Since the shape and inputs are always saved as copy in ArrayDesc, we can
unify array's constructors to just take the arguments by value.
There are 2 cases:
1. When shape is a lvalue, it will be copied into array's constructor and
then moved into ArrayDesc's member. So only 1 copy happens.
2. When shape is a rvalue, it will be moved into array's constructor and
then moved into ArrayDesc's member. So no copy happens.
So having 1 constructor that takes by value is equivalent to having 2
constructors that const reference and rvalue separately.
* mostly builds
* most tests pass
* fix circle build
* add back buffer protocol
* includes
* fix for py38
* limit to cpu device
* include
* fix stubs
* move signatures for docs
* stubgen + docs fix
* doc for compiled function, comments
1. Anonymous namespace means internal linkage, static keyword is not needed.
2. The default constructor of std::shared_ptr initializes the pointer to
nullptr, you don't need to explicitly set it.
* Enable collapsing batch dims in gemm
* Update gemm to only make copies when neither of the last 2 axes are contiguous
* Update addmm to support gemv shapes
* Update addmm to support irregular batch strides
* Update tests
* Fast Inference SDPA op
Implements metal shaders for:
o = mx.fast_inference_sdpa(queries, keys, values, scale, mask)
Supports fp16, fp32 dtypes; assumes d_k = 128.
Generic op support / prompt encoding supported via mlx primitives.
Metal implementation is for the inference use case only.
Majority of performance benefits appears to results from GQA & reduced
bandwidth requirements; there is approximate performance parity for the
MHA use case (from some measurements on M3 Max).
* Flush shared memory to zero before unprotected reads for (scores @ values)
* Move to fast:: namespace, address reviewer comments
... also attempt to revert formatter auto-change for files not relevant
to this change
* Shared memory flush to top of kernel
* Resolve compiler warnings
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update python/src/fast.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Update docstring per PR feedback
* Softmax in higher precision, ...
* route to fallback for more use cases - batch size > 1, head_dim other
than 128, etc.
* Address linux build failure
* Address other reviewer comments
* Remove extraneous eval_cpu function per review
---------
Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com>
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
Co-authored-by: atila <atiorh@icloud.com>
* try to fix build for ios
* skip cpu compile
* fix namespace
* fix namespace
* Use CMake for platform specific cpu compile
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Split reduction files to reduce compile times
* Add small and medium axis size specializations for row reductions
* Add non-row-reduction options for small and med kernels
* Fix case for step=inf in arange and add inf check for start/stop
* Add test cases for arange
* Update ops.cpp to include climits header
* Fix arange
* Fix formatting
* Refactor
* Add missing include
* Faster scatter.
Add specialization for 1-d index tensors.
* Address review comments.
- Check for row contiguity of index, update tensors
instead of checking strides.
- Add support for 1d specialization with col contiguous update
tensor, along with a test.
* Nit1
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* Nit2
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
* shapeless compilation for some graphs
* update compile benchmark
* default compile a few activations
* buffer donation
* bugfix
* shapeless fix
* update tests to work for cpu and gpu fusion
* test kwargs
* add kwargs to compile
* Recompile when python arguments change
* no compile for tanh
* some constant tests
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* refactor cpu preamble
* fix include order
* fix some issues'
* fixes for linux
* try to fix includes
* add back warning suppression
* more linux fixes
* extensions start
* rope custom op
* fix build
* docs + rope benchmark
* fix test
* Add a Metal kernel for RoPE
* Fix position of traditional
* transform tests
* Move rope computation to float and fix tests
* Fix the test and a typo
* change to fast
* fix no metal build
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Replace argument encoder usage for gather and scatter
* Use constant address space for shapes and strides
* Split gather and scatter to improve compile times
* Enable the GPU tests
* Update the CI config
* Fix scatter dispatch for scalar indices
* Remove arg encoder utils
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
Reduce unnecessary integer ops, especially since
there kernels are integer bound.
Increase number of iterations for benchmarks for
better smoothing.
Github Issue #506
Co-authored-by: Vijay Krishnamoorthy <vijay_krish@apple.com>
Launch 2D grid to eliminate divide and mod in device code,
since 64b integer division is very expensive.
Github Issue #506
Co-authored-by: Vijay Krishnamoorthy <vijay_krish@apple.com>
* Simple kernel generation
* Remove the generate kernel from graph_utils
* fix multi-output with compile
* fuse with stopgrad
* v1 input, output capture in compile
* cleanup tree update with visitor update
* nit
* remove todo
* state for model, optional explicit init and more pure optimizer steps
* move learning rate to state
* add lr to opt state, some fixes in capture
* fix optim
* update tuple of containers as well
* fix stream for compiled output
* rng state for compile
* nit
* updates and comments
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Implement custom_vjp and checkpointing
* Add a dependency management primitive
* Change the eval order to deep branches first
* Add graph depth tracking to the array
* Add option to specialize metal functions on function constants
* Update Compute Pipeline Creation API
* Add options to make libraries from source and stitching
* Update function specialization name options
* Implement diagonal operator
This implements mx.diagonal in operator level, inspired by
@ManishAradwad.
* added `mx.diag` with tests
* corrected few things
* nits in bindings
* updates to diag
---------
Co-authored-by: ManishAradwad <manisharadwad@gmail.com>
Co-authored-by: Awni Hannun <awni@apple.com>
* propagate nans in binary ops
* handle empty matmul
* cpu minimum/maximum propagate nan
* benchmark maximum
* add min as well
* throw on negative indices with full
* verbose on linux
* fix matmul for zero K
* buffer donation
* fix to move shared pointer
* format
* gpu in place for copy and binary
* revert ops test
* cpu in place
* a little cleanup
* remove useless bench
* fix tests for linux
* make a move on compile
* basic compile scaffold works
* compile binding
* clean
* fix
* fix grad, more tests
* basic python tests
* fix segfault on python exit
* compile works with python closures
* fix test
* fix python globals bug, and erase
* simplify
* more cpp tests
* bug fix with move function and compile at exit
* simplify inputs also
* enable and disable compiler
* remove simplify
* simplify tests use compile now
* fix multi-output with compile
* clear output tree from cache when function goes out of scope
* ../python/src/transforms.cpp
* remove closure capture
* comments
* support disable metal buffer cache, due to large unused memory buffered when llm generated long context tokens
* Run format and add "cache_enabled" feature tests
* Organize and collect metal subroutine templates and elements in `metal/kernels/steel/`
* Update gemm elements for better performance
* Add split-K specialization for gemm
* Add `addmm` primitive, op and bindings for fused matmul and bias addition
* Update tests and benchmarks as needed
* resolved extension build issues and added test to ci
* missing gguflib
* rebased
* force mlx install from fix branch
* linux build issue
* point to git install and comment out ci tests