Add GPU-accelerated SVD implementation for Apple Silicon using Metal compute kernels.
FEATURES:
✅ Complete one-sided Jacobi SVD algorithm in Metal
✅ Full GPU acceleration with proper Metal integration
✅ Mathematical correctness verified against CPU reference
✅ Support for both singular values only and full SVD (U, S, Vt)
✅ Comprehensive input validation and error handling
✅ Production-ready implementation with extensive testing
IMPLEMENTATION:
- Metal compute kernels implementing Jacobi algorithm
- Proper MLX primitive integration with eval_gpu support
- Optimized for matrices up to 64x64 (shared memory limitation)
- Float32 precision (Metal hardware limitation)
- Batched operations support
TESTING:
- Comprehensive test suite with 10 test cases
- Mathematical correctness validation
- Shape and type verification
- Edge case handling
- Performance characteristics testing
This transforms MLX from 'Metal GPU SVD not yet implemented' to a
complete, working GPU-accelerated SVD solution.
* export and import functions
* refactor + works for few primitives
* nit
* allow primitives with state
* nit
* nit
* simplify serialize / deserialize
* fix for constants
* python bindings
* maybe fix serialize failure case
* add example
* more primitives, training kind of works
* same result for python and c++
* some fixes
* fix export
* template it up
* some simplificatoin
* rebase
* allow kwargs and multiple functions
* exporter
* more primitives for exporting
* deal with endianness
* handle invalid stream
* add docstring
* einsum initial
* fix comma break
* sum axis was wrong
* small cleanups
* python binding
* changed bindings to resemble numpy
* remove todo comment
* comment changes
* add count of operands/inputs
* fail fast if operands list is empty
* ignore comma if no output
* einsum path matching numpy
* getting somewhere with path
* remove print
* it passes the first test
* moved einsum tests to seperate file
* seperated einsum path
* moved einsum naive
* remove space from equation
* fast fail if no operands passed
* update tests and remove printf
* small cleanup
* some more cleanups
* removed python helper file
* ack
* utilize std for finding min in vector
* duplicate def
* remove the tuple as it was unreadable
* moved einsum_naive back to ops
* remaining isn't needed
* avoid creating another set
* cleanup
* greedy path, start of naive einsum
* more einsum
* fix some bugs
* some more fixes, tests pass
* benchmark
* some simplify
* fix einsum and test
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* add a bunch more tests and fix a bunch more bugs
* some docs nits
---------
Co-authored-by: dc-dc-dc <dgcruz983@gmail.com>
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
* Implement custom_vjp and checkpointing
* Add a dependency management primitive
* Change the eval order to deep branches first
* Add graph depth tracking to the array
* fix tests for linux
* make a move on compile
* basic compile scaffold works
* compile binding
* clean
* fix
* fix grad, more tests
* basic python tests
* fix segfault on python exit
* compile works with python closures
* fix test
* fix python globals bug, and erase
* simplify
* more cpp tests
* bug fix with move function and compile at exit
* simplify inputs also
* enable and disable compiler
* remove simplify
* simplify tests use compile now
* fix multi-output with compile
* clear output tree from cache when function goes out of scope
* ../python/src/transforms.cpp
* remove closure capture
* comments
* implemented vector_norm in cpp
added linalg to mlx
* implemented vector_norm python binding
* renamed vector_norm to norm, implemented norm without provided ord
* completed the implementation of the norm
* added tests
* removed unused import in linalg.cpp
* updated python bindings
* added some tests for python bindings
* handling inf, -inf as numpy does, more extensive tests of compatibility with numpy
* added better docs and examples
* refactored mlx.linalg.norm bindings
* reused existing util for implementation of linalg.norm
* more tests
* fixed a bug with no ord and axis provided
* removed unused imports
* some style and API consistency updates to linalg norm
* remove unused includes
* fix python tests
* fixed a bug with frobenius norm of a complex-valued matrix
* complex for vector too
---------
Co-authored-by: Awni Hannun <awni@apple.com>