Arkar Min Aung
6d01528e90
feat: Add benchmarking and documentation updates for Metal SVD
...
- Add comprehensive SVD benchmark script (benchmarks/python/svd_benchmark.py):
* Performance comparison between CPU and GPU implementations
* Batch processing benchmarks
* Correctness verification tests
* Detailed timing and speedup analysis
- Update linalg documentation to mention Metal GPU acceleration
- Add implementation summary document for development reference
This addresses CONTRIBUTING.md requirements:
- Benchmarks for efficiency impact measurement (point 3)
- Documentation updates for API changes (point 4)
- Comprehensive testing coverage (point 2)
2025-06-14 17:28:19 +10:00
Arkar Min Aung
5875252f87
feat: Add comprehensive testing and documentation for Metal SVD
...
- Add comprehensive test suite (test_metal_svd.cpp):
* Basic functionality tests
* Input validation tests
* Various matrix sizes and batch processing
* Reconstruction accuracy verification
* Orthogonality property checks
* Special matrices (identity, zero, diagonal)
* Performance characteristic tests
- Add detailed implementation documentation:
* Algorithm description and complexity analysis
* Usage examples and API documentation
* Performance benchmarks and characteristics
* Implementation details and file structure
* Error handling and limitations
* Contributing guidelines
- Enhance error handling and robustness:
* Improved input validation with detailed error messages
* Memory allocation error handling
* NaN/Inf input detection
* Performance logging for large matrices
- Integrate tests into CMake build system
This completes the Metal SVD implementation with production-ready
testing and documentation.
2025-06-14 17:05:10 +10:00
Arkar Min Aung
c09f1faf9a
feat: Add convergence checking and algorithm improvements
...
- Add svd_check_convergence kernel to monitor off-diagonal norm
- Implement proper convergence checking in Jacobi iterations
- Add algorithm selection heuristics based on matrix properties
- Improve singular vector computation with proper rotation application
- Add adaptive parameter selection (tolerance, max_iterations)
- Enhance error handling and workspace management
Key improvements:
* Convergence checking every 5 iterations to reduce overhead
* Matrix-size-dependent parameter tuning
* Better memory management with convergence tracking
* More accurate singular vector computation
This significantly improves the robustness and efficiency of the
Metal SVD implementation.
2025-06-14 17:05:10 +10:00
Arkar Min Aung
7ec92466df
feat: Implement basic one-sided Jacobi SVD algorithm in Metal
...
- Add complete Metal kernel implementations for SVD computation:
* svd_preprocess: Computes A^T * A matrix
* svd_jacobi_iteration: Performs Jacobi rotations to diagonalize
* svd_extract_singular_values: Extracts singular values from diagonal
* svd_compute_vectors: Computes singular vectors (basic implementation)
- Update host-side implementation to orchestrate kernel execution:
* Allocate workspace for A^T * A and rotation storage
* Execute preprocessing, iteration, and extraction phases
* Handle both singular values only and full SVD modes
- Add proper template instantiations for float and double precision
This provides a working Metal SVD implementation using the Jacobi method.
Performance optimizations and convergence checking will follow.
2025-06-14 17:05:10 +10:00
Arkar Min Aung
c67eea520e
Merge branch 'ml-explore:main' into feature/metal-svd-base
2025-06-14 16:53:43 +10:00
Awni Hannun
8402a2acf4
Fix complex power and print ( #2286 )
...
* fix complex power and print
* fix complex matmul shape
2025-06-13 11:13:00 -07:00
Jagrit Digani
fddb6933e1
Collection of refactors ( #2274 )
...
* Refactor gemv into a function
* Refactor splitk step 1
* Refactor split k axpby
* Rearrange steel_gemm_regular
* Redirect steel_gemm_regular
* Add axpby routing to steel_matmul_regular
* Refactor AddMM step 1
* Redirect steel_gemm
* Update addmm
* Comments and format
* Some cleanup
* Add architecture gen to device
* Update no copy condition in normalization to account for axis size 1
2025-06-13 10:44:56 -07:00
Arkar Min Aung
a71a9e0ddd
feat: Add Metal SVD infrastructure and parameter structures
...
- Add SVDParams, JacobiRotation, and SVDConvergenceInfo structures
- Create placeholder Metal kernel declarations for SVD operations
- Add SVD kernel compilation to CMake build system
- Update SVD::eval_gpu to dispatch to Metal implementation
- Add basic input validation and error handling
- Include placeholder kernel implementation for compilation
This establishes the foundation for Metal SVD implementation.
Actual algorithm implementation will follow in subsequent commits.
2025-06-13 23:28:52 +10:00
Awni Hannun
f5f65ef48c
Make sliceUpdate general ( #2282 )
...
* Make sliceUpdate general
* fix
2025-06-12 16:48:54 -07:00
Awni Hannun
c35f4d089a
start cuda circle config ( #2256 )
...
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
2025-06-10 21:19:47 -07:00
Angelos Katharopoulos
8590c0941e
Add load_safe to the general conv loaders ( #2258 )
2025-06-10 20:58:16 -07:00
Cheng
f8bad60609
CUDA backend: unary ops ( #2158 )
2025-06-09 06:45:08 -07:00
Awni Hannun
1ca616844b
Fix unintuitive metal kernel caching ( #2242 )
...
* Fix unintuitive metal kernel caching
* alternative solution
2025-06-06 20:08:15 -07:00
Angelos Katharopoulos
2e8cf0b450
Change layernorms to two pass algorithm ( #2246 )
2025-06-06 13:34:56 -07:00
Cheng
24f89173d1
CUDA backend: matmul ( #2241 )
2025-06-06 12:24:04 -07:00
Awni Hannun
c6a20b427a
Improve metal elementwise kernels ( #2247 )
...
* improve metal elementwise kernels
* compile and copy
* fix jit
2025-06-06 11:37:40 -07:00
Cheng
0bb89e9e5f
Share more common code in Compiled ( #2240 )
...
* Share more common code in Compiled
* Remove build_lib_name
2025-06-03 16:48:50 -07:00
Cheng
1b021f6984
Fast primitives decide when to use the fallback ( #2216 )
2025-06-02 13:26:37 -07:00
Cheng
db5a7c6192
Add memory cache to CUDA backend ( #2221 )
...
* Move BufferCache out of allocator
* Add memory cache to cuda backend allocator
* Simplify BufferCache assuming buf can not be null
2025-05-30 12:12:54 -07:00
Awni Hannun
6ef2f67e7f
5bit quants ( #2226 )
...
* 5bit quants
* 5bit quants
2025-05-30 12:12:10 -07:00
Cheng
f76ee1ffd2
Move some dims utils to common ( #2223 )
2025-05-29 06:48:30 -07:00
Cheng
79071bfba4
Fix out-of-bounds default value in logsumexp/softmax ( #2213 )
2025-05-21 07:25:16 -07:00
Cheng
7774b87cbd
Remove redundant simd_sum in logsumexp ( #2210 )
2025-05-21 07:25:03 -07:00
Awni Hannun
eebe73001a
fix large arg reduce ( #2206 )
2025-05-19 13:10:44 -07:00
Awni Hannun
8576e6fe36
fix conv2d bug + faster conv 1d ( #2195 )
...
* fix conv2d bug + faster conv 1d
* revert sort + flaky test
2025-05-18 06:05:11 -07:00
Jack Wind
7ff5c41e06
Add set_threadgroup_memory_length to CommandEncoder ( #2183 )
2025-05-16 00:28:03 -07:00
Awni Hannun
c1eb9d05d9
non-symmetric eig and eigh ( #2188 )
2025-05-15 13:01:44 -07:00
Cheng
0751263dec
Fix typo in row_reduce_small ( #2179 )
2025-05-13 20:19:54 -07:00
Cheng
eca2f3eb97
Add remove_index utility ( #2173 )
2025-05-13 17:09:56 -07:00
Awni Hannun
8f3d208dce
Close a couple edge case bugs: hadamard and addmm on empty inputs ( #2177 )
...
* handle hadamard and addmm on empty inputs
* fix
2025-05-12 10:48:57 -07:00
Awni Hannun
6661387066
Fix fft for integer overflow ( #2161 )
2025-05-09 14:25:12 -07:00
ATurker
a7fae8a176
fix: conv_general differences between gpu, cpu ( #2070 )
...
* fix general_conv padding
* fix bugs
* add test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
2025-05-09 10:26:52 -07:00
Awni Hannun
5a1a5d5ed1
fix input coherent kernel launch ( #2153 )
2025-05-05 17:30:50 -07:00
Cheng
1683975acf
Move common gpu primitives to backend/gpu ( #2145 )
2025-05-05 13:45:29 -07:00
Awni Hannun
af705590ac
fix batched vector sdpa ( #2152 )
2025-05-05 13:13:03 -07:00
Awni Hannun
825124af8f
fix bw for elementwise ops ( #2151 )
...
* fix bw for elementwise ops
* add compile
* fix
* fix
* fix
* fix
2025-05-05 06:15:04 -07:00
Angelos Katharopoulos
481349495b
GPU Hadamard for large N ( #1879 )
2025-05-01 17:19:17 -07:00
Awni Hannun
e496c5a4b4
fix integer overflow in qmm ( #2143 )
2025-04-30 09:28:56 -07:00
Awni Hannun
f1606486d2
Generalize gpu backend ( #2138 )
...
* generalize gpu backend
* fix no_gpu build
* fix no_gpu build
* generalize gpu backend
2025-04-30 09:08:17 -07:00
Alex Chi Z.
b36dd472bb
return library if it is successfully loaded ( #2131 )
2025-04-29 07:30:36 -07:00
hdeng-apple
167b759a38
Fix typos ( #2136 )
2025-04-29 07:26:05 -07:00
Angelos Katharopoulos
f0e70afff0
Fix swift pm load ( #2117 )
2025-04-24 10:58:29 -07:00
hdeng-apple
38c1e720c2
Search mlx.metallib in macOS framework "Resources" dir ( #2061 )
...
---------
Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>
2025-04-23 09:53:13 -07:00
Yury Popov
1d2c9d6a07
Complex scan ( #2094 )
2025-04-22 18:56:28 -07:00
Awni Hannun
fdadc4f22c
Add more complex unary ops ( #2101 )
2025-04-21 13:04:54 -07:00
Angelos Katharopoulos
3cde719eb7
Route to gather qmm only for many tokens per expert ( #2082 )
2025-04-17 14:53:08 -07:00
Angelos Katharopoulos
5de6d94a90
Gather qmm batched kernel and refactoring of quantized ( #2078 )
2025-04-17 13:53:11 -07:00
Angelos Katharopoulos
99eefd2ec0
Gather mm new kernel and small refactoring ( #2040 )
2025-04-14 16:37:36 -07:00
Yury Popov
e9e268336b
LogCumSumExp ( #2069 )
2025-04-13 01:27:29 -07:00
Angelos Katharopoulos
c4189a38e4
Add float mask to sdpa vector ( #2068 )
2025-04-11 17:29:40 -07:00