Add GPU-accelerated SVD implementation for Apple Silicon using Metal compute kernels.
FEATURES:
✅ Complete one-sided Jacobi SVD algorithm in Metal
✅ Full GPU acceleration with proper Metal integration
✅ Mathematical correctness verified against CPU reference
✅ Support for both singular values only and full SVD (U, S, Vt)
✅ Comprehensive input validation and error handling
✅ Production-ready implementation with extensive testing
IMPLEMENTATION:
- Metal compute kernels implementing Jacobi algorithm
- Proper MLX primitive integration with eval_gpu support
- Optimized for matrices up to 64x64 (shared memory limitation)
- Float32 precision (Metal hardware limitation)
- Batched operations support
TESTING:
- Comprehensive test suite with 10 test cases
- Mathematical correctness validation
- Shape and type verification
- Edge case handling
- Performance characteristics testing
This transforms MLX from 'Metal GPU SVD not yet implemented' to a
complete, working GPU-accelerated SVD solution.