zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-06-24 09:21:16 +08:00

Author	SHA1	Message	Date
Arkar Min Aung	a478e79047	Merge `cb4dc59a9e` into `c552ff2451`	2025-06-16 10:37:56 -07:00
Awni Hannun	c552ff2451	[CUDA] Fix back-end bugs and enable corresponding tests (#2296 ) * Fix some cuda back-end bugs and enable corresponding tests * more fixes * enable more tests * format	2025-06-16 08:45:40 -07:00
Awni Hannun	4fda5fbdf9	add python testing for cuda with ability to skip list of tests (#2295 )	2025-06-15 10:56:48 -07:00
Angelos Katharopoulos	580776559b	RoPE for CUDA (#2293 ) * First working CUDA rope * Fix random	2025-06-15 06:08:07 -07:00
Arkar Min Aung	cb4dc59a9e	feat(benchmarks): add comprehensive SVD performance benchmarks Add benchmarks for Metal SVD implementation as required by CONTRIBUTING.md: - Square matrix benchmarks (64x64 to 512x512) - Rectangular matrix benchmarks - Batched matrix benchmarks - CPU vs GPU performance comparison - Special matrices (identity, diagonal, zero) Benchmarks validate performance improvements from GPU acceleration and help identify performance regressions in future changes. Usage: python benchmarks/python/svd_bench.py --gpu python benchmarks/python/svd_bench.py --compare python benchmarks/python/svd_bench.py --all	2025-06-15 18:09:11 +10:00
Arkar Min Aung	e5c8773371	feat(metal): implement complete Metal SVD with Jacobi algorithm Add GPU-accelerated SVD implementation for Apple Silicon using Metal compute kernels. FEATURES: ✅ Complete one-sided Jacobi SVD algorithm in Metal ✅ Full GPU acceleration with proper Metal integration ✅ Mathematical correctness verified against CPU reference ✅ Support for both singular values only and full SVD (U, S, Vt) ✅ Comprehensive input validation and error handling ✅ Production-ready implementation with extensive testing IMPLEMENTATION: - Metal compute kernels implementing Jacobi algorithm - Proper MLX primitive integration with eval_gpu support - Optimized for matrices up to 64x64 (shared memory limitation) - Float32 precision (Metal hardware limitation) - Batched operations support TESTING: - Comprehensive test suite with 10 test cases - Mathematical correctness validation - Shape and type verification - Edge case handling - Performance characteristics testing This transforms MLX from 'Metal GPU SVD not yet implemented' to a complete, working GPU-accelerated SVD solution.	2025-06-15 17:44:38 +10:00
Awni Hannun	a14aaa7c9d	Fix cuda arg reduce (#2291 )	2025-06-14 17:54:00 -07:00
Awni Hannun	a6d780154f	fix cuda gemm for bf16 (#2288 )	2025-06-13 22:10:46 -07:00
Awni Hannun	6871e2eeb7	fix cuda jit (#2287 )	2025-06-13 19:21:46 -07:00
Awni Hannun	8402a2acf4	Fix complex power and print (#2286 ) * fix complex power and print * fix complex matmul shape	2025-06-13 11:13:00 -07:00
Jagrit Digani	fddb6933e1	Collection of refactors (#2274 ) * Refactor gemv into a function * Refactor splitk step 1 * Refactor split k axpby * Rearrange steel_gemm_regular * Redirect steel_gemm_regular * Add axpby routing to steel_matmul_regular * Refactor AddMM step 1 * Redirect steel_gemm * Update addmm * Comments and format * Some cleanup * Add architecture gen to device * Update no copy condition in normalization to account for axis size 1	2025-06-13 10:44:56 -07:00
Cheng	c8b4787e4e	CUDA backend: indexing ops (#2277 )	2025-06-12 21:44:19 -07:00
Awni Hannun	2188199ff8	[CUDA] ternary with select op (#2283 ) * cuda ternary with select op * comment + fix * fix	2025-06-12 20:24:43 -07:00
Awni Hannun	aa07429bad	Fix cuda build (#2284 )	2025-06-12 17:48:05 -07:00
Awni Hannun	918761a25a	[CUDA] RMSNorm and VJP (#2280 ) * rms norm start * nit	2025-06-12 17:09:49 -07:00
Cheng	a4fc671d3e	CUDA backend: compile (#2276 ) * CUDA backend: compile * Rename kernels/ to device/	2025-06-12 17:08:39 -07:00
Awni Hannun	f5f65ef48c	Make sliceUpdate general (#2282 ) * Make sliceUpdate general * fix	2025-06-12 16:48:54 -07:00
Cheng	c2dd81a8aa	Fix warnings from latest CUDA toolkit (#2275 )	2025-06-12 06:03:01 -07:00
Cheng	d7e680ffe4	CUDA backend: layernorm (#2271 )	2025-06-11 15:48:32 -07:00
Cheng	c371baf53a	CUDA backend: softmax (#2272 )	2025-06-11 13:55:22 -07:00
Cheng	ccf78f566c	CUDA backend: argreduce (#2270 )	2025-06-11 13:26:17 -07:00
Cheng	c9fa68664a	CUDA backend: reduce (#2269 )	2025-06-11 11:22:25 -07:00
Awni Hannun	c35f4d089a	start cuda circle config (#2256 ) * rebase * fix metal kernel linking issue on cuda * start cuda circle config	2025-06-10 21:19:47 -07:00
Angelos Katharopoulos	8590c0941e	Add load_safe to the general conv loaders (#2258 )	2025-06-10 20:58:16 -07:00
Cheng	095163b8d1	Fix building cpp benchmarks on Linux (#2268 )	2025-06-10 17:10:24 -07:00
Cheng	99c33d011d	rebase + nit (#2260 ) Co-authored-by: Awni Hannun <awni@apple.com>	2025-06-10 10:51:51 -07:00
Awni Hannun	62fecf3e13	fix conv export (#2265 )	2025-06-10 09:34:01 -07:00
Cheng	7c4eb5d03e	CUDA backend: random (#2261 )	2025-06-10 08:59:56 -07:00
Cheng	bae9a6b404	CUDA backend: sort (#2262 ) Co-authored-by: Awni Hannun <awni@apple.com>	2025-06-10 08:59:47 -07:00
Christopher Fleetwood	004c1d8ef2	Report number of missing parameters (#2264 ) * chore: inform * chore: format --------- Co-authored-by: FL33TW00D <FL33TW00D@users.noreply.github.com>	2025-06-10 06:37:50 -07:00
Cheng	7ebb2e0193	CUDA backend: binary ops (#2259 )	2025-06-10 06:37:40 -07:00
Awni Hannun	9ce77798b1	fix export to work with gather/scatter axis (#2263 )	2025-06-09 20:37:27 -07:00
Cheng	f8bad60609	CUDA backend: unary ops (#2158 )	2025-06-09 06:45:08 -07:00
Emmanuel Ferdman	5866b3857b	Refactor the lu test (#2250 ) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-06-07 06:12:08 -07:00
Awni Hannun	1ca616844b	Fix unintuitive metal kernel caching (#2242 ) * Fix unintuitive metal kernel caching * alternative solution	2025-06-06 20:08:15 -07:00
Angelos Katharopoulos	2e8cf0b450	Change layernorms to two pass algorithm (#2246 )	2025-06-06 13:34:56 -07:00
Cheng	24f89173d1	CUDA backend: matmul (#2241 )	2025-06-06 12:24:04 -07:00
Awni Hannun	c6a20b427a	Improve metal elementwise kernels (#2247 ) * improve metal elementwise kernels * compile and copy * fix jit	2025-06-06 11:37:40 -07:00
Awni Hannun	a5ac9244c4	fix linux linking error (#2248 )	2025-06-06 10:41:51 -07:00
Awni Hannun	c763fe1be0	default strict mode for module update and update_modules (#2239 )	2025-06-05 15:27:02 -07:00
Cheng	52dc8c8cd5	Add profiler annotations in common primitives for CUDA backend (#2244 )	2025-06-04 19:55:12 -07:00
Angelos Katharopoulos	aede70e81d	Perf regression fix (#2243 )	2025-06-03 17:55:12 -07:00
Cheng	85a8beb5e4	Avoid atomic updates across CPU/GPU in CUDA event (#2231 )	2025-06-03 16:49:06 -07:00
Cheng	0bb89e9e5f	Share more common code in Compiled (#2240 ) * Share more common code in Compiled * Remove build_lib_name	2025-06-03 16:48:50 -07:00
Cheng	5685ceb3c7	Avoid invoking allocator::malloc when creating CUDA event (#2232 )	2025-06-03 16:48:40 -07:00
Suryash Malviya	0408ba0a76	Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm (#2220 ) * Implementing Complex Matmul using Karatsuba Algorithm * Implemented Karatsuba's Algorithm for complex matmul and pre-commit them * fix --------- Co-authored-by: Awni Hannun <awni@apple.com>	2025-06-02 15:58:46 -07:00
Awni Hannun	cbad6c3093	version (#2237 )	2025-06-02 15:58:33 -07:00
Cheng	1b021f6984	Fast primitives decide when to use the fallback (#2216 )	2025-06-02 13:26:37 -07:00
Cheng	95b7551d65	Do not check event.is_signaled() in eval_impl (#2230 )	2025-06-02 13:23:34 -07:00
Cheng	db5a7c6192	Add memory cache to CUDA backend (#2221 ) * Move BufferCache out of allocator * Add memory cache to cuda backend allocator * Simplify BufferCache assuming buf can not be null	2025-05-30 12:12:54 -07:00

1 2 3 4 5 ...

1212 Commits