zhangyiss/mlx - mlx - Gitea for Geophysics

mirror of https://github.com/ml-explore/mlx.git synced 2025-12-16 01:49:05 +08:00

Author	SHA1	Message	Date
Jagrit Digani	7f8ba2a003	[WIP] 2 pass sdpav	2025-08-06 09:56:39 -07:00
Jagrit Digani	c28249b81a	Add more nvtx range for debug	2025-08-06 09:56:39 -07:00
Jagrit Digani	e74bcdc5e3	Add sdpa file	2025-08-06 09:56:39 -07:00
Jagrit Digani	d8ed6c1aa3	Add base cudnn attention support	2025-08-06 09:56:39 -07:00
Awni Hannun	db5c7efcf6	revert default cuda install (#2465 ) * revert default cuda install * revert default cuda install	2025-08-06 06:19:12 -07:00
Awni Hannun	7bb96e4249	fix cublas on h100 (#2466 )	2025-08-06 06:18:58 -07:00
Awni Hannun	fa89f0b150	faster gather qmm sorted test (#2463 )	2025-08-05 06:27:40 -07:00
Awni Hannun	ca973d1e83	fix install tags (#2464 )	2025-08-04 20:01:23 -07:00
Cheng	828c5f1137	Use SmallVector for shapes and strides (#2454 ) * Use SmallVector for shapes and strides * Convert SmallVector to tuple	2025-08-05 09:41:03 +09:00
Gaétan Lepage	7d86a5c108	Feat: add USE_SYSTEM_FMT CMake option (#2219 )	2025-08-04 16:36:11 -07:00
Awni Hannun	0b807893a7	fix wraps compile (#2461 )	2025-08-04 16:14:18 -07:00
Awni Hannun	6ad0889c8a	default install cuda on linux (#2462 )	2025-08-04 15:33:05 -07:00
Zamderax	737dd6d1ac	Add missing <algorithm> header to jit_compiler.cpp (#2460 ) Fixes compilation error on Linux where std::find_if is used on line 121 but the <algorithm> header was not included. While this might work on some platforms due to transitive includes, it's not guaranteed by the C++ standard. Resolves issue #2459	2025-08-04 14:00:46 -07:00
Cheng	aaf78f4c6b	Use LRU cache for cuda graph (#2448 ) * Use LRU cache for cuda graph * Remove unused destructor	2025-08-02 21:28:57 +09:00
Angelos Katharopoulos	8831064493	Fix arctan2 grads (#2453 )	2025-08-01 21:06:04 -07:00
Angelos Katharopoulos	be9bc96da4	[CUDA] Matmul utils initial commit (#2441 )	2025-08-01 14:22:25 -07:00
Angelos Katharopoulos	86258f292f	[CUDA] Vectorize generated kernels (#2444 )	2025-07-31 18:18:57 -07:00
Cheng	b26d88591c	[CUDA] Save primitive inputs faster (#2449 ) * Add more nvtx loggings * [CUDA] Saving primitive inputs faster * Remove unneeded check	2025-08-01 10:16:06 +09:00
Cheng	86c6a15571	[CUDA] Backward convolution (#2431 )	2025-08-01 09:54:05 +09:00
junpeiz	8b25ce62d5	Add tests for export including control flow models and quantized models (#2430 ) * Add tests for export, including control flow export and quantized model export. * Skip quantization related test for CUDA backend.	2025-07-31 11:06:26 -07:00
Awni Hannun	da5912e4f2	fix custom metal extension (#2446 )	2025-07-31 06:25:36 -07:00
Cheng	daafee676f	Fix wrong graph key when using concurrent context (#2447 )	2025-07-31 06:01:05 -07:00
Awni Hannun	d32519c8ee	fix gemv regression (#2445 )	2025-07-30 14:23:01 -07:00
Awni Hannun	b405591249	fix circular reference (#2443 )	2025-07-30 09:37:44 -07:00
Angelos Katharopoulos	3bf81ed1bd	[CUDA] Quantized refactoring (#2442 )	2025-07-30 08:27:20 -07:00
Cheng	2204182bba	Make CI faster (#2440 )	2025-07-30 02:26:36 -07:00
Cheng	3628e5d497	Use load_vector in arg_reduce (#2439 )	2025-07-30 17:40:26 +09:00
Cheng	a0ae49d397	Move arange to its own file (#2438 )	2025-07-30 13:05:51 +09:00
Cheng	254476718b	Remove the kernel arg from get_launch_args (#2437 )	2025-07-30 11:43:02 +09:00
Awni Hannun	3adba92ebe	Cuda faster softmax (#2435 ) * faster softmax and logsumexp * faster softmax and logsumexp * format	2025-07-29 17:18:12 -07:00
Awni Hannun	ef631d63af	faster rms norm (#2433 )	2025-07-29 13:12:00 -07:00
Cheng	970dbe8e25	Use ccache in CI (#2414 ) * Detect ccache * Use ccache in CI * Separate cache for different images * Test both 12.2 and 12.9 for PRs	2025-07-29 08:43:22 +09:00
Awni Hannun	641be9463b	Add more CUDA architectures for PyPi package (#2427 ) * add cuda sm 90 * add more archs	2025-07-28 12:35:15 -07:00
Awni Hannun	ab0e608862	[CUDA] More sizes for gemv (#2429 ) * route more to gemv * route more sizes to custom gemv	2025-07-28 12:35:01 -07:00
Awni Hannun	1588659062	no occupancy query for launch params (#2426 )	2025-07-28 09:09:41 -07:00
Awni Hannun	b9e88fb976	[CUDA] Fix segfault on exit (#2424 ) * fix cuda segfault on exit * comment	2025-07-27 08:08:13 -07:00
Awni Hannun	4ad53414dd	fix cuda pypi package (#2423 ) * fix cuda pypi package * patch bump v0.27.1	2025-07-25 15:20:29 -07:00
Awni Hannun	d1165b215e	version (#2420 )	2025-07-25 13:29:28 -07:00
Awni Hannun	dcb8319f3d	update install docs and requirements (#2419 )	2025-07-25 12:13:19 -07:00
Awni Hannun	5597fa089c	Fix qvm splitk (#2415 )	2025-07-25 11:50:24 -07:00
Awni Hannun	9acec364c2	[CUDA] Always use batched matmul (#2404 ) * cuda batched mm * addmm as well * comment	2025-07-24 20:46:02 -07:00
Skonor	7d9d6ef456	docs: fix adam and adamw eps placement (#2416 ) Co-authored-by: Mikhail Gorbunov <m_gorbunov@apple.com>	2025-07-24 16:40:45 -07:00
Cheng	6f5874a2f2	[CUDA] Initial implementation of Convolution with cuDNN (#2385 ) * Link with cuDNN * Initial implementation * Remove backend apis * Fix recording cudnn conv * More unused backend apis * Fix C++ conv tests * include cudnn as python dep * Install libcudnn9-dev-cuda-12 in CI * cudnn only accepts contiguous inputs * Switch to backend apis * Plan needs to be kept alive * Turn off tf32 * Add cache * Test the native cuda graph api * Set cudnn stream before execution * Make LRUCache more like a normal container * Do error check for cublas handle * Zero-initilizing array * Use tf32 for conv * Skip TestConv.test_torch_conv_2D test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2025-07-25 08:12:10 +09:00
Awni Hannun	70dc336785	Test on cuda 12.2 and 12.9 (#2413 )	2025-07-24 06:06:15 -07:00
Awni Hannun	4e504039f5	[Metal] Release metal events (#2412 ) * release metal events * fix * fix	2025-07-23 19:53:42 -07:00
Awni Hannun	d1f4d291e8	Fix uv install and add dev release (#2411 ) * fix uv install and add dev release * fix docstring * pin cuda deps * cuda release on cpu-only machine	2025-07-23 16:54:19 -07:00
Awni Hannun	e1840853ce	full row mask in sdpa consistently gives nan (#2406 )	2025-07-23 16:37:03 -07:00
Cheng	0f5ce173da	[CUDA] --compress-mode requires CUDA 12.8 (#2407 )	2025-07-23 06:11:11 -07:00
Cheng	588854195f	Remove unused code in Convolution::vjp (#2408 )	2025-07-23 06:11:00 -07:00
Fangjun Kuang	28d068bce6	Fix an error in the comment for mx.dequantize (#2409 )	2025-07-23 06:10:50 -07:00

1 2 3 4 5 ...

1325 Commits