Commit Graph

  • 728d4db582 Support destination arg in tree flatten/unflatten (#2450) Luca Vivona 2025-08-06 18:34:59 -04:00
  • 99d8de8445 Fix cudnn routing Jagrit Digani 2025-08-06 15:05:58 -07:00
  • c66b76a8c8 Update routing Jagrit Digani 2025-08-06 15:01:15 -07:00
  • f81edd184f Complete 2 pass sdpav Jagrit Digani 2025-08-06 13:57:40 -07:00
  • 7f8ba2a003 [WIP] 2 pass sdpav Jagrit Digani 2025-08-06 09:54:41 -07:00
  • c28249b81a Add more nvtx range for debug Jagrit Digani 2025-08-01 12:49:24 -07:00
  • e74bcdc5e3 Add sdpa file Jagrit Digani 2025-07-25 12:30:50 -07:00
  • d8ed6c1aa3 Add base cudnn attention support Jagrit Digani 2025-07-25 12:30:22 -07:00
  • db5c7efcf6 revert default cuda install (#2465) Awni Hannun 2025-08-06 06:19:12 -07:00
  • 7bb96e4249 fix cublas on h100 (#2466) Awni Hannun 2025-08-06 06:18:58 -07:00
  • fa89f0b150 faster gather qmm sorted test (#2463) Awni Hannun 2025-08-05 06:27:40 -07:00
  • ca973d1e83 fix install tags (#2464) Awni Hannun 2025-08-04 20:01:23 -07:00
  • 828c5f1137 Use SmallVector for shapes and strides (#2454) Cheng 2025-08-05 09:41:03 +09:00
  • 7d86a5c108 Feat: add USE_SYSTEM_FMT CMake option (#2219) Gaétan Lepage 2025-08-05 01:36:11 +02:00
  • 0b807893a7 fix wraps compile (#2461) Awni Hannun 2025-08-04 16:14:18 -07:00
  • 6ad0889c8a default install cuda on linux (#2462) Awni Hannun 2025-08-04 15:33:05 -07:00
  • 737dd6d1ac Add missing <algorithm> header to jit_compiler.cpp (#2460) Zamderax 2025-08-04 14:00:46 -07:00
  • aaf78f4c6b Use LRU cache for cuda graph (#2448) Cheng 2025-08-02 21:28:57 +09:00
  • 8831064493 Fix arctan2 grads (#2453) Angelos Katharopoulos 2025-08-01 21:06:04 -07:00
  • be9bc96da4 [CUDA] Matmul utils initial commit (#2441) Angelos Katharopoulos 2025-08-01 14:22:25 -07:00
  • 86258f292f [CUDA] Vectorize generated kernels (#2444) Angelos Katharopoulos 2025-07-31 18:18:57 -07:00
  • b26d88591c [CUDA] Save primitive inputs faster (#2449) Cheng 2025-08-01 10:16:06 +09:00
  • 86c6a15571 [CUDA] Backward convolution (#2431) Cheng 2025-08-01 09:54:05 +09:00
  • 8b25ce62d5 Add tests for export including control flow models and quantized models (#2430) junpeiz 2025-07-31 11:06:26 -07:00
  • da5912e4f2 fix custom metal extension (#2446) Awni Hannun 2025-07-31 06:25:36 -07:00
  • daafee676f Fix wrong graph key when using concurrent context (#2447) Cheng 2025-07-31 22:01:05 +09:00
  • d32519c8ee fix gemv regression (#2445) Awni Hannun 2025-07-30 14:23:01 -07:00
  • b405591249 fix circular reference (#2443) Awni Hannun 2025-07-30 09:37:44 -07:00
  • 3bf81ed1bd [CUDA] Quantized refactoring (#2442) Angelos Katharopoulos 2025-07-30 08:27:20 -07:00
  • 2204182bba Make CI faster (#2440) Cheng 2025-07-30 18:26:36 +09:00
  • 3628e5d497 Use load_vector in arg_reduce (#2439) Cheng 2025-07-30 17:40:26 +09:00
  • a0ae49d397 Move arange to its own file (#2438) Cheng 2025-07-30 13:05:51 +09:00
  • 254476718b Remove the kernel arg from get_launch_args (#2437) Cheng 2025-07-30 11:43:02 +09:00
  • 3adba92ebe Cuda faster softmax (#2435) Awni Hannun 2025-07-29 17:18:12 -07:00
  • ef631d63af faster rms norm (#2433) Awni Hannun 2025-07-29 13:12:00 -07:00
  • 970dbe8e25 Use ccache in CI (#2414) Cheng 2025-07-29 08:43:22 +09:00
  • 641be9463b Add more CUDA architectures for PyPi package (#2427) Awni Hannun 2025-07-28 12:35:15 -07:00
  • ab0e608862 [CUDA] More sizes for gemv (#2429) Awni Hannun 2025-07-28 12:35:01 -07:00
  • 1588659062 no occupancy query for launch params (#2426) Awni Hannun 2025-07-28 09:09:41 -07:00
  • b9e88fb976 [CUDA] Fix segfault on exit (#2424) Awni Hannun 2025-07-27 08:08:13 -07:00
  • 4ad53414dd fix cuda pypi package (#2423) v0.27.1 Awni Hannun 2025-07-25 15:20:29 -07:00
  • d1165b215e version (#2420) Awni Hannun 2025-07-25 13:29:28 -07:00
  • dcb8319f3d update install docs and requirements (#2419) Awni Hannun 2025-07-25 12:13:19 -07:00
  • 5597fa089c Fix qvm splitk (#2415) Awni Hannun 2025-07-25 11:50:24 -07:00
  • 9acec364c2 [CUDA] Always use batched matmul (#2404) Awni Hannun 2025-07-24 20:46:02 -07:00
  • 7d9d6ef456 docs: fix adam and adamw eps placement (#2416) Skonor 2025-07-24 16:40:45 -07:00
  • 6f5874a2f2 [CUDA] Initial implementation of Convolution with cuDNN (#2385) Cheng 2025-07-25 08:12:10 +09:00
  • 70dc336785 Test on cuda 12.2 and 12.9 (#2413) Awni Hannun 2025-07-24 06:06:15 -07:00
  • 4e504039f5 [Metal] Release metal events (#2412) Awni Hannun 2025-07-23 19:53:42 -07:00
  • d1f4d291e8 Fix uv install and add dev release (#2411) Awni Hannun 2025-07-23 16:54:19 -07:00
  • e1840853ce full row mask in sdpa consistently gives nan (#2406) Awni Hannun 2025-07-23 16:37:03 -07:00
  • 0f5ce173da [CUDA] --compress-mode requires CUDA 12.8 (#2407) Cheng 2025-07-23 22:11:11 +09:00
  • 588854195f Remove unused code in Convolution::vjp (#2408) Cheng 2025-07-23 22:11:00 +09:00
  • 28d068bce6 Fix an error in the comment for mx.dequantize (#2409) Fangjun Kuang 2025-07-23 21:10:50 +08:00
  • 8269c9d02d Support unaligned M qmm Angelos Katharopoulos 2025-07-23 00:40:27 -07:00
  • 903b40627c Add dynamic shared memory and improve qmm Angelos Katharopoulos 2025-07-22 23:36:53 -07:00
  • d107d8d495 add cuda gemv (#2400) Awni Hannun 2025-07-22 08:24:13 -07:00
  • 1e496ddb82 [CUDA] Simplify allocator (#2392) Awni Hannun 2025-07-22 08:24:01 -07:00
  • 74eccbf3fa use size option in binary (#2399) Awni Hannun 2025-07-22 07:00:53 -07:00
  • 08638223ca Fix including stubs in wheel (#2398) Awni Hannun 2025-07-22 06:30:17 -07:00
  • 700f7dcf01 Refactor the matmul a bit Angelos Katharopoulos 2025-07-21 23:38:21 -07:00
  • 56cc858af9 Add contiguous_copy_cpu util for copying array (#2397) Cheng 2025-07-21 23:30:35 +09:00
  • f55c4ed1d6 Remove thrust iterators (#2396) Cheng 2025-07-21 23:30:27 +09:00
  • 6c60bd1cbf Fixed mma and working dequant Angelos Katharopoulos 2025-07-21 04:39:27 -07:00
  • a64cc02a0c Somewhat working matmul primitives Angelos Katharopoulos 2025-07-21 02:22:25 -07:00
  • 346ae5fdb5 Refactor quantized Angelos Katharopoulos 2025-07-16 16:22:25 -07:00
  • 93d70419e7 [CUDA] speedup handling scalars (#2389) Awni Hannun 2025-07-18 21:47:31 -07:00
  • 63f663d9c6 fix cuda manylinux version to match others (#2388) Awni Hannun 2025-07-18 21:02:16 -07:00
  • 84b4d96efa fix release build + patch bump (#2387) v0.26.5 Awni Hannun 2025-07-18 14:47:37 -07:00
  • aec67f2fa6 patch bump (#2386) Awni Hannun 2025-07-18 12:25:48 -07:00
  • deee214a95 Adding support for the Muon Optimizer (#1914) Gökdeniz Gülmez 2025-07-18 21:25:28 +02:00
  • 45adec102c Add contiguous_copy_gpu util for copying array (#2379) Cheng 2025-07-18 22:44:25 +09:00
  • 31fc530c76 [CUDA] Add more ways finding CCCL headers in JIT (#2382) Cheng 2025-07-18 07:25:34 +09:00
  • fbb3f65a1a fix resource leaks in matmul and graph (#2383) Awni Hannun 2025-07-17 06:50:15 -07:00
  • 6b1b8ea91b [CUDA] Add work per thread to compile (#2368) Angelos Katharopoulos 2025-07-17 06:47:52 -07:00
  • b2273733ea Test with CUDA 12.2 (#2375) Awni Hannun 2025-07-16 13:00:37 -07:00
  • f409b229a4 fix ring distributed test (#2380) Awni Hannun 2025-07-16 11:25:24 -07:00
  • 30571e2326 Rename the copy util in cpu/copy.h to copy_cpu (#2378) Cheng 2025-07-16 23:34:24 +09:00
  • d7734edd9f fix complex reduce + nan propagation in min and max (#2377) Awni Hannun 2025-07-15 18:19:47 -07:00
  • 2ba69bc8fa lower memory uniform sampling (#2361) Awni Hannun 2025-07-15 14:22:07 -07:00
  • cb349a291c [CUDA] Use cuda::std::complex in place of cuComplex (#2372) Cheng 2025-07-15 16:36:13 +09:00
  • f0a0b077a0 Install linux with mlx[cuda] and mlx[cpu] (#2356) Awni Hannun 2025-07-14 17:17:33 -07:00
  • 49114f28ab fix flaky test (#2371) Awni Hannun 2025-07-14 17:16:18 -07:00
  • e7d2ebadd2 [CUDA] Affine quantize (#2354) Awni Hannun 2025-07-14 15:45:44 -07:00
  • e569803d7c update linux build (#2370) Awni Hannun 2025-07-14 15:13:56 -07:00
  • d34f887abc Add Primitive::name and remove Primitive::print (#2365) Cheng 2025-07-15 06:06:35 +09:00
  • 5201df5030 Fix imag() vjp (#2367) Angelos Katharopoulos 2025-07-14 13:11:16 -07:00
  • 2d3c26c565 [CUDA] Do not put kernels in annoymous namespace (#2362) Cheng 2025-07-13 06:24:45 +09:00
  • 6325f60d52 [CUDA] Bundle CCCL for JIT compilation (#2357) Cheng 2025-07-12 10:45:37 +09:00
  • a9c720e8cd Improve the ring backend initialization ring-init Angelos Katharopoulos 2025-07-11 15:31:28 -07:00
  • 42cc9cfbc7 fix copy dispatch (#2360) Awni Hannun 2025-07-11 10:59:35 -07:00
  • 8347575ba1 [CUDA] Implement Scan kernel (#2347) Cheng 2025-07-11 08:54:12 +09:00
  • b6eec20260 Fix edge check in qmm_n QuantizedLoader (#2355) Angelos Katharopoulos 2025-07-10 16:28:50 -07:00
  • 0eb035b4b1 Fix type promotion in Adam with bias correction (#2350) Angelos Katharopoulos 2025-07-10 11:14:42 -07:00
  • afb9817599 [CUDA] Put version in ptx cache dir path (#2352) Cheng 2025-07-10 23:24:21 +09:00
  • 8fb3e7a26c [CUDA] Set current device before cudaGraphLaunch (#2351) Cheng 2025-07-10 23:24:02 +09:00
  • 8c7bc30ce4 Align mlx::core::min op nan propagation with NumPy (#2346) jhavukainen 2025-07-10 06:20:43 -07:00
  • 85873cb162 [CUDA] Do vectorized store/load in contiguous elementwise ops (#2342) Cheng 2025-07-10 10:48:43 +09:00
  • e14ee12491 add zero for argsort vjp (#2345) Awni Hannun 2025-07-09 14:37:14 -07:00
  • 8b9a3f3cea Align mlx::core::max op nan propagation with NumPy (#2339) jhavukainen 2025-07-09 11:26:27 -07:00