Awni Hannun
df58b4133a
[CUDA] Reduce use of managed memory ( #2725 )
...
Nightly Build / build_linux_release (3.10) (push) Has been cancelled
Nightly Build / build_linux_release (3.14) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.10) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.11) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.12) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.13) (push) Has been cancelled
Nightly Build / build_linux_with_tests (3.14) (push) Has been cancelled
Nightly Build / build_mac_release (3.10) (push) Has been cancelled
Nightly Build / build_mac_release (3.13) (push) Has been cancelled
Nightly Build / build_cuda_with_tests (push) Has been cancelled
Nightly Build / build_cuda_release (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (aarch64) (push) Has been cancelled
Nightly Build / Linux Fedora CPP Build (x86_64) (push) Has been cancelled
* Use async cuda malloc managed with cuda 13
* add pool threshold
* refactor for regular cuda malloc
* load eval gpu for cuda
* remove use of cuda pool, use cuda free async
* fix
* fix
* fix
* fix
* fix + comment
2025-11-05 16:05:23 -08:00
Anastasiia Filippova
27778156dc
Nccl reduce scatter, all gather ( #2727 )
...
* Added reduce scatter and all gather for nccl
* fix unused import, delete unused file
* small fix
* deleted useless condition
* fixed comments
* fix bug in eval_gpu, renamed to sum_scatter, fix docs
* final fix docs
* remove and
* Update mlx/distributed/mpi/mpi.cpp
Co-authored-by: Awni Hannun <awni.hannun@gmail.com >
* fix broken set input output
* fixes set output
* typo
* fix typo
* no cpu, no gpu for reduce scatter
---------
Co-authored-by: Awni Hannun <awni.hannun@gmail.com >
2025-11-05 08:21:11 -08:00
Awni Hannun
68c5fa1c95
fix memory count bug ( #2717 )
2025-10-30 14:27:15 -07:00
Awni Hannun
ec72b44417
Add quantize/dequantize for mxfp8 and nvfp4 ( #2688 )
...
* Add quantize/dequantize slow path for mxfp8 and nvfp4
* fast cuda kernel for mx/nv quantization
* fallback for cuda < 12.8 (#2697 )
* format (#2700 )
* fix (#2701 )
* metal kernels
* docs
* fix jit
* add default bits and group sizes
* improve quant docs
* fix output type of mxfp4 matmuls
2025-10-28 16:23:12 -07:00
Awni Hannun
969924cc69
Fp8 conversion ( #2686 )
...
* add fp8 e4m3 converters
* add cuda
* default saturate to min/max
* fix for older OS
* fix no gpu/cpu
* fix saturate
* fix compile
2025-10-27 16:35:50 -07:00
Awni Hannun
4bce5f9b2d
suppress gcc 10.1 warnings ( #2679 )
...
* suppress gcc 10.1 warnings
* suppress gcc 10.1 warnings
2025-10-17 12:09:21 -07:00
Awni Hannun
36ca62dba8
remove unused unary file ( #2672 )
2025-10-13 19:36:26 -07:00
Awni Hannun
25e2356316
speed up scalars ( #2669 )
2025-10-13 12:10:15 -07:00
Awni Hannun
226a1d24e0
Debug cuda conv ( #2662 )
...
* use t4
* use t4
2025-10-10 16:12:47 -07:00
Awni Hannun
630350ad3e
Precise sigmoid ( #2659 )
...
* bump patch
* Sigmoid matches PyTorch and is more precise on tails
2025-10-10 10:05:23 -07:00
Awni Hannun
e89e8b4272
Export with callback ( #2612 )
...
* export with callback
* export with callback
* Add types, fix kwarg ordering bug + test
* cleanup, test, fix
* typos
2025-10-08 19:24:33 -07:00
Angelos Katharopoulos
0073096dd1
Split name into directories for cuda jit ( #2656 )
2025-10-07 01:52:58 -07:00
Angelos Katharopoulos
e3d004fed9
Fix and refactor row-reduce ( #2650 )
2025-10-07 01:51:08 -07:00
Daniel Yeh
22a5da76c8
Faster complex matmul ( #2571 )
2025-10-02 23:33:15 -07:00
Angelos Katharopoulos
c2c3e0b0a2
[CUDA] Add a small column specialization to reduce ( #2642 )
2025-10-02 14:41:05 -07:00
Awni Hannun
b0cc71ae71
Faster triu, tril, where with scalar ( #2644 )
2025-10-02 12:21:27 -07:00
Awni Hannun
bbf1423953
wait for tasks in cuda ( #2636 )
2025-09-30 16:08:46 -07:00
Awni Hannun
dc371ae7a5
fix for max block dim ( #2631 )
2025-09-29 08:59:25 -07:00
Cheng
b466dea982
[CUDA] Make CudaEvent work with multi-device ( #2614 )
...
* Set current device when creating cuda event
* Separate cuda events by device
* Avoid race condition in pool
2025-09-27 11:27:17 +09:00
Daniel Yeh
bf01ad9367
fix ( #2613 )
...
Co-authored-by: Chen-Chen Yeh <ge96noj@mytum.de >
2025-09-22 20:12:04 -07:00
Cheng
ae438d05fa
[CUDA] Recycle CUDA events ( #2604 )
...
* Make CudaEvent a CudaHandle
* Add caching for CudaEvent
* Make sure cuda events are destroyed at last
* Fix headers
* SharedEvent => AtomicEvent
* RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent
* Remove unneeded asserts
2025-09-23 10:42:03 +09:00
Awni Hannun
711a645807
avoid producing NaN in attention ( #2608 )
2025-09-22 13:10:43 -07:00
Cheng
787c0d90cd
Detect cache thrashing in LRUCache ( #2600 )
...
* Detect cache thrashing in LRUCache
* Do not check cache thrashing in tests
2025-09-19 09:12:14 +09:00
Cheng
6a3acf2301
[CUDA] Set bias as input when using bias epilogue ( #2584 )
2025-09-11 15:31:09 +09:00
Awni Hannun
d6977f2a57
Add sdpa with sinks ( #2558 )
...
* add sdpa with sinks
* fix 2 pass
* fix matrix sdpa
* fix perf regression
* add to cuda (#2580 )
2025-09-10 14:53:00 -07:00
Cheng
44cc5da4bc
[CUDA] Fix alpha not respected when using bias epilogue ( #2578 )
2025-09-10 09:08:01 +09:00
Cheng
dde3682b69
[CUDA] Use GEMM with epilogue instead of AddMM ( #2569 )
2025-09-09 13:18:49 +09:00
Awni Hannun
17310d91a6
Add batch offsets for mx.fast.rope ( #2564 )
...
* implement batch rope for Metal
* cuda rope (#2576 )
2025-09-08 17:35:07 -07:00
Awni Hannun
70560b6bd5
Add mode parameter for quantization ( #2499 )
...
* add mode parameter for quantization
* mxfp4 quantize/dequantize + start of optional biases
* mxfp4 works
* speedup
* cpu mxfp4
* fix
* fix test tol
* fix
* refactor
* add quant mode enum
2025-08-28 06:45:26 -07:00
Awni Hannun
7ef8a6f2d5
[CUDA] fix sort ( #2550 )
...
* [CUDA] fix sort
* fix test
2025-08-27 19:48:43 -07:00
Cheng
31c6f6e33f
[CUDA] Use ConcurrentContext in concatenate_gpu ( #2549 )
2025-08-28 09:30:08 +09:00
Cheng
a9bac3d9e5
Run CPP tests for CUDA build in CI ( #2544 )
2025-08-27 08:06:46 +09:00
Awni Hannun
a4dba65220
Enable cuda graph toggle ( #2545 )
...
* enable cuda graph toggle
* increase cache size
2025-08-26 12:50:38 -07:00
Cheng
4822c3dbe9
[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )
...
* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA
2025-08-26 07:31:39 +09:00
Cheng
333ffea273
[CUDA] Remove thrust in arange ( #2535 )
2025-08-24 16:22:36 +09:00
Awni Hannun
30561229c7
Fix allocation bug in NCCL ( #2530 )
2025-08-22 14:39:43 -07:00
Awni Hannun
068a4612e9
nccl default for backend=any ( #2528 )
...
* nccl default for backend=any
* check num gpus + ensure row contiguous for all reduce
* comment
2025-08-22 12:24:27 -07:00
Andrey Portnoy
5722c147de
[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 ( #2525 )
...
* [CUDA] Update cudaMemAdvise and cudaGraphAddDependencies for CUDA 13
These functions' signatures changed in CUDA 13, so we differentiate
between CUDA 13 and preceding releases at compile time.
* Mention NVIDIA in ACKNOWLEDGMENTS.md
2025-08-21 19:57:20 -07:00
Cheng
f6819a1f26
Fix warning 186-D from nvcc ( #2527 )
2025-08-22 10:29:55 +09:00
Anastasiia Filippova
9392fc3f88
NCCL backend ( #2476 )
2025-08-21 11:56:15 -07:00
Awni Hannun
e843c4d8d5
fix power ( #2523 )
2025-08-21 06:46:01 -07:00
Angelos Katharopoulos
e397177f6e
Custom cuda kernel ( #2517 )
2025-08-20 17:20:22 -07:00
Cheng
f4c8888cbe
[CUDA] Fix stride of singleton dims before passing to cuDNN ( #2521 )
2025-08-21 08:55:26 +09:00
Cheng
ac85ddfdb7
[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )
...
* Add gemm_conv
* Add gemm_grouped_conv
2025-08-20 10:06:22 +09:00
Cheng
65d0d40232
Split cuDNN helpers into a separate header ( #2491 )
...
* Add RAII managed CudaGraph class
* Implement forward rms_norm with cuDNN
* Revert back to old rms norm kernel
2025-08-20 09:29:28 +09:00
Cheng
c422050ca7
Update cuDNN Frontend to v1.14 ( #2505 )
2025-08-17 19:13:01 +09:00
Cheng
1ba18ff7d9
[CUDA] Fix conv grads with groups ( #2495 )
...
* Put reshape utils in one file
* [CUDA] Fix conv grads with groups
* Put the reshape utils in gpu/copy.h
2025-08-16 10:09:18 +09:00
Awni Hannun
6441c21a94
Faster general unary op ( #2472 )
...
* faster general unary op
* faster general ops + reorg
* fix + comment
* binary two
* copy general
2025-08-15 15:04:12 -07:00
Cheng
dfb5022eab
Rename cu::Matmul to CublasGemm ( #2488 )
2025-08-13 09:37:40 +09:00
Cheng
aa7b47481a
[CUDA] Optimize set_mm_device_pointers for small ndim ( #2473 )
2025-08-08 15:23:30 +09:00