Awni Hannun
dc371ae7a5
fix for max block dim ( #2631 )
2025-09-29 08:59:25 -07:00
AN Long
e76a8dd5c5
Fix incorrect path and typos ( #2630 )
2025-09-28 06:03:04 -07:00
Cheng
b466dea982
[CUDA] Make CudaEvent work with multi-device ( #2614 )
...
* Set current device when creating cuda event
* Separate cuda events by device
* Avoid race condition in pool
2025-09-27 11:27:17 +09:00
Angelos Katharopoulos
7a6adda1e6
Bump the version ( #2627 )
2025-09-26 15:15:28 -07:00
Angelos Katharopoulos
1a9f820af6
Compiled should not end in broadcast ( #2622 )
2025-09-26 13:36:09 -07:00
Jagrit Digani
7c7e48dbd1
New tuning for small K gemv ( #2620 )
...
* New tuning for small K gemv
2025-09-23 12:28:35 -07:00
Daniel Yeh
bf01ad9367
fix ( #2613 )
...
Co-authored-by: Chen-Chen Yeh <ge96noj@mytum.de >
2025-09-22 20:12:04 -07:00
Cheng
ae438d05fa
[CUDA] Recycle CUDA events ( #2604 )
...
* Make CudaEvent a CudaHandle
* Add caching for CudaEvent
* Make sure cuda events are destroyed at last
* Fix headers
* SharedEvent => AtomicEvent
* RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent
* Remove unneeded asserts
2025-09-23 10:42:03 +09:00
Awni Hannun
711a645807
avoid producing NaN in attention ( #2608 )
2025-09-22 13:10:43 -07:00
Josh Bleecher Snyder
aa9d44b3d4
implement Convolution::output_shape ( #2601 )
...
- pull conv_out_shape out for re-use
- add Conv::output_shape
- add e2e python tests confirming shapeless=True support and correctness
Updates #2599
2025-09-22 10:09:45 -07:00
Awni Hannun
ec2ab42888
Lower sorted QMM gather threshold ( #2609 )
2025-09-19 18:22:55 -07:00
Cheng
787c0d90cd
Detect cache thrashing in LRUCache ( #2600 )
...
* Detect cache thrashing in LRUCache
* Do not check cache thrashing in tests
2025-09-19 09:12:14 +09:00
Oleksandr Bilous
e8b604a6a3
fix: library loading for swift dynamic frameworks ( #2568 )
2025-09-18 13:54:59 -07:00
Awni Hannun
caecbe876a
no copy batch rope ( #2595 )
2025-09-15 14:23:48 -07:00
Awni Hannun
6ccfa603cd
fix metal scan ( #2591 )
2025-09-15 11:01:57 -07:00
Awni Hannun
ee18e1cbf0
patch bump ( #2588 )
2025-09-11 17:10:09 -07:00
Awni Hannun
af120c2bc0
set nccl ABI version ( #2587 )
2025-09-11 16:55:53 -07:00
Cheng
6a3acf2301
[CUDA] Set bias as input when using bias epilogue ( #2584 )
2025-09-11 15:31:09 +09:00
Awni Hannun
d6977f2a57
Add sdpa with sinks ( #2558 )
...
* add sdpa with sinks
* fix 2 pass
* fix matrix sdpa
* fix perf regression
* add to cuda (#2580 )
2025-09-10 14:53:00 -07:00
Cheng
44cc5da4bc
[CUDA] Fix alpha not respected when using bias epilogue ( #2578 )
2025-09-10 09:08:01 +09:00
Cheng
dde3682b69
[CUDA] Use GEMM with epilogue instead of AddMM ( #2569 )
2025-09-09 13:18:49 +09:00
Awni Hannun
17310d91a6
Add batch offsets for mx.fast.rope ( #2564 )
...
* implement batch rope for Metal
* cuda rope (#2576 )
2025-09-08 17:35:07 -07:00
Cheng
a44b27f5f8
Fix a few ccache cache miss ( #2573 )
...
* Fix ccache cache miss
* Do not define _VERSION_ in python bindings
2025-09-09 07:41:05 +09:00
Awni Hannun
e5a33f2223
faster depthwise 1D conv ( #2567 )
2025-09-08 11:37:23 -07:00
Awni Hannun
b61a65e313
fix copies in sdpa ( #2563 )
2025-09-02 11:00:36 -07:00
Awni Hannun
8ce49cd39e
fix quantized vjp for mxfp4 ( #2555 )
2025-08-29 10:06:15 -07:00
Awni Hannun
9c68b50853
version bump ( #2554 )
2025-08-29 06:54:17 -07:00
Awni Hannun
111f1e71af
Faster contiguous gather for indices in the first axis ( #2552 )
...
* faster contiguous gather for indices in the first axis
* work per thread > 1
* angelos suggestion for scales / biases
2025-08-28 21:26:30 -07:00
Awni Hannun
827003d568
fix METAL quantization in JIT ( #2553 )
2025-08-28 18:26:25 -07:00
Awni Hannun
70560b6bd5
Add mode parameter for quantization ( #2499 )
...
* add mode parameter for quantization
* mxfp4 quantize/dequantize + start of optional biases
* mxfp4 works
* speedup
* cpu mxfp4
* fix
* fix test tol
* fix
* refactor
* add quant mode enum
2025-08-28 06:45:26 -07:00
Awni Hannun
7ef8a6f2d5
[CUDA] fix sort ( #2550 )
...
* [CUDA] fix sort
* fix test
2025-08-27 19:48:43 -07:00
Cheng
31c6f6e33f
[CUDA] Use ConcurrentContext in concatenate_gpu ( #2549 )
2025-08-28 09:30:08 +09:00
Awni Hannun
584d48458e
link with nccl ( #2546 )
2025-08-27 10:01:07 -07:00
Cheng
5cf984ca87
Separate cpu compilation cache by versions ( #2548 )
2025-08-27 11:25:15 +09:00
Cheng
a9bac3d9e5
Run CPP tests for CUDA build in CI ( #2544 )
2025-08-27 08:06:46 +09:00
Awni Hannun
a4dba65220
Enable cuda graph toggle ( #2545 )
...
* enable cuda graph toggle
* increase cache size
2025-08-26 12:50:38 -07:00
Cheng
4822c3dbe9
[CUDA] Implement DynamicSlice/DynamicSliceUpdate ( #2533 )
...
* Move DynamicSlice to gpu/primitives
* Implement compute_dynamic_offset in CUDA
2025-08-26 07:31:39 +09:00
Awni Hannun
d2f540f4e0
Use nccl header only when nccl is not present ( #2539 )
...
* use nccl header only when nccl is not present
* larger machine for cuda build
2025-08-25 14:17:25 -07:00
Cheng
333ffea273
[CUDA] Remove thrust in arange ( #2535 )
2025-08-24 16:22:36 +09:00
Cheng
f55b6f1f2f
Enable COMPILE_WARNING_AS_ERROR for linux builds in CI ( #2534 )
2025-08-24 15:33:08 +09:00
Awni Hannun
30561229c7
Fix allocation bug in NCCL ( #2530 )
2025-08-22 14:39:43 -07:00
Awni Hannun
068a4612e9
nccl default for backend=any ( #2528 )
...
* nccl default for backend=any
* check num gpus + ensure row contiguous for all reduce
* comment
2025-08-22 12:24:27 -07:00
Andrey Portnoy
5722c147de
[CUDA] Update calls to cudaMemAdvise
and cudaGraphAddDependencies
for CUDA 13 ( #2525 )
...
* [CUDA] Update cudaMemAdvise and cudaGraphAddDependencies for CUDA 13
These functions' signatures changed in CUDA 13, so we differentiate
between CUDA 13 and preceding releases at compile time.
* Mention NVIDIA in ACKNOWLEDGMENTS.md
2025-08-21 19:57:20 -07:00
Cheng
f6819a1f26
Fix warning 186-D from nvcc ( #2527 )
2025-08-22 10:29:55 +09:00
Anastasiia Filippova
9392fc3f88
NCCL backend ( #2476 )
2025-08-21 11:56:15 -07:00
Awni Hannun
e843c4d8d5
fix power ( #2523 )
2025-08-21 06:46:01 -07:00
Angelos Katharopoulos
e397177f6e
Custom cuda kernel ( #2517 )
2025-08-20 17:20:22 -07:00
Cheng
f4c8888cbe
[CUDA] Fix stride of singleton dims before passing to cuDNN ( #2521 )
2025-08-21 08:55:26 +09:00
Angelos Katharopoulos
25c1e03205
Fix overflow in large filter small channels ( #2520 )
2025-08-20 08:03:29 -07:00
Cheng
ac85ddfdb7
[CUDA] Add GEMM-based fallback convolution kernels ( #2511 )
...
* Add gemm_conv
* Add gemm_grouped_conv
2025-08-20 10:06:22 +09:00