Commit Graph

24 Commits

Author SHA1 Message Date
Awni Hannun
e843c4d8d5
fix power (#2523) 2025-08-21 06:46:01 -07:00
Awni Hannun
6441c21a94
Faster general unary op (#2472)
* faster general unary op

* faster general ops + reorg

* fix + comment

* binary two

* copy general
2025-08-15 15:04:12 -07:00
Awni Hannun
d32519c8ee
fix gemv regression (#2445) 2025-07-30 14:23:01 -07:00
Cheng
3628e5d497
Use load_vector in arg_reduce (#2439) 2025-07-30 17:40:26 +09:00
Cheng
a0ae49d397
Move arange to its own file (#2438) 2025-07-30 13:05:51 +09:00
Awni Hannun
ef631d63af
faster rms norm (#2433) 2025-07-29 13:12:00 -07:00
Awni Hannun
641be9463b
Add more CUDA architectures for PyPi package (#2427)
* add cuda sm 90

* add more archs
2025-07-28 12:35:15 -07:00
Awni Hannun
d107d8d495
add cuda gemv (#2400) 2025-07-22 08:24:13 -07:00
Cheng
f55c4ed1d6
Remove thrust iterators (#2396) 2025-07-21 07:30:27 -07:00
Awni Hannun
d7734edd9f
fix complex reduce + nan propagation in min and max (#2377) 2025-07-15 18:19:47 -07:00
Cheng
cb349a291c
[CUDA] Use cuda::std::complex in place of cuComplex (#2372) 2025-07-15 00:36:13 -07:00
Cheng
6325f60d52
[CUDA] Bundle CCCL for JIT compilation (#2357)
* Ship CCCL for JIT compilation

* Remove cexpf
2025-07-11 18:45:37 -07:00
Cheng
8347575ba1
[CUDA] Implement Scan kernel (#2347)
* Contiguous scan

* Strided scan

* Enable tests

* Fix failing logaddexp test

* Use cexpf in Metal
2025-07-10 16:54:12 -07:00
Cheng
2ca533b279
Fix compilation with CUDA 11 (#2331) 2025-07-07 20:00:43 -07:00
Cheng
9d10239af7
[CUDA] Do vectorized store/load in binary ops (#2330) 2025-07-07 08:44:14 -07:00
Awni Hannun
dd4f53db63
use fp32 for testing, add more complex ops (#2322) 2025-07-01 07:30:00 -07:00
Awni Hannun
c9a9180584
Cuda perf tuning (#2307)
* perf tuning

* fix adding inputs arrays in matmul / srot

* format

* fix
2025-06-20 14:50:57 -07:00
Awni Hannun
b8022c578a
divmod, partition, sort fixes (#2302) 2025-06-16 18:49:32 -07:00
Awni Hannun
bc53f8293f
Cuda bug fixes 2 (#2298)
* more bug fixes

* more bug fixes

* format
2025-06-16 13:14:46 -07:00
Awni Hannun
c552ff2451
[CUDA] Fix back-end bugs and enable corresponding tests (#2296)
* Fix some cuda back-end bugs and enable corresponding tests

* more fixes

* enable more tests

* format
2025-06-16 08:45:40 -07:00
Awni Hannun
8402a2acf4
Fix complex power and print (#2286)
* fix complex power and print

* fix complex matmul shape
2025-06-13 11:13:00 -07:00
Cheng
c8b4787e4e
CUDA backend: indexing ops (#2277) 2025-06-12 21:44:19 -07:00
Awni Hannun
2188199ff8
[CUDA] ternary with select op (#2283)
* cuda ternary with select op

* comment + fix

* fix
2025-06-12 20:24:43 -07:00
Cheng
a4fc671d3e
CUDA backend: compile (#2276)
* CUDA backend: compile

* Rename kernels/ to device/
2025-06-12 17:08:39 -07:00