Jagrit Digani
|
2e49b57ea5
|
Redirect steel_gemm
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
3ad2574d1a
|
Refactor AddMM step 1
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
dd1b6fa629
|
Add axpby routing to steel_matmul_regular
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
a733dae4ba
|
Redirect steel_gemm_regular
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
13585ba4ee
|
Rearrange steel_gemm_regular
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
7e9ac08a61
|
Refactor split k axpby
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
d192587cdf
|
Refactor splitk step 1
|
2025-06-11 09:58:55 -07:00 |
|
Jagrit Digani
|
f828c5b5ae
|
Refactor gemv into a function
|
2025-06-11 09:58:55 -07:00 |
|
Awni Hannun
|
c35f4d089a
|
start cuda circle config (#2256)
* rebase
* fix metal kernel linking issue on cuda
* start cuda circle config
|
2025-06-10 21:19:47 -07:00 |
|
Angelos Katharopoulos
|
8590c0941e
|
Add load_safe to the general conv loaders (#2258)
|
2025-06-10 20:58:16 -07:00 |
|
Cheng
|
99c33d011d
|
rebase + nit (#2260)
Co-authored-by: Awni Hannun <awni@apple.com>
|
2025-06-10 10:51:51 -07:00 |
|
Cheng
|
7c4eb5d03e
|
CUDA backend: random (#2261)
|
2025-06-10 08:59:56 -07:00 |
|
Cheng
|
bae9a6b404
|
CUDA backend: sort (#2262)
Co-authored-by: Awni Hannun <awni@apple.com>
|
2025-06-10 08:59:47 -07:00 |
|
Cheng
|
7ebb2e0193
|
CUDA backend: binary ops (#2259)
|
2025-06-10 06:37:40 -07:00 |
|
Cheng
|
f8bad60609
|
CUDA backend: unary ops (#2158)
|
2025-06-09 06:45:08 -07:00 |
|
Awni Hannun
|
1ca616844b
|
Fix unintuitive metal kernel caching (#2242)
* Fix unintuitive metal kernel caching
* alternative solution
|
2025-06-06 20:08:15 -07:00 |
|
Angelos Katharopoulos
|
2e8cf0b450
|
Change layernorms to two pass algorithm (#2246)
|
2025-06-06 13:34:56 -07:00 |
|
Cheng
|
24f89173d1
|
CUDA backend: matmul (#2241)
|
2025-06-06 12:24:04 -07:00 |
|
Awni Hannun
|
c6a20b427a
|
Improve metal elementwise kernels (#2247)
* improve metal elementwise kernels
* compile and copy
* fix jit
|
2025-06-06 11:37:40 -07:00 |
|
Cheng
|
52dc8c8cd5
|
Add profiler annotations in common primitives for CUDA backend (#2244)
|
2025-06-04 19:55:12 -07:00 |
|
Cheng
|
85a8beb5e4
|
Avoid atomic updates across CPU/GPU in CUDA event (#2231)
|
2025-06-03 16:49:06 -07:00 |
|
Cheng
|
0bb89e9e5f
|
Share more common code in Compiled (#2240)
* Share more common code in Compiled
* Remove build_lib_name
|
2025-06-03 16:48:50 -07:00 |
|
Cheng
|
5685ceb3c7
|
Avoid invoking allocator::malloc when creating CUDA event (#2232)
|
2025-06-03 16:48:40 -07:00 |
|
Cheng
|
1b021f6984
|
Fast primitives decide when to use the fallback (#2216)
|
2025-06-02 13:26:37 -07:00 |
|
Cheng
|
db5a7c6192
|
Add memory cache to CUDA backend (#2221)
* Move BufferCache out of allocator
* Add memory cache to cuda backend allocator
* Simplify BufferCache assuming buf can not be null
|
2025-05-30 12:12:54 -07:00 |
|
Awni Hannun
|
6ef2f67e7f
|
5bit quants (#2226)
* 5bit quants
* 5bit quants
|
2025-05-30 12:12:10 -07:00 |
|
Cheng
|
f76ee1ffd2
|
Move some dims utils to common (#2223)
|
2025-05-29 06:48:30 -07:00 |
|
Cheng
|
54a71f270a
|
Remove unused defines (#2217)
|
2025-05-23 06:14:58 -07:00 |
|
Cheng
|
79071bfba4
|
Fix out-of-bounds default value in logsumexp/softmax (#2213)
|
2025-05-21 07:25:16 -07:00 |
|
Cheng
|
7774b87cbd
|
Remove redundant simd_sum in logsumexp (#2210)
|
2025-05-21 07:25:03 -07:00 |
|
Cheng
|
35c87741cf
|
Build for compute capability 70 instead of 75 (#2209)
|
2025-05-20 19:42:48 -07:00 |
|
Awni Hannun
|
eebe73001a
|
fix large arg reduce (#2206)
|
2025-05-19 13:10:44 -07:00 |
|
Cheng
|
237f9e58a8
|
Fix BEFORE keyword in target_include_directories (#2204)
|
2025-05-19 06:10:44 -07:00 |
|
Awni Hannun
|
8576e6fe36
|
fix conv2d bug + faster conv 1d (#2195)
* fix conv2d bug + faster conv 1d
* revert sort + flaky test
|
2025-05-18 06:05:11 -07:00 |
|
Angelos Katharopoulos
|
0654543dcc
|
Add complex eigh (#2191)
|
2025-05-18 00:18:43 -07:00 |
|
Cheng
|
7d4b378952
|
Include cuda_bf16.h for bfloat16 overloads (#2192)
* Include cuda_bf16.h for bfloat16 overloads
* Add NO_GPU_MULTI(Eig) in cuda backend
|
2025-05-16 06:44:42 -07:00 |
|
Jack Wind
|
7ff5c41e06
|
Add set_threadgroup_memory_length to CommandEncoder (#2183)
|
2025-05-16 00:28:03 -07:00 |
|
Awni Hannun
|
c1eb9d05d9
|
non-symmetric eig and eigh (#2188)
|
2025-05-15 13:01:44 -07:00 |
|
Cheng
|
0751263dec
|
Fix typo in row_reduce_small (#2179)
|
2025-05-13 20:19:54 -07:00 |
|
Cheng
|
eca2f3eb97
|
Add remove_index utility (#2173)
|
2025-05-13 17:09:56 -07:00 |
|
Awni Hannun
|
8f3d208dce
|
Close a couple edge case bugs: hadamard and addmm on empty inputs (#2177)
* handle hadamard and addmm on empty inputs
* fix
|
2025-05-12 10:48:57 -07:00 |
|
Awni Hannun
|
6661387066
|
Fix fft for integer overflow (#2161)
|
2025-05-09 14:25:12 -07:00 |
|
ATurker
|
a7fae8a176
|
fix: conv_general differences between gpu, cpu (#2070)
* fix general_conv padding
* fix bugs
* add test
---------
Co-authored-by: Awni Hannun <awni@apple.com>
|
2025-05-09 10:26:52 -07:00 |
|
Cheng
|
0cae0bdac8
|
CUDA backend: backbone (#2075)
|
2025-05-06 21:26:46 -07:00 |
|
Awni Hannun
|
5a1a5d5ed1
|
fix input coherent kernel launch (#2153)
|
2025-05-05 17:30:50 -07:00 |
|
Cheng
|
1683975acf
|
Move common gpu primitives to backend/gpu (#2145)
|
2025-05-05 13:45:29 -07:00 |
|
Awni Hannun
|
af705590ac
|
fix batched vector sdpa (#2152)
|
2025-05-05 13:13:03 -07:00 |
|
Awni Hannun
|
825124af8f
|
fix bw for elementwise ops (#2151)
* fix bw for elementwise ops
* add compile
* fix
* fix
* fix
* fix
|
2025-05-05 06:15:04 -07:00 |
|
Angelos Katharopoulos
|
481349495b
|
GPU Hadamard for large N (#1879)
|
2025-05-01 17:19:17 -07:00 |
|
Awni Hannun
|
e496c5a4b4
|
fix integer overflow in qmm (#2143)
|
2025-04-30 09:28:56 -07:00 |
|