.. |
copy
|
[CUDA] Do vectorized store/load in contiguous elementwise ops (#2342)
|
2025-07-09 18:48:43 -07:00 |
device
|
Fix compilation with CUDA 11 (#2331)
|
2025-07-07 20:00:43 -07:00 |
iterators
|
CUDA backend: argreduce (#2270)
|
2025-06-11 13:26:17 -07:00 |
reduce
|
Fix compilation with CUDA 11 (#2331)
|
2025-07-07 20:00:43 -07:00 |
allocator.cpp
|
Cuda perf tuning (#2307)
|
2025-06-20 14:50:57 -07:00 |
allocator.h
|
Avoid invoking allocator::malloc when creating CUDA event (#2232)
|
2025-06-03 16:48:40 -07:00 |
arg_reduce.cu
|
Fix compilation with CUDA 11 (#2331)
|
2025-07-07 20:00:43 -07:00 |
bin2h.cmake
|
CUDA backend: compile (#2276)
|
2025-06-12 17:08:39 -07:00 |
binary_two.cu
|
[CUDA] Do vectorized store/load in contiguous elementwise ops (#2342)
|
2025-07-09 18:48:43 -07:00 |
binary.cu
|
[CUDA] Do vectorized store/load in contiguous elementwise ops (#2342)
|
2025-07-09 18:48:43 -07:00 |
CMakeLists.txt
|
[CUDA] Fix reductions (#2314)
|
2025-06-27 12:59:20 -07:00 |
compiled.cpp
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
copy.cu
|
Cuda perf tuning (#2307)
|
2025-06-20 14:50:57 -07:00 |
cuda.cpp
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
cuda.h
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
device.cpp
|
[CUDA] Set current device before cudaGraphLaunch (#2351)
|
2025-07-10 07:24:02 -07:00 |
device.h
|
[CUDA] Set current device before cudaGraphLaunch (#2351)
|
2025-07-10 07:24:02 -07:00 |
eval.cpp
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
event.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
event.h
|
CUDA backend: backbone (#2075)
|
2025-05-06 21:26:46 -07:00 |
fence.cpp
|
Avoid atomic updates across CPU/GPU in CUDA event (#2231)
|
2025-06-03 16:49:06 -07:00 |
indexing.cpp
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
jit_module.cpp
|
[CUDA] Put version in ptx cache dir path (#2352)
|
2025-07-10 07:24:21 -07:00 |
jit_module.h
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
kernel_utils.cu
|
RoPE for CUDA (#2293)
|
2025-06-15 06:08:07 -07:00 |
kernel_utils.cuh
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
layer_norm.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
logsumexp.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
matmul.cpp
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
no_cuda.cpp
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
primitives.cu
|
MoE backward improvements (#2335)
|
2025-07-07 17:59:53 -07:00 |
random.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
reduce.cu
|
[CUDA] Fix reductions (#2314)
|
2025-06-27 12:59:20 -07:00 |
rms_norm.cu
|
Fix compilation with CUDA 11 (#2331)
|
2025-07-07 20:00:43 -07:00 |
rope.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
slicing.cpp
|
rebase + nit (#2260)
|
2025-06-10 10:51:51 -07:00 |
softmax.cu
|
Fix compilation with CUDA 11 (#2331)
|
2025-07-07 20:00:43 -07:00 |
sort.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
ternary.cu
|
[CUDA] Do vectorized store/load in contiguous elementwise ops (#2342)
|
2025-07-09 18:48:43 -07:00 |
unary.cu
|
[CUDA] Do vectorized store/load in contiguous elementwise ops (#2342)
|
2025-07-09 18:48:43 -07:00 |
utils.cpp
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
utils.h
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
worker.cpp
|
[CUDA] synch properly waits for all tasks to finish and clear (#2303)
|
2025-06-17 12:03:25 -07:00 |
worker.h
|
CUDA backend: backbone (#2075)
|
2025-05-06 21:26:46 -07:00 |