.. |
copy
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
device
|
Move arange to its own file (#2438)
|
2025-07-30 13:05:51 +09:00 |
gemms
|
faster rms norm (#2433)
|
2025-07-29 13:12:00 -07:00 |
reduce
|
fix complex reduce + nan propagation in min and max (#2377)
|
2025-07-15 18:19:47 -07:00 |
allocator.cpp
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
allocator.h
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
arange.cu
|
Move arange to its own file (#2438)
|
2025-07-30 13:05:51 +09:00 |
arg_reduce.cu
|
Remove thrust iterators (#2396)
|
2025-07-21 07:30:27 -07:00 |
bin2h.cmake
|
CUDA backend: compile (#2276)
|
2025-06-12 17:08:39 -07:00 |
binary_two.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
binary.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
CMakeLists.txt
|
Move arange to its own file (#2438)
|
2025-07-30 13:05:51 +09:00 |
compiled.cpp
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
conv.cpp
|
[CUDA] Initial implementation of Convolution with cuDNN (#2385)
|
2025-07-25 08:12:10 +09:00 |
copy.cu
|
Cuda perf tuning (#2307)
|
2025-06-20 14:50:57 -07:00 |
cuda.cpp
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
cuda.h
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
device.cpp
|
Add more CUDA architectures for PyPi package (#2427)
|
2025-07-28 12:35:15 -07:00 |
device.h
|
[CUDA] Initial implementation of Convolution with cuDNN (#2385)
|
2025-07-25 08:12:10 +09:00 |
eval.cpp
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
event.cu
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
event.h
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
fence.cpp
|
Avoid atomic updates across CPU/GPU in CUDA event (#2231)
|
2025-06-03 16:49:06 -07:00 |
indexing.cpp
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
jit_module.cpp
|
[CUDA] Fix segfault on exit (#2424)
|
2025-07-27 08:08:13 -07:00 |
jit_module.h
|
[CUDA] Fix segfault on exit (#2424)
|
2025-07-27 08:08:13 -07:00 |
kernel_utils.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
kernel_utils.cuh
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
layer_norm.cu
|
faster rms norm (#2433)
|
2025-07-29 13:12:00 -07:00 |
logsumexp.cu
|
Cuda faster softmax (#2435)
|
2025-07-29 17:18:12 -07:00 |
lru_cache.h
|
[CUDA] Initial implementation of Convolution with cuDNN (#2385)
|
2025-07-25 08:12:10 +09:00 |
matmul.cpp
|
[CUDA] Always use batched matmul (#2404)
|
2025-07-24 20:46:02 -07:00 |
no_cuda.cpp
|
start cuda circle config (#2256)
|
2025-06-10 21:19:47 -07:00 |
primitives.cpp
|
Move arange to its own file (#2438)
|
2025-07-30 13:05:51 +09:00 |
quantized.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
random.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
reduce.cu
|
faster rms norm (#2433)
|
2025-07-29 13:12:00 -07:00 |
rms_norm.cu
|
faster rms norm (#2433)
|
2025-07-29 13:12:00 -07:00 |
rope.cu
|
[CUDA] Switch to CUDA graphs (#2317)
|
2025-07-02 15:59:13 -07:00 |
scan.cu
|
Add contiguous_copy_gpu util for copying array (#2379)
|
2025-07-18 06:44:25 -07:00 |
slicing.cpp
|
rebase + nit (#2260)
|
2025-06-10 10:51:51 -07:00 |
softmax.cu
|
Cuda faster softmax (#2435)
|
2025-07-29 17:18:12 -07:00 |
sort.cu
|
Add contiguous_copy_gpu util for copying array (#2379)
|
2025-07-18 06:44:25 -07:00 |
ternary.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
unary.cu
|
Remove the kernel arg from get_launch_args (#2437)
|
2025-07-30 11:43:02 +09:00 |
utils.cpp
|
[CUDA] Initial implementation of Convolution with cuDNN (#2385)
|
2025-07-25 08:12:10 +09:00 |
utils.h
|
[CUDA] Initial implementation of Convolution with cuDNN (#2385)
|
2025-07-25 08:12:10 +09:00 |
worker.cpp
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |
worker.h
|
[CUDA] Simplify allocator (#2392)
|
2025-07-22 08:24:01 -07:00 |