mlx/kernels at fce53b61d6a93f3a86e74b0a8a3bc86547228c11 - mlx

Abe Leininger fce53b61d6 Fix reduce sum/prod overflow (#2477 )	2025-08-12 00:05:33 -07:00
..
fft	Fix fft for integer overflow (#2161 )	2025-05-09 14:25:12 -07:00
jit	Dispatch bf16 at run time when using the JIT (#1584 )	2024-11-15 16:54:36 -08:00
metal_3_0	Dispatch bf16 at run time when using the JIT (#1584 )	2024-11-15 16:54:36 -08:00
metal_3_1	Dispatch bf16 at run time when using the JIT (#1584 )	2024-11-15 16:54:36 -08:00
reduction	Align mlx::core::min op nan propagation with NumPy (#2346 )	2025-07-10 06:20:43 -07:00
steel	MoE backward improvements (#2335 )	2025-07-07 17:59:53 -07:00
arange.h	More jitting (#1132 )	2024-05-23 16:23:44 -07:00
arange.metal	Custom logsumexp (#2028 )	2025-03-31 07:36:55 -07:00
arg_reduce.metal	fix large arg reduce (#2206 )	2025-05-19 13:10:44 -07:00
atomic.h	Refactor reductions and fix scatter atomics for large sizes (#1300 )	2024-08-22 16:03:31 -07:00
bf16_math.h	Dispatch bf16 at run time when using the JIT (#1584 )	2024-11-15 16:54:36 -08:00
binary_ops.h	Fix complex power and print (#2286 )	2025-06-13 11:13:00 -07:00
binary_two.h	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
binary_two.metal	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
binary.h	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
binary.metal	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
cexpf.h	[CUDA] Implement Scan kernel (#2347 )	2025-07-10 16:54:12 -07:00
CMakeLists.txt	MoE backward improvements (#2335 )	2025-07-07 17:59:53 -07:00
complex.h	Add more complex unary ops (#2101 )	2025-04-21 13:04:54 -07:00
conv.metal	Depthwise Conv2D optimization (#2036 )	2025-04-03 09:42:04 -07:00
copy.h	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
copy.metal	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
defines.h	Refactor reductions and fix scatter atomics for large sizes (#1300 )	2024-08-22 16:03:31 -07:00
erf.h	JIT compile option for binary minimization (#1091 )	2024-05-22 12:57:13 -07:00
expm1f.h	Fix overflow / underflow handling for expm1f (#1278 )	2024-07-23 07:29:06 -07:00
fence.metal	Faster synchronization `Fence` primitive (#1773 )	2025-01-17 18:42:19 -08:00
fft.h	fix fft bug (#2062 )	2025-04-10 19:41:27 -07:00
fft.metal	Add Quantized Ops to the JIT (#1204 )	2024-06-12 09:47:12 -07:00
gather_axis.h	scatter axis + gather axis primitives (#1813 )	2025-01-31 20:48:08 -08:00
gather.h	Use int64 stride everywhere (#1671 )	2024-12-09 11:09:02 -08:00
gemv_masked.h	Use same accumulation precision in gemv as gemm (#1962 )	2025-03-16 07:13:24 -07:00
gemv_masked.metal	Use int64 stride everywhere (#1671 )	2024-12-09 11:09:02 -08:00
gemv.metal	enable complex gemm (#2017 )	2025-03-28 10:45:13 -07:00
hadamard.h	GPU Hadamard for large N (#1879 )	2025-05-01 17:19:17 -07:00
indexing.h	Use int64 stride everywhere (#1671 )	2024-12-09 11:09:02 -08:00
layer_norm.metal	Fix layernorm race condition (#2340 )	2025-07-07 06:06:01 -07:00
logsumexp.h	Fix out-of-bounds default value in logsumexp/softmax (#2213 )	2025-05-21 07:25:16 -07:00
logsumexp.metal	Custom logsumexp (#2028 )	2025-03-31 07:36:55 -07:00
quantized.h	Fix edge check in qmm_n QuantizedLoader (#2355 )	2025-07-10 16:28:50 -07:00
quantized.metal	5bit quants (#2226 )	2025-05-30 12:12:10 -07:00
random.metal	Use int64 stride everywhere (#1671 )	2024-12-09 11:09:02 -08:00
reduce_utils.h	More jitting (#1132 )	2024-05-23 16:23:44 -07:00
reduce.h	Fix JIT reductions (#1373 )	2024-08-28 16:39:11 -07:00
reduce.metal	Fix reduce sum/prod overflow (#2477 )	2025-08-12 00:05:33 -07:00
rms_norm.metal	Custom logsumexp (#2028 )	2025-03-31 07:36:55 -07:00
rope.metal	Dispatch bf16 at run time when using the JIT (#1584 )	2024-11-15 16:54:36 -08:00
scaled_dot_product_attention.metal	Add float mask to sdpa vector (#2068 )	2025-04-11 17:29:40 -07:00
scan.h	LogCumSumExp (#2069 )	2025-04-13 01:27:29 -07:00
scan.metal	Complex scan (#2094 )	2025-04-22 18:56:28 -07:00
scatter_axis.h	scatter axis + gather axis primitives (#1813 )	2025-01-31 20:48:08 -08:00
scatter.h	Use int64 stride everywhere (#1671 )	2024-12-09 11:09:02 -08:00
sdpa_vector.h	fix batched vector sdpa (#2152 )	2025-05-05 13:13:03 -07:00
softmax.h	Fix out-of-bounds default value in logsumexp/softmax (#2213 )	2025-05-21 07:25:16 -07:00
softmax.metal	Custom logsumexp (#2028 )	2025-03-31 07:36:55 -07:00
sort.h	faster sort (#1831 )	2025-02-05 06:10:22 -08:00
sort.metal	faster sort (#1831 )	2025-02-05 06:10:22 -08:00
ternary_ops.h	JIT compile option for binary minimization (#1091 )	2024-05-22 12:57:13 -07:00
ternary.h	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
ternary.metal	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
unary_ops.h	[CUDA] Implement Scan kernel (#2347 )	2025-07-10 16:54:12 -07:00
unary.h	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
unary.metal	Improve metal elementwise kernels (#2247 )	2025-06-06 11:37:40 -07:00
utils.h	fix bw for elementwise ops (#2151 )	2025-05-05 06:15:04 -07:00

fft

Fix fft for integer overflow (#2161 )

2025-05-09 14:25:12 -07:00

jit

Dispatch bf16 at run time when using the JIT (#1584 )

2024-11-15 16:54:36 -08:00

metal_3_0

Dispatch bf16 at run time when using the JIT (#1584 )

2024-11-15 16:54:36 -08:00

metal_3_1

Dispatch bf16 at run time when using the JIT (#1584 )

2024-11-15 16:54:36 -08:00

reduction

Align mlx::core::min op nan propagation with NumPy (#2346 )

2025-07-10 06:20:43 -07:00

steel

MoE backward improvements (#2335 )

2025-07-07 17:59:53 -07:00

arange.h

More jitting (#1132 )

2024-05-23 16:23:44 -07:00

arange.metal

Custom logsumexp (#2028 )

2025-03-31 07:36:55 -07:00

arg_reduce.metal

fix large arg reduce (#2206 )

2025-05-19 13:10:44 -07:00

atomic.h

Refactor reductions and fix scatter atomics for large sizes (#1300 )

2024-08-22 16:03:31 -07:00

bf16_math.h

Dispatch bf16 at run time when using the JIT (#1584 )

2024-11-15 16:54:36 -08:00

binary_ops.h

Fix complex power and print (#2286 )

2025-06-13 11:13:00 -07:00

binary_two.h

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

binary_two.metal

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

binary.h

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

binary.metal

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

cexpf.h

[CUDA] Implement Scan kernel (#2347 )

2025-07-10 16:54:12 -07:00

CMakeLists.txt

MoE backward improvements (#2335 )

2025-07-07 17:59:53 -07:00

complex.h

Add more complex unary ops (#2101 )

2025-04-21 13:04:54 -07:00

conv.metal

Depthwise Conv2D optimization (#2036 )

2025-04-03 09:42:04 -07:00

copy.h

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

copy.metal

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

defines.h

Refactor reductions and fix scatter atomics for large sizes (#1300 )

2024-08-22 16:03:31 -07:00

erf.h

JIT compile option for binary minimization (#1091 )

2024-05-22 12:57:13 -07:00

expm1f.h

Fix overflow / underflow handling for expm1f (#1278 )

2024-07-23 07:29:06 -07:00

fence.metal

Faster synchronization Fence primitive (#1773 )

2025-01-17 18:42:19 -08:00

fft.h

fix fft bug (#2062 )

2025-04-10 19:41:27 -07:00

fft.metal

Add Quantized Ops to the JIT (#1204 )

2024-06-12 09:47:12 -07:00

gather_axis.h

scatter axis + gather axis primitives (#1813 )

2025-01-31 20:48:08 -08:00

gather.h

Use int64 stride everywhere (#1671 )

2024-12-09 11:09:02 -08:00

gemv_masked.h

Use same accumulation precision in gemv as gemm (#1962 )

2025-03-16 07:13:24 -07:00

gemv_masked.metal

Use int64 stride everywhere (#1671 )

2024-12-09 11:09:02 -08:00

gemv.metal

enable complex gemm (#2017 )

2025-03-28 10:45:13 -07:00

hadamard.h

GPU Hadamard for large N (#1879 )

2025-05-01 17:19:17 -07:00

indexing.h

Use int64 stride everywhere (#1671 )

2024-12-09 11:09:02 -08:00

layer_norm.metal

Fix layernorm race condition (#2340 )

2025-07-07 06:06:01 -07:00

logsumexp.h

Fix out-of-bounds default value in logsumexp/softmax (#2213 )

2025-05-21 07:25:16 -07:00

logsumexp.metal

Custom logsumexp (#2028 )

2025-03-31 07:36:55 -07:00

quantized.h

Fix edge check in qmm_n QuantizedLoader (#2355 )

2025-07-10 16:28:50 -07:00

quantized.metal

5bit quants (#2226 )

2025-05-30 12:12:10 -07:00

random.metal

Use int64 stride everywhere (#1671 )

2024-12-09 11:09:02 -08:00

reduce_utils.h

More jitting (#1132 )

2024-05-23 16:23:44 -07:00

reduce.h

Fix JIT reductions (#1373 )

2024-08-28 16:39:11 -07:00

reduce.metal

Fix reduce sum/prod overflow (#2477 )

2025-08-12 00:05:33 -07:00

rms_norm.metal

Custom logsumexp (#2028 )

2025-03-31 07:36:55 -07:00

rope.metal

Dispatch bf16 at run time when using the JIT (#1584 )

2024-11-15 16:54:36 -08:00

scaled_dot_product_attention.metal

Add float mask to sdpa vector (#2068 )

2025-04-11 17:29:40 -07:00

scan.h

LogCumSumExp (#2069 )

2025-04-13 01:27:29 -07:00

scan.metal

Complex scan (#2094 )

2025-04-22 18:56:28 -07:00

scatter_axis.h

scatter axis + gather axis primitives (#1813 )

2025-01-31 20:48:08 -08:00

scatter.h

Use int64 stride everywhere (#1671 )

2024-12-09 11:09:02 -08:00

sdpa_vector.h

fix batched vector sdpa (#2152 )

2025-05-05 13:13:03 -07:00

softmax.h

Fix out-of-bounds default value in logsumexp/softmax (#2213 )

2025-05-21 07:25:16 -07:00

softmax.metal

Custom logsumexp (#2028 )

2025-03-31 07:36:55 -07:00

sort.h

faster sort (#1831 )

2025-02-05 06:10:22 -08:00

sort.metal

faster sort (#1831 )

2025-02-05 06:10:22 -08:00

ternary_ops.h

JIT compile option for binary minimization (#1091 )

2024-05-22 12:57:13 -07:00

ternary.h

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

ternary.metal

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

unary_ops.h

[CUDA] Implement Scan kernel (#2347 )

2025-07-10 16:54:12 -07:00

unary.h

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

unary.metal

Improve metal elementwise kernels (#2247 )

2025-06-06 11:37:40 -07:00

utils.h

fix bw for elementwise ops (#2151 )

2025-05-05 06:15:04 -07:00