mlx/mlx/fast.h

// Copyright © 2023-2024 Apple Inc.

#pragma once

#include <optional>
#include <variant>

#include "mlx/utils.h"

namespace mlx::core::fast {

array rms_norm(
    const array& x,
    const std::optional<array>& weight,
    float eps,
    StreamOrDevice s = {});

array layer_norm(
    const array& x,
    const std::optional<array>& weight,
    const std::optional<array>& bias,
    float eps,
    StreamOrDevice s = {});

array rope(
    const array& x,
    int dims,
    bool traditional,
    std::optional<float> base,
    float scale,
    int offset,
    const std::optional<array>& freqs = std::nullopt,
    StreamOrDevice s = {});

array rope(
    const array& x,
    int dims,
    bool traditional,
    std::optional<float> base,
    float scale,
    const array& offset,
    const std::optional<array>& freqs = std::nullopt,
    StreamOrDevice s = {});

/** Computes: O = softmax(Q @ K.T) @ V **/
array scaled_dot_product_attention(
    const array& queries,
    const array& keys,
    const array& values,
    const float scale,
    const std::string& mask_mode = "",
    const std::vector<array>& mask_arrs = {},
    StreamOrDevice s = {});

std::tuple<array, array, array> affine_quantize(
    const array& w,
    int group_size = 64,
    int bits = 4,
    StreamOrDevice s = {});

array affine_dequantize(
    const array& w,
    const array& scales,
    const array& biases,
    int group_size = 64,
    int bits = 4,
    StreamOrDevice s = {});

typedef std::variant<int, bool, Dtype> TemplateArg;

typedef std::function<std::vector<array>(
    const std::vector<array>&,
    const std::vector<Shape>&,
    const std::vector<Dtype>&,
    std::tuple<int, int, int>,
    std::tuple<int, int, int>,
    std::vector<std::pair<std::string, TemplateArg>>,
    std::optional<float>,
    bool,
    StreamOrDevice)>
    MetalKernelFunction;

MetalKernelFunction metal_kernel(
    const std::string& name,
    const std::vector<std::string>& input_names,
    const std::vector<std::string>& output_names,
    const std::string& source,
    const std::string& header = "",
    bool ensure_row_contiguous = true,
    bool atomic_outputs = false);

} // namespace mlx::core::fast
Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00			`// Copyright © 2023-2024 Apple Inc.`

			`#pragma once`

Fast Inference SDPA op (#735) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com> 2024-03-05 13:06:11 +08:00			`#include <optional>`
Support fused masking in Attention (#1924) * Update API to allow mask='causal' in fast::sdpa * Add fallback * Update steel::AttnParams * Fix typo * WIP, basic causal * Update tests * Update benchmarking * Update masking loop limits * Add bool masking and update tests * Update additive mask * Update benchmarks * Update benchmarks * Update tests * Update for bfloat error * Update early exit * Add random seed to tests 2025-03-21 02:01:32 +08:00			`#include <variant>`
Fast Inference SDPA op (#735) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com> 2024-03-05 13:06:11 +08:00
Separate fast ops and primitives (#699) 2024-02-17 11:16:39 +08:00			`#include "mlx/utils.h"`
Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00
			`namespace mlx::core::fast {`

Fast RMS Norm (#862) * fast rmsnorm * no rms gpu * kernel * fix shared mem * looped rms and donation in softmax * Make the squaring in float32 to avoid underflow * Fix the default StreamOrDevice for rope and rms_norm in fast * nits --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-03-21 22:20:54 +08:00			`array rms_norm(`
			`const array& x,`
RMS norm without scaling (#1915) 2025-03-01 12:26:57 +08:00			`const std::optional<array>& weight,`
Fast RMS Norm (#862) * fast rmsnorm * no rms gpu * kernel * fix shared mem * looped rms and donation in softmax * Make the squaring in float32 to avoid underflow * Fix the default StreamOrDevice for rope and rms_norm in fast * nits --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-03-21 22:20:54 +08:00			`float eps,`
			`StreamOrDevice s = {});`

Adds mx.fast.layer_norm (#870) 2024-03-22 04:55:51 +08:00			`array layer_norm(`
			`const array& x,`
			`const std::optional<array>& weight,`
			`const std::optional<array>& bias,`
			`float eps,`
			`StreamOrDevice s = {});`

Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00			`array rope(`
			`const array& x,`
			`int dims,`
			`bool traditional,`
RoPE with frequencies as optional input (#1337) * start rope with freq input * rope with frequencies * nits * fix bug * fix bug + test * cleanup * optional base 2024-08-20 09:30:50 +08:00			`std::optional<float> base,`
Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00			`float scale,`
			`int offset,`
RoPE with frequencies as optional input (#1337) * start rope with freq input * rope with frequencies * nits * fix bug * fix bug + test * cleanup * optional base 2024-08-20 09:30:50 +08:00			`const std::optional<array>& freqs = std::nullopt,`
Fast RMS Norm (#862) * fast rmsnorm * no rms gpu * kernel * fix shared mem * looped rms and donation in softmax * Make the squaring in float32 to avoid underflow * Fix the default StreamOrDevice for rope and rms_norm in fast * nits --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-03-21 22:20:54 +08:00			`StreamOrDevice s = {});`
Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00
Allow offset to be an mx.array for `mx.fast.rope` (#1724) * allow offset for rope * comment 2024-12-20 07:51:44 +08:00			`array rope(`
			`const array& x,`
			`int dims,`
			`bool traditional,`
			`std::optional<float> base,`
			`float scale,`
			`const array& offset,`
			`const std::optional<array>& freqs = std::nullopt,`
			`StreamOrDevice s = {});`

Fast Inference SDPA op (#735) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com> 2024-03-05 13:06:11 +08:00			`/ Computes: O = softmax(Q @ K.T) @ V /`
			`array scaled_dot_product_attention(`
			`const array& queries,`
			`const array& keys,`
			`const array& values,`
			`const float scale,`
Add new sdpa function overload (#2035) * Add new sdpa function overload * Address comments * Remove std::varaint from cpp sdpa function 2025-04-04 02:58:28 +08:00			`const std::string& mask_mode = "",`
			`const std::vector<array>& mask_arrs = {},`
Fast Inference SDPA op (#735) * Fast Inference SDPA op Implements metal shaders for: o = mx.fast_inference_sdpa(queries, keys, values, scale, mask) Supports fp16, fp32 dtypes; assumes d_k = 128. Generic op support / prompt encoding supported via mlx primitives. Metal implementation is for the inference use case only. Majority of performance benefits appears to results from GQA & reduced bandwidth requirements; there is approximate performance parity for the MHA use case (from some measurements on M3 Max). * Flush shared memory to zero before unprotected reads for (scores @ values) * Move to fast:: namespace, address reviewer comments ... also attempt to revert formatter auto-change for files not relevant to this change * Shared memory flush to top of kernel * Resolve compiler warnings * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/fast.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update docstring per PR feedback * Softmax in higher precision, ... * route to fallback for more use cases - batch size > 1, head_dim other than 128, etc. * Address linux build failure * Address other reviewer comments * Remove extraneous eval_cpu function per review --------- Co-authored-by: Atila Orhon <64497909+atiorh@users.noreply.github.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: atila <atiorh@icloud.com> 2024-03-05 13:06:11 +08:00			`StreamOrDevice s = {});`

Fused Affine Quantize/Dequantize ops (#1282) * Add fast affine dequantize * add full quantize kernel * fused kernel with scale/bias computation * fix docstring * fix no jit error * fix test * test fix * reduce fast api to only affine_quantize 2024-07-30 06:11:38 +08:00			`std::tuple<array, array, array> affine_quantize(`
			`const array& w,`
			`int group_size = 64,`
			`int bits = 4,`
			`StreamOrDevice s = {});`

			`array affine_dequantize(`
			`const array& w,`
			`const array& scales,`
			`const array& biases,`
			`int group_size = 64,`
			`int bits = 4,`
			`StreamOrDevice s = {});`

Custom Metal Kernels from Python (#1325) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only 2024-08-23 04:46:29 +08:00			`typedef std::variant<int, bool, Dtype> TemplateArg;`

Simplifications for MLX C (#1396) * simplifications for MLX C * use vectors instead of map * update examples 2024-09-07 10:16:50 +08:00			`typedef std::function<std::vector<array>(`
			`const std::vector<array>&,`
More shape type (#1705) * more shape type * fix 2024-12-20 00:08:20 +08:00			`const std::vector<Shape>&,`
Simplifications for MLX C (#1396) * simplifications for MLX C * use vectors instead of map * update examples 2024-09-07 10:16:50 +08:00			`const std::vector<Dtype>&,`
			`std::tuple<int, int, int>,`
			`std::tuple<int, int, int>,`
			`std::vector<std::pair<std::string, TemplateArg>>,`
			`std::optional<float>,`
			`bool,`
			`StreamOrDevice)>`
			`MetalKernelFunction;`
Custom Metal Kernels from Python (#1325) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only 2024-08-23 04:46:29 +08:00
Simplifications for MLX C (#1396) * simplifications for MLX C * use vectors instead of map * update examples 2024-09-07 10:16:50 +08:00			`MetalKernelFunction metal_kernel(`
			`const std::string& name,`
			`const std::vector<std::string>& input_names,`
			`const std::vector<std::string>& output_names,`
			`const std::string& source,`
			`const std::string& header = "",`
			`bool ensure_row_contiguous = true,`
			`bool atomic_outputs = false);`
Custom Metal Kernels from Python (#1325) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only 2024-08-23 04:46:29 +08:00
Custom primitive + RoPE fat op (#676) * extensions start * rope custom op * fix build * docs + rope benchmark * fix test * Add a Metal kernel for RoPE * Fix position of traditional * transform tests * Move rope computation to float and fix tests * Fix the test and a typo * change to fast * fix no metal build --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> 2024-02-15 06:04:25 +08:00			`} // namespace mlx::core::fast`