version bump (#1477 )

Add mx.random.permutation (#1471 )
* random permutation * comment
2025-09-07 17:44:38 +08:00 · 2024-10-10 13:05:17 -07:00 · 2024-10-08 19:42:19 -07:00 · 2024-10-08 09:54:02 -07:00 · 2024-10-07 19:13:50 -07:00 · 2024-10-07 17:21:42 -07:00
248 changed files with 13616 additions and 7293 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -31,19 +31,24 @@ jobs:
          name: Install dependencies
          command: |
            pip install --upgrade cmake
-            pip install git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
+            pip install nanobind==2.2.0
            pip install numpy
            sudo apt-get update
            sudo apt-get install libblas-dev liblapack-dev liblapacke-dev
      - run:
          name: Install Python package
          command: |
-            CMAKE_ARGS="-DMLX_BUILD_METAL=OFF" CMAKE_BUILD_PARALLEL_LEVEL="" python3 setup.py build_ext --inplace
-            CMAKE_ARGS="-DMLX_BUILD_METAL=OFF" CMAKE_BUILD_PARALLEL_LEVEL="" python3 setup.py develop
+            CMAKE_ARGS="-DMLX_BUILD_METAL=OFF" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`nproc` \
+              python3 setup.py build_ext --inplace
+            CMAKE_ARGS="-DMLX_BUILD_METAL=OFF" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`nproc` \
+              python3 setup.py develop
      - run:
          name: Generate package stubs
          command: |
            echo "stubs"
+            pip install typing_extensions
            python setup.py generate_stubs 
      - run:
          name: Run Python tests
@@ -52,7 +57,9 @@ jobs:
      - run:
          name: Build CPP only
          command: |
-            mkdir -p build && cd build && cmake .. -DMLX_BUILD_METAL=OFF && make -j
+            mkdir -p build && cd build 
+            cmake .. -DMLX_BUILD_METAL=OFF -DCMAKE_BUILD_TYPE=DEBUG
+            make -j `nproc`
      - run:
          name: Run CPP tests
          command: ./build/tests/tests
@@ -76,7 +83,7 @@ jobs:
            source env/bin/activate
            pip install --upgrade pip
            pip install --upgrade cmake
-            pip install git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
+            pip install nanobind==2.2.0
            pip install numpy
            pip install torch
            pip install tensorflow
@@ -85,11 +92,12 @@ jobs:
          name: Install Python package
          command: |
            source env/bin/activate
-            CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e . -v
+            DEBUG=1 CMAKE_BUILD_PARALLEL_LEVEL=`sysctl -n hw.ncpu` pip install -e . -v
      - run:
          name: Generate package stubs
          command: |
            source env/bin/activate
+            pip install typing_extensions
            python setup.py generate_stubs 
      - run:
          name: Run Python tests
@@ -97,7 +105,7 @@ jobs:
            source env/bin/activate
            LOW_MEMORY=1 DEVICE=cpu python -m xmlrunner discover -v python/tests -o test-results/cpu
            LOW_MEMORY=1 DEVICE=gpu METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 python -m xmlrunner discover -v python/tests -o test-results/gpu
-            mpirun -host localhost:8 -np 8 -x DYLD_LIBRARY_PATH=/opt/homebrew/lib/ python python/tests/mpi_test_distributed.py
+            mpirun --bind-to none -host localhost:8 -np 8 -x DYLD_LIBRARY_PATH=/opt/homebrew/lib/ python python/tests/mpi_test_distributed.py
      - run:
          name: Build example extension
          command: |
@@ -111,7 +119,7 @@ jobs:
          name: Build CPP only
          command: |
            source env/bin/activate
-            mkdir -p build && cd build && cmake .. && make -j
+            mkdir -p build && cd build && cmake .. && make -j `sysctl -n hw.ncpu`
      - run:
          name: Run CPP tests
          command: |
@@ -121,8 +129,23 @@ jobs:
          command: |
            source env/bin/activate
            cd build/
-            cmake .. -DCMAKE_BUILD_TYPE=MinSizeRel -DBUILD_SHARED_LIBS=ON -DMLX_BUILD_CPU=OFF -DMLX_BUILD_SAFETENSORS=OFF -DMLX_BUILD_GGUF=OFF -DMLX_METAL_JIT=ON
-            make -j
+            cmake .. -DCMAKE_BUILD_TYPE=MinSizeRel \
+              -DBUILD_SHARED_LIBS=ON \
+              -DMLX_BUILD_CPU=OFF \
+              -DMLX_BUILD_SAFETENSORS=OFF \
+              -DMLX_BUILD_GGUF=OFF \
+              -DMLX_METAL_JIT=ON
+            make -j `sysctl -n hw.ncpu`
+      - run:
+          name: Run Python tests with JIT
+          command: |
+            source env/bin/activate
+            CMAKE_BUILD_PARALLEL_LEVEL=`sysctl -n hw.ncpu` \
+              CMAKE_ARGS="-DMLX_METAL_JIT=ON" \
+              pip install -e . -v
+            LOW_MEMORY=1 DEVICE=gpu METAL_DEVICE_WRAPPER_TYPE=1 \
+              METAL_DEBUG_ERROR_MODE=0 \
+              python -m xmlrunner discover -v python/tests -o test-results/gpu_jit

  build_release:
    parameters:
@@ -149,7 +172,7 @@ jobs:
            source env/bin/activate
            pip install --upgrade pip
            pip install --upgrade cmake
-            pip install git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
+            pip install nanobind==2.2.0
            pip install --upgrade setuptools
            pip install numpy
            pip install twine
@@ -159,19 +182,20 @@ jobs:
          command: |
            source env/bin/activate
            DEV_RELEASE=1 \
-              CMAKE_BUILD_PARALLEL_LEVEL="" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`sysctl -n hw.ncpu` \
              pip install . -v
      - run:
          name: Generate package stubs
          command: |
            source env/bin/activate
+            pip install typing_extensions
            python setup.py generate_stubs 
      - run:
          name: Build Python package
          command: |
            source env/bin/activate
            << parameters.build_env >> \
-              CMAKE_BUILD_PARALLEL_LEVEL="" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`sysctl -n hw.ncpu` \
              python -m build -w
      - when:
          condition: << parameters.build_env >>
@@ -213,18 +237,19 @@ jobs:
            source env/bin/activate
            pip install --upgrade pip
            pip install --upgrade cmake
-            pip install git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
+            pip install nanobind==2.2.0
            pip install --upgrade setuptools
            pip install numpy
            pip install auditwheel
            pip install patchelf
            pip install build
            << parameters.extra_env >> \
-              CMAKE_BUILD_PARALLEL_LEVEL="" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`nproc` \
              pip install . -v
+            pip install typing_extensions
            python setup.py generate_stubs 
            << parameters.extra_env >> \
-              CMAKE_BUILD_PARALLEL_LEVEL="" \
+              CMAKE_BUILD_PARALLEL_LEVEL=`nproc` \
              python -m build --wheel
            auditwheel show dist/*
            auditwheel repair dist/* --plat manylinux_2_31_x86_64
@@ -245,7 +270,7 @@ workflows:
      - mac_build_and_test:
          matrix:
            parameters:
-              xcode_version: ["15.0.0", "15.2.0"]
+              xcode_version: ["15.0.0", "15.2.0", "16.0.0"]
      - linux_build_and_test

  build_pypi_release:
@@ -280,7 +305,7 @@ workflows:
          requires: [ hold ]
          matrix:
            parameters:
-              xcode_version: ["15.0.0", "15.2.0"]
+              xcode_version: ["15.0.0", "15.2.0", "16.0.0"]
      - linux_build_and_test:
          requires: [ hold ]
  nightly_build:
@@ -304,7 +329,7 @@ workflows:
          matrix:
            parameters:
              python_version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
-              xcode_version: ["15.0.0", "15.2.0"]
+              xcode_version: ["15.0.0", "15.2.0", "16.0.0"]
              build_env: ["DEV_RELEASE=1"]
  linux_test_release:
    when:
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,11 +1,11 @@
 repos:
 -   repo: https://github.com/pre-commit/mirrors-clang-format
-    rev: v18.1.4
+    rev: v18.1.8
    hooks:
    -   id: clang-format
 # Using this mirror lets us use mypyc-compiled black, which is about 2x faster
 -   repo: https://github.com/psf/black-pre-commit-mirror
-    rev: 24.4.2
+    rev: 24.8.0
    hooks:
    -   id: black
 -   repo: https://github.com/pycqa/isort
@@ -14,3 +14,7 @@ repos:
    -   id: isort
        args:
            - --profile=black
+- repo: https://github.com/cheshirekow/cmake-format-precommit
+  rev: v0.6.13
+  hooks:
+    - id: cmake-format
--- a/ACKNOWLEDGMENTS.md
+++ b/ACKNOWLEDGMENTS.md
@@ -7,7 +7,7 @@ with a short description of your contribution(s) below. For example:

 MLX was developed with contributions from the following individuals:

- Nripesh Niketan: Added `softsign`, `softmax`, `hardswish`, `logsoftmax` activation functions. Added `dropout3d` ops. Added `LogicalAnd` and `LogicalOR` ops. Added `clip_grad_norm` along with `tree_reduce`.
+- Nripesh Niketan: Added `softsign`, `softmax`, `hardswish`, `logsoftmax` activation functions. Added `dropout3d` ops. Added `LogicalAnd` and `LogicalOR` ops. Added `clip_grad_norm` along with `tree_reduce`. Added `cross`.
 - Juarez Bochi: Fixed bug in cross attention.
 - Justin Deschenaux: Sine, Cosine, arange, randint, truncated normal, bernoulli, lion optimizer, Dropout2d, linear and logistic regression python example.
 - Diogo Da Cruz: Added `tri`, `tril`, `triu`, `tensordot`, `inner`, `outer`, `tile`, `StreamContext`, `stream`, safetensors support, `einsum`, and `einsum_path`.
@@ -18,6 +18,7 @@ MLX was developed with contributions from the following individuals:
 - AmirHossein Razlighi: Added chaining support for some of the ops in `nn.Module`. Comparison works for non array objects in `mlx.core.array`. Exception handling for invalid operations in `mlx.core.array`.
 - Gleb Pobudzey: Added the `where` primitive, and groups in 1D and 2D convolutions.
 - Paul Paczuski: Improved stability of BCE loss calculation
+- Max-Heinrich Laves: Added `conv_transpose1d`, `conv_transpose2d`, and `conv_transpose3d` ops.

 <a href="https://github.com/ml-explore/mlx/graphs/contributors">
  <img class="dark-light" src="https://contrib.rocks/image?repo=ml-explore/mlx&anon=0&columns=20&max=100&r=true" />
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -0,0 +1,24 @@
+cff-version: 1.2.0
+title: mlx
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Awni
+    family-names: Hannun
+    affiliation: Apple
+  - given-names: Jagrit
+    family-names: Digani
+    affiliation: Apple
+  - given-names: Angelos
+    family-names: Katharopoulos
+    affiliation: Apple
+  - given-names: Ronan
+    family-names: Collobert
+    affiliation: Apple
+repository-code: 'https://github.com/ml-explore'
+abstract: >-
+  MLX: efficient and flexible machine learning on Apple
+  silicon
+license: MIT
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -24,35 +24,43 @@ option(MLX_METAL_JIT "Use JIT compilation for Metal kernels" OFF)
 option(BUILD_SHARED_LIBS "Build mlx as a shared library" OFF)

 if(NOT MLX_VERSION)
-  set(MLX_VERSION 0.16.1)
+  set(MLX_VERSION 0.18.1)
 endif()

 # --------------------- Processor tests -------------------------

-message(STATUS "Building MLX for ${CMAKE_SYSTEM_PROCESSOR} processor on ${CMAKE_SYSTEM_NAME}")
+message(
+  STATUS
+    "Building MLX for ${CMAKE_SYSTEM_PROCESSOR} processor on ${CMAKE_SYSTEM_NAME}"
+)

 set(MLX_BUILD_ARM OFF)

-if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
+if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
  if(${CMAKE_SYSTEM_PROCESSOR} MATCHES "x86_64")
    if(NOT MLX_ENABLE_X64_MAC)
-      message(FATAL_ERROR
-        "Building for x86_64 on macOS is not supported."
-        " If you are on an Apple silicon system, check the build"
-        " documentation for possible fixes: "
-        "https://ml-explore.github.io/mlx/build/html/install.html#build-from-source")
+      message(
+        FATAL_ERROR
+          "Building for x86_64 on macOS is not supported."
+          " If you are on an Apple silicon system, check the build"
+          " documentation for possible fixes: "
+          "https://ml-explore.github.io/mlx/build/html/install.html#build-from-source"
+      )
    else()
+      set(MLX_BUILD_METAL OFF)
      message(WARNING "Building for x86_64 arch is not officially supported.")
    endif()
-    set(MLX_BUILD_METAL OFF)
-  elseif(${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm64")
-    set(MLX_BUILD_ARM ON)
  endif()

 else()
+  set(MLX_BUILD_METAL OFF)
  message(WARNING "MLX is prioritised for Apple silicon systems using macOS.")
 endif()

+if(${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm64")
+  set(MLX_BUILD_ARM ON)
+endif()
+
 # ----------------------------- Lib -----------------------------

 include(FetchContent)
@@ -61,63 +69,59 @@ cmake_policy(SET CMP0135 NEW)

 add_library(mlx)

-if (MLX_BUILD_METAL)
-  find_library(METAL_LIB Metal)
-  find_library(FOUNDATION_LIB Foundation)
-  find_library(QUARTZ_LIB QuartzCore)
+if(MLX_BUILD_METAL)
+  set(METAL_LIB "-framework Metal")
+  set(FOUNDATION_LIB "-framework Foundation")
+  set(QUARTZ_LIB "-framework QuartzCore")
 endif()

-if (MLX_BUILD_METAL AND NOT METAL_LIB)
+if(MLX_BUILD_METAL AND NOT METAL_LIB)
  message(STATUS "Metal not found. Unable to build GPU")
  set(MLX_BUILD_METAL OFF)
  set(MLX_METAL_DEBUG OFF)
-elseif (MLX_BUILD_METAL)
+elseif(MLX_BUILD_METAL)
  message(STATUS "Building METAL sources")

-  if (MLX_METAL_DEBUG)
+  if(MLX_METAL_DEBUG)
    add_compile_definitions(MLX_METAL_DEBUG)
  endif()

  # Throw an error if xcrun not found
-  execute_process(COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
-                  OUTPUT_VARIABLE MACOS_VERSION
-                  COMMAND_ERROR_IS_FATAL ANY)
+  execute_process(
+    COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
+    OUTPUT_VARIABLE MACOS_VERSION COMMAND_ERROR_IS_FATAL ANY)

-  if (${MACOS_VERSION} LESS 14.0)
-    message(FATAL_ERROR "MLX requires macOS SDK >= 14.0 to be built with MLX_BUILD_METAL=ON" )
+  if(${MACOS_VERSION} LESS 14.0)
+    message(
+      FATAL_ERROR
+        "MLX requires macOS SDK >= 14.0 to be built with MLX_BUILD_METAL=ON")
  endif()
  message(STATUS "Building with SDK for macOS version ${MACOS_VERSION}")

-  set(METAL_CPP_URL https://developer.apple.com/metal/cpp/files/metal-cpp_macOS15_iOS18-beta.zip)
+  set(METAL_CPP_URL
+      https://developer.apple.com/metal/cpp/files/metal-cpp_macOS15_iOS18-beta.zip
+  )
  # Get the metal version
  execute_process(
-    COMMAND zsh "-c" "echo \"__METAL_VERSION__\" | xcrun -sdk macosx metal -E -x metal -P - | tail -1 | tr -d '\n'"
-    OUTPUT_VARIABLE MLX_METAL_VERSION
-    COMMAND_ERROR_IS_FATAL ANY)
+    COMMAND
+      zsh "-c"
+      "echo \"__METAL_VERSION__\" | xcrun -sdk macosx metal -E -x metal -P - | tail -1 | tr -d '\n'"
+    OUTPUT_VARIABLE MLX_METAL_VERSION COMMAND_ERROR_IS_FATAL ANY)

-  FetchContent_Declare(
-    metal_cpp
-    URL ${METAL_CPP_URL}
-  )
+  FetchContent_Declare(metal_cpp URL ${METAL_CPP_URL})

  FetchContent_MakeAvailable(metal_cpp)
  target_include_directories(
-    mlx PUBLIC
-    $<BUILD_INTERFACE:${metal_cpp_SOURCE_DIR}>
-    $<INSTALL_INTERFACE:include/metal_cpp>
-  )
-  target_link_libraries(
-    mlx PUBLIC
-    ${METAL_LIB}
-    ${FOUNDATION_LIB}
-    ${QUARTZ_LIB})
+    mlx PUBLIC $<BUILD_INTERFACE:${metal_cpp_SOURCE_DIR}>
+               $<INSTALL_INTERFACE:include/metal_cpp>)
+  target_link_libraries(mlx PUBLIC ${METAL_LIB} ${FOUNDATION_LIB} ${QUARTZ_LIB})

  add_compile_definitions("MLX_METAL_VERSION=${MLX_METAL_VERSION}")
 endif()

-if (MLX_BUILD_CPU)
+if(MLX_BUILD_CPU)
  find_library(ACCELERATE_LIBRARY Accelerate)
-  if (MLX_BUILD_ARM AND ACCELERATE_LIBRARY)
+  if(MLX_BUILD_ARM AND ACCELERATE_LIBRARY)
    message(STATUS "Accelerate found ${ACCELERATE_LIBRARY}")
    set(MLX_BUILD_ACCELERATE ON)
    target_link_libraries(mlx PUBLIC ${ACCELERATE_LIBRARY})
@@ -129,32 +133,29 @@ if (MLX_BUILD_CPU)
      # The blas shipped in macOS SDK is not supported, search homebrew for
      # openblas instead.
      set(BLA_VENDOR OpenBLAS)
-      set(LAPACK_ROOT "${LAPACK_ROOT};$ENV{LAPACK_ROOT};/usr/local/opt/openblas")
+      set(LAPACK_ROOT
+          "${LAPACK_ROOT};$ENV{LAPACK_ROOT};/usr/local/opt/openblas")
    endif()
    # Search and link with lapack.
    find_package(LAPACK REQUIRED)
-    if (NOT LAPACK_FOUND)
+    if(NOT LAPACK_FOUND)
      message(FATAL_ERROR "Must have LAPACK installed")
    endif()
-    find_path(LAPACK_INCLUDE_DIRS lapacke.h
-      /usr/include
-      /usr/local/include
-      /usr/local/opt/openblas/include)
+    find_path(LAPACK_INCLUDE_DIRS lapacke.h /usr/include /usr/local/include
+              /usr/local/opt/openblas/include)
    message(STATUS "Lapack lib " ${LAPACK_LIBRARIES})
    message(STATUS "Lapack include " ${LAPACK_INCLUDE_DIRS})
    target_include_directories(mlx PRIVATE ${LAPACK_INCLUDE_DIRS})
    target_link_libraries(mlx PUBLIC ${LAPACK_LIBRARIES})
-    # List blas after lapack otherwise we may accidentally incldue an old version
-    # of lapack.h from the include dirs of blas.
+    # List blas after lapack otherwise we may accidentally incldue an old
+    # version of lapack.h from the include dirs of blas.
    find_package(BLAS REQUIRED)
-    if (NOT BLAS_FOUND)
+    if(NOT BLAS_FOUND)
      message(FATAL_ERROR "Must have BLAS installed")
    endif()
    # TODO find a cleaner way to do this
-    find_path(BLAS_INCLUDE_DIRS cblas.h
-      /usr/include
-      /usr/local/include
-      $ENV{BLAS_HOME}/include)
+    find_path(BLAS_INCLUDE_DIRS cblas.h /usr/include /usr/local/include
+              $ENV{BLAS_HOME}/include)
    message(STATUS "Blas lib " ${BLAS_LIBRARIES})
    message(STATUS "Blas include " ${BLAS_INCLUDE_DIRS})
    target_include_directories(mlx PRIVATE ${BLAS_INCLUDE_DIRS})
@@ -165,103 +166,95 @@ else()
 endif()

 find_package(MPI)
-if (MPI_FOUND)
+if(MPI_FOUND)
  execute_process(
    COMMAND zsh "-c" "mpirun --version"
    OUTPUT_VARIABLE MPI_VERSION
-    ERROR_QUIET
-  )
-  if (${MPI_VERSION} MATCHES ".*Open MPI.*")
+    ERROR_QUIET)
+  if(${MPI_VERSION} MATCHES ".*Open MPI.*")
    target_include_directories(mlx PRIVATE ${MPI_INCLUDE_PATH})
-  elseif (MPI_VERSION STREQUAL "")
+  elseif(MPI_VERSION STREQUAL "")
    set(MPI_FOUND FALSE)
    message(
-      WARNING
-      "MPI found but mpirun is not available. Building without MPI."
-    )
+      WARNING "MPI found but mpirun is not available. Building without MPI.")
  else()
    set(MPI_FOUND FALSE)
-    message(
-      WARNING
-      "MPI which is not OpenMPI found. Building without MPI."
-    )
-  endif() 
+    message(WARNING "MPI which is not OpenMPI found. Building without MPI.")
+  endif()
 endif()

 add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/mlx)

 target_include_directories(
-  mlx
-  PUBLIC
-  $<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}>
-  $<INSTALL_INTERFACE:include>
-)
+  mlx PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}>
+             $<INSTALL_INTERFACE:include>)

-FetchContent_Declare(fmt
+FetchContent_Declare(
+  fmt
  GIT_REPOSITORY https://github.com/fmtlib/fmt.git
-  GIT_TAG 10.2.1 
-  EXCLUDE_FROM_ALL
-)
+  GIT_TAG 10.2.1
+  EXCLUDE_FROM_ALL)
 FetchContent_MakeAvailable(fmt)
-target_link_libraries(mlx PRIVATE fmt::fmt-header-only)
+target_link_libraries(mlx PRIVATE $<BUILD_INTERFACE:fmt::fmt-header-only>)

-if (MLX_BUILD_PYTHON_BINDINGS)
+if(MLX_BUILD_PYTHON_BINDINGS)
  message(STATUS "Building Python bindings.")
-  find_package(Python 3.8 COMPONENTS Interpreter Development.Module REQUIRED)
+  find_package(
+    Python 3.8
+    COMPONENTS Interpreter Development.Module
+    REQUIRED)
  execute_process(
    COMMAND "${Python_EXECUTABLE}" -m nanobind --cmake_dir
-    OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE NB_DIR)
+    OUTPUT_STRIP_TRAILING_WHITESPACE
+    OUTPUT_VARIABLE NB_DIR)
  list(APPEND CMAKE_PREFIX_PATH "${NB_DIR}")
  find_package(nanobind CONFIG REQUIRED)
  add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/python/src)
 endif()

-if (MLX_BUILD_TESTS)
+if(MLX_BUILD_TESTS)
  include(CTest)
  add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/tests)
 endif()

-if (MLX_BUILD_EXAMPLES)
+if(MLX_BUILD_EXAMPLES)
  add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/examples/cpp)
 endif()

-if (MLX_BUILD_BENCHMARKS)
+if(MLX_BUILD_BENCHMARKS)
  add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/benchmarks/cpp)
 endif()

-
-
 # ----------------------------- Installation -----------------------------
 include(GNUInstallDirs)

 # Install library
 install(
-    TARGETS mlx
-    EXPORT MLXTargets
-    LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
-    ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
-    RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
-    INCLUDES DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
-)
-
+  TARGETS mlx
+  EXPORT MLXTargets
+  LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
+  ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
+  RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
+  INCLUDES
+  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})

 # Install headers
 install(
-    DIRECTORY ${CMAKE_CURRENT_LIST_DIR}/mlx
-    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
-    COMPONENT headers
-    FILES_MATCHING PATTERN "*.h"
-)
+  DIRECTORY ${CMAKE_CURRENT_LIST_DIR}/mlx
+  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  COMPONENT headers
+  FILES_MATCHING
+  PATTERN "*.h"
+  PATTERN "backend/metal/kernels.h" EXCLUDE)

 # Install metal dependencies
-if (MLX_BUILD_METAL)
+if(MLX_BUILD_METAL)

  # Install metal cpp
  install(
-      DIRECTORY ${metal_cpp_SOURCE_DIR}/
-      DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/metal_cpp
-      COMPONENT metal_cpp_source
-  )
+    DIRECTORY ${metal_cpp_SOURCE_DIR}/
+    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/metal_cpp
+    COMPONENT metal_cpp_source)

 endif()

@@ -273,31 +266,24 @@ set(MLX_CMAKE_INSTALL_MODULE_DIR share/cmake/MLX)
 install(
  EXPORT MLXTargets
  FILE MLXTargets.cmake
-  DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR}
-)
+  DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR})

 include(CMakePackageConfigHelpers)

 write_basic_package_version_file(
  ${MLX_CMAKE_BUILD_VERSION_CONFIG}
  COMPATIBILITY SameMajorVersion
-  VERSION ${MLX_VERSION}
-)
+  VERSION ${MLX_VERSION})

 configure_package_config_file(
-  ${CMAKE_CURRENT_LIST_DIR}/mlx.pc.in
-  ${MLX_CMAKE_BUILD_CONFIG}
+  ${CMAKE_CURRENT_LIST_DIR}/mlx.pc.in ${MLX_CMAKE_BUILD_CONFIG}
  INSTALL_DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR}
  NO_CHECK_REQUIRED_COMPONENTS_MACRO
-  PATH_VARS CMAKE_INSTALL_LIBDIR CMAKE_INSTALL_INCLUDEDIR MLX_CMAKE_INSTALL_MODULE_DIR
-)
+  PATH_VARS CMAKE_INSTALL_LIBDIR CMAKE_INSTALL_INCLUDEDIR
+            MLX_CMAKE_INSTALL_MODULE_DIR)

-install(
-  FILES ${MLX_CMAKE_BUILD_CONFIG} ${MLX_CMAKE_BUILD_VERSION_CONFIG}
-  DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR}
-)
+install(FILES ${MLX_CMAKE_BUILD_CONFIG} ${MLX_CMAKE_BUILD_VERSION_CONFIG}
+        DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR})

-install(
-  DIRECTORY ${CMAKE_MODULE_PATH}/
-  DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR}
-)
+install(DIRECTORY ${CMAKE_MODULE_PATH}/
+        DESTINATION ${MLX_CMAKE_INSTALL_MODULE_DIR})
--- a/benchmarks/python/conv2d_bench_cpu.py
+++ b/benchmarks/python/conv2d_bench_cpu.py
@@ -0,0 +1,127 @@
+import argparse
+import math
+import time
+
+import mlx.core as mx
+import numpy as np
+import torch
+
+N_warmup = 1
+N_iter_bench = 10
+N_iter_func = 5
+mx.set_default_device(mx.cpu)
+
+
+def bench(f, a, b):
+    for i in range(N_warmup):
+        f(a, b)
+
+    s = time.perf_counter_ns()
+    for i in range(N_iter_bench):
+        f(a, b)
+    e = time.perf_counter_ns()
+    return (e - s) * 1e-9
+
+
+def make_mx_conv_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    def mx_conv_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = mx.conv2d(a, b, stride=strides, padding=padding, groups=groups)
+            ys.append(y)
+        mx.eval(ys)
+        return ys
+
+    return mx_conv_2D
+
+
+def make_pt_conv_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    @torch.no_grad()
+    def pt_conv_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = torch.conv2d(a, b, stride=strides, padding=padding, groups=groups)
+            ys.append(y)
+        return ys
+
+    return pt_conv_2D
+
+
+def bench_shape(N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype):
+    scale = 1.0 / math.sqrt(kH * kH * C)
+    a_np = np.random.uniform(0, 0.5, (N, H, W, C)).astype(np_dtype)
+    b_np = np.random.uniform(-scale, scale, (O, kH, kW, int(C / groups))).astype(
+        np_dtype
+    )
+
+    a_mx = mx.array(a_np)
+    b_mx = mx.array(b_np)
+
+    a_pt = torch.from_numpy(a_np.transpose((0, 3, 1, 2))).to("cpu")
+    b_pt = torch.from_numpy(b_np.transpose((0, 3, 1, 2))).to("cpu")
+
+    f_mx = make_mx_conv_2D(strides, padding, groups)
+    f_pt = make_pt_conv_2D(strides, padding, groups)
+
+    time_torch = bench(f_pt, a_pt, b_pt)
+    time_mlx = bench(f_mx, a_mx, b_mx)
+
+    out_mx = mx.conv2d(a_mx, b_mx, stride=strides, padding=padding, groups=groups)
+    out_pt = torch.conv2d(
+        a_pt.to("cpu"), b_pt.to("cpu"), stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.permute(out_pt, (0, 2, 3, 1))
+    out_pt = out_pt.numpy(force=True)
+
+    atol = 2e-5 if np_dtype == np.float32 else 1e-4
+
+    if not np.allclose(out_pt, out_mx, atol=atol):
+        print(
+            f"Failed at {(N, H, W, C)}, {(O, kH, kW, C)} [strides = {strides}, padding = {padding}, groups = {groups}] with max(|a - b|) = {np.max(np.abs(out_pt - out_mx))}"
+        )
+
+    return time_mlx, time_torch
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run conv benchmarks")
+
+    dtypes = ("float32",)
+    shapes = (
+        (4, 32, 32, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 32, 32, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 32, 32, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 32, 32, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        (4, 32, 32, 512, 5, 5, 512, (1, 1), (2, 2), 1),
+        (4, 64, 64, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 64, 64, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 64, 64, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        # (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 2),
+        # (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 16),
+        # (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 64),
+        (4, 128, 128, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 128, 128, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 256, 256, 32, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 256, 256, 3, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 128, 128, 3, 5, 5, 64, (1, 1), (2, 2), 1),
+    )
+
+    for dtype in dtypes:
+        print(
+            "(N,   H,   W,   C), (  O, kH, kW,   C),   dtype, stride,   pads,  groups, diff%"
+        )
+        for N, H, W, C, kH, kW, O, strides, padding, groups in shapes:
+            np_dtype = getattr(np, dtype)
+            time_mlx, time_torch = bench_shape(
+                N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype
+            )
+            diff = time_torch / time_mlx - 1.0
+
+            print(
+                f"({N}, {H:3d}, {W:3d}, {C:3d}), ({O:3d}, {kH:2d}, {kW:2d}, {C:3d}), {dtype}, {strides}, {padding}, {groups:7d}, {100. * diff:+5.2f}%"
+            )
+            if time_mlx >= 2.0 * time_torch:
+                print("ATTENTION ^^^^^^^")
--- a/benchmarks/python/conv2d_train_bench_cpu.py
+++ b/benchmarks/python/conv2d_train_bench_cpu.py
@@ -0,0 +1,143 @@
+import time
+
+import mlx.core as mx
+import mlx.nn
+import mlx.optimizers as opt
+import torch
+
+
+def bench_mlx(steps: int = 20) -> float:
+    mx.set_default_device(mx.cpu)
+
+    class BenchNetMLX(mlx.nn.Module):
+        # simple encoder-decoder net
+
+        def __init__(self, in_channels, hidden_channels=32):
+            super().__init__()
+
+            self.net = mlx.nn.Sequential(
+                mlx.nn.Conv2d(in_channels, hidden_channels, kernel_size=3, padding=1),
+                mlx.nn.ReLU(),
+                mlx.nn.Conv2d(
+                    hidden_channels, 2 * hidden_channels, kernel_size=3, padding=1
+                ),
+                mlx.nn.ReLU(),
+                mlx.nn.ConvTranspose2d(
+                    2 * hidden_channels, hidden_channels, kernel_size=3, padding=1
+                ),
+                mlx.nn.ReLU(),
+                mlx.nn.ConvTranspose2d(
+                    hidden_channels, in_channels, kernel_size=3, padding=1
+                ),
+            )
+
+        def __call__(self, input):
+            return self.net(input)
+
+    benchNet = BenchNetMLX(3)
+    mx.eval(benchNet.parameters())
+    optim = opt.Adam(learning_rate=1e-3)
+
+    inputs = mx.random.normal([10, 256, 256, 3])
+
+    params = benchNet.parameters()
+    optim.init(params)
+
+    state = [benchNet.state, optim.state]
+
+    def loss_fn(params, image):
+        benchNet.update(params)
+        pred_image = benchNet(image)
+        return (pred_image - image).abs().mean()
+
+    def step(params, image):
+        loss, grads = mx.value_and_grad(loss_fn)(params, image)
+        optim.update(benchNet, grads)
+        return loss
+
+    total_time = 0.0
+    print("MLX:")
+    for i in range(steps):
+        start_time = time.perf_counter()
+
+        step(benchNet.parameters(), inputs)
+        mx.eval(state)
+        end_time = time.perf_counter()
+
+        print(f"{i:3d}, time={(end_time-start_time) * 1000:7.2f} ms")
+        total_time += (end_time - start_time) * 1000
+
+    return total_time
+
+
+def bench_torch(steps: int = 20) -> float:
+    device = torch.device("cpu")
+
+    class BenchNetTorch(torch.nn.Module):
+        # simple encoder-decoder net
+
+        def __init__(self, in_channels, hidden_channels=32):
+            super().__init__()
+
+            self.net = torch.nn.Sequential(
+                torch.nn.Conv2d(in_channels, hidden_channels, kernel_size=3, padding=1),
+                torch.nn.ReLU(),
+                torch.nn.Conv2d(
+                    hidden_channels, 2 * hidden_channels, kernel_size=3, padding=1
+                ),
+                torch.nn.ReLU(),
+                torch.nn.ConvTranspose2d(
+                    2 * hidden_channels, hidden_channels, kernel_size=3, padding=1
+                ),
+                torch.nn.ReLU(),
+                torch.nn.ConvTranspose2d(
+                    hidden_channels, in_channels, kernel_size=3, padding=1
+                ),
+            )
+
+        def forward(self, input):
+            return self.net(input)
+
+    benchNet = BenchNetTorch(3).to(device)
+    optim = torch.optim.Adam(benchNet.parameters(), lr=1e-3)
+
+    inputs = torch.randn(10, 3, 256, 256, device=device)
+
+    def loss_fn(pred_image, image):
+        return (pred_image - image).abs().mean()
+
+    total_time = 0.0
+    print("PyTorch:")
+    for i in range(steps):
+        start_time = time.perf_counter()
+
+        optim.zero_grad()
+        pred_image = benchNet(inputs)
+        loss = loss_fn(pred_image, inputs)
+        loss.backward()
+        optim.step()
+
+        end_time = time.perf_counter()
+
+        print(f"{i:3d}, time={(end_time-start_time) * 1000:7.2f} ms")
+        total_time += (end_time - start_time) * 1000
+
+    return total_time
+
+
+def main():
+    steps = 20
+    time_mlx = bench_mlx(steps)
+    time_torch = bench_torch(steps)
+
+    print(f"average time of MLX:     {time_mlx/steps:9.2f} ms")
+    print(f"total time of MLX:       {time_mlx:9.2f} ms")
+    print(f"average time of PyTorch: {time_torch/steps:9.2f} ms")
+    print(f"total time of PyTorch:   {time_torch:9.2f} ms")
+
+    diff = time_torch / time_mlx - 1.0
+    print(f"torch/mlx diff: {100. * diff:+5.2f}%")
+
+
+if __name__ == "__main__":
+    main()
--- a/benchmarks/python/conv2d_transpose_bench_cpu.py
+++ b/benchmarks/python/conv2d_transpose_bench_cpu.py
@@ -0,0 +1,129 @@
+import argparse
+import math
+import time
+
+import mlx.core as mx
+import numpy as np
+import torch
+
+N_warmup = 1
+N_iter_bench = 10
+N_iter_func = 5
+
+
+def bench(f, a, b):
+    for i in range(N_warmup):
+        f(a, b)
+
+    s = time.perf_counter_ns()
+    for i in range(N_iter_bench):
+        f(a, b)
+    e = time.perf_counter_ns()
+    return (e - s) * 1e-9
+
+
+def make_mx_conv_transpose_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    def mx_conv_transpose_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = mx.conv_transpose2d(
+                a, b, stride=strides, padding=padding, groups=groups, stream=mx.cpu
+            )
+            ys.append(y)
+        mx.eval(ys)
+        return ys
+
+    return mx_conv_transpose_2D
+
+
+def make_pt_conv_transpose_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    @torch.no_grad()
+    def pt_conv_transpose_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = torch.conv_transpose2d(
+                a, b, stride=strides, padding=padding, groups=groups
+            )
+            ys.append(y)
+        return ys
+
+    return pt_conv_transpose_2D
+
+
+def bench_shape(N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype):
+    scale = 1.0 / math.sqrt(kH * kH * C)
+    a_np = np.random.uniform(0, 0.5, (N, H, W, C)).astype(np_dtype)
+    b_np = np.random.uniform(-scale, scale, (int(O / groups), kH, kW, C)).astype(
+        np_dtype
+    )
+
+    a_mx = mx.array(a_np)
+    b_mx = mx.array(b_np)
+
+    a_pt = torch.from_numpy(a_np.transpose((0, 3, 1, 2))).to("cpu")
+    b_pt = torch.from_numpy(b_np.transpose((3, 0, 1, 2))).to("cpu")
+
+    f_mx = make_mx_conv_transpose_2D(strides, padding, groups)
+    f_pt = make_pt_conv_transpose_2D(strides, padding, groups)
+
+    time_torch = bench(f_pt, a_pt, b_pt)
+    time_mlx = bench(f_mx, a_mx, b_mx)
+
+    out_mx = mx.conv_transpose2d(
+        a_mx, b_mx, stride=strides, padding=padding, groups=groups, stream=mx.cpu
+    )
+    out_pt = torch.conv_transpose2d(
+        a_pt.to("cpu"), b_pt.to("cpu"), stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.permute(out_pt, (0, 2, 3, 1))
+    out_pt = out_pt.numpy(force=True)
+
+    atol = 2e-5 if np_dtype == np.float32 else 1e-4
+
+    if not np.allclose(out_pt, out_mx, atol=atol):
+        print(
+            f"Failed at {(N, H, W, C)}, {(O, kH, kW, C)} [strides = {strides}, padding = {padding}, groups = {groups}] with max(|a - b|) = {np.max(np.abs(out_pt - out_mx))}"
+        )
+
+    return time_mlx, time_torch
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run conv benchmarks")
+
+    dtypes = ("float32",)
+    shapes = (
+        (4, 32, 32, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 32, 32, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 32, 32, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 32, 32, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        (4, 32, 32, 512, 5, 5, 512, (1, 1), (2, 2), 1),
+        (4, 64, 64, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 64, 64, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 64, 64, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        (4, 128, 128, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 128, 128, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 256, 256, 32, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 256, 256, 3, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 128, 128, 3, 5, 5, 64, (1, 1), (2, 2), 1),
+    )
+
+    for dtype in dtypes:
+        print(
+            "(N,   H,   W,   C), (  O, kH, kW,   C),   dtype, stride,   pads,  groups, diff%"
+        )
+        for N, H, W, C, kH, kW, O, strides, padding, groups in shapes:
+            np_dtype = getattr(np, dtype)
+            time_mlx, time_torch = bench_shape(
+                N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype
+            )
+            diff = time_torch / time_mlx - 1.0
+
+            print(
+                f"({N}, {H:3d}, {W:3d}, {C:3d}), ({O:3d}, {kH:2d}, {kW:2d}, {C:3d}), {dtype}, {strides}, {padding}, {groups:7d}, {100. * diff:+5.2f}%"
+            )
+            if time_mlx >= 2.0 * time_torch:
+                print("ATTENTION ^^^^^^^")
--- a/benchmarks/python/conv3d_bench_cpu.py
+++ b/benchmarks/python/conv3d_bench_cpu.py
@@ -0,0 +1,110 @@
+import argparse
+import math
+import time
+
+import mlx.core as mx
+import numpy as np
+import torch
+
+N_warmup = 1
+N_iter_bench = 10
+N_iter_func = 5
+mx.set_default_device(mx.cpu)
+
+
+def bench(f, a, b):
+    for i in range(N_warmup):
+        f(a, b)
+
+    s = time.perf_counter_ns()
+    for i in range(N_iter_bench):
+        f(a, b)
+    e = time.perf_counter_ns()
+    return (e - s) * 1e-9
+
+
+def make_mx_conv_3D(strides=(1, 1), padding=(0, 0), groups=1):
+    def mx_conv_3D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = mx.conv3d(a, b, stride=strides, padding=padding, groups=groups)
+            ys.append(y)
+        mx.eval(ys)
+        return ys
+
+    return mx_conv_3D
+
+
+def make_pt_conv_3D(strides=(1, 1), padding=(0, 0), groups=1):
+    @torch.no_grad()
+    def pt_conv_3D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = torch.conv3d(a, b, stride=strides, padding=padding, groups=groups)
+            ys.append(y)
+        return ys
+
+    return pt_conv_3D
+
+
+def bench_shape(N, D, H, W, C, kD, kH, kW, O, strides, padding, groups, np_dtype):
+    scale = 1.0 / math.sqrt(kD * kH * kW * C)
+    a_np = np.random.uniform(0, 0.5, (N, D, H, W, C)).astype(np_dtype)
+    b_np = np.random.uniform(-scale, scale, (O, kD, kH, kW, int(C / groups))).astype(
+        np_dtype
+    )
+
+    a_mx = mx.array(a_np)
+    b_mx = mx.array(b_np)
+
+    a_pt = torch.from_numpy(a_np.transpose((0, 4, 1, 2, 3))).to("cpu")
+    b_pt = torch.from_numpy(b_np.transpose((0, 4, 1, 2, 3))).to("cpu")
+
+    f_mx = make_mx_conv_3D(strides, padding, groups)
+    f_pt = make_pt_conv_3D(strides, padding, groups)
+
+    time_torch = bench(f_pt, a_pt, b_pt)
+    time_mlx = bench(f_mx, a_mx, b_mx)
+
+    out_mx = mx.conv3d(a_mx, b_mx, stride=strides, padding=padding, groups=groups)
+    out_pt = torch.conv3d(
+        a_pt.to("cpu"), b_pt.to("cpu"), stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.permute(out_pt, (0, 2, 3, 4, 1))
+    out_pt = out_pt.numpy(force=True)
+
+    atol = 2e-5 if np_dtype == np.float32 else 1e-4
+
+    if not np.allclose(out_pt, out_mx, atol=atol):
+        print(
+            f"Failed at {(N, D, H, W, C)}, {(O, kD, kH, kW, C)} [strides = {strides}, padding = {padding}, groups = {groups}] with max(|a - b|) = {np.max(np.abs(out_pt - out_mx))}"
+        )
+
+    return time_mlx, time_torch
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run conv benchmarks")
+
+    dtypes = ("float32",)
+    shapes = (
+        (4, 16, 16, 16, 16, 5, 5, 5, 16, (1, 1, 1), (2, 2, 2), 1),
+        (4, 16, 16, 16, 32, 5, 5, 5, 32, (1, 1, 1), (2, 2, 2), 1),
+    )
+
+    for dtype in dtypes:
+        print(
+            "(N,   D,   H,   W,   C), (  O, kD, kH, kW,   C),   dtype,    stride,      pads,  groups, diff%"
+        )
+        for N, D, H, W, C, kD, kH, kW, O, strides, padding, groups in shapes:
+            np_dtype = getattr(np, dtype)
+            time_mlx, time_torch = bench_shape(
+                N, D, H, W, C, kD, kH, kW, O, strides, padding, groups, np_dtype
+            )
+            diff = time_torch / time_mlx - 1.0
+
+            print(
+                f"({N}, {D:3d}, {H:3d}, {W:3d}, {C:3d}), ({O:3d}, {kD:2d}, {kH:2d}, {kW:2d}, {C:3d}), {dtype}, {strides}, {padding}, {groups:7d}, {100. * diff:+5.2f}%"
+            )
+            if time_mlx >= 2.0 * time_torch:
+                print("ATTENTION ^^^^^^^")
--- a/benchmarks/python/conv3d_train_bench_cpu.py
+++ b/benchmarks/python/conv3d_train_bench_cpu.py
@@ -0,0 +1,143 @@
+import time
+
+import mlx.core as mx
+import mlx.nn
+import mlx.optimizers as opt
+import torch
+
+
+def bench_mlx(steps: int = 20, shape=(10, 32, 32, 32, 3)) -> float:
+    mx.set_default_device(mx.cpu)
+
+    class BenchNetMLX(mlx.nn.Module):
+        # simple encoder-decoder net
+
+        def __init__(self, in_channels, hidden_channels=16):
+            super().__init__()
+
+            self.net = mlx.nn.Sequential(
+                mlx.nn.Conv3d(in_channels, hidden_channels, kernel_size=3, padding=1),
+                mlx.nn.ReLU(),
+                mlx.nn.Conv3d(
+                    hidden_channels, 2 * hidden_channels, kernel_size=3, padding=1
+                ),
+                mlx.nn.ReLU(),
+                mlx.nn.ConvTranspose3d(
+                    2 * hidden_channels, hidden_channels, kernel_size=3, padding=1
+                ),
+                mlx.nn.ReLU(),
+                mlx.nn.ConvTranspose3d(
+                    hidden_channels, in_channels, kernel_size=3, padding=1
+                ),
+            )
+
+        def __call__(self, input):
+            return self.net(input)
+
+    benchNet = BenchNetMLX(3)
+    mx.eval(benchNet.parameters())
+    optim = opt.Adam(learning_rate=1e-3)
+
+    inputs = mx.random.normal(shape)
+
+    params = benchNet.parameters()
+    optim.init(params)
+
+    state = [benchNet.state, optim.state]
+
+    def loss_fn(params, image):
+        benchNet.update(params)
+        pred_image = benchNet(image)
+        return (pred_image - image).abs().mean()
+
+    def step(params, image):
+        loss, grads = mx.value_and_grad(loss_fn)(params, image)
+        optim.update(benchNet, grads)
+        return loss
+
+    total_time = 0.0
+    print("MLX:")
+    for i in range(steps):
+        start_time = time.perf_counter()
+
+        step(benchNet.parameters(), inputs)
+        mx.eval(state)
+        end_time = time.perf_counter()
+
+        print(f"{i:3d}, time={(end_time-start_time) * 1000:7.2f} ms")
+        total_time += (end_time - start_time) * 1000
+
+    return total_time
+
+
+def bench_torch(steps: int = 20, shape=(10, 3, 32, 32, 32)) -> float:
+    device = torch.device("cpu")
+
+    class BenchNetTorch(torch.nn.Module):
+        # simple encoder-decoder net
+
+        def __init__(self, in_channels, hidden_channels=16):
+            super().__init__()
+
+            self.net = torch.nn.Sequential(
+                torch.nn.Conv3d(in_channels, hidden_channels, kernel_size=3, padding=1),
+                torch.nn.ReLU(),
+                torch.nn.Conv3d(
+                    hidden_channels, 2 * hidden_channels, kernel_size=3, padding=1
+                ),
+                torch.nn.ReLU(),
+                torch.nn.ConvTranspose3d(
+                    2 * hidden_channels, hidden_channels, kernel_size=3, padding=1
+                ),
+                torch.nn.ReLU(),
+                torch.nn.ConvTranspose3d(
+                    hidden_channels, in_channels, kernel_size=3, padding=1
+                ),
+            )
+
+        def forward(self, input):
+            return self.net(input)
+
+    benchNet = BenchNetTorch(3).to(device)
+    optim = torch.optim.Adam(benchNet.parameters(), lr=1e-3)
+
+    inputs = torch.randn(*shape, device=device)
+
+    def loss_fn(pred_image, image):
+        return (pred_image - image).abs().mean()
+
+    total_time = 0.0
+    print("PyTorch:")
+    for i in range(steps):
+        start_time = time.perf_counter()
+
+        optim.zero_grad()
+        pred_image = benchNet(inputs)
+        loss = loss_fn(pred_image, inputs)
+        loss.backward()
+        optim.step()
+
+        end_time = time.perf_counter()
+
+        print(f"{i:3d}, time={(end_time-start_time) * 1000:7.2f} ms")
+        total_time += (end_time - start_time) * 1000
+
+    return total_time
+
+
+def main():
+    steps = 10
+    time_mlx = bench_mlx(steps)
+    time_torch = bench_torch(steps)
+
+    print(f"average time of MLX:     {time_mlx/steps:9.2f} ms")
+    print(f"total time of MLX:       {time_mlx:9.2f} ms")
+    print(f"average time of PyTorch: {time_torch/steps:9.2f} ms")
+    print(f"total time of PyTorch:   {time_torch:9.2f} ms")
+
+    diff = time_torch / time_mlx - 1.0
+    print(f"torch/mlx diff: {100. * diff:+5.2f}%")
+
+
+if __name__ == "__main__":
+    main()
--- a/benchmarks/python/conv3d_transpose_bench_cpu.py
+++ b/benchmarks/python/conv3d_transpose_bench_cpu.py
@@ -0,0 +1,116 @@
+import argparse
+import math
+import time
+
+import mlx.core as mx
+import numpy as np
+import torch
+
+N_warmup = 1
+N_iter_bench = 10
+N_iter_func = 5
+mx.set_default_device(mx.cpu)
+
+
+def bench(f, a, b):
+    for i in range(N_warmup):
+        f(a, b)
+
+    s = time.perf_counter_ns()
+    for i in range(N_iter_bench):
+        f(a, b)
+    e = time.perf_counter_ns()
+    return (e - s) * 1e-9
+
+
+def make_mx_conv_3D(strides=(1, 1, 1), padding=(0, 0, 0), groups=1):
+    def mx_conv_3D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = mx.conv_transpose3d(
+                a, b, stride=strides, padding=padding, groups=groups
+            )
+            ys.append(y)
+        mx.eval(ys)
+        return ys
+
+    return mx_conv_3D
+
+
+def make_pt_conv_3D(strides=(1, 1, 1), padding=(0, 0, 0), groups=1):
+    @torch.no_grad()
+    def pt_conv_3D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = torch.conv_transpose3d(
+                a, b, stride=strides, padding=padding, groups=groups
+            )
+            ys.append(y)
+        return ys
+
+    return pt_conv_3D
+
+
+def bench_shape(N, D, H, W, C, kD, kH, kW, O, strides, padding, groups, np_dtype):
+    scale = 1.0 / math.sqrt(kD * kH * kW * C)
+    a_np = np.random.uniform(0, 0.5, (N, D, H, W, C)).astype(np_dtype)
+    b_np = np.random.uniform(-scale, scale, (O, kD, kH, kW, int(C / groups))).astype(
+        np_dtype
+    )
+
+    a_mx = mx.array(a_np)
+    b_mx = mx.array(b_np)
+
+    a_pt = torch.from_numpy(a_np.transpose((0, 4, 1, 2, 3))).to("cpu")
+    b_pt = torch.from_numpy(b_np.transpose((4, 0, 1, 2, 3))).to("cpu")
+
+    f_mx = make_mx_conv_3D(strides, padding, groups)
+    f_pt = make_pt_conv_3D(strides, padding, groups)
+
+    time_torch = bench(f_pt, a_pt, b_pt)
+    time_mlx = bench(f_mx, a_mx, b_mx)
+
+    out_mx = mx.conv_transpose3d(
+        a_mx, b_mx, stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.conv_transpose3d(
+        a_pt.to("cpu"), b_pt.to("cpu"), stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.permute(out_pt, (0, 2, 3, 4, 1))
+    out_pt = out_pt.numpy(force=True)
+
+    atol = 2e-5 if np_dtype == np.float32 else 1e-4
+
+    if not np.allclose(out_pt, out_mx, atol=atol):
+        print(
+            f"Failed at {(N, D, H, W, C)}, {(O, kD, kH, kW, C)} [strides = {strides}, padding = {padding}, groups = {groups}] with max(|a - b|) = {np.max(np.abs(out_pt - out_mx))}"
+        )
+
+    return time_mlx, time_torch
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run conv benchmarks")
+
+    dtypes = ("float32",)
+    shapes = (
+        (4, 16, 16, 16, 16, 5, 5, 5, 16, (1, 1, 1), (2, 2, 2), 1),
+        (4, 16, 16, 16, 32, 5, 5, 5, 32, (1, 1, 1), (2, 2, 2), 1),
+    )
+
+    for dtype in dtypes:
+        print(
+            "(N,   D,   H,   W,   C), (  O, kD, kH, kW,   C),   dtype,    stride,      pads,  groups, diff%"
+        )
+        for N, D, H, W, C, kD, kH, kW, O, strides, padding, groups in shapes:
+            np_dtype = getattr(np, dtype)
+            time_mlx, time_torch = bench_shape(
+                N, D, H, W, C, kD, kH, kW, O, strides, padding, groups, np_dtype
+            )
+            diff = time_torch / time_mlx - 1.0
+
+            print(
+                f"({N}, {D:3d}, {H:3d}, {W:3d}, {C:3d}), ({O:3d}, {kD:2d}, {kH:2d}, {kW:2d}, {C:3d}), {dtype}, {strides}, {padding}, {groups:7d}, {100. * diff:+5.2f}%"
+            )
+            if time_mlx >= 2.0 * time_torch:
+                print("ATTENTION ^^^^^^^")
--- a/benchmarks/python/conv_transpose_bench.py
+++ b/benchmarks/python/conv_transpose_bench.py
@@ -0,0 +1,135 @@
+import argparse
+import math
+import os
+import subprocess
+import time
+
+import mlx.core as mx
+import numpy as np
+import torch
+
+N_warmup = 10
+N_iter_bench = 100
+N_iter_func = 5
+
+
+def bench(f, a, b):
+    for i in range(N_warmup):
+        f(a, b)
+    torch.mps.synchronize()
+
+    s = time.perf_counter_ns()
+    for i in range(N_iter_bench):
+        f(a, b)
+    e = time.perf_counter_ns()
+    return (e - s) * 1e-9
+
+
+def make_mx_conv_transpose_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    def mx_conv_transpose_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = mx.conv_transpose2d(
+                a, b, stride=strides, padding=padding, groups=groups
+            )
+            ys.append(y)
+        mx.eval(ys)
+        return ys
+
+    return mx_conv_transpose_2D
+
+
+def make_pt_conv_transpose_2D(strides=(1, 1), padding=(0, 0), groups=1):
+    @torch.no_grad()
+    def pt_conv_transpose_2D(a, b):
+        ys = []
+        for i in range(N_iter_func):
+            y = torch.conv_transpose2d(
+                a, b, stride=strides, padding=padding, groups=groups
+            )
+            ys.append(y)
+        torch.mps.synchronize()
+        return ys
+
+    return pt_conv_transpose_2D
+
+
+def bench_shape(N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype):
+    scale = 1.0 / math.sqrt(kH * kH * C)
+    a_np = np.random.uniform(0, 0.5, (N, H, W, C)).astype(np_dtype)
+    b_np = np.random.uniform(-scale, scale, (O, kH, kW, int(C / groups))).astype(
+        np_dtype
+    )
+
+    a_mx = mx.array(a_np)
+    b_mx = mx.array(b_np)
+
+    a_pt = torch.from_numpy(a_np.transpose((0, 3, 1, 2))).to("mps")
+    b_pt = torch.from_numpy(b_np.transpose((3, 0, 1, 2))).to("mps")
+
+    torch.mps.synchronize()
+
+    f_mx = make_mx_conv_transpose_2D(strides, padding, groups)
+    f_pt = make_pt_conv_transpose_2D(strides, padding, groups)
+
+    time_torch = bench(f_pt, a_pt, b_pt)
+    time_mlx = bench(f_mx, a_mx, b_mx)
+
+    out_mx = mx.conv_transpose2d(
+        a_mx, b_mx, stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.conv_transpose2d(
+        a_pt.to("cpu"), b_pt.to("cpu"), stride=strides, padding=padding, groups=groups
+    )
+    out_pt = torch.permute(out_pt, (0, 2, 3, 1))
+    out_pt = out_pt.numpy(force=True)
+
+    atol = 2e-5 if np_dtype == np.float32 else 1e-4
+
+    if not np.allclose(out_pt, out_mx, atol=atol):
+        print(
+            f"Failed at {(N, H, W, C)}, {(O, kH, kW, C)} [strides = {strides}, padding = {padding}, groups = {groups}] with max(|a - b|) = {np.max(np.abs(out_pt - out_mx))}"
+        )
+
+    return time_mlx, time_torch
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run conv benchmarks")
+
+    dtypes = ("float32",)
+    shapes = (
+        (4, 32, 32, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 32, 32, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 32, 32, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 32, 32, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        (4, 32, 32, 512, 5, 5, 512, (1, 1), (2, 2), 1),
+        (4, 64, 64, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 64, 64, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 64, 64, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 64, 64, 256, 5, 5, 256, (1, 1), (2, 2), 1),
+        (4, 128, 128, 32, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 64, (1, 1), (2, 2), 1),
+        (4, 128, 128, 128, 5, 5, 128, (1, 1), (2, 2), 1),
+        (4, 256, 256, 32, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 256, 256, 3, 5, 5, 32, (1, 1), (2, 2), 1),
+        (4, 128, 128, 64, 5, 5, 3, (1, 1), (2, 2), 1),
+        (4, 128, 128, 3, 5, 5, 64, (1, 1), (2, 2), 1),
+    )
+
+    for dtype in dtypes:
+        print(
+            "(N,   H,   W,   C), (  O, kH, kW,   C),   dtype, stride,   pads,  groups, diff%"
+        )
+        for N, H, W, C, kH, kW, O, strides, padding, groups in shapes:
+            np_dtype = getattr(np, dtype)
+            time_mlx, time_torch = bench_shape(
+                N, H, W, C, kH, kW, O, strides, padding, groups, np_dtype
+            )
+            diff = time_torch / time_mlx - 1.0
+
+            print(
+                f"({N}, {H:3d}, {W:3d}, {C:3d}), ({O:3d}, {kH:2d}, {kW:2d}, {C:3d}), {dtype}, {strides}, {padding}, {groups:7d}, {100. * diff:+5.2f}%"
+            )
+            if time_mlx >= 2.0 * time_torch:
+                print("ATTENTION ^^^^^^^")
--- a/benchmarks/python/distributed_bench.py
+++ b/benchmarks/python/distributed_bench.py
@@ -0,0 +1,66 @@
+# Copyright © 2024 Apple Inc.
+
+"""
+Run with:
+    mpirun -n 2 python /path/to/distributed_bench.py
+"""
+
+import time
+
+import mlx.core as mx
+
+
+def time_fn(fn, *args, **kwargs):
+    msg = kwargs.pop("msg", None)
+    world = mx.distributed.init()
+    if world.rank() == 0:
+        if msg:
+            print(f"Timing {msg} ...", end=" ")
+        else:
+            print(f"Timing {fn.__name__} ...", end=" ")
+
+    # warmup
+    for _ in range(5):
+        mx.eval(fn(*args, **kwargs))
+
+    num_iters = 100
+    tic = time.perf_counter()
+    for _ in range(num_iters):
+        x = mx.eval(fn(*args, **kwargs))
+    toc = time.perf_counter()
+
+    msec = 1e3 * (toc - tic) / num_iters
+    if world.rank() == 0:
+        print(f"{msec:.5f} msec")
+
+
+def time_all_sum():
+    shape = (4096,)
+    x = mx.random.uniform(shape=shape)
+    mx.eval(x)
+
+    def sine(x):
+        for _ in range(20):
+            x = mx.sin(x)
+        return x
+
+    time_fn(sine, x)
+
+    def all_sum_plain(x):
+        for _ in range(20):
+            x = mx.distributed.all_sum(x)
+        return x
+
+    time_fn(all_sum_plain, x)
+
+    def all_sum_with_sine(x):
+        for _ in range(20):
+            x = mx.sin(x)
+            x = mx.distributed.all_sum(x)
+        return x
+
+    time_fn(all_sum_with_sine, x)
+
+
+if __name__ == "__main__":
+    time_all_sum()
--- a/cmake/extension.cmake
+++ b/cmake/extension.cmake
@@ -1,56 +1,41 @@
 include(CMakeParseArguments)

-###############################################################################
+# ##############################################################################
 # Build metal library
 #
 # Adds a custom target ${TARGET} to build ${OUTPUT_DIRECTORY}/{TITLE}.metallib
 # from list ${SOURCES}, including list ${INCLUDE_DIRS}, depends on list ${DEPS}
 #
-# Args:
-#     TARGET: Custom target to be added for the metal library 
-#     TITLE: Name of the .metallib
-#     OUTPUT_DIRECTORY: Where to place ${TITLE}.metallib
-#     SOURCES: List of source files
-#     INCLUDE_DIRS: List of include dirs
-#     DEPS: List of dependency files (like headers)
+# Args: TARGET: Custom target to be added for the metal library TITLE: Name of
+# the .metallib OUTPUT_DIRECTORY: Where to place ${TITLE}.metallib SOURCES: List
+# of source files INCLUDE_DIRS: List of include dirs DEPS: List of dependency
+# files (like headers)
 #
 macro(mlx_build_metallib)
  # Parse args
  set(oneValueArgs TARGET TITLE OUTPUT_DIRECTORY)
  set(multiValueArgs SOURCES INCLUDE_DIRS DEPS)
-  cmake_parse_arguments(
-      MTLLIB 
-      ""
-      "${oneValueArgs}"
-      "${multiValueArgs}" 
-      ${ARGN}
-  )
+  cmake_parse_arguments(MTLLIB "" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})

  # Set output
  set(MTLLIB_BUILD_TARGET "${MTLLIB_OUTPUT_DIRECTORY}/${MTLLIB_TITLE}.metallib")

-  # Collect compile options 
+  # Collect compile options
  set(MTLLIB_COMPILE_OPTIONS -Wall -Wextra -fno-fast-math)

  # Prepare metallib build command
  add_custom_command(
    OUTPUT ${MTLLIB_BUILD_TARGET}
-    COMMAND xcrun -sdk macosx metal 
-                  "$<LIST:TRANSFORM,${MTLLIB_INCLUDE_DIRS},PREPEND,-I>"
-                  ${MTLLIB_COMPILE_OPTIONS}
-                  ${MTLLIB_SOURCES}
-                  -o ${MTLLIB_BUILD_TARGET}
+    COMMAND
+      xcrun -sdk macosx metal
+      "$<LIST:TRANSFORM,${MTLLIB_INCLUDE_DIRS},PREPEND,-I>"
+      ${MTLLIB_COMPILE_OPTIONS} ${MTLLIB_SOURCES} -o ${MTLLIB_BUILD_TARGET}
    DEPENDS ${MTLLIB_DEPS} ${MTLLIB_SOURCES}
    COMMAND_EXPAND_LISTS
    COMMENT "Building ${MTLLIB_TITLE}.metallib"
-    VERBATIM
-  )
+    VERBATIM)

  # Add metallib custom target
-  add_custom_target(
-    ${MTLLIB_TARGET}
-    DEPENDS
-    ${MTLLIB_BUILD_TARGET}
-  )
+  add_custom_target(${MTLLIB_TARGET} DEPENDS ${MTLLIB_BUILD_TARGET})

-endmacro(mlx_build_metallib)
+endmacro(mlx_build_metallib)
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,3 +1,4 @@
 sphinx
 breathe
 sphinx-book-theme
+mlx
--- a/docs/src/conf.py
+++ b/docs/src/conf.py
@@ -83,3 +83,15 @@ def setup(app):
 # -- Options for LaTeX output ------------------------------------------------

 latex_documents = [(main_doc, "MLX.tex", "MLX Documentation", author, "manual")]
+latex_elements = {
+    "preamble": r"""
+    \usepackage{enumitem}
+    \setlistdepth{5}
+    \setlist[itemize,1]{label=$\bullet$}
+    \setlist[itemize,2]{label=$\bullet$}
+    \setlist[itemize,3]{label=$\bullet$}
+    \setlist[itemize,4]{label=$\bullet$}
+    \setlist[itemize,5]{label=$\bullet$}
+    \renewlist{itemize}{itemize}{5}
+""",
+}
--- a/docs/src/dev/custom_metal_kernels.rst
+++ b/docs/src/dev/custom_metal_kernels.rst
@@ -0,0 +1,421 @@
+Custom Metal Kernels
+====================
+
+MLX supports writing custom Metal kernels through the Python and C++ APIs.
+
+Simple Example
+--------------
+
+Let's write a custom kernel that computes ``exp`` elementwise:
+
+.. code-block:: python
+
+  def exp_elementwise(a: mx.array):
+      source = """
+          uint elem = thread_position_in_grid.x;
+          T tmp = inp[elem];
+          out[elem] = metal::exp(tmp);
+      """
+
+      kernel = mx.fast.metal_kernel(
+          name="myexp",
+          input_names=["inp"],
+          output_names=["out"],
+          source=source,
+      )
+      outputs = kernel(
+          inputs=[a],
+          template=[("T", mx.float32)],
+          grid=(a.size, 1, 1),
+          threadgroup=(256, 1, 1),
+          output_shapes=[a.shape],
+          output_dtypes=[a.dtype],
+      )
+      return outputs[0]
+
+  a = mx.random.normal(shape=(4, 16)).astype(mx.float16)
+  b = exp_elementwise(a)
+  assert mx.allclose(b, mx.exp(a))
+
+.. note::
+    We are only required to pass the body of the Metal kernel in ``source``.
+
+The full function signature will be generated using:
+
+* The shapes/dtypes of ``inputs``
+    In the above, ``a`` is an ``mx.array`` of type ``mx.float16`` and we pass it with the key ``inp``
+    so we will add ``const device float16_t* inp`` to the signature.
+    ``inp_shape``, ``inp_strides`` and ``inp_ndim`` are also added for convenience if they are present
+    in ``source``.
+* The list of ``output_dtypes``
+    In the above, ``out`` is an ``mx.array`` of type ``mx.float16``
+    so we add ``device float16_t* out``.
+* Template parameters passed using ``template``
+    In the above, ``template=[("T", mx.float32)]`` adds a template of ``template <typename T>`` to the function
+    and instantiates the template with ``custom_kernel_myexp_float<float>``.
+    Template parameters can be ``mx.core.Dtype``, ``int`` or ``bool``.
+* Metal attributes used in ``source`` such as ``[[thread_position_in_grid]]``
+    These will be added as function arguments.
+    All the attributes defined in Table 5.8 of the `Metal Shading Language Specification <https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf>`_ are supported.
+
+Putting this all together, the generated function signature for ``myexp`` is as follows:
+
+.. code-block:: cpp
+
+  template <typename T>
+  [[kernel]] void custom_kernel_myexp_float(
+    const device float16_t* inp [[buffer(0)]],
+    device float16_t* out [[buffer(1)]],
+    uint3 thread_position_in_grid [[thread_position_in_grid]]) {
+
+          uint elem = thread_position_in_grid.x;
+          T tmp = inp[elem];
+          out[elem] = metal::exp(tmp);
+
+  }
+
+  template [[host_name("custom_kernel_myexp_float")]] [[kernel]] decltype(custom_kernel_myexp_float<float>) custom_kernel_myexp_float<float>;
+
+Passing ``verbose=True`` to ``mx.fast.metal_kernel.__call__`` will print the generated code for debugging purposes.
+
+Using Shape/Strides
+-------------------
+
+``mx.fast.metal_kernel`` supports an argument ``ensure_row_contiguous`` which is ``True`` by default.
+This will copy the ``mx.array`` inputs if needed before the kernel is launched to ensure that the memory layout is row contiguous.
+Generally this makes writing the kernel easier, since we don't have to worry about gaps or the ordering of the dims
+when indexing.
+
+If we want to avoid this copy, ``metal_kernel`` automatically passes ``a_shape``, ``a_strides`` and ``a_ndim`` for each
+input array ``a`` if any are present in ``source``.
+We can then use MLX's built in indexing utils to fetch the right elements for each thread.
+
+Let's convert ``myexp`` above to support arbitrarily strided arrays without relying on a copy from ``ensure_row_contiguous``:
+
+.. code-block:: python
+
+  def exp_elementwise(a: mx.array):
+      source = """
+          uint elem = thread_position_in_grid.x;
+          // Utils from `mlx/backend/metal/kernels/utils.h` are automatically included
+          uint loc = elem_to_loc(elem, inp_shape, inp_strides, inp_ndim);
+          T tmp = inp[loc];
+          // Output arrays are always row contiguous
+          out[elem] = metal::exp(tmp);
+      """
+
+      kernel = mx.fast.metal_kernel(
+          name="myexp_strided",
+          input_names=["inp"],
+          output_names=["out"],
+          source=source
+      )
+      outputs = kernel(
+          inputs=[a],
+          template=[("T", mx.float32)],
+          grid=(a.size, 1, 1),
+          threadgroup=(256, 1, 1),
+          output_shapes=[a.shape],
+          output_dtypes=[a.dtype],
+          ensure_row_contiguous=False,
+      )
+      return outputs[0]
+
+  a = mx.random.normal(shape=(4, 16)).astype(mx.float16)
+  # make non-contiguous
+  a = a[::2]
+  b = exp_elementwise(a)
+  assert mx.allclose(b, mx.exp(a))
+
+Complex Example
+-----------------------------
+
+Let's implement a more complex example: ``grid_sample`` in ``"bilinear"`` mode.
+
+We'll start with the following MLX implementation using standard ops:
+
+.. code-block:: python
+
+    def grid_sample_ref(x, grid):
+        N, H_in, W_in, _ = x.shape
+        ix = ((grid[..., 0] + 1) * W_in - 1) / 2
+        iy = ((grid[..., 1] + 1) * H_in - 1) / 2
+
+        ix_nw = mx.floor(ix).astype(mx.int32)
+        iy_nw = mx.floor(iy).astype(mx.int32)
+
+        ix_ne = ix_nw + 1
+        iy_ne = iy_nw
+
+        ix_sw = ix_nw
+        iy_sw = iy_nw + 1
+
+        ix_se = ix_nw + 1
+        iy_se = iy_nw + 1
+
+        nw = (ix_se - ix)    * (iy_se - iy)
+        ne = (ix    - ix_sw) * (iy_sw - iy)
+        sw = (ix_ne - ix)    * (iy    - iy_ne)
+        se = (ix    - ix_nw) * (iy    - iy_nw)
+
+        I_nw = x[mx.arange(N)[:, None, None], iy_nw, ix_nw, :]
+        I_ne = x[mx.arange(N)[:, None, None], iy_ne, ix_ne, :]
+        I_sw = x[mx.arange(N)[:, None, None], iy_sw, ix_sw, :]
+        I_se = x[mx.arange(N)[:, None, None], iy_se, ix_se, :]
+
+        mask_nw = (iy_nw >= 0) & (iy_nw <= H_in - 1) & (ix_nw >= 0) & (ix_nw <= W_in - 1)
+        mask_ne = (iy_ne >= 0) & (iy_ne <= H_in - 1) & (ix_ne >= 0) & (ix_ne <= W_in - 1)
+        mask_sw = (iy_sw >= 0) & (iy_sw <= H_in - 1) & (ix_sw >= 0) & (ix_sw <= W_in - 1)
+        mask_se = (iy_se >= 0) & (iy_se <= H_in - 1) & (ix_se >= 0) & (ix_se <= W_in - 1)
+
+        I_nw *= mask_nw[..., None]
+        I_ne *= mask_ne[..., None]
+        I_sw *= mask_sw[..., None]
+        I_se *= mask_se[..., None]
+
+        output = nw[..., None] * I_nw + ne[..., None] * I_ne + sw[..., None] * I_sw + se[..., None] * I_se
+
+        return output
+
+Now let's use ``mx.custom_function`` together with ``mx.fast.metal_kernel``
+to write a fast GPU kernel for both the forward and backward passes.
+
+First we'll implement the forward pass as a fused kernel:
+
+.. code-block:: python
+
+    @mx.custom_function
+    def grid_sample(x, grid):
+
+        assert x.ndim == 4, "`x` must be 4D."
+        assert grid.ndim == 4, "`grid` must be 4D."
+
+        B, _, _, C = x.shape
+        _, gN, gM, D = grid.shape
+        out_shape = (B, gN, gM, C)
+
+        assert D == 2, "Last dim of `grid` must be size 2."
+
+        source = """
+            uint elem = thread_position_in_grid.x;
+            int H = x_shape[1];
+            int W = x_shape[2];
+            int C = x_shape[3];
+            int gH = grid_shape[1];
+            int gW = grid_shape[2];
+
+            int w_stride = C;
+            int h_stride = W * w_stride;
+            int b_stride = H * h_stride;
+
+            uint grid_idx = elem / C * 2;
+            float ix = ((grid[grid_idx] + 1) * W - 1) / 2;
+            float iy = ((grid[grid_idx + 1] + 1) * H - 1) / 2;
+
+            int ix_nw = floor(ix);
+            int iy_nw = floor(iy);
+
+            int ix_ne = ix_nw + 1;
+            int iy_ne = iy_nw;
+
+            int ix_sw = ix_nw;
+            int iy_sw = iy_nw + 1;
+
+            int ix_se = ix_nw + 1;
+            int iy_se = iy_nw + 1;
+
+            T nw = (ix_se - ix)    * (iy_se - iy);
+            T ne = (ix    - ix_sw) * (iy_sw - iy);
+            T sw = (ix_ne - ix)    * (iy    - iy_ne);
+            T se = (ix    - ix_nw) * (iy    - iy_nw);
+
+            int batch_idx = elem / C / gH / gW * b_stride;
+            int channel_idx = elem % C;
+            int base_idx = batch_idx + channel_idx;
+
+            T I_nw = x[base_idx + iy_nw * h_stride + ix_nw * w_stride];
+            T I_ne = x[base_idx + iy_ne * h_stride + ix_ne * w_stride];
+            T I_sw = x[base_idx + iy_sw * h_stride + ix_sw * w_stride];
+            T I_se = x[base_idx + iy_se * h_stride + ix_se * w_stride];
+
+            I_nw = iy_nw >= 0 && iy_nw <= H - 1 && ix_nw >= 0 && ix_nw <= W - 1 ? I_nw : 0;
+            I_ne = iy_ne >= 0 && iy_ne <= H - 1 && ix_ne >= 0 && ix_ne <= W - 1 ? I_ne : 0;
+            I_sw = iy_sw >= 0 && iy_sw <= H - 1 && ix_sw >= 0 && ix_sw <= W - 1 ? I_sw : 0;
+            I_se = iy_se >= 0 && iy_se <= H - 1 && ix_se >= 0 && ix_se <= W - 1 ? I_se : 0;
+
+            out[elem] = nw * I_nw + ne * I_ne + sw * I_sw + se * I_se;
+        """
+        kernel = mx.fast.metal_kernel(
+            name="grid_sample",
+            input_names=["x", "grid"],
+            output_names=["out"],
+            source=source,
+        )
+        outputs = kernel(
+            inputs=[x, grid],
+            template=[("T", x.dtype)],
+            output_shapes=[out_shape],
+            output_dtypes=[x.dtype],
+            grid=(np.prod(out_shape), 1, 1),
+            threadgroup=(256, 1, 1),
+        )
+        return outputs[0]
+
+For a reasonably sized input such as:
+
+.. code-block:: python
+
+    x.shape = (8, 1024, 1024, 64)
+    grid.shape = (8, 256, 256, 2)
+
+On an M1 Max, we see a big performance improvement:
+
+``55.7ms -> 6.7ms => 8x speed up``
+
+Grid Sample VJP
+---------------
+
+Since we decorated ``grid_sample`` with ``mx.custom_function``, we can now define
+its custom vjp transform so MLX can differentiate it.
+
+The backwards pass requires atomically updating ``x_grad``/``grid_grad`` and so
+requires a few extra ``mx.fast.metal_kernel`` features:
+
+* ``init_value=0``
+    Initialize all of the kernel's outputs to this value before it runs. This allows us to update only part of the output arrays with the kernel.
+
+* ``atomic_outputs=True``
+    Designate all of the kernel outputs as ``atomic`` in the function signature. 
+    This means we can use Metal's ``atomic`` features to simultaneously update the ``x_grad`` and ``grid_grad`` arrays from multiple threadgroups. 
+    See section 6.15 of the `Metal Shading Language Specification <https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf>`_ for more details.
+
+We can then implement the backwards pass as follows:
+
+.. code-block:: python
+
+    @grid_sample.vjp
+    def grid_sample_vjp(primals, cotangent, _):
+        x, grid = primals
+        B, _, _, C = x.shape
+        _, gN, gM, D = grid.shape
+
+        assert D == 2, "Last dim of `grid` must be size 2."
+
+        source = """
+            uint elem = thread_position_in_grid.x;
+            int H = x_shape[1];
+            int W = x_shape[2];
+            int C = x_shape[3];
+            // Pad C to the nearest larger simdgroup size multiple
+            int C_padded = ceildiv(C, threads_per_simdgroup) * threads_per_simdgroup;
+
+            int gH = grid_shape[1];
+            int gW = grid_shape[2];
+
+            int w_stride = C;
+            int h_stride = W * w_stride;
+            int b_stride = H * h_stride;
+
+            uint grid_idx = elem / C_padded * 2;
+            float ix = ((grid[grid_idx] + 1) * W - 1) / 2;
+            float iy = ((grid[grid_idx + 1] + 1) * H - 1) / 2;
+
+            int ix_nw = floor(ix);
+            int iy_nw = floor(iy);
+
+            int ix_ne = ix_nw + 1;
+            int iy_ne = iy_nw;
+
+            int ix_sw = ix_nw;
+            int iy_sw = iy_nw + 1;
+
+            int ix_se = ix_nw + 1;
+            int iy_se = iy_nw + 1;
+
+            T nw = (ix_se - ix)    * (iy_se - iy);
+            T ne = (ix    - ix_sw) * (iy_sw - iy);
+            T sw = (ix_ne - ix)    * (iy    - iy_ne);
+            T se = (ix    - ix_nw) * (iy    - iy_nw);
+
+            int batch_idx = elem / C_padded / gH / gW * b_stride;
+            int channel_idx = elem % C_padded;
+            int base_idx = batch_idx + channel_idx;
+
+            T gix = T(0);
+            T giy = T(0);
+            if (channel_idx < C) {
+                int cot_index = elem / C_padded * C + channel_idx;
+                T cot = cotangent[cot_index];
+                if (iy_nw >= 0 && iy_nw <= H - 1 && ix_nw >= 0 && ix_nw <= W - 1) {
+                    int offset = base_idx + iy_nw * h_stride + ix_nw * w_stride;
+                    atomic_fetch_add_explicit(&x_grad[offset], nw * cot, memory_order_relaxed);
+
+                    T I_nw = x[offset];
+                    gix -= I_nw * (iy_se - iy) * cot;
+                    giy -= I_nw * (ix_se - ix) * cot;
+                }
+                if (iy_ne >= 0 && iy_ne <= H - 1 && ix_ne >= 0 && ix_ne <= W - 1) {
+                    int offset = base_idx + iy_ne * h_stride + ix_ne * w_stride;
+                    atomic_fetch_add_explicit(&x_grad[offset], ne * cot, memory_order_relaxed);
+
+                    T I_ne = x[offset];
+                    gix += I_ne * (iy_sw - iy) * cot;
+                    giy -= I_ne * (ix - ix_sw) * cot;
+                }
+                if (iy_sw >= 0 && iy_sw <= H - 1 && ix_sw >= 0 && ix_sw <= W - 1) {
+                    int offset = base_idx + iy_sw * h_stride + ix_sw * w_stride;
+                    atomic_fetch_add_explicit(&x_grad[offset], sw * cot, memory_order_relaxed);
+
+                    T I_sw = x[offset];
+                    gix -= I_sw * (iy - iy_ne) * cot;
+                    giy += I_sw * (ix_ne - ix) * cot;
+                }
+                if (iy_se >= 0 && iy_se <= H - 1 && ix_se >= 0 && ix_se <= W - 1) {
+                    int offset = base_idx + iy_se * h_stride + ix_se * w_stride;
+                    atomic_fetch_add_explicit(&x_grad[offset], se * cot, memory_order_relaxed);
+
+                    T I_se = x[offset];
+                    gix += I_se * (iy - iy_nw) * cot;
+                    giy += I_se * (ix - ix_nw) * cot;
+                }
+            }
+
+            T gix_mult = W / 2;
+            T giy_mult = H / 2;
+
+            // Reduce across each simdgroup first.
+            // This is much faster than relying purely on atomics.
+            gix = simd_sum(gix);
+            giy = simd_sum(giy);
+
+            if (thread_index_in_simdgroup == 0) {
+                atomic_fetch_add_explicit(&grid_grad[grid_idx], gix * gix_mult, memory_order_relaxed);
+                atomic_fetch_add_explicit(&grid_grad[grid_idx + 1], giy * giy_mult, memory_order_relaxed);
+            }
+        """
+        kernel = mx.fast.metal_kernel(
+            name="grid_sample_grad",
+            input_names=["x", "grid", "cotangent"],
+            output_names=["x_grad", "grid_grad"],
+            source=source,
+            atomic_outputs=True,
+        )
+        # pad the output channels to simd group size
+        # so that our `simd_sum`s don't overlap.
+        simdgroup_size = 32
+        C_padded = (C + simdgroup_size - 1) // simdgroup_size * simdgroup_size
+        grid_size = B * gN * gM * C_padded
+        outputs = kernel(
+            inputs=[x, grid, cotangent],
+            template=[("T", x.dtype)],
+            output_shapes=[x.shape, grid.shape],
+            output_dtypes=[x.dtype, x.dtype],
+            grid=(grid_size, 1, 1),
+            threadgroup=(256, 1, 1),
+            init_value=0,
+        )
+        return outputs[0], outputs[1]
+
+There's an even larger speed up for the vjp:
+
+``676.4ms -> 16.7ms => 40x speed up``
--- a/docs/src/dev/extensions.rst
+++ b/docs/src/dev/extensions.rst
@@ -486,9 +486,8 @@ below.
        std::ostringstream kname;
        kname << "axpby_" << "general_" << type_to_name(out);

-        // Make sure the metal library is available and look for it
-        // in the same folder as this executable if needed
-        d.register_library("mlx_ext", metal::get_colocated_mtllib_path);
+        // Make sure the metal library is available
+        d.register_library("mlx_ext");

        // Make a kernel from this metal library
        auto kernel = d.get_kernel(kname.str(), "mlx_ext");
--- a/docs/src/examples/llama-inference.rst
+++ b/docs/src/examples/llama-inference.rst
@@ -15,7 +15,7 @@ module to concisely define the model architecture.
 Attention layer
 ^^^^^^^^^^^^^^^^

-We will start with the llama attention layer which notably uses the RoPE
+We will start with the Llama attention layer which notably uses the RoPE
 positional encoding. [1]_ In addition, our attention layer will optionally use a
 key/value cache that will be concatenated with the provided keys and values to
 support efficient inference.
--- a/docs/src/examples/mlp.rst
+++ b/docs/src/examples/mlp.rst
@@ -64,7 +64,7 @@ set:
 Next, setup the problem parameters and load the data. To load the data, you need our
 `mnist data loader
 <https://github.com/ml-explore/mlx-examples/blob/main/mnist/mnist.py>`_, which
-we will import as `mnist`.
+we will import as ``mnist``.

 .. code-block:: python

--- a/docs/src/index.rst
+++ b/docs/src/index.rst
@@ -85,3 +85,4 @@ are the CPU and GPU.

   dev/extensions
   dev/metal_debugger
+   dev/custom_metal_kernels
--- a/docs/src/install.rst
+++ b/docs/src/install.rst
@@ -70,36 +70,36 @@ To build and install the MLX python library from source, first, clone MLX from

   git clone git@github.com:ml-explore/mlx.git mlx && cd mlx

-Install `nanobind <https://nanobind.readthedocs.io/en/latest/>`_ with:
-
-.. code-block:: shell
-
-    pip install git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
-
 Then simply build and install MLX using pip:

 .. code-block:: shell

-   env CMAKE_BUILD_PARALLEL_LEVEL="" pip install .
+  CMAKE_BUILD_PARALLEL_LEVEL=8 pip install .

-For developing use an editable install:
+For developing, install the package with development dependencies, and use an
+editable install:

 .. code-block:: shell

-  env CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e .
+  CMAKE_BUILD_PARALLEL_LEVEL=8 pip install -e ".[dev]"

-To make sure the install is working run the tests with:
+Once the development dependencies are installed, you can build faster with:
+
+.. code-block:: shell
+
+ CMAKE_BUILD_PARALLEL_LEVEL=8 python setup.py build_ext --inplace
+
+Run the tests with:

 .. code-block:: shell

-  pip install ".[testing]"
  python -m unittest discover python/tests

-Optional: Install stubs to enable auto completions and type checking from your IDE:
+Optional: Install stubs to enable auto completions and type checking from your
+IDE:

 .. code-block:: shell

-  pip install ".[dev]"
  python setup.py generate_stubs

 C++ API
--- a/docs/src/python/array.rst
+++ b/docs/src/python/array.rst
@@ -53,8 +53,9 @@ Array
    array.sqrt
    array.square
    array.squeeze
-    array.swapaxes
+    array.std
    array.sum
+    array.swapaxes
    array.transpose
    array.T
    array.var
--- a/docs/src/python/distributed.rst
+++ b/docs/src/python/distributed.rst
@@ -17,3 +17,6 @@ made available.
    init
    all_sum
    all_gather
+    send
+    recv
+    recv_like
--- a/docs/src/python/fast.rst
+++ b/docs/src/python/fast.rst
@@ -12,3 +12,5 @@ Fast
  layer_norm
  rope
  scaled_dot_product_attention
+  affine_quantize
+  metal_kernel
--- a/docs/src/python/linalg.rst
+++ b/docs/src/python/linalg.rst
@@ -9,7 +9,10 @@ Linear Algebra
   :toctree: _autosummary 

    inv
+    tri_inv
    norm
    cholesky
+    cholesky_inv
+    cross
    qr
    svd
--- a/docs/src/python/nn/functions.rst
+++ b/docs/src/python/nn/functions.rst
@@ -13,6 +13,7 @@ simple functions.
   :template: nn-module-template.rst

   elu
+   celu
   gelu
   gelu_approx
   gelu_fast_approx
--- a/docs/src/python/nn/layers.rst
+++ b/docs/src/python/nn/layers.rst
@@ -13,13 +13,18 @@ Layers
   AvgPool1d
   AvgPool2d
   BatchNorm
+   CELU
   Conv1d
   Conv2d
   Conv3d
+   ConvTranspose1d
+   ConvTranspose2d
+   ConvTranspose3d
   Dropout
   Dropout2d
   Dropout3d
   Embedding
+   ELU
   GELU
   GLU
   GroupNorm
@@ -31,6 +36,8 @@ Layers
   LayerNorm
   LeakyReLU
   Linear
+   LogSigmoid
+   LogSoftmax
   LSTM
   MaxPool1d
   MaxPool2d
@@ -46,6 +53,7 @@ Layers
   RoPE
   SELU
   Sequential
+   Sigmoid
   SiLU
   SinusoidalPositionalEncoding
   Softmin
--- a/docs/src/python/ops.rst
+++ b/docs/src/python/ops.rst
@@ -44,6 +44,10 @@ Operations
   convolve
   conv1d
   conv2d
+   conv3d
+   conv_transpose1d
+   conv_transpose2d
+   conv_transpose3d
   conv_general
   cos
   cosh
@@ -77,6 +81,7 @@ Operations
   hadamard_transform
   identity
   inner
+   isfinite
   isclose
   isinf
   isnan
@@ -116,6 +121,7 @@ Operations
   pad
   power
   prod
+   put_along_axis
   quantize
   quantized_matmul
   radians
@@ -124,6 +130,7 @@ Operations
   repeat
   reshape
   right_shift
+   roll
   round
   rsqrt
   save
--- a/docs/src/python/random.rst
+++ b/docs/src/python/random.rst
@@ -45,3 +45,4 @@ we use a splittable version of Threefry, which is a counter-based PRNG.
   truncated_normal
   uniform
   laplace
+   permutation
--- a/examples/extensions/CMakeLists.txt
+++ b/examples/extensions/CMakeLists.txt
@@ -11,10 +11,14 @@ option(BUILD_SHARED_LIBS "Build extensions as a shared library" ON)

 # ----------------------------- Dependencies -----------------------------
 find_package(MLX CONFIG REQUIRED)
-find_package(Python 3.8 COMPONENTS Interpreter Development.Module REQUIRED)
+find_package(
+  Python 3.8
+  COMPONENTS Interpreter Development.Module
+  REQUIRED)
 execute_process(
  COMMAND "${Python_EXECUTABLE}" -m nanobind --cmake_dir
-  OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE NB_DIR)
+  OUTPUT_STRIP_TRAILING_WHITESPACE
+  OUTPUT_VARIABLE NB_DIR)
 list(APPEND CMAKE_PREFIX_PATH "${NB_DIR}")
 find_package(nanobind CONFIG REQUIRED)

@@ -24,16 +28,10 @@ find_package(nanobind CONFIG REQUIRED)
 add_library(mlx_ext)

 # Add sources
-target_sources(
-  mlx_ext
-  PUBLIC
-  ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.cpp
-)
+target_sources(mlx_ext PUBLIC ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.cpp)

 # Add include headers
-target_include_directories(
-  mlx_ext PUBLIC ${CMAKE_CURRENT_LIST_DIR}
-)
+target_include_directories(mlx_ext PUBLIC ${CMAKE_CURRENT_LIST_DIR})

 # Link to mlx
 target_link_libraries(mlx_ext PUBLIC mlx)
@@ -43,27 +41,32 @@ target_link_libraries(mlx_ext PUBLIC mlx)
 # Build metallib
 if(MLX_BUILD_METAL)
  mlx_build_metallib(
-    TARGET mlx_ext_metallib
-    TITLE mlx_ext
-    SOURCES ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.metal
-    INCLUDE_DIRS ${PROJECT_SOURCE_DIR} ${MLX_INCLUDE_DIRS}
-    OUTPUT_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY}
-  )
-
-  add_dependencies(
-    mlx_ext
+    TARGET
    mlx_ext_metallib
-  )
+    TITLE
+    mlx_ext
+    SOURCES
+    ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.metal
+    INCLUDE_DIRS
+    ${PROJECT_SOURCE_DIR}
+    ${MLX_INCLUDE_DIRS}
+    OUTPUT_DIRECTORY
+    ${CMAKE_LIBRARY_OUTPUT_DIRECTORY})
+
+  add_dependencies(mlx_ext mlx_ext_metallib)

 endif()

 # ----------------------------- Python Bindings -----------------------------
 nanobind_add_module(
  _ext
-  NB_STATIC STABLE_ABI LTO NOMINSIZE
-  NB_DOMAIN mlx 
-  ${CMAKE_CURRENT_LIST_DIR}/bindings.cpp
-)
+  NB_STATIC
+  STABLE_ABI
+  LTO
+  NOMINSIZE
+  NB_DOMAIN
+  mlx
+  ${CMAKE_CURRENT_LIST_DIR}/bindings.cpp)
 target_link_libraries(_ext PRIVATE mlx_ext)

 if(BUILD_SHARED_LIBS)
--- a/examples/extensions/axpby/axpby.cpp
+++ b/examples/extensions/axpby/axpby.cpp
@@ -249,9 +249,8 @@ void Axpby::eval_gpu(
  kname << (contiguous_kernel ? "contiguous_" : "general_");
  kname << type_to_name(out);

-  // Make sure the metal library is available and look for it
-  // in the same folder as this executable if needed
-  d.register_library("mlx_ext", metal::get_colocated_mtllib_path);
+  // Make sure the metal library is available
+  d.register_library("mlx_ext");

  // Make a kernel from this metal library
  auto kernel = d.get_kernel(kname.str(), "mlx_ext");
--- a/examples/extensions/pyproject.toml
+++ b/examples/extensions/pyproject.toml
@@ -2,7 +2,7 @@
 requires = [
  "setuptools>=42",
  "cmake>=3.24",
-  "mlx>=0.9.0",
-  "nanobind@git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4",
+  "mlx>=0.18.0",
+  "nanobind==2.2.0",
 ]
 build-backend = "setuptools.build_meta"
--- a/examples/extensions/requirements.txt
+++ b/examples/extensions/requirements.txt
@@ -1,4 +1,4 @@
 setuptools>=42
 cmake>=3.24
-mlx>=0.9.0
-nanobind@git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
+mlx>=0.18.1
+nanobind==2.2.0
--- a/examples/extensions/setup.py
+++ b/examples/extensions/setup.py
@@ -13,7 +13,6 @@ if __name__ == "__main__":
        cmdclass={"build_ext": extension.CMakeBuild},
        packages=["mlx_sample_extensions"],
        package_data={"mlx_sample_extensions": ["*.so", "*.dylib", "*.metallib"]},
-        extras_require={"dev": []},
        zip_safe=False,
        python_requires=">=3.8",
    )
--- a/mlx/CMakeLists.txt
+++ b/mlx/CMakeLists.txt
@@ -1,26 +1,24 @@
 target_sources(
  mlx
-  PRIVATE
-  ${CMAKE_CURRENT_SOURCE_DIR}/allocator.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/array.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/compile.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/dtype.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/einsum.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/fast.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/ops.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/graph_utils.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/random.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/scheduler.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/transforms.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/utils.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/linalg.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/backend/metal/metal.h
-)
+  PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/allocator.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/array.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/compile.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/dtype.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/einsum.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/fast.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/ops.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/graph_utils.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/random.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/scheduler.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/transforms.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/utils.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/linalg.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/backend/metal/metal.h)

-if (MLX_BUILD_CPU)
+if(MLX_BUILD_CPU)
  add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/backend/common)
 else()
  add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/backend/no_cpu)
@@ -28,17 +26,15 @@ endif()

 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/distributed)
 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/io)
-if (MLX_BUILD_ACCELERATE)
+if(MLX_BUILD_ACCELERATE)
  add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/backend/accelerate)
 elseif(MLX_BUILD_CPU)
  target_sources(
    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/backend/common/default_primitives.cpp
-  )
+    PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/backend/common/default_primitives.cpp)
 endif()

-if (MLX_BUILD_METAL)
+if(MLX_BUILD_METAL)
  add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/backend/metal)
 else()
  add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/backend/no_metal)
--- a/mlx/allocator.cpp
+++ b/mlx/allocator.cpp
@@ -23,11 +23,22 @@ void free(Buffer buffer) {
 }

 Buffer CommonAllocator::malloc(size_t size, bool) {
-  return Buffer{std::malloc(size)};
+  void* ptr = std::malloc(size + sizeof(size_t));
+  if (ptr != nullptr) {
+    *static_cast<size_t*>(ptr) = size;
+  }
+  return Buffer{ptr};
 }

 void CommonAllocator::free(Buffer buffer) {
-  std::free(buffer.raw_ptr());
+  std::free(buffer.ptr());
+}
+
+size_t CommonAllocator::size(Buffer buffer) const {
+  if (buffer.ptr() == nullptr) {
+    return 0;
+  }
+  return *static_cast<size_t*>(buffer.ptr());
 }

 Buffer malloc_or_wait(size_t size) {
--- a/mlx/allocator.h
+++ b/mlx/allocator.h
@@ -41,6 +41,7 @@ class Allocator {
 public:
  virtual Buffer malloc(size_t size, bool allow_swap = false) = 0;
  virtual void free(Buffer buffer) = 0;
+  virtual size_t size(Buffer buffer) const = 0;

  Allocator() = default;
  Allocator(const Allocator& other) = delete;
@@ -57,6 +58,7 @@ class CommonAllocator : public Allocator {
 public:
  virtual Buffer malloc(size_t size, bool allow_swap = false) override;
  virtual void free(Buffer buffer) override;
+  virtual size_t size(Buffer buffer) const override;

 private:
  CommonAllocator() = default;
--- a/mlx/array.cpp
+++ b/mlx/array.cpp
@@ -95,13 +95,29 @@ void array::detach() {
  array_desc_->primitive = nullptr;
 }

-void array::eval() {
-  // Ensure the array is ready to be read
-  if (status() == Status::scheduled) {
+bool array::is_available() const {
+  if (status() == Status::available) {
+    return true;
+  } else if (status() == Status::evaluated && event().is_signaled()) {
+    set_status(Status::available);
+    return true;
+  }
+  return false;
+}
+
+void array::wait() {
+  if (!is_available()) {
    event().wait();
    set_status(Status::available);
-  } else if (status() == Status::unscheduled) {
+  }
+}
+
+void array::eval() {
+  // Ensure the array is ready to be read
+  if (status() == Status::unscheduled) {
    mlx::core::eval({*this});
+  } else {
+    wait();
  }
 }

@@ -242,25 +258,35 @@ array::ArrayDesc::~ArrayDesc() {
  // This calls recursively the destructor and can result in stack overflow, we
  // instead put them in a vector and destroy them one at a time resulting in a
  // max stack depth of 2.
+  if (inputs.empty()) {
+    return;
+  }
+
  std::vector<std::shared_ptr<ArrayDesc>> for_deletion;

-  for (array& a : inputs) {
-    if (a.array_desc_.use_count() == 1) {
-      for_deletion.push_back(std::move(a.array_desc_));
+  auto append_deletable_inputs = [&for_deletion](ArrayDesc& ad) {
+    std::unordered_map<std::uintptr_t, array> input_map;
+    for (array& a : ad.inputs) {
+      if (a.array_desc_) {
+        input_map.insert({a.id(), a});
+      }
    }
-  }
+    ad.inputs.clear();
+    for (auto& [_, a] : input_map) {
+      if (a.array_desc_.use_count() <= a.siblings().size() + 1) {
+        for_deletion.push_back(std::move(a.array_desc_));
+      }
+    }
+  };
+
+  append_deletable_inputs(*this);

  while (!for_deletion.empty()) {
    // top is going to be deleted at the end of the block *after* the arrays
    // with inputs have been moved into the vector
    auto top = std::move(for_deletion.back());
    for_deletion.pop_back();
-
-    for (array& a : top->inputs) {
-      if (a.array_desc_.use_count() == 1) {
-        for_deletion.push_back(std::move(a.array_desc_));
-      }
-    }
+    append_deletable_inputs(*top);
  }
 }

--- a/mlx/array.h
+++ b/mlx/array.h
@@ -219,11 +219,23 @@ class array {
  };

  struct Flags {
-    // True if there are no gaps in the underlying data. Each item
+    // True iff there are no gaps in the underlying data. Each item
    // in the underlying data buffer belongs to at least one index.
+    //
+    // True iff:
+    // prod(shape[i] for i in range(ndim) if strides[i] > 0) == data_size()
    bool contiguous : 1;

+    // True iff:
+    // strides[-1] == 1 and
+    // all(strides[i] == (shape[i+1]*strides[i+1]) or shape[i] == 1 for i in
+    // range(ndim - 1))
    bool row_contiguous : 1;
+
+    // True iff:
+    // strides[0] == 1 and
+    // all(strides[i] == (shape[i-1]*strides[i-1]) or shape[i] == 1 for i in
+    // range(1, ndim))
    bool col_contiguous : 1;
  };

@@ -291,7 +303,16 @@ class array {
    return array_desc_->flags;
  }

-  /** The size (in elements) of the underlying buffer the array points to. */
+  /** The size (in elements) of the underlying buffer the array points to.
+   *
+   * This can be different than the actual size of the array if the array has
+   * been broadcast or irregularly strided.  If ``first`` is the offset into
+   * the data buffer of the first element of the array (i.e. the offset
+   * corresponding to ``arr[0, 0, ...]``) and last is the offset into the
+   * data buffer of the last element of the array (i.e. the offset
+   * corresponding to ``arr[-1, -1, ...]``) then ``data_size = last - first``.
+   * Note, ``data_size`` is in units of ``item_size`` (not bytes).
+   **/
  size_t data_size() const {
    return array_desc_->data_size;
  }
@@ -303,6 +324,10 @@ class array {
    return array_desc_->data->buffer;
  }

+  size_t buffer_size() const {
+    return allocator::allocator().size(buffer());
+  }
+
  // Return a copy of the shared pointer
  // to the array::Data struct
  std::shared_ptr<Data> data_shared_ptr() const {
@@ -319,11 +344,33 @@ class array {
    return static_cast<T*>(array_desc_->data_ptr);
  }

-  enum Status { unscheduled, scheduled, available };
+  enum Status {
+    // The ouptut of a computation which has not been scheduled.
+    // For example, the status of `x` in `auto x = a + b`.
+    unscheduled,

-  bool is_available() const {
-    return status() == Status::available;
-  }
+    // The ouptut of a computation which has been scheduled but `eval_*` has
+    // not yet been called on the array's primitive. A possible
+    // status of `x` in `auto x = a + b; eval(x);`
+    scheduled,
+
+    // The array's `eval_*` function has been run, but the computation is not
+    // necessarily complete. The array will have memory allocated and if it is
+    // not a tracer then it will be detached from the graph.
+    evaluated,
+
+    // If the array is the output of a computation then the computation
+    // is complete. Constant arrays are always available (e.g. `array({1, 2,
+    // 3})`)
+    available
+  };
+
+  // Check if the array is safe to read.
+  bool is_available() const;
+
+  // Wait on the array to be available. After this `is_available` returns
+  // `true`.
+  void wait();

  Status status() const {
    return array_desc_->status;
@@ -412,8 +459,6 @@ class array {
    void* data_ptr{nullptr};

    // The size in elements of the data buffer the array accesses
-    // This can be different than the actual size of the array if it
-    // has been broadcast or irregularly strided.
    size_t data_size;

    // Contains useful meta data about the array
--- a/mlx/backend/accelerate/CMakeLists.txt
+++ b/mlx/backend/accelerate/CMakeLists.txt
@@ -1,10 +1,8 @@
 target_sources(
  mlx
-  PRIVATE
-  ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/matmul.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp
-)
+  PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/matmul.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp)
--- a/mlx/backend/accelerate/softmax.cpp
+++ b/mlx/backend/accelerate/softmax.cpp
@@ -33,8 +33,8 @@ namespace {
 * Note: The implementation below is a general fast exp. There could be faster
 *       implementations for numbers strictly < 0.
 */
-inline simd_float16 simd_fast_exp(simd_float16 x) {
-  x *= 1.442695; // multiply with log_2(e)
+inline simd_float16 simd_fast_exp(simd_float16 x_init) {
+  auto x = x_init * 1.442695; // multiply with log_2(e)
  simd_float16 ipart, fpart;
  simd_int16 epart;
  x = simd_clamp(x, -80, 80);
@@ -53,7 +53,9 @@ inline simd_float16 simd_fast_exp(simd_float16 x) {
  // bitshifting
  epart = (simd_int(ipart) + 127) << 23;

-  return (*(simd_float16*)&epart) * x;
+  // Avoid supressing NaNs
+  simd_int16 eq = (x_init == x_init);
+  return simd_bitselect(x_init, (*(simd_float16*)&epart) * x, eq);
 }

 #if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
@@ -70,7 +72,6 @@ inline float16x8_t neon_fast_exp(float16x8_t x) {

  x = vdupq_n_f16(float16_t(1.535336188319500e-4f));
  x = vfmaq_f16(vdupq_n_f16(float16_t(1.339887440266574e-3f)), x, fpart);
-  x = vfmaq_f16(vdupq_n_f16(float16_t(1.339887440266574e-3f)), x, fpart);
  x = vfmaq_f16(vdupq_n_f16(float16_t(9.618437357674640e-3f)), x, fpart);
  x = vfmaq_f16(vdupq_n_f16(float16_t(5.550332471162809e-2f)), x, fpart);
  x = vfmaq_f16(vdupq_n_f16(float16_t(2.402264791363012e-1f)), x, fpart);
--- a/mlx/backend/common/CMakeLists.txt
+++ b/mlx/backend/common/CMakeLists.txt
@@ -1,5 +1,4 @@
-
-if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
+if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
  set(COMPILER ${CMAKE_C_COMPILER})
  set(CLANG TRUE)
 else()
@@ -7,72 +6,56 @@ else()
 endif()

 add_custom_command(
-    OUTPUT  compiled_preamble.cpp
-    COMMAND /bin/bash
-              ${CMAKE_CURRENT_SOURCE_DIR}/make_compiled_preamble.sh
-              ${CMAKE_CURRENT_BINARY_DIR}/compiled_preamble.cpp
-              ${COMPILER}
-              ${PROJECT_SOURCE_DIR}
-              ${CLANG}
+  OUTPUT compiled_preamble.cpp
+  COMMAND
+    /bin/bash ${CMAKE_CURRENT_SOURCE_DIR}/make_compiled_preamble.sh
+    ${CMAKE_CURRENT_BINARY_DIR}/compiled_preamble.cpp ${COMPILER}
+    ${PROJECT_SOURCE_DIR} ${CLANG}
+  DEPENDS make_compiled_preamble.sh
+          compiled_preamble.h
+          ${PROJECT_SOURCE_DIR}/mlx/types/half_types.h
+          ${PROJECT_SOURCE_DIR}/mlx/types/fp16.h
+          ${PROJECT_SOURCE_DIR}/mlx/types/bf16.h
+          ${PROJECT_SOURCE_DIR}/mlx/types/complex.h
+          ops.h)

-    DEPENDS make_compiled_preamble.sh
-            compiled_preamble.h
-            ${PROJECT_SOURCE_DIR}/mlx/types/half_types.h
-            ${PROJECT_SOURCE_DIR}/mlx/types/fp16.h
-            ${PROJECT_SOURCE_DIR}/mlx/types/bf16.h
-            ${PROJECT_SOURCE_DIR}/mlx/types/complex.h
-            ops.h
-)
-
-add_custom_target(
-  cpu_compiled_preamble
-  DEPENDS compiled_preamble.cpp
-)
+add_custom_target(cpu_compiled_preamble DEPENDS compiled_preamble.cpp)

 add_dependencies(mlx cpu_compiled_preamble)

 target_sources(
  mlx
-  PRIVATE
-  ${CMAKE_CURRENT_SOURCE_DIR}/arg_reduce.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/compiled.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/common.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/copy.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/erf.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/hadamard.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/masked_mm.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/reduce_utils.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/select.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/slicing.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/threefry.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/indexing.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/load.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/qrf.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/svd.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/inverse.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cholesky.cpp
-  ${CMAKE_CURRENT_BINARY_DIR}/compiled_preamble.cpp
-)
+  PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/arg_reduce.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/compiled.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/common.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/copy.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/erf.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/hadamard.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/masked_mm.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/reduce_utils.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/select.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/slicing.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/threefry.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/indexing.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/load.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/qrf.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/svd.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/inverse.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/cholesky.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/utils.cpp
+          ${CMAKE_CURRENT_BINARY_DIR}/compiled_preamble.cpp)

-if (IOS)
-  target_sources(
-    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/compiled_nocpu.cpp
-  )
+if(IOS)
+  target_sources(mlx PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/compiled_nocpu.cpp)
 else()
-  target_sources(
-    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/compiled_cpu.cpp
-  )
+  target_sources(mlx PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/compiled_cpu.cpp)
 endif()
--- a/mlx/backend/common/binary.h
+++ b/mlx/backend/common/binary.h
@@ -43,13 +43,15 @@ void set_binary_op_output_data(
    array& out,
    BinaryOpType bopt,
    bool donate_with_move = false) {
+  bool b_donatable = is_donatable(b, out);
+  bool a_donatable = is_donatable(a, out);
  switch (bopt) {
    case BinaryOpType::ScalarScalar:
      out.set_data(
          allocator::malloc_or_wait(out.itemsize()), 1, a.strides(), a.flags());
      break;
    case BinaryOpType::ScalarVector:
-      if (b.is_donatable() && b.itemsize() == out.itemsize()) {
+      if (b_donatable) {
        if (donate_with_move) {
          out.move_shared_buffer(b);
        } else {
@@ -64,7 +66,7 @@ void set_binary_op_output_data(
      }
      break;
    case BinaryOpType::VectorScalar:
-      if (a.is_donatable() && a.itemsize() == out.itemsize()) {
+      if (a_donatable) {
        if (donate_with_move) {
          out.move_shared_buffer(a);
        } else {
@@ -79,13 +81,13 @@ void set_binary_op_output_data(
      }
      break;
    case BinaryOpType::VectorVector:
-      if (a.is_donatable() && a.itemsize() == out.itemsize()) {
+      if (a_donatable) {
        if (donate_with_move) {
          out.move_shared_buffer(a);
        } else {
          out.copy_shared_buffer(a);
        }
-      } else if (b.is_donatable() && b.itemsize() == out.itemsize()) {
+      } else if (b_donatable) {
        if (donate_with_move) {
          out.move_shared_buffer(b);
        } else {
@@ -100,16 +102,14 @@ void set_binary_op_output_data(
      }
      break;
    case BinaryOpType::General:
-      if (a.is_donatable() && a.flags().row_contiguous &&
-          a.itemsize() == out.itemsize() && a.size() == out.size()) {
+      if (a_donatable && a.flags().row_contiguous && a.size() == out.size()) {
        if (donate_with_move) {
          out.move_shared_buffer(a);
        } else {
          out.copy_shared_buffer(a);
        }
      } else if (
-          b.is_donatable() && b.flags().row_contiguous &&
-          b.itemsize() == out.itemsize() && b.size() == out.size()) {
+          b_donatable && b.flags().row_contiguous && b.size() == out.size()) {
        if (donate_with_move) {
          out.move_shared_buffer(b);
        } else {
@@ -122,19 +122,7 @@ void set_binary_op_output_data(
  }
 }

-struct UseDefaultBinaryOp {
-  template <typename T, typename U>
-  void operator()(const T* a, const T* b, U* dst, int size) {
-    // Should we throw? This should normally never be called.
-    assert(false);
-  }
-
-  template <typename T, typename U>
-  void operator()(const T* a, const T* b, U* dst_a, U* dst_b, int size) {
-    // Should we throw? This should normally never be called.
-    assert(false);
-  }
-};
+struct UseDefaultBinaryOp {};

 template <typename T, typename U, typename Op>
 struct DefaultVectorScalar {
@@ -150,18 +138,6 @@ struct DefaultVectorScalar {
      a++;
    }
  }
-
-  void operator()(const T* a, const T* b, U* dst_a, U* dst_b, int size) {
-    T scalar = *b;
-    while (size-- > 0) {
-      auto dst = op(*a, scalar);
-      *dst_a = dst.first;
-      *dst_b = dst.second;
-      dst_a++;
-      dst_b++;
-      a++;
-    }
-  }
 };

 template <typename T, typename U, typename Op>
@@ -178,18 +154,6 @@ struct DefaultScalarVector {
      b++;
    }
  }
-
-  void operator()(const T* a, const T* b, U* dst_a, U* dst_b, int size) {
-    T scalar = *a;
-    while (size-- > 0) {
-      auto dst = op(scalar, *b);
-      *dst_a = dst.first;
-      *dst_b = dst.second;
-      dst_a++;
-      dst_b++;
-      b++;
-    }
-  }
 };

 template <typename T, typename U, typename Op>
@@ -206,204 +170,110 @@ struct DefaultVectorVector {
      b++;
    }
  }
-
-  void operator()(const T* a, const T* b, U* dst_a, U* dst_b, int size) {
-    while (size-- > 0) {
-      auto dst = op(*a, *b);
-      *dst_a = dst.first;
-      *dst_b = dst.second;
-      dst_a++;
-      dst_b++;
-      a++;
-      b++;
-    }
-  }
 };

-template <typename T, typename U, typename Op>
-void binary_op_dims1(const array& a, const array& b, array& out, Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < out.size(); ++i) {
-    dst[i] = op(a_ptr[a_idx], b_ptr[b_idx]);
-    a_idx += a.strides()[0];
-    b_idx += b.strides()[0];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims1(
-    const array& a,
-    const array& b,
-    array& out,
+template <typename T, typename U, typename Op, int D, bool Strided>
+void binary_op_dims(
+    const T* a,
+    const T* b,
+    U* out,
    Op op,
-    int stride) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; i++) {
-    op(a_ptr + a_idx, b_ptr + b_idx, dst, stride);
-    a_idx += a.strides()[0];
-    b_idx += b.strides()[0];
-    dst += stride;
-  }
-}
+    const std::vector<int>& shape,
+    const std::vector<size_t>& a_strides,
+    const std::vector<size_t>& b_strides,
+    const std::vector<size_t>& out_strides,
+    int axis) {
+  auto stride_a = a_strides[axis];
+  auto stride_b = b_strides[axis];
+  auto stride_out = out_strides[axis];
+  auto N = shape[axis];

-template <typename T, typename U, typename Op>
-void binary_op_dims2(const array& a, const array& b, array& out, Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx]);
-      a_idx += a.strides()[1];
-      b_idx += b.strides()[1];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims2(
-    const array& a,
-    const array& b,
-    array& out,
-    Op op,
-    int stride) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      op(a_ptr + a_idx, b_ptr + b_idx, dst, stride);
-      a_idx += a.strides()[1];
-      b_idx += b.strides()[1];
-      dst += stride;
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims3(const array& a, const array& b, array& out, Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx]);
-        a_idx += a.strides()[2];
-        b_idx += b.strides()[2];
+  for (int i = 0; i < N; i++) {
+    if constexpr (D > 1) {
+      binary_op_dims<T, U, Op, D - 1, Strided>(
+          a, b, out, op, shape, a_strides, b_strides, out_strides, axis + 1);
+    } else {
+      if constexpr (Strided) {
+        op(a, b, out, stride_out);
+      } else {
+        *out = op(*a, *b);
      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
+    out += stride_out;
+    a += stride_a;
+    b += stride_b;
  }
 }

-template <typename T, typename U, typename Op>
-void binary_op_dims4(const array& a, const array& b, array& out, Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        for (size_t ii = 0; ii < a.shape()[3]; ++ii) {
-          dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx]);
-          a_idx += a.strides()[3];
-          b_idx += b.strides()[3];
-        }
-        a_idx += a.strides()[2] - a.strides()[3] * a.shape()[3];
-        b_idx += b.strides()[2] - b.strides()[3] * b.shape()[3];
-      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dispatch_dims(
-    const array& a,
-    const array& b,
-    array& out,
-    Op op) {
-  switch (out.ndim()) {
-    case 1:
-      binary_op_dims1<T, U, Op>(a, b, out, op);
-      return;
-    case 2:
-      binary_op_dims2<T, U, Op>(a, b, out, op);
-      return;
-    case 3:
-      binary_op_dims3<T, U, Op>(a, b, out, op);
-      return;
-    case 4:
-      binary_op_dims4<T, U, Op>(a, b, out, op);
-      return;
-  }
-
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  for (size_t i = 0; i < out.size(); i++) {
-    int a_idx = elem_to_loc(i, a.shape(), a.strides());
-    int b_idx = elem_to_loc(i, b.shape(), b.strides());
-    dst[i] = op(a_ptr[a_idx], b_ptr[b_idx]);
-  }
-}
-
-template <typename T, typename U, typename Op>
+template <typename T, typename U, bool Strided, typename Op>
 void binary_op_dispatch_dims(
    const array& a,
    const array& b,
    array& out,
    Op op,
    int dim,
-    int stride) {
-  // Number of dimensions to loop over for vectorized ops
+    const std::vector<int>& shape,
+    const std::vector<size_t>& a_strides,
+    const std::vector<size_t>& b_strides,
+    const std::vector<size_t>& out_strides) {
+  const T* a_ptr = a.data<T>();
+  const T* b_ptr = b.data<T>();
+  U* out_ptr = out.data<U>();
  switch (dim) {
    case 1:
-      binary_op_dims1<T, U, Op>(a, b, out, op, stride);
+      binary_op_dims<T, U, Op, 1, Strided>(
+          a_ptr,
+          b_ptr,
+          out_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          0);
      return;
    case 2:
-      binary_op_dims2<T, U, Op>(a, b, out, op, stride);
+      binary_op_dims<T, U, Op, 2, Strided>(
+          a_ptr,
+          b_ptr,
+          out_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          0);
+      return;
+    case 3:
+      binary_op_dims<T, U, Op, 3, Strided>(
+          a_ptr,
+          b_ptr,
+          out_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          0);
      return;
  }

-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst = out.data<U>();
-  for (size_t i = 0; i < out.size(); i += stride) {
-    int a_idx = elem_to_loc(i, a.shape(), a.strides());
-    int b_idx = elem_to_loc(i, b.shape(), b.strides());
-    op(a_ptr + a_idx, b_ptr + b_idx, dst, stride);
-    dst += stride;
+  ContiguousIterator<size_t> a_it(shape, a_strides, dim - 3);
+  ContiguousIterator<size_t> b_it(shape, b_strides, dim - 3);
+  size_t stride = out_strides[dim - 4];
+  for (size_t elem = 0; elem < a.size(); elem += stride) {
+    binary_op_dims<T, U, Op, 3, Strided>(
+        a_ptr + a_it.loc,
+        b_ptr + b_it.loc,
+        out_ptr + elem,
+        op,
+        shape,
+        a_strides,
+        b_strides,
+        out_strides,
+        dim - 3);
+    a_it.step();
+    b_it.step();
  }
 }

@@ -450,29 +320,33 @@ void binary_op(
  }

  // General computation so let's try to optimize
+  auto [new_shape, new_strides] = collapse_contiguous_dims(
+      a.shape(), {a.strides(), b.strides(), out.strides()});
+  const auto& a_strides = new_strides[0];
+  const auto& b_strides = new_strides[1];
+  const auto& strides = new_strides[2];

  // Get the left-most dim such that the array is row contiguous after
-  auto& strides = out.strides();
-  auto leftmost_rc_dim = [&strides](const array& arr) {
-    int d = arr.ndim() - 1;
-    for (; d >= 0 && arr.strides()[d] == strides[d]; d--) {
+  auto leftmost_rc_dim = [&strides](const std::vector<size_t>& arr_strides) {
+    int d = arr_strides.size() - 1;
+    for (; d >= 0 && arr_strides[d] == strides[d]; d--) {
    }
    return d + 1;
  };
-  auto a_rc_dim = leftmost_rc_dim(a);
-  auto b_rc_dim = leftmost_rc_dim(b);
+  auto a_rc_dim = leftmost_rc_dim(a_strides);
+  auto b_rc_dim = leftmost_rc_dim(b_strides);

  // Get the left-most dim such that the array is a broadcasted "scalar" after
-  auto leftmost_s_dim = [](const array& arr) {
-    int d = arr.ndim() - 1;
-    for (; d >= 0 && arr.strides()[d] == 0; d--) {
+  auto leftmost_s_dim = [](const std::vector<size_t>& arr_strides) {
+    int d = arr_strides.size() - 1;
+    for (; d >= 0 && arr_strides[d] == 0; d--) {
    }
    return d + 1;
  };
-  auto a_s_dim = leftmost_s_dim(a);
-  auto b_s_dim = leftmost_s_dim(b);
+  auto a_s_dim = leftmost_s_dim(a_strides);
+  auto b_s_dim = leftmost_s_dim(b_strides);

-  auto ndim = out.ndim();
+  auto ndim = new_shape.size();

  // Case 1: LxM and FxM where L and F are broadcastable and M is row contiguous
  int dim = ndim;
@@ -494,27 +368,27 @@ void binary_op(
  // Can be sure dim > 0 since otherwise we would have used one of the fully
  // contiguous methods above. Except for the case that the flags do not
  // correspond to the underlying contiguity.
-  size_t stride;
  if (dim == 0 || strides[dim - 1] < 16) {
-    stride = 1;
    bopt = BinaryOpType::General;
    dim = ndim;
-  } else {
-    stride = strides[dim - 1];
  }

  switch (bopt) {
    case BinaryOpType::VectorVector:
-      binary_op_dispatch_dims<T, U>(a, b, out, opvv, dim, stride);
+      binary_op_dispatch_dims<T, U, true>(
+          a, b, out, opvv, dim, new_shape, a_strides, b_strides, strides);
      break;
    case BinaryOpType::VectorScalar:
-      binary_op_dispatch_dims<T, U>(a, b, out, opvs, dim, stride);
+      binary_op_dispatch_dims<T, U, true>(
+          a, b, out, opvs, dim, new_shape, a_strides, b_strides, strides);
      break;
    case BinaryOpType::ScalarVector:
-      binary_op_dispatch_dims<T, U>(a, b, out, opsv, dim, stride);
+      binary_op_dispatch_dims<T, U, true>(
+          a, b, out, opsv, dim, new_shape, a_strides, b_strides, strides);
      break;
    default:
-      binary_op_dispatch_dims<T, U>(a, b, out, op);
+      binary_op_dispatch_dims<T, U, false>(
+          a, b, out, op, dim, new_shape, a_strides, b_strides, strides);
      break;
  }
 }
@@ -531,9 +405,9 @@ void binary_op(
  // TODO: The following mess of constexpr evaluations can probably be achieved
  //       with template specializations and overloading. Would it be simpler?

-  if (std::is_same<decltype(opsv), UseDefaultBinaryOp>::value) {
-    if (std::is_same<decltype(opvs), UseDefaultBinaryOp>::value) {
-      if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
+  if constexpr (std::is_same<decltype(opsv), UseDefaultBinaryOp>::value) {
+    if constexpr (std::is_same<decltype(opvs), UseDefaultBinaryOp>::value) {
+      if constexpr (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
        // All ops are UseDefaultBinaryOp (why oh why would someone call that?)
        binary_op<T, T>(
            a,
@@ -554,7 +428,8 @@ void binary_op(
            DefaultVectorScalar<T, T, Op>(op),
            opvv);
      }
-    } else if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
+    } else if constexpr (std::is_same<decltype(opvv), UseDefaultBinaryOp>::
+                             value) {
      // opsv and opvv were UseDefaultBinaryOp
      binary_op<T, T>(
          a,
@@ -569,7 +444,8 @@ void binary_op(
      binary_op<T, T>(
          a, b, out, op, DefaultScalarVector<T, T, Op>(op), opvs, opvv);
    }
-  } else if (std::is_same<decltype(opvs), UseDefaultBinaryOp>::value) {
+  } else if constexpr (std::is_same<decltype(opvs), UseDefaultBinaryOp>::
+                           value) {
    if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
      // opvs and opvv were UseDefaultBinaryOp
      binary_op<T, T>(
@@ -585,7 +461,8 @@ void binary_op(
      binary_op<T, T>(
          a, b, out, op, opsv, DefaultVectorScalar<T, T, Op>(op), opvv);
    }
-  } else if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
+  } else if constexpr (std::is_same<decltype(opvv), UseDefaultBinaryOp>::
+                           value) {
    // opvv was UseDefaultBinaryOp
    binary_op<T, T>(
        a, b, out, op, opsv, opvs, DefaultVectorVector<T, T, Op>(op));
--- a/mlx/backend/common/binary_two.h
+++ b/mlx/backend/common/binary_two.h
@@ -9,168 +9,43 @@ namespace mlx::core {

 namespace {

-template <typename T, typename U, typename Op>
-void binary_op_dims1(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < out_a.size(); ++i) {
-    auto dst = op(a_ptr[a_idx], b_ptr[b_idx]);
-    dst_a[i] = dst.first;
-    dst_b[i] = dst.second;
-    a_idx += a.strides()[0];
-    b_idx += b.strides()[0];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims1(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
+template <typename T, typename U, typename Op, int D>
+void binary_op_dims(
+    const T* a,
+    const T* b,
+    U* out_a,
+    U* out_b,
    Op op,
-    int stride) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; i++) {
-    op(a_ptr + a_idx, b_ptr + b_idx, dst_a, dst_b, stride);
-    a_idx += a.strides()[0];
-    b_idx += b.strides()[0];
-    dst_a += stride;
-    dst_b += stride;
-  }
-}
+    const std::vector<int>& shape,
+    const std::vector<size_t>& a_strides,
+    const std::vector<size_t>& b_strides,
+    const std::vector<size_t>& out_strides,
+    int axis) {
+  auto stride_a = a_strides[axis];
+  auto stride_b = b_strides[axis];
+  auto stride_out = out_strides[axis];
+  auto N = shape[axis];

-template <typename T, typename U, typename Op>
-void binary_op_dims2(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      auto dst = op(a_ptr[a_idx], b_ptr[b_idx]);
-      dst_a[out_idx] = dst.first;
-      dst_b[out_idx++] = dst.second;
-      a_idx += a.strides()[1];
-      b_idx += b.strides()[1];
+  for (int i = 0; i < N; i++) {
+    if constexpr (D > 1) {
+      binary_op_dims<T, U, Op, D - 1>(
+          a,
+          b,
+          out_a,
+          out_b,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          axis + 1);
+    } else {
+      std::tie(*out_a, *out_b) = op(*a, *b);
    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims2(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op,
-    int stride) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      op(a_ptr + a_idx, b_ptr + b_idx, dst_a, dst_b, stride);
-      a_idx += a.strides()[1];
-      b_idx += b.strides()[1];
-      dst_a += stride;
-      dst_b += stride;
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims3(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        auto dst = op(a_ptr[a_idx], b_ptr[b_idx]);
-        dst_a[out_idx] = dst.first;
-        dst_b[out_idx++] = dst.second;
-        a_idx += a.strides()[2];
-        b_idx += b.strides()[2];
-      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-  }
-}
-
-template <typename T, typename U, typename Op>
-void binary_op_dims4(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op) {
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        for (size_t ii = 0; ii < a.shape()[3]; ++ii) {
-          auto dst = op(a_ptr[a_idx], b_ptr[b_idx]);
-          dst_a[out_idx] = dst.first;
-          dst_b[out_idx++] = dst.second;
-          a_idx += a.strides()[3];
-          b_idx += b.strides()[3];
-        }
-        a_idx += a.strides()[2] - a.strides()[3] * a.shape()[3];
-        b_idx += b.strides()[2] - b.strides()[3] * b.shape()[3];
-      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
+    a += stride_a;
+    b += stride_b;
+    out_a += stride_out;
+    out_b += stride_out;
  }
 }

@@ -181,352 +56,160 @@ void binary_op_dispatch_dims(
    array& out_a,
    array& out_b,
    Op op) {
-  switch (out_a.ndim()) {
+  auto [shape, strides] = collapse_contiguous_dims(
+      a.shape(), {a.strides(), b.strides(), out_a.strides()});
+  const auto& a_strides = strides[0];
+  const auto& b_strides = strides[1];
+  const auto& out_strides = strides[2];
+  const T* a_ptr = a.data<T>();
+  const T* b_ptr = b.data<T>();
+  U* out_a_ptr = out_a.data<U>();
+  U* out_b_ptr = out_b.data<U>();
+
+  int ndim = shape.size();
+  switch (ndim) {
    case 1:
-      binary_op_dims1<T, U, Op>(a, b, out_a, out_b, op);
+      binary_op_dims<T, U, Op, 1>(
+          a_ptr,
+          b_ptr,
+          out_a_ptr,
+          out_b_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          0);
      return;
    case 2:
-      binary_op_dims2<T, U, Op>(a, b, out_a, out_b, op);
-      return;
-    case 3:
-      binary_op_dims3<T, U, Op>(a, b, out_a, out_b, op);
-      return;
-    case 4:
-      binary_op_dims4<T, U, Op>(a, b, out_a, out_b, op);
+      binary_op_dims<T, U, Op, 2>(
+          a_ptr,
+          b_ptr,
+          out_a_ptr,
+          out_b_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          out_strides,
+          0);
      return;
  }

-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  for (size_t i = 0; i < out_a.size(); i++) {
-    int a_idx = elem_to_loc(i, a.shape(), a.strides());
-    int b_idx = elem_to_loc(i, b.shape(), b.strides());
-    std::tie(dst_a[i], dst_b[i]) = op(a_ptr[a_idx], b_ptr[b_idx]);
+  ContiguousIterator<size_t> a_it(shape, a_strides, ndim - 2);
+  ContiguousIterator<size_t> b_it(shape, b_strides, ndim - 2);
+  size_t stride = out_strides[ndim - 3];
+  for (size_t elem = 0; elem < a.size(); elem += stride) {
+    binary_op_dims<T, U, Op, 2>(
+        a_ptr + a_it.loc,
+        b_ptr + b_it.loc,
+        out_a_ptr + elem,
+        out_b_ptr + elem,
+        op,
+        shape,
+        a_strides,
+        b_strides,
+        out_strides,
+        ndim - 2);
+    a_it.step();
+    b_it.step();
  }
 }

-template <typename T, typename U, typename Op>
-void binary_op_dispatch_dims(
-    const array& a,
-    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op,
-    int dim,
-    int stride) {
-  // Number of dimensions to loop over for vectorized ops
-  switch (dim) {
-    case 1:
-      binary_op_dims1<T, U, Op>(a, b, out_a, out_b, op, stride);
-      return;
-    case 2:
-      binary_op_dims2<T, U, Op>(a, b, out_a, out_b, op, stride);
-      return;
-  }
-
-  const T* a_ptr = a.data<T>();
-  const T* b_ptr = b.data<T>();
-  U* dst_a = out_a.data<U>();
-  U* dst_b = out_b.data<U>();
-  for (size_t i = 0; i < out_a.size(); i += stride) {
-    int a_idx = elem_to_loc(i, a.shape(), a.strides());
-    int b_idx = elem_to_loc(i, b.shape(), b.strides());
-    op(a_ptr + a_idx, b_ptr + b_idx, dst_a, dst_b, stride);
-    dst_a += stride;
-    dst_b += stride;
-  }
-}
-
-template <
-    typename T,
-    typename U,
-    typename Op,
-    typename OpSV,
-    typename OpVS,
-    typename OpVV>
+template <typename T, typename U = T, typename Op>
 void binary_op(
    const array& a,
    const array& b,
-    array& out_a,
-    array& out_b,
-    Op op,
-    OpSV opsv,
-    OpVS opvs,
-    OpVV opvv) {
+    std::vector<array>& outputs,
+    Op op) {
  auto bopt = get_binary_op_type(a, b);
+  auto& out_a = outputs[0];
+  auto& out_b = outputs[1];
  set_binary_op_output_data(a, b, out_a, bopt);
  set_binary_op_output_data(a, b, out_b, bopt);

  // The full computation is scalar scalar so call the base op once
+  if (bopt == BinaryOpType::General) {
+    binary_op_dispatch_dims<T, U, Op>(a, b, out_a, out_b, op);
+    return;
+  }
+
+  auto a_ptr = a.data<T>();
+  auto b_ptr = b.data<T>();
+  auto out_a_ptr = out_a.data<U>();
+  auto out_b_ptr = out_b.data<U>();
  if (bopt == BinaryOpType::ScalarScalar) {
-    std::tie(*(out_a.data<U>()), *(out_b.data<U>())) =
-        op(*a.data<T>(), *b.data<T>());
-    return;
-  }
-
-  // The full computation is scalar vector so delegate to the op
-  if (bopt == BinaryOpType::ScalarVector) {
-    opsv(
-        a.data<T>(),
-        b.data<T>(),
-        out_a.data<U>(),
-        out_b.data<U>(),
-        b.data_size());
-    return;
-  }
-
-  // The full computation is vector scalar so delegate to the op
-  if (bopt == BinaryOpType::VectorScalar) {
-    opvs(
-        a.data<T>(),
-        b.data<T>(),
-        out_a.data<U>(),
-        out_b.data<U>(),
-        a.data_size());
-    return;
-  }
-
-  // The full computation is vector vector so delegate to the op
-  if (bopt == BinaryOpType::VectorVector) {
-    opvv(
-        a.data<T>(),
-        b.data<T>(),
-        out_a.data<U>(),
-        out_b.data<U>(),
-        out_a.size());
-    return;
-  }
-
-  // General computation so let's try to optimize
-
-  // Get the left-most dim such that the array is row contiguous after
-  auto& strides = out_a.strides();
-  auto leftmost_rc_dim = [&strides](const array& arr) {
-    int d = arr.ndim() - 1;
-    for (; d >= 0 && arr.strides()[d] == strides[d]; d--) {
+    std::tie(*out_a_ptr, *out_b_ptr) = op(*a_ptr, *b_ptr);
+  } else if (bopt == BinaryOpType::ScalarVector) {
+    for (size_t i = 0; i < b.size(); ++i) {
+      std::tie(*out_a_ptr, *out_b_ptr) = op(*a_ptr, *b_ptr);
+      out_a_ptr++;
+      out_b_ptr++;
+      b_ptr++;
    }
-    return d + 1;
-  };
-  auto a_rc_dim = leftmost_rc_dim(a);
-  auto b_rc_dim = leftmost_rc_dim(b);
-
-  // Get the left-most dim such that the array is a broadcasted "scalar" after
-  auto leftmost_s_dim = [](const array& arr) {
-    int d = arr.ndim() - 1;
-    for (; d >= 0 && arr.strides()[d] == 0; d--) {
+  } else if (bopt == BinaryOpType::VectorScalar) {
+    for (size_t i = 0; i < a.size(); ++i) {
+      std::tie(*out_a_ptr, *out_b_ptr) = op(*a_ptr, *b_ptr);
+      out_a_ptr++;
+      out_b_ptr++;
+      a_ptr++;
+    }
+  } else { // VectorVector
+    for (size_t i = 0; i < a.size(); ++i) {
+      std::tie(*out_a_ptr, *out_b_ptr) = op(*a_ptr, *b_ptr);
+      out_a_ptr++;
+      out_b_ptr++;
+      a_ptr++;
+      b_ptr++;
    }
-    return d + 1;
-  };
-  auto a_s_dim = leftmost_s_dim(a);
-  auto b_s_dim = leftmost_s_dim(b);
-
-  auto ndim = out_a.ndim();
-
-  // Case 1: LxM and FxM where L and F are broadcastable and M is row contiguous
-  int dim = ndim;
-  if (int d = std::max(a_rc_dim, b_rc_dim); d < ndim) {
-    bopt = BinaryOpType::VectorVector;
-    dim = d;
-    // Case 2: LxM and Fx1 where L and F are broadcastable and M is row
-    // contiguous
-  } else if (int d = std::max(a_rc_dim, b_s_dim); d < ndim) {
-    bopt = BinaryOpType::VectorScalar;
-    dim = d;
-    // Case 3: Lx1 and FxM where L and F are broadcastable and M is row
-    // contiguous
-  } else if (int d = std::max(a_s_dim, b_rc_dim); d < ndim) {
-    bopt = BinaryOpType::ScalarVector;
-    dim = d;
-  }
-
-  // Can be sure dim > 0 since otherwise we would have used one of the fully
-  // contiguous methods above. Except for the case that the flags do not
-  // correspond to the underlying contiguity.
-  size_t stride;
-  if (dim == 0 || strides[dim - 1] < 16) {
-    stride = 1;
-    bopt = BinaryOpType::General;
-    dim = ndim;
-  } else {
-    stride = strides[dim - 1];
-  }
-
-  switch (bopt) {
-    case BinaryOpType::VectorVector:
-      binary_op_dispatch_dims<T, U>(a, b, out_a, out_b, opvv, dim, stride);
-      break;
-    case BinaryOpType::VectorScalar:
-      binary_op_dispatch_dims<T, U>(a, b, out_a, out_b, opvs, dim, stride);
-      break;
-    case BinaryOpType::ScalarVector:
-      binary_op_dispatch_dims<T, U>(a, b, out_a, out_b, opsv, dim, stride);
-      break;
-    default:
-      binary_op_dispatch_dims<T, U>(a, b, out_a, out_b, op);
-      break;
  }
 }

-template <typename T, typename Op, typename OpSV, typename OpVS, typename OpVV>
-void binary_op(
-    const array& a,
-    const array& b,
-    std::vector<array>& outputs,
-    Op op,
-    OpSV opsv,
-    OpVS opvs,
-    OpVV opvv) {
-  // TODO: The following mess of constexpr evaluations can probably be achieved
-  //       with template specializations and overloading. Would it be simpler?
-
-  if (std::is_same<decltype(opsv), UseDefaultBinaryOp>::value) {
-    if (std::is_same<decltype(opvs), UseDefaultBinaryOp>::value) {
-      if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
-        // All ops are UseDefaultBinaryOp (why oh why would someone call that?)
-        binary_op<T, T>(
-            a,
-            b,
-            outputs[0],
-            outputs[1],
-            op,
-            DefaultScalarVector<T, T, Op>(op),
-            DefaultVectorScalar<T, T, Op>(op),
-            DefaultVectorVector<T, T, Op>(op));
-      } else {
-        // opsv and opvs were UseDefaultBinaryOp
-        binary_op<T, T>(
-            a,
-            b,
-            outputs[0],
-            outputs[1],
-            op,
-            DefaultScalarVector<T, T, Op>(op),
-            DefaultVectorScalar<T, T, Op>(op),
-            opvv);
-      }
-    } else if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
-      // opsv and opvv were UseDefaultBinaryOp
-      binary_op<T, T>(
-          a,
-          b,
-          outputs[0],
-          outputs[1],
-          op,
-          DefaultScalarVector<T, T, Op>(op),
-          opvs,
-          DefaultVectorVector<T, T, Op>(op));
-    } else {
-      // opsv was UseDefaultBinaryOp
-      binary_op<T, T>(
-          a,
-          b,
-          outputs[0],
-          outputs[1],
-          op,
-          DefaultScalarVector<T, T, Op>(op),
-          opvs,
-          opvv);
-    }
-  } else if (std::is_same<decltype(opvs), UseDefaultBinaryOp>::value) {
-    if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
-      // opvs and opvv were UseDefaultBinaryOp
-      binary_op<T, T>(
-          a,
-          b,
-          outputs[0],
-          outputs[1],
-          op,
-          opsv,
-          DefaultVectorScalar<T, T, Op>(op),
-          DefaultVectorVector<T, T, Op>(op));
-    } else {
-      // opvs was UseDefaultBinaryOp
-      binary_op<T, T>(
-          a,
-          b,
-          outputs[0],
-          outputs[1],
-          op,
-          opsv,
-          DefaultVectorScalar<T, T, Op>(op),
-          opvv);
-    }
-  } else if (std::is_same<decltype(opvv), UseDefaultBinaryOp>::value) {
-    // opvv was UseDefaultBinaryOp
-    binary_op<T, T>(
-        a,
-        b,
-        outputs[0],
-        outputs[1],
-        op,
-        opsv,
-        opvs,
-        DefaultVectorVector<T, T, Op>(op));
-  } else {
-    // All ops provided
-    binary_op<T, T>(a, b, outputs[0], outputs[1], op, opsv, opvs, opvv);
-  }
-}
-
-template <typename T, typename Op>
-void binary_op(
-    const array& a,
-    const array& b,
-    std::vector<array>& outputs,
-    Op op) {
-  DefaultScalarVector<T, T, Op> opsv(op);
-  DefaultVectorScalar<T, T, Op> opvs(op);
-  DefaultVectorVector<T, T, Op> opvv(op);
-  binary_op<T, T>(a, b, outputs[0], outputs[1], op, opsv, opvs, opvv);
-}
-
-template <typename... Ops>
+template <typename Op>
 void binary(
    const array& a,
    const array& b,
    std::vector<array>& outputs,
-    Ops... ops) {
+    Op op) {
  switch (outputs[0].dtype()) {
    case bool_:
-      binary_op<bool>(a, b, outputs, ops...);
+      binary_op<bool>(a, b, outputs, op);
      break;
    case uint8:
-      binary_op<uint8_t>(a, b, outputs, ops...);
+      binary_op<uint8_t>(a, b, outputs, op);
      break;
    case uint16:
-      binary_op<uint16_t>(a, b, outputs, ops...);
+      binary_op<uint16_t>(a, b, outputs, op);
      break;
    case uint32:
-      binary_op<uint32_t>(a, b, outputs, ops...);
+      binary_op<uint32_t>(a, b, outputs, op);
      break;
    case uint64:
-      binary_op<uint64_t>(a, b, outputs, ops...);
+      binary_op<uint64_t>(a, b, outputs, op);
      break;
    case int8:
-      binary_op<int8_t>(a, b, outputs, ops...);
+      binary_op<int8_t>(a, b, outputs, op);
      break;
    case int16:
-      binary_op<int16_t>(a, b, outputs, ops...);
+      binary_op<int16_t>(a, b, outputs, op);
      break;
    case int32:
-      binary_op<int32_t>(a, b, outputs, ops...);
+      binary_op<int32_t>(a, b, outputs, op);
      break;
    case int64:
-      binary_op<int64_t>(a, b, outputs, ops...);
+      binary_op<int64_t>(a, b, outputs, op);
      break;
    case float16:
-      binary_op<float16_t>(a, b, outputs, ops...);
+      binary_op<float16_t>(a, b, outputs, op);
      break;
    case float32:
-      binary_op<float>(a, b, outputs, ops...);
+      binary_op<float>(a, b, outputs, op);
      break;
    case bfloat16:
-      binary_op<bfloat16_t>(a, b, outputs, ops...);
+      binary_op<bfloat16_t>(a, b, outputs, op);
      break;
    case complex64:
-      binary_op<complex64_t>(a, b, outputs, ops...);
+      binary_op<complex64_t>(a, b, outputs, op);
      break;
  }
 }
--- a/mlx/backend/common/common.cpp
+++ b/mlx/backend/common/common.cpp
@@ -156,8 +156,7 @@ std::pair<bool, std::vector<size_t>> Reshape::prepare_reshape(
  }

  // Firstly let's collapse all the contiguous dimensions of the input
-  auto [shape, _strides] = collapse_contiguous_dims(in);
-  auto& strides = _strides[0];
+  auto [shape, strides] = collapse_contiguous_dims(in);

  // If shapes fit exactly in the contiguous dims then no copy is necessary so
  // let's check.
--- a/mlx/backend/common/compiled.cpp
+++ b/mlx/backend/common/compiled.cpp
@@ -18,7 +18,8 @@ void print_constant(std::ostream& os, const array& x) {
    case complex64:
      return print_complex_constant<complex64_t>(os, x);
    case int8:
-      return print_int_constant<int8_t>(os, x);
+      os << static_cast<int32_t>(x.item<int8_t>());
+      return;
    case int16:
      return print_int_constant<int16_t>(os, x);
    case int32:
@@ -26,7 +27,8 @@ void print_constant(std::ostream& os, const array& x) {
    case int64:
      return print_int_constant<int64_t>(os, x);
    case uint8:
-      return print_int_constant<uint8_t>(os, x);
+      os << static_cast<uint32_t>(x.item<uint8_t>());
+      return;
    case uint16:
      return print_int_constant<uint16_t>(os, x);
    case uint32:
--- a/mlx/backend/common/compiled_cpu.cpp
+++ b/mlx/backend/common/compiled_cpu.cpp
@@ -2,6 +2,7 @@

 #include <dlfcn.h>
 #include <filesystem>
+#include <fstream>
 #include <list>

 #include "mlx/backend/common/compiled.h"
--- a/mlx/backend/common/conv.cpp
+++ b/mlx/backend/common/conv.cpp
@@ -684,6 +684,32 @@ void dispatch_slow_conv_3D(
 // Explicit gemm conv
 ///////////////////////////////////////////////////////////////////////////////

+template <typename T>
+void flip_spatial_dims_inplace(array& wt) {
+  T* x = wt.data<T>();
+  size_t out_channels = wt.shape(0);
+  size_t in_channels = wt.shape(-1);
+
+  // Calculate the total size of the spatial dimensions
+  int spatial_size = 1;
+  for (int d = 1; d < wt.ndim() - 1; ++d) {
+    spatial_size *= wt.shape(d);
+  }
+
+  for (size_t i = 0; i < out_channels; i++) {
+    T* top = x + i * spatial_size * in_channels;
+    T* bottom =
+        x + i * spatial_size * in_channels + (spatial_size - 1) * in_channels;
+    for (size_t j = 0; j < spatial_size / 2; j++) {
+      for (size_t k = 0; k < in_channels; k++) {
+        std::swap(top[k], bottom[k]);
+      }
+      top += in_channels;
+      bottom -= in_channels;
+    }
+  }
+}
+
 void explicit_gemm_conv_1D_cpu(
    const array& in,
    const array& wt,
@@ -910,7 +936,8 @@ void explicit_gemm_conv_ND_cpu(
    array out,
    const std::vector<int>& padding,
    const std::vector<int>& wt_strides,
-    const std::vector<int>& wt_dilation) {
+    const std::vector<int>& wt_dilation,
+    const bool flip) {
  const int N = in.shape(0); // Batch size, should be the same as out.shape(0)
  const auto iDim = std::vector<int>(
      in.shape().begin() + 1, in.shape().end() - 1); // Input spatial dim
@@ -1000,6 +1027,14 @@ void explicit_gemm_conv_ND_cpu(
    copy(wt, gemm_wt, ctype);
  }

+  if (flip) {
+    auto gemm_wt_ = array(gemm_wt.shape(), float32, nullptr, {});
+    copy(gemm_wt, gemm_wt_, CopyType::Vector);
+
+    flip_spatial_dims_inplace<float>(gemm_wt_);
+    gemm_wt = gemm_wt_;
+  }
+
  if (out.dtype() != float32) {
    gemm_out = array(out.shape(), float32, nullptr, {});
    gemm_out.set_data(allocator::malloc_or_wait(gemm_out.nbytes()));
@@ -1042,10 +1077,15 @@ void conv_1D_cpu(
    const std::vector<int>& wt_dilation,
    const std::vector<int>& in_dilation,
    bool flip) {
+  const int groups = in.shape().back() / wt.shape().back();
  if (wt_dilation[0] == 1 && in_dilation[0] == 1 && !flip) {
    return explicit_gemm_conv_1D_cpu(
        in, wt, out, padding, wt_strides, wt_dilation);
  }
+  if (wt_dilation[0] == 1 && in_dilation[0] == 1 && groups == 1) {
+    return explicit_gemm_conv_ND_cpu(
+        in, wt, out, padding, wt_strides, wt_dilation, flip);
+  }

  return dispatch_slow_conv_1D(
      in, wt, out, padding, wt_strides, wt_dilation, in_dilation, flip);
@@ -1060,6 +1100,13 @@ void conv_2D_cpu(
    const std::vector<int>& wt_dilation,
    const std::vector<int>& in_dilation,
    bool flip) {
+  const int groups = in.shape().back() / wt.shape().back();
+  if (wt_dilation[0] == 1 && wt_dilation[1] == 1 && in_dilation[0] == 1 &&
+      in_dilation[1] == 1 && groups == 1) {
+    return explicit_gemm_conv_ND_cpu(
+        in, wt, out, padding, wt_strides, wt_dilation, flip);
+  }
+
  return dispatch_slow_conv_2D(
      in, wt, out, padding, wt_strides, wt_dilation, in_dilation, flip);
 }
@@ -1073,6 +1120,14 @@ void conv_3D_cpu(
    const std::vector<int>& wt_dilation,
    const std::vector<int>& in_dilation,
    bool flip) {
+  const int groups = in.shape().back() / wt.shape().back();
+  if (wt_dilation[0] == 1 && wt_dilation[1] == 1 && wt_dilation[2] == 1 &&
+      in_dilation[0] == 1 && in_dilation[1] == 1 && in_dilation[2] == 1 &&
+      groups == 1) {
+    return explicit_gemm_conv_ND_cpu(
+        in, wt, out, padding, wt_strides, wt_dilation, flip);
+  }
+
  return dispatch_slow_conv_3D(
      in, wt, out, padding, wt_strides, wt_dilation, in_dilation, flip);
 }
@@ -1125,7 +1180,7 @@ void Convolution::eval(const std::vector<array>& inputs, array& out) {
  else {
    std::ostringstream msg;
    msg << "[Convolution::eval] Convolution currently only supports"
-        << " 1D and 2D convolutions. Got inputs with " << in.ndim() - 2
+        << " 1D, 2D and 3D convolutions. Got inputs with " << in.ndim() - 2
        << " spatial dimensions";
    throw std::invalid_argument(msg.str());
  }
--- a/mlx/backend/common/copy.cpp
+++ b/mlx/backend/common/copy.cpp
@@ -26,292 +26,117 @@ void copy_vector(const array& src, array& dst) {
  std::copy(src_ptr, src_ptr + src.data_size(), dst_ptr);
 }

-template <typename SrcT, typename DstT, typename stride_t>
-void copy_general_dim1(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    int64_t i_offset) {
-  const SrcT* src_ptr = src.data<SrcT>();
-  DstT* dst_ptr = dst.data<DstT>();
-  stride_t src_idx = i_offset;
-  stride_t dst_idx = 0;
-  for (int i = 0; i < data_shape[0]; ++i) {
-    dst_ptr[dst_idx++] = static_cast<DstT>(src_ptr[src_idx]);
-    src_idx += i_strides[0];
-  }
-}
+template <typename SrcT, typename DstT, typename StrideT, int D>
+inline void copy_dims(
+    const SrcT* src,
+    DstT* dst,
+    const std::vector<int>& shape,
+    const std::vector<StrideT>& i_strides,
+    const std::vector<StrideT>& o_strides,
+    int axis) {
+  auto stride_src = i_strides[axis];
+  auto stride_dst = o_strides[axis];
+  auto N = shape[axis];

-template <typename SrcT, typename DstT>
-inline void copy_general_dim1(const array& src, array& dst) {
-  return copy_general_dim1<SrcT, DstT, size_t>(
-      src, dst, src.shape(), src.strides(), 0);
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
-void copy_general_dim2(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    int64_t i_offset) {
-  const SrcT* src_ptr = src.data<SrcT>();
-  DstT* dst_ptr = dst.data<DstT>();
-  stride_t src_idx = i_offset;
-  stride_t dst_idx = 0;
-  for (int i = 0; i < data_shape[0]; ++i) {
-    for (int j = 0; j < data_shape[1]; ++j) {
-      dst_ptr[dst_idx++] = static_cast<DstT>(src_ptr[src_idx]);
-      src_idx += i_strides[1];
+  for (int i = 0; i < N; i++) {
+    if constexpr (D > 1) {
+      copy_dims<SrcT, DstT, StrideT, D - 1>(
+          src, dst, shape, i_strides, o_strides, axis + 1);
+    } else {
+      *dst = static_cast<DstT>(*src);
    }
-    src_idx += i_strides[0] - i_strides[1] * data_shape[1];
+    src += stride_src;
+    dst += stride_dst;
  }
 }

-template <typename SrcT, typename DstT>
-inline void copy_general_dim2(const array& src, array& dst) {
-  return copy_general_dim2<SrcT, DstT, size_t>(
-      src, dst, src.shape(), src.strides(), 0);
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
-void copy_general_dim3(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    int64_t i_offset) {
-  const SrcT* src_ptr = src.data<SrcT>();
-  DstT* dst_ptr = dst.data<DstT>();
-  stride_t src_idx = i_offset;
-  stride_t dst_idx = 0;
-  for (int i = 0; i < data_shape[0]; ++i) {
-    for (int j = 0; j < data_shape[1]; ++j) {
-      for (int k = 0; k < data_shape[2]; ++k) {
-        dst_ptr[dst_idx++] = static_cast<DstT>(src_ptr[src_idx]);
-        src_idx += i_strides[2];
-      }
-      src_idx += i_strides[1] - i_strides[2] * data_shape[2];
-    }
-    src_idx += i_strides[0] - i_strides[1] * data_shape[1];
-  }
-}
-
-template <typename SrcT, typename DstT>
-inline void copy_general_dim3(const array& src, array& dst) {
-  return copy_general_dim3<SrcT, DstT, size_t>(
-      src, dst, src.shape(), src.strides(), 0);
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
-void copy_general_dim4(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    int64_t i_offset) {
-  const SrcT* src_ptr = src.data<SrcT>();
-  DstT* dst_ptr = dst.data<DstT>();
-  stride_t src_idx = i_offset;
-  stride_t dst_idx = 0;
-  for (int i = 0; i < data_shape[0]; ++i) {
-    for (int j = 0; j < data_shape[1]; ++j) {
-      for (int k = 0; k < data_shape[2]; ++k) {
-        for (int ii = 0; ii < data_shape[3]; ++ii) {
-          dst_ptr[dst_idx++] = static_cast<DstT>(src_ptr[src_idx]);
-          src_idx += i_strides[3];
-        }
-        src_idx += i_strides[2] - i_strides[3] * data_shape[3];
-      }
-      src_idx += i_strides[1] - i_strides[2] * data_shape[2];
-    }
-    src_idx += i_strides[0] - i_strides[1] * data_shape[1];
-  }
-}
-
-template <typename SrcT, typename DstT>
-inline void copy_general_dim4(const array& src, array& dst) {
-  return copy_general_dim4<SrcT, DstT, size_t>(
-      src, dst, src.shape(), src.strides(), 0);
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
-void copy_general(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    int64_t i_offset) {
-  auto [new_shape, new_strides] = collapse_contiguous_dims(
-      data_shape, std::vector<std::vector<stride_t>>{i_strides});
-  switch (new_shape.size()) {
-    case 1:
-      copy_general_dim1<SrcT, DstT, stride_t>(
-          src, dst, new_shape, new_strides[0], i_offset);
-      return;
-    case 2:
-      copy_general_dim2<SrcT, DstT, stride_t>(
-          src, dst, new_shape, new_strides[0], i_offset);
-      return;
-    case 3:
-      copy_general_dim3<SrcT, DstT, stride_t>(
-          src, dst, new_shape, new_strides[0], i_offset);
-      return;
-    case 4:
-      copy_general_dim4<SrcT, DstT, stride_t>(
-          src, dst, new_shape, new_strides[0], i_offset);
-      return;
-  }
-
-  auto src_ptr = src.data<SrcT>() + i_offset;
-  auto dst_ptr = dst.data<DstT>();
-  for (size_t i = 0; i < dst.size(); ++i) {
-    stride_t src_elem = elem_to_loc(i, new_shape, new_strides[0]);
-    dst_ptr[i] = static_cast<DstT>(src_ptr[src_elem]);
-  }
-}
-
-template <typename SrcT, typename DstT>
-inline void copy_general(const array& src, array& dst) {
-  return copy_general<SrcT, DstT, size_t>(
-      src, dst, src.shape(), src.strides(), 0);
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
-inline void copy_general(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    const std::vector<stride_t>& o_strides,
-    int64_t i_offset,
-    int64_t o_offset) {
-  return copy_general<SrcT, DstT, stride_t>(
-      src, dst, data_shape, i_strides, i_offset);
-}
-
-template <typename SrcT, typename DstT, typename stride_t, int D>
-inline void copy_general_general_dims(
-    const array& src,
-    array& dst,
-    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    const std::vector<stride_t>& o_strides,
-    int64_t i_offset,
-    int64_t o_offset) {
-  if constexpr (D > 1) {
-    int axis = data_shape.size() - D;
-    auto stride_src = i_strides[axis];
-    auto stride_dst = o_strides[axis];
-    auto N = data_shape[axis];
-    for (int i = 0; i < N; i++) {
-      copy_general_general_dims<SrcT, DstT, stride_t, D - 1>(
-          src, dst, data_shape, i_strides, o_strides, i_offset, o_offset);
-      i_offset += stride_src;
-      o_offset += stride_dst;
-    }
-  } else {
-    int axis = data_shape.size() - 1;
-    auto stride_src = i_strides[axis];
-    auto stride_dst = o_strides[axis];
-    auto N = data_shape[axis];
-    const SrcT* src_ptr = src.data<SrcT>() + i_offset;
-    DstT* dst_ptr = dst.data<DstT>() + o_offset;
-    for (int i = 0; i < N; i++) {
-      *dst_ptr = static_cast<DstT>(*src_ptr);
-      src_ptr += stride_src;
-      dst_ptr += stride_dst;
-    }
-  }
-}
-
-template <typename SrcT, typename DstT, typename stride_t>
+template <typename SrcT, typename DstT, typename StrideT>
 void copy_general_general(
    const array& src,
    array& dst,
    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    const std::vector<stride_t>& o_strides,
+    const std::vector<StrideT>& i_strides,
+    const std::vector<StrideT>& o_strides,
    int64_t i_offset,
    int64_t o_offset) {
-  auto [new_shape, new_strides] = collapse_contiguous_dims(
-      data_shape, std::vector<std::vector<stride_t>>{i_strides, o_strides});
-  switch (new_shape.size()) {
-    case 1:
-      copy_general_general_dims<SrcT, DstT, stride_t, 1>(
-          src,
-          dst,
-          new_shape,
-          new_strides[0],
-          new_strides[1],
-          i_offset,
-          o_offset);
-      return;
-    case 2:
-      copy_general_general_dims<SrcT, DstT, stride_t, 2>(
-          src,
-          dst,
-          new_shape,
-          new_strides[0],
-          new_strides[1],
-          i_offset,
-          o_offset);
-      return;
-    case 3:
-      copy_general_general_dims<SrcT, DstT, stride_t, 3>(
-          src,
-          dst,
-          new_shape,
-          new_strides[0],
-          new_strides[1],
-          i_offset,
-          o_offset);
-      return;
-    case 4:
-      copy_general_general_dims<SrcT, DstT, stride_t, 4>(
-          src,
-          dst,
-          new_shape,
-          new_strides[0],
-          new_strides[1],
-          i_offset,
-          o_offset);
-      return;
-    case 5:
-      copy_general_general_dims<SrcT, DstT, stride_t, 5>(
-          src,
-          dst,
-          new_shape,
-          new_strides[0],
-          new_strides[1],
-          i_offset,
-          o_offset);
-      return;
+  if (data_shape.empty()) {
+    auto val = static_cast<DstT>(*(src.data<SrcT>() + i_offset));
+    auto dst_ptr = dst.data<DstT>() + o_offset;
+    *dst_ptr = val;
+    return;
  }
-
-  int size = std::accumulate(
-      new_shape.end() - 5, new_shape.end(), 1, std::multiplies<int>());
-  for (int i = 0; i < src.size(); i += size) {
-    stride_t src_offset = i_offset + elem_to_loc(i, new_shape, new_strides[0]);
-    stride_t dst_offset = o_offset + elem_to_loc(i, new_shape, new_strides[1]);
-    copy_general_general_dims<SrcT, DstT, stride_t, 5>(
-        src,
-        dst,
-        new_shape,
-        new_strides[0],
-        new_strides[1],
-        src_offset,
-        dst_offset);
+  auto [shape, strides] = collapse_contiguous_dims(
+      data_shape, std::vector<std::vector<StrideT>>{i_strides, o_strides});
+  auto src_ptr = src.data<SrcT>() + i_offset;
+  auto dst_ptr = dst.data<DstT>() + o_offset;
+  int ndim = shape.size();
+  if (ndim == 1) {
+    copy_dims<SrcT, DstT, StrideT, 1>(
+        src_ptr, dst_ptr, shape, strides[0], strides[1], 0);
+    return;
+  } else if (ndim == 2) {
+    copy_dims<SrcT, DstT, StrideT, 2>(
+        src_ptr, dst_ptr, shape, strides[0], strides[1], 0);
+    return;
+  } else if (ndim == 3) {
+    copy_dims<SrcT, DstT, StrideT, 3>(
+        src_ptr, dst_ptr, shape, strides[0], strides[1], 0);
+    return;
+  }
+  ContiguousIterator<StrideT> in(shape, strides[0], ndim - 3);
+  ContiguousIterator<StrideT> out(shape, strides[1], ndim - 3);
+  StrideT stride = std::accumulate(
+      shape.end() - 3, shape.end(), 1, std::multiplies<StrideT>());
+  for (StrideT elem = 0; elem < src.size(); elem += stride) {
+    copy_dims<SrcT, DstT, StrideT, 3>(
+        src_ptr + in.loc,
+        dst_ptr + out.loc,
+        shape,
+        strides[0],
+        strides[1],
+        ndim - 3);
+    in.step();
+    out.step();
  }
 }

 template <typename SrcT, typename DstT>
 inline void copy_general_general(const array& src, array& dst) {
-  return copy_general_general<SrcT, DstT, size_t>(
+  copy_general_general<SrcT, DstT, size_t>(
      src, dst, src.shape(), src.strides(), dst.strides(), 0, 0);
 }

+template <typename SrcT, typename DstT, typename StrideT>
+void copy_general(
+    const array& src,
+    array& dst,
+    const std::vector<int>& data_shape,
+    const std::vector<StrideT>& i_strides,
+    const std::vector<StrideT>&,
+    int64_t i_offset,
+    int64_t o_offset) {
+  copy_general_general<SrcT, DstT, StrideT>(
+      src,
+      dst,
+      data_shape,
+      i_strides,
+      make_contiguous_strides<StrideT>(data_shape),
+      i_offset,
+      o_offset);
+}
+
+template <typename SrcT, typename DstT>
+inline void copy_general(const array& src, array& dst) {
+  copy_general_general<SrcT, DstT, size_t>(
+      src,
+      dst,
+      src.shape(),
+      src.strides(),
+      make_contiguous_strides<size_t>(src.shape()),
+      0,
+      0);
+}
+
 template <typename SrcT, typename DstT, typename... Args>
 void copy(const array& src, array& dst, CopyType ctype, Args&&... args) {
  switch (ctype) {
@@ -326,6 +151,7 @@ void copy(const array& src, array& dst, CopyType ctype, Args&&... args) {
      return;
    case CopyType::GeneralGeneral:
      copy_general_general<SrcT, DstT>(src, dst, std::forward<Args>(args)...);
+      return;
  }
 }

@@ -426,7 +252,7 @@ inline void copy_inplace_dispatch(
 } // namespace

 void copy_inplace(const array& src, array& dst, CopyType ctype) {
-  return copy_inplace_dispatch(src, dst, ctype);
+  copy_inplace_dispatch(src, dst, ctype);
 }

 void copy(const array& src, array& dst, CopyType ctype) {
@@ -456,20 +282,20 @@ void copy(const array& src, array& dst, CopyType ctype) {
  copy_inplace(src, dst, ctype);
 }

-template <typename stride_t>
+template <typename StrideT>
 void copy_inplace(
    const array& src,
    array& dst,
    const std::vector<int>& data_shape,
-    const std::vector<stride_t>& i_strides,
-    const std::vector<stride_t>& o_strides,
+    const std::vector<StrideT>& i_strides,
+    const std::vector<StrideT>& o_strides,
    int64_t i_offset,
    int64_t o_offset,
    CopyType ctype) {
  switch (ctype) {
    case CopyType::General:
    case CopyType::GeneralGeneral:
-      return copy_inplace_dispatch(
+      copy_inplace_dispatch(
          src,
          dst,
          ctype,
@@ -478,10 +304,10 @@ void copy_inplace(
          o_strides,
          i_offset,
          o_offset);
-
+      break;
    case CopyType::Scalar:
    case CopyType::Vector:
-      return copy_inplace_dispatch(src, dst, ctype);
+      copy_inplace_dispatch(src, dst, ctype);
  }
 }

--- a/mlx/backend/common/indexing.cpp
+++ b/mlx/backend/common/indexing.cpp
@@ -1,5 +1,4 @@
 // Copyright © 2023 Apple Inc.
-
 #include <algorithm>
 #include <cassert>
 #include <cmath>
@@ -81,11 +80,18 @@ void gather(
  T* dst_ptr = out.data<T>();
  size_t out_idx = 0;

+  std::vector<ContiguousIterator<size_t>> its(inds.begin(), inds.end());
+  ContiguousIterator<size_t> src_it;
+  if (!can_copy && src.ndim() > 0) {
+    src_it = std::move(
+        ContiguousIterator<size_t>(slice_sizes, src.strides(), src.ndim()));
+  }
  for (int idx = 0; idx < ind_size; idx++) {
    size_t src_idx = 0;
    for (int ii = 0; ii < inds.size(); ++ii) {
      auto ax = axes[ii];
-      auto idx_loc = elem_to_loc(idx, inds[ii]);
+      auto idx_loc = its[ii].loc;
+      its[ii].step();
      auto idx_val =
          offset_neg_idx(inds[ii].data<IdxT>()[idx_loc], src.shape(ax));
      src_idx += (idx_val * src.strides()[ax]);
@@ -99,9 +105,10 @@ void gather(
      out_idx += slice_size;
    } else {
      for (int jj = 0; jj < slice_size; jj++) {
-        auto src_offset = elem_to_loc(jj, slice_sizes, src.strides());
-        dst_ptr[out_idx++] = src_ptr[src_idx + src_offset];
+        dst_ptr[out_idx++] = src_ptr[src_idx + src_it.loc];
+        src_it.step();
      }
+      src_it.reset();
    }
  }
 }
@@ -223,21 +230,29 @@ void scatter(
    update_size *= us;
  }

+  std::vector<ContiguousIterator<size_t>> its(inds.begin(), inds.end());
+  ContiguousIterator<size_t> update_it(updates);
+  ContiguousIterator<size_t> out_it(update_shape, out.strides(), out.ndim());
+
  for (int i = 0; i < n_updates; ++i) {
    size_t out_offset = 0;
    for (int j = 0; j < nind; ++j) {
      auto ax = axes[j];
-      auto idx_loc = elem_to_loc(i, inds[j]);
+      auto idx_loc = its[j].loc;
+      its[j].step();
      auto idx_val =
          offset_neg_idx(inds[j].data<IdxT>()[idx_loc], out.shape(ax));
      out_offset += (idx_val * out.strides()[ax]);
    }
+    update_it.seek(i * update_size);
    for (int j = 0; j < update_size; ++j) {
-      auto update_loc = elem_to_loc(i * update_size + j, updates);
-      auto out_loc = elem_to_loc(j, update_shape, out.strides());
-      op(updates.data<InT>()[update_loc],
-         out.data<InT>() + out_offset + out_loc);
+      op(updates.data<InT>()[update_it.loc],
+         out.data<InT>() + out_offset + out_it.loc);
+      update_it.step();
+      out_it.step();
    }
+    out_it.reset();
+    update_it.reset();
  }
 }

--- a/mlx/backend/common/inverse.cpp
+++ b/mlx/backend/common/inverse.cpp
@@ -10,9 +10,106 @@
 #include <lapack.h>
 #endif

+// Wrapper to account for differences in
+// LAPACK implementations (basically how to pass the 'uplo' string to fortran).
+int strtri_wrapper(char uplo, char diag, float* matrix, int N) {
+  int info;
+
+#ifdef LAPACK_FORTRAN_STRLEN_END
+  strtri_(
+      /* uplo = */ &uplo,
+      /* diag = */ &diag,
+      /* N = */ &N,
+      /* a = */ matrix,
+      /* lda = */ &N,
+      /* info = */ &info,
+      /* uplo_len = */ static_cast<size_t>(1),
+      /* diag_len = */ static_cast<size_t>(1));
+#else
+  strtri_(
+      /* uplo = */ &uplo,
+      /* diag = */ &diag,
+      /* N = */ &N,
+      /* a = */ matrix,
+      /* lda = */ &N,
+      /* info = */ &info);
+#endif
+
+  return info;
+}
+
 namespace mlx::core {

-void inverse_impl(const array& a, array& inv) {
+void general_inv(array& inv, int N, int i) {
+  int info;
+  auto ipiv = array::Data{allocator::malloc_or_wait(sizeof(int) * N)};
+  // Compute LU factorization.
+  sgetrf_(
+      /* m = */ &N,
+      /* n = */ &N,
+      /* a = */ inv.data<float>() + N * N * i,
+      /* lda = */ &N,
+      /* ipiv = */ static_cast<int*>(ipiv.buffer.raw_ptr()),
+      /* info = */ &info);
+
+  if (info != 0) {
+    std::stringstream ss;
+    ss << "inverse_impl: LU factorization failed with error code " << info;
+    throw std::runtime_error(ss.str());
+  }
+
+  static const int lwork_query = -1;
+  float workspace_size = 0;
+
+  // Compute workspace size.
+  sgetri_(
+      /* m = */ &N,
+      /* a = */ nullptr,
+      /* lda = */ &N,
+      /* ipiv = */ nullptr,
+      /* work = */ &workspace_size,
+      /* lwork = */ &lwork_query,
+      /* info = */ &info);
+
+  if (info != 0) {
+    std::stringstream ss;
+    ss << "inverse_impl: LU workspace calculation failed with error code "
+       << info;
+    throw std::runtime_error(ss.str());
+  }
+
+  const int lwork = workspace_size;
+  auto scratch = array::Data{allocator::malloc_or_wait(sizeof(float) * lwork)};
+
+  // Compute inverse.
+  sgetri_(
+      /* m = */ &N,
+      /* a = */ inv.data<float>() + N * N * i,
+      /* lda = */ &N,
+      /* ipiv = */ static_cast<int*>(ipiv.buffer.raw_ptr()),
+      /* work = */ static_cast<float*>(scratch.buffer.raw_ptr()),
+      /* lwork = */ &lwork,
+      /* info = */ &info);
+
+  if (info != 0) {
+    std::stringstream ss;
+    ss << "inverse_impl: inversion failed with error code " << info;
+    throw std::runtime_error(ss.str());
+  }
+}
+
+void tri_inv(array& inv, int N, int i, bool upper) {
+  const char uplo = upper ? 'L' : 'U';
+  const char diag = 'N';
+  int info = strtri_wrapper(uplo, diag, inv.data<float>() + N * N * i, N);
+  if (info != 0) {
+    std::stringstream ss;
+    ss << "inverse_impl: triangular inversion failed with error code " << info;
+    throw std::runtime_error(ss.str());
+  }
+}
+
+void inverse_impl(const array& a, array& inv, bool tri, bool upper) {
  // Lapack uses the column-major convention. We take advantage of the following
  // identity to avoid transposing (see
  // https://math.stackexchange.com/a/340234):
@@ -24,63 +121,11 @@ void inverse_impl(const array& a, array& inv) {
  const int N = a.shape(-1);
  const size_t num_matrices = a.size() / (N * N);

-  int info;
-  auto ipiv = array::Data{allocator::malloc_or_wait(sizeof(int) * N)};
-
  for (int i = 0; i < num_matrices; i++) {
-    // Compute LU factorization.
-    sgetrf_(
-        /* m = */ &N,
-        /* n = */ &N,
-        /* a = */ inv.data<float>() + N * N * i,
-        /* lda = */ &N,
-        /* ipiv = */ static_cast<int*>(ipiv.buffer.raw_ptr()),
-        /* info = */ &info);
-
-    if (info != 0) {
-      std::stringstream ss;
-      ss << "inverse_impl: LU factorization failed with error code " << info;
-      throw std::runtime_error(ss.str());
-    }
-
-    static const int lwork_query = -1;
-    float workspace_size = 0;
-
-    // Compute workspace size.
-    sgetri_(
-        /* m = */ &N,
-        /* a = */ nullptr,
-        /* lda = */ &N,
-        /* ipiv = */ nullptr,
-        /* work = */ &workspace_size,
-        /* lwork = */ &lwork_query,
-        /* info = */ &info);
-
-    if (info != 0) {
-      std::stringstream ss;
-      ss << "inverse_impl: LU workspace calculation failed with error code "
-         << info;
-      throw std::runtime_error(ss.str());
-    }
-
-    const int lwork = workspace_size;
-    auto scratch =
-        array::Data{allocator::malloc_or_wait(sizeof(float) * lwork)};
-
-    // Compute inverse.
-    sgetri_(
-        /* m = */ &N,
-        /* a = */ inv.data<float>() + N * N * i,
-        /* lda = */ &N,
-        /* ipiv = */ static_cast<int*>(ipiv.buffer.raw_ptr()),
-        /* work = */ static_cast<float*>(scratch.buffer.raw_ptr()),
-        /* lwork = */ &lwork,
-        /* info = */ &info);
-
-    if (info != 0) {
-      std::stringstream ss;
-      ss << "inverse_impl: inversion failed with error code " << info;
-      throw std::runtime_error(ss.str());
+    if (tri) {
+      tri_inv(inv, N, i, upper);
+    } else {
+      general_inv(inv, N, i);
    }
  }
 }
@@ -89,7 +134,7 @@ void Inverse::eval(const std::vector<array>& inputs, array& output) {
  if (inputs[0].dtype() != float32) {
    throw std::runtime_error("[Inverse::eval] only supports float32.");
  }
-  inverse_impl(inputs[0], output);
+  inverse_impl(inputs[0], output, tri_, upper_);
 }

 } // namespace mlx::core
--- a/mlx/backend/common/load.cpp
+++ b/mlx/backend/common/load.cpp
@@ -5,11 +5,9 @@
 #include <utility>

 #include "mlx/allocator.h"
-#include "mlx/io/load.h"
+#include "mlx/backend/common/load.h"
 #include "mlx/primitives.h"

-namespace mlx::core {
-
 namespace {

 template <const uint8_t scalar_size>
@@ -29,12 +27,14 @@ void swap_endianness(uint8_t* data_bytes, size_t N) {

 } // namespace

-void Load::eval(const std::vector<array>& inputs, array& out) {
-  assert(inputs.size() == 0);
-  out.set_data(allocator::malloc_or_wait(out.nbytes()));
+namespace mlx::core {

-  reader_->seek(offset_, std::ios_base::beg);
-  reader_->read(out.data<char>(), out.nbytes());
+void load(
+    array& out,
+    size_t offset,
+    const std::shared_ptr<io::Reader>& reader,
+    bool swap_endianness_) {
+  reader->read(out.data<char>(), out.nbytes(), offset);

  if (swap_endianness_) {
    switch (out.itemsize()) {
@@ -51,4 +51,11 @@ void Load::eval(const std::vector<array>& inputs, array& out) {
  }
 }

+void Load::eval(const std::vector<array>& inputs, array& out) {
+  assert(inputs.size() == 0);
+  out.set_data(allocator::malloc_or_wait(out.nbytes()));
+
+  load(out, offset_, reader_, swap_endianness_);
+}
+
 } // namespace mlx::core
--- a/mlx/backend/common/load.h
+++ b/mlx/backend/common/load.h
@@ -0,0 +1,14 @@
+// Copyright © 2024 Apple Inc.
+
+#include "mlx/array.h"
+#include "mlx/io/load.h"
+
+namespace mlx::core {
+
+void load(
+    array& out,
+    size_t offset,
+    const std::shared_ptr<io::Reader>& reader,
+    bool swap_endianess);
+
+} // namespace mlx::core
--- a/mlx/backend/common/make_compiled_preamble.sh
+++ b/mlx/backend/common/make_compiled_preamble.sh
@@ -21,7 +21,7 @@ EOM

 fi

-CONTENT=$($GCC -I $SRCDIR -E $SRCDIR/mlx/backend/common/compiled_preamble.h 2>/dev/null)
+CONTENT=$($GCC -I "$SRCDIR" -E "$SRCDIR/mlx/backend/common/compiled_preamble.h" 2>/dev/null)

 cat << EOF > "$OUTPUT_FILE"
 const char* get_kernel_preamble() {
--- a/mlx/backend/common/ops.h
+++ b/mlx/backend/common/ops.h
@@ -373,6 +373,10 @@ struct Sign {
  uint64_t operator()(uint64_t x) {
    return x != 0;
  }
+
+  complex64_t operator()(complex64_t x) {
+    return x == complex64_t(0) ? x : x / std::abs(x);
+  }
 };

 struct Sin {
--- a/mlx/backend/common/primitives.cpp
+++ b/mlx/backend/common/primitives.cpp
@@ -406,16 +406,7 @@ void Reshape::eval(const std::vector<array>& inputs, array& out) {

  if (copy_necessary) {
    out.set_data(allocator::malloc_or_wait(out.nbytes()));
-    auto out_strides = make_contiguous_strides<size_t>(in.shape());
-    copy_inplace<size_t>(
-        in,
-        out,
-        in.shape(),
-        in.strides(),
-        out_strides,
-        0,
-        0,
-        CopyType::General);
+    copy_inplace(in, out, CopyType::General);
  } else {
    shared_buffer_reshape(in, out_strides, out);
  }
@@ -505,8 +496,16 @@ void Slice::eval(const std::vector<array>& inputs, array& out) {
        /* int64_t o_offset = */ 0,
        /* CopyType ctype = */ CopyType::General);
  } else {
+    size_t data_end = 1;
+    for (int i = 0; i < end_indices_.size(); ++i) {
+      if (in.shape()[i] > 1) {
+        auto end_idx = start_indices_[i] + out.shape()[i] * strides_[i] - 1;
+        data_end += end_idx * in.strides()[i];
+      }
+    }
+    size_t data_size = data_end - data_offset;
    std::vector<size_t> ostrides{inp_strides.begin(), inp_strides.end()};
-    shared_buffer_slice(in, ostrides, data_offset, out);
+    shared_buffer_slice(in, ostrides, data_offset, data_size, out);
  }
 }

@@ -604,11 +603,18 @@ void View::eval_cpu(const std::vector<array>& inputs, array& out) {
      strides[i] /= obytes;
    }
    out.copy_shared_buffer(
-        in, strides, in.flags(), in.data_size() * obytes / ibytes);
+        in, strides, in.flags(), in.data_size() * ibytes / obytes);
  } else {
-    auto tmp = array(in.shape(), in.dtype(), nullptr, {});
+    auto tmp = array(
+        in.shape(), in.dtype() == bool_ ? uint8 : in.dtype(), nullptr, {});
    tmp.set_data(allocator::malloc_or_wait(tmp.nbytes()));
-    copy_inplace(in, tmp, CopyType::General);
+    if (in.dtype() == bool_) {
+      auto in_tmp = array(in.shape(), uint8, nullptr, {});
+      in_tmp.copy_shared_buffer(in);
+      copy_inplace(in_tmp, tmp, CopyType::General);
+    } else {
+      copy_inplace(in, tmp, CopyType::General);
+    }

    auto flags = out.flags();
    flags.contiguous = true;
--- a/mlx/backend/common/reduce.cpp
+++ b/mlx/backend/common/reduce.cpp
@@ -87,6 +87,38 @@ struct OrReduce {
  }
 };

+struct MaxReduce {
+  template <typename T>
+  std::enable_if_t<std::is_integral_v<T>> operator()(T* y, T x) {
+    (*y) = (*y > x) ? *y : x;
+  };
+
+  template <typename T>
+  std::enable_if_t<!std::is_integral_v<T>> operator()(T* y, T x) {
+    if (std::isnan(x)) {
+      *y = x;
+    } else {
+      (*y) = (*y > x) ? *y : x;
+    }
+  };
+};
+
+struct MinReduce {
+  template <typename T>
+  std::enable_if_t<std::is_integral_v<T>> operator()(T* y, T x) {
+    (*y) = (*y < x) ? *y : x;
+  };
+
+  template <typename T>
+  std::enable_if_t<!std::is_integral_v<T>> operator()(T* y, T x) {
+    if (std::isnan(x)) {
+      *y = x;
+    } else {
+      (*y) = (*y < x) ? *y : x;
+    }
+  };
+};
+
 template <typename InT>
 void reduce_dispatch_out(
    const array& in,
@@ -118,15 +150,13 @@ void reduce_dispatch_out(
      break;
    }
    case Reduce::Max: {
-      auto op = [](auto y, auto x) { (*y) = (*y > x) ? *y : x; };
      auto init = Limits<InT>::min;
-      reduction_op<InT, InT>(in, out, axes, init, op);
+      reduction_op<InT, InT>(in, out, axes, init, MaxReduce());
      break;
    }
    case Reduce::Min: {
-      auto op = [](auto y, auto x) { (*y) = (*y < x) ? *y : x; };
      auto init = Limits<InT>::max;
-      reduction_op<InT, InT>(in, out, axes, init, op);
+      reduction_op<InT, InT>(in, out, axes, init, MinReduce());
      break;
    }
  }
--- a/mlx/backend/common/reduce.h
+++ b/mlx/backend/common/reduce.h
@@ -49,7 +49,7 @@ struct ReductionPlan {
  ReductionPlan(ReductionOpType type_) : type(type_) {}
 };

-ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes);
+ReductionPlan get_reduction_plan(const array& x, const std::vector<int>& axes);

 // Helper for the ndimensional strided loop
 // Should this be in utils?
--- a/mlx/backend/common/reduce_utils.cpp
+++ b/mlx/backend/common/reduce_utils.cpp
@@ -19,7 +19,7 @@ std::pair<std::vector<int>, std::vector<size_t>> shapes_without_reduction_axes(
  return std::make_pair(shape, strides);
 }

-ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes) {
+ReductionPlan get_reduction_plan(const array& x, const std::vector<int>& axes) {
  // The data is all there and we are reducing over everything
  if (x.size() == x.data_size() && axes.size() == x.ndim() &&
      x.flags().contiguous) {
@@ -32,7 +32,7 @@ ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes) {
    std::vector<int> shape = {x.shape(axes[0])};
    std::vector<size_t> strides = {x.strides()[axes[0]]};
    for (int i = 1; i < axes.size(); i++) {
-      if (axes[i] - 1 == axes[i - 1]) {
+      if (axes[i] - 1 == axes[i - 1] && x.shape(axes[i]) > 1) {
        shape.back() *= x.shape(axes[i]);
        strides.back() = x.strides()[axes[i]];
      } else {
@@ -41,6 +41,14 @@ ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes) {
      }
    }

+    // Remove singleton axes from the plan
+    for (int i = shape.size() - 1; i >= 0; i--) {
+      if (shape[i] == 1) {
+        shape.erase(shape.begin() + i);
+        strides.erase(strides.begin() + i);
+      }
+    }
+
    if (strides.back() == 1) {
      return ReductionPlan(ContiguousReduce, shape, strides);
    } else if (strides.back() > 1) {
@@ -63,10 +71,14 @@ ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes) {
  // have a contiguous reduction.
  std::vector<std::pair<int, size_t>> reductions;
  for (auto a : axes) {
-    reductions.push_back(std::make_pair(x.shape(a), x.strides()[a]));
+    if (x.shape(a) > 1) {
+      reductions.push_back(std::make_pair(x.shape(a), x.strides()[a]));
+    }
  }
  std::sort(reductions.begin(), reductions.end(), [](auto a, auto b) {
-    return a.second > b.second;
+    bool a_is_zero = a.second == 0;
+    bool b_is_zero = b.second == 0;
+    return (a_is_zero != b_is_zero) ? a.second < b.second : a.second > b.second;
  });
  // Extract the two smallest and try to merge them in case the contiguous
  // reduction can be bigger than just the last axis.
@@ -98,16 +110,33 @@ ReductionPlan get_reduction_plan(const array& x, const std::vector<int> axes) {
  // strides.back() are contiguous.
  if (strides.back() > 1) {
    int size = 1;
+    bool have_expand = false;
    for (int i = x.ndim() - 1; i >= 0; i--) {
      if (axes.back() == i) {
        continue;
      }
-      if (x.strides()[i] != size) {
+
+      size_t stride_i = x.strides()[i];
+      int shape_i = x.shape(i);
+      if (stride_i == 0) {
+        if (shape_i == 1) {
+          continue;
+        }
+
+        have_expand = true;
        break;
      }
-      size *= x.shape(i);
+
+      if (stride_i != size && shape_i != 1) {
+        break;
+      }
+      size *= shape_i;
    }
-    if (size >= strides.back()) {
+    // In the case of an expanded dimension we are being conservative and
+    // require the smallest reduction stride to be smaller than the maximum row
+    // contiguous size. The reason is that we can't easily know if the reduced
+    // axis is before or after an expanded dimension.
+    if (size > strides.back() || (size == strides.back() && !have_expand)) {
      return ReductionPlan(GeneralStridedReduce, shape, strides);
    }
  }
--- a/mlx/backend/common/slicing.cpp
+++ b/mlx/backend/common/slicing.cpp
@@ -6,18 +6,16 @@ namespace mlx::core {

 std::tuple<bool, int64_t, std::vector<int64_t>> prepare_slice(
    const array& in,
-    std::vector<int>& start_indices,
-    std::vector<int>& strides) {
+    const std::vector<int>& start_indices,
+    const std::vector<int>& strides) {
  int64_t data_offset = 0;
  bool copy_needed = false;
  std::vector<int64_t> inp_strides(in.ndim(), 0);
  for (int i = 0; i < in.ndim(); ++i) {
    data_offset += start_indices[i] * in.strides()[i];
    inp_strides[i] = in.strides()[i] * strides[i];
-
    copy_needed |= strides[i] < 0;
  }
-
  return std::make_tuple(copy_needed, data_offset, inp_strides);
 }

@@ -25,26 +23,16 @@ void shared_buffer_slice(
    const array& in,
    const std::vector<size_t>& out_strides,
    size_t data_offset,
+    size_t data_size,
    array& out) {
  // Compute row/col contiguity
-  auto [data_size, is_row_contiguous, is_col_contiguous] =
+  auto [no_bsx_size, is_row_contiguous, is_col_contiguous] =
      check_contiguity(out.shape(), out_strides);

  auto flags = in.flags();
  flags.row_contiguous = is_row_contiguous;
  flags.col_contiguous = is_col_contiguous;
-
-  if (data_size == 1) {
-    // Broadcasted scalar array is contiguous.
-    flags.contiguous = true;
-  } else if (data_size == in.data_size()) {
-    // Means we sliced a broadcasted dimension so leave the "no holes" flag
-    // alone.
-  } else {
-    // We sliced something. So either we are row or col contiguous or we
-    // punched a hole.
-    flags.contiguous &= flags.row_contiguous || flags.col_contiguous;
-  }
+  flags.contiguous = (no_bsx_size == data_size);

  out.copy_shared_buffer(in, out_strides, flags, data_size, data_offset);
 }
--- a/mlx/backend/common/slicing.h
+++ b/mlx/backend/common/slicing.h
@@ -8,13 +8,14 @@ namespace mlx::core {

 std::tuple<bool, int64_t, std::vector<int64_t>> prepare_slice(
    const array& in,
-    std::vector<int>& start_indices,
-    std::vector<int>& strides);
+    const std::vector<int>& start_indices,
+    const std::vector<int>& strides);

 void shared_buffer_slice(
    const array& in,
    const std::vector<size_t>& out_strides,
    size_t data_offset,
+    size_t data_size,
    array& out);

 } // namespace mlx::core
--- a/mlx/backend/common/sort.cpp
+++ b/mlx/backend/common/sort.cpp
@@ -111,7 +111,8 @@ void sort(const array& in, array& out, int axis) {

  // Get axis, shape and stride info
  axis = axis < 0 ? axis + in.ndim() : axis;
-  size_t n_rows = in.size() / in.shape(axis);
+  size_t in_size = in.flags().contiguous ? in.data_size() : in.size();
+  size_t n_rows = in_size / in.shape(axis);

  auto remaining_shape = out.shape();
  remaining_shape.erase(remaining_shape.begin() + axis);
@@ -123,14 +124,16 @@ void sort(const array& in, array& out, int axis) {
  int axis_size = out.shape(axis);

  // Perform sorting in place
+  ContiguousIterator<size_t> src_it(
+      remaining_shape, remaining_strides, remaining_shape.size());
  for (int i = 0; i < n_rows; i++) {
-    size_t loc = elem_to_loc(i, remaining_shape, remaining_strides);
-    T* data_ptr = out.data<T>() + loc;
+    T* data_ptr = out.data<T>() + src_it.loc;

    StridedIterator st(data_ptr, axis_stride, 0);
    StridedIterator ed(data_ptr, axis_stride, axis_size);

    std::stable_sort(st, ed);
+    src_it.step();
  }
 }

@@ -160,11 +163,15 @@ void argsort(const array& in, array& out, int axis) {
  int axis_size = in.shape(axis);

  // Perform sorting
+  ContiguousIterator<size_t> in_it(
+      in_remaining_shape, in_remaining_strides, in_remaining_shape.size());
+  ContiguousIterator<size_t> out_it(
+      out_remaining_shape, out_remaining_strides, out_remaining_shape.size());
  for (int i = 0; i < n_rows; i++) {
-    size_t in_loc = elem_to_loc(i, in_remaining_shape, in_remaining_strides);
-    size_t out_loc = elem_to_loc(i, out_remaining_shape, out_remaining_strides);
-    const T* data_ptr = in.data<T>() + in_loc;
-    IdxT* idx_ptr = out.data<IdxT>() + out_loc;
+    const T* data_ptr = in.data<T>() + in_it.loc;
+    IdxT* idx_ptr = out.data<IdxT>() + out_it.loc;
+    in_it.step();
+    out_it.step();

    StridedIterator st_(idx_ptr, out_stride, 0);
    StridedIterator ed_(idx_ptr, out_stride, axis_size);
@@ -192,7 +199,8 @@ void partition(const array& in, array& out, int axis, int kth) {

  // Get axis, shape and stride info
  axis = axis < 0 ? axis + in.ndim() : axis;
-  size_t n_rows = in.size() / in.shape(axis);
+  size_t in_size = in.flags().contiguous ? in.data_size() : in.size();
+  size_t n_rows = in_size / in.shape(axis);

  auto remaining_shape = in.shape();
  remaining_shape.erase(remaining_shape.begin() + axis);
@@ -206,9 +214,11 @@ void partition(const array& in, array& out, int axis, int kth) {
  kth = kth < 0 ? kth + axis_size : kth;

  // Perform partition in place
+  ContiguousIterator<size_t> src_it(
+      remaining_shape, remaining_strides, remaining_shape.size());
  for (int i = 0; i < n_rows; i++) {
-    size_t loc = elem_to_loc(i, remaining_shape, remaining_strides);
-    T* data_ptr = out.data<T>() + loc;
+    T* data_ptr = out.data<T>() + src_it.loc;
+    src_it.step();

    StridedIterator st(data_ptr, axis_stride, 0);
    StridedIterator md(data_ptr, axis_stride, kth);
@@ -227,37 +237,49 @@ void argpartition(const array& in, array& out, int axis, int kth) {
  axis = axis < 0 ? axis + in.ndim() : axis;
  size_t n_rows = in.size() / in.shape(axis);

-  auto remaining_shape = in.shape();
-  remaining_shape.erase(remaining_shape.begin() + axis);
+  auto in_remaining_shape = in.shape();
+  in_remaining_shape.erase(in_remaining_shape.begin() + axis);

-  auto remaining_strides = in.strides();
-  remaining_strides.erase(remaining_strides.begin() + axis);
+  auto in_remaining_strides = in.strides();
+  in_remaining_strides.erase(in_remaining_strides.begin() + axis);

-  size_t axis_stride = in.strides()[axis];
+  auto out_remaining_shape = out.shape();
+  out_remaining_shape.erase(out_remaining_shape.begin() + axis);
+
+  auto out_remaining_strides = out.strides();
+  out_remaining_strides.erase(out_remaining_strides.begin() + axis);
+
+  size_t in_stride = in.strides()[axis];
+  size_t out_stride = out.strides()[axis];
  int axis_size = in.shape(axis);

  kth = kth < 0 ? kth + axis_size : kth;

  // Perform partition
+  ContiguousIterator<size_t> in_it(
+      in_remaining_shape, in_remaining_strides, in_remaining_shape.size());
+  ContiguousIterator<size_t> out_it(
+      out_remaining_shape, out_remaining_strides, out_remaining_shape.size());
  for (int i = 0; i < n_rows; i++) {
-    size_t loc = elem_to_loc(i, remaining_shape, remaining_strides);
-    const T* data_ptr = in.data<T>() + loc;
-    IdxT* idx_ptr = out.data<IdxT>() + loc;
+    const T* data_ptr = in.data<T>() + in_it.loc;
+    IdxT* idx_ptr = out.data<IdxT>() + out_it.loc;
+    in_it.step();
+    out_it.step();

-    StridedIterator st_(idx_ptr, axis_stride, 0);
-    StridedIterator ed_(idx_ptr, axis_stride, axis_size);
+    StridedIterator st_(idx_ptr, out_stride, 0);
+    StridedIterator ed_(idx_ptr, out_stride, axis_size);

    // Initialize with iota
    std::iota(st_, ed_, IdxT(0));

    // Sort according to vals
-    StridedIterator st(idx_ptr, axis_stride, 0);
-    StridedIterator md(idx_ptr, axis_stride, kth);
-    StridedIterator ed(idx_ptr, axis_stride, axis_size);
+    StridedIterator st(idx_ptr, out_stride, 0);
+    StridedIterator md(idx_ptr, out_stride, kth);
+    StridedIterator ed(idx_ptr, out_stride, axis_size);

-    std::nth_element(st, md, ed, [data_ptr, axis_stride](IdxT a, IdxT b) {
-      auto v1 = data_ptr[a * axis_stride];
-      auto v2 = data_ptr[b * axis_stride];
+    std::nth_element(st, md, ed, [data_ptr, in_stride](IdxT a, IdxT b) {
+      auto v1 = data_ptr[a * in_stride];
+      auto v2 = data_ptr[b * in_stride];
      return v1 < v2 || (v1 == v2 && a < b);
    });
  }
--- a/mlx/backend/common/ternary.h
+++ b/mlx/backend/common/ternary.h
@@ -12,6 +12,7 @@ namespace {
 // TODO: Add support for more combinations of input types.
 enum class TernaryOpType {
  ScalarScalarScalar,
+  VectorVectorVector,
  General,
 };

@@ -20,6 +21,12 @@ get_ternary_op_type(const array& a, const array& b, const array& c) {
  TernaryOpType topt;
  if (a.data_size() == 1 && b.data_size() == 1 && c.data_size() == 1) {
    topt = TernaryOpType::ScalarScalarScalar;
+  } else if (
+      (a.flags().row_contiguous && b.flags().row_contiguous &&
+       c.flags().row_contiguous) ||
+      (a.flags().col_contiguous && b.flags().col_contiguous &&
+       c.flags().col_contiguous)) {
+    topt = TernaryOpType::VectorVectorVector;
  } else {
    topt = TernaryOpType::General;
  }
@@ -33,138 +40,77 @@ void set_ternary_op_output_data(
    array& out,
    TernaryOpType topt,
    bool donate_with_move = false) {
+  auto maybe_donate = [&out, donate_with_move](const array& x) {
+    if (is_donatable(x, out)) {
+      if (donate_with_move) {
+        out.move_shared_buffer(x);
+      } else {
+        out.copy_shared_buffer(x);
+      }
+      return true;
+    }
+    return false;
+  };
+
  switch (topt) {
    case TernaryOpType::ScalarScalarScalar:
      out.set_data(
          allocator::malloc_or_wait(out.itemsize()), 1, b.strides(), b.flags());
      break;
+    case TernaryOpType::VectorVectorVector:
+      if (!(maybe_donate(a) || maybe_donate(b) || maybe_donate(c))) {
+        out.set_data(
+            allocator::malloc_or_wait(out.itemsize() * b.data_size()),
+            b.data_size(),
+            b.strides(),
+            b.flags());
+      }
+      break;
    case TernaryOpType::General:
      out.set_data(allocator::malloc_or_wait(out.nbytes()));
      break;
  }
 }
+template <typename T1, typename T2, typename T3, typename U, typename Op, int D>
+void ternary_op_dims(
+    const T1* a,
+    const T2* b,
+    const T3* c,
+    U* out,
+    Op op,
+    const std::vector<int>& shape,
+    const std::vector<size_t>& a_strides,
+    const std::vector<size_t>& b_strides,
+    const std::vector<size_t>& c_strides,
+    const std::vector<size_t>& out_strides,
+    int axis) {
+  auto stride_a = a_strides[axis];
+  auto stride_b = b_strides[axis];
+  auto stride_c = c_strides[axis];
+  auto stride_out = out_strides[axis];
+  auto N = shape[axis];

-template <typename T1, typename T2, typename T3, typename U, typename Op>
-void ternary_op_dims1(
-    const array& a,
-    const array& b,
-    const array& c,
-    array& out,
-    Op op) {
-  const T1* a_ptr = a.data<T1>();
-  const T2* b_ptr = b.data<T2>();
-  const T3* c_ptr = c.data<T3>();
-
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t c_idx = 0;
-  for (size_t i = 0; i < out.size(); ++i) {
-    dst[i] = op(a_ptr[a_idx], b_ptr[b_idx], c_ptr[c_idx]);
-    a_idx += a.strides()[0];
-    b_idx += b.strides()[0];
-    c_idx += c.strides()[0];
-  }
-}
-
-template <typename T1, typename T2, typename T3, typename U, typename Op>
-void ternary_op_dims2(
-    const array& a,
-    const array& b,
-    const array& c,
-    array& out,
-    Op op) {
-  const T1* a_ptr = a.data<T1>();
-  const T2* b_ptr = b.data<T2>();
-  const T3* c_ptr = c.data<T3>();
-
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t c_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx], c_ptr[c_idx]);
-      a_idx += a.strides()[1];
-      b_idx += b.strides()[1];
-      c_idx += c.strides()[1];
+  for (int i = 0; i < N; i++) {
+    if constexpr (D > 1) {
+      ternary_op_dims<T1, T2, T3, U, Op, D - 1>(
+          a,
+          b,
+          c,
+          out,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          c_strides,
+          out_strides,
+          axis + 1);
+    } else {
+      *out = op(*a, *b, *c);
    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-    c_idx += c.strides()[0] - c.strides()[1] * c.shape()[1];
-  }
-}
-
-template <typename T1, typename T2, typename T3, typename U, typename Op>
-void ternary_op_dims3(
-    const array& a,
-    const array& b,
-    const array& c,
-    array& out,
-    Op op) {
-  const T1* a_ptr = a.data<T1>();
-  const T2* b_ptr = b.data<T2>();
-  const T3* c_ptr = c.data<T3>();
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t c_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx], c_ptr[c_idx]);
-        a_idx += a.strides()[2];
-        b_idx += b.strides()[2];
-        c_idx += c.strides()[2];
-      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
-      c_idx += c.strides()[1] - c.strides()[2] * c.shape()[2];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-    c_idx += c.strides()[0] - c.strides()[1] * c.shape()[1];
-  }
-}
-
-template <typename T1, typename T2, typename T3, typename U, typename Op>
-void ternary_op_dims4(
-    const array& a,
-    const array& b,
-    const array& c,
-    array& out,
-    Op op) {
-  const T1* a_ptr = a.data<T1>();
-  const T2* b_ptr = b.data<T2>();
-  const T3* c_ptr = c.data<T3>();
-
-  U* dst = out.data<U>();
-  size_t a_idx = 0;
-  size_t b_idx = 0;
-  size_t c_idx = 0;
-  size_t out_idx = 0;
-  for (size_t i = 0; i < a.shape()[0]; ++i) {
-    for (size_t j = 0; j < a.shape()[1]; ++j) {
-      for (size_t k = 0; k < a.shape()[2]; ++k) {
-        for (size_t ii = 0; ii < a.shape()[3]; ++ii) {
-          dst[out_idx++] = op(a_ptr[a_idx], b_ptr[b_idx], c_ptr[c_idx]);
-          a_idx += a.strides()[3];
-          b_idx += b.strides()[3];
-          c_idx += c.strides()[3];
-        }
-        a_idx += a.strides()[2] - a.strides()[3] * a.shape()[3];
-        b_idx += b.strides()[2] - b.strides()[3] * b.shape()[3];
-        c_idx += c.strides()[2] - c.strides()[3] * c.shape()[3];
-      }
-      a_idx += a.strides()[1] - a.strides()[2] * a.shape()[2];
-      b_idx += b.strides()[1] - b.strides()[2] * b.shape()[2];
-      c_idx += c.strides()[1] - c.strides()[2] * c.shape()[2];
-    }
-    a_idx += a.strides()[0] - a.strides()[1] * a.shape()[1];
-    b_idx += b.strides()[0] - b.strides()[1] * b.shape()[1];
-    c_idx += c.strides()[0] - c.strides()[1] * c.shape()[1];
+    a += stride_a;
+    b += stride_b;
+    c += stride_c;
+    out += stride_out;
  }
 }

@@ -175,30 +121,69 @@ void ternary_op_dispatch_dims(
    const array& c,
    array& out,
    Op op) {
-  switch (out.ndim()) {
-    case 1:
-      ternary_op_dims1<T1, T2, T3, U, Op>(a, b, c, out, op);
-      return;
-    case 2:
-      ternary_op_dims2<T1, T2, T3, U, Op>(a, b, c, out, op);
-      return;
-    case 3:
-      ternary_op_dims3<T1, T2, T3, U, Op>(a, b, c, out, op);
-      return;
-    case 4:
-      ternary_op_dims4<T1, T2, T3, U, Op>(a, b, c, out, op);
-      return;
-  }
+  auto [shape, strides] = collapse_contiguous_dims(
+      a.shape(), {a.strides(), b.strides(), c.strides(), out.strides()});
+  const auto& a_strides = strides[0];
+  const auto& b_strides = strides[1];
+  const auto& c_strides = strides[2];
+  const auto& out_strides = strides[3];

  const T1* a_ptr = a.data<T1>();
  const T2* b_ptr = b.data<T2>();
  const T3* c_ptr = c.data<T3>();
-  U* dst = out.data<U>();
-  for (size_t i = 0; i < out.size(); i++) {
-    int a_idx = elem_to_loc(i, a.shape(), a.strides());
-    int b_idx = elem_to_loc(i, b.shape(), b.strides());
-    int c_idx = elem_to_loc(i, c.shape(), c.strides());
-    dst[i] = op(a_ptr[a_idx], b_ptr[b_idx], c_ptr[c_idx]);
+  U* out_ptr = out.data<T3>();
+  int ndim = shape.size();
+  switch (ndim) {
+    case 1:
+      ternary_op_dims<T1, T2, T3, U, Op, 1>(
+          a_ptr,
+          b_ptr,
+          c_ptr,
+          out_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          c_strides,
+          out_strides,
+          0);
+      return;
+    case 2:
+      ternary_op_dims<T1, T2, T3, U, Op, 2>(
+          a_ptr,
+          b_ptr,
+          c_ptr,
+          out_ptr,
+          op,
+          shape,
+          a_strides,
+          b_strides,
+          c_strides,
+          out_strides,
+          0);
+      return;
+  }
+
+  ContiguousIterator<size_t> a_it(shape, a_strides, ndim - 2);
+  ContiguousIterator<size_t> b_it(shape, b_strides, ndim - 2);
+  ContiguousIterator<size_t> c_it(shape, c_strides, ndim - 2);
+  size_t stride = out_strides[ndim - 3];
+  for (size_t elem = 0; elem < a.size(); elem += stride) {
+    ternary_op_dims<T1, T2, T3, U, Op, 2>(
+        a_ptr + a_it.loc,
+        b_ptr + b_it.loc,
+        c_ptr + c_it.loc,
+        out_ptr + elem,
+        op,
+        shape,
+        a_strides,
+        b_strides,
+        c_strides,
+        out_strides,
+        ndim - 2);
+    a_it.step();
+    b_it.step();
+    c_it.step();
  }
 }

@@ -215,10 +200,21 @@ void ternary_op(
  // The full computation is scalar-scalar-scalar so we call the base op once.
  if (topt == TernaryOpType::ScalarScalarScalar) {
    *(out.data<U>()) = op(*a.data<T1>(), *b.data<T2>(), *c.data<T3>());
-    return;
+  } else if (topt == TernaryOpType::VectorVectorVector) {
+    const T1* a_ptr = a.data<T1>();
+    const T2* b_ptr = b.data<T2>();
+    const T3* c_ptr = c.data<T3>();
+    U* out_ptr = out.data<U>();
+    for (size_t i = 0; i < out.size(); ++i) {
+      *out_ptr = op(*a_ptr, *b_ptr, *c_ptr);
+      a_ptr++;
+      b_ptr++;
+      c_ptr++;
+      out_ptr++;
+    }
+  } else {
+    ternary_op_dispatch_dims<T1, T2, T3, U>(a, b, c, out, op);
  }
-
-  ternary_op_dispatch_dims<T1, T2, T3, U>(a, b, c, out, op);
 }

 } // namespace
--- a/mlx/backend/common/unary.h
+++ b/mlx/backend/common/unary.h
@@ -12,7 +12,7 @@ namespace mlx::core {
 namespace {

 void set_unary_output_data(const array& in, array& out) {
-  if (in.is_donatable() && in.itemsize() == out.itemsize()) {
+  if (is_donatable(in, out)) {
    out.copy_shared_buffer(in);
  } else {
    auto size = in.data_size();
@@ -24,6 +24,14 @@ void set_unary_output_data(const array& in, array& out) {
  }
 }

+template <typename T, typename Op>
+void unary_op(const T* a, T* out, Op op, size_t shape, size_t stride) {
+  for (size_t i = 0; i < shape; i += 1) {
+    out[i] = op(*a);
+    a += stride;
+  }
+}
+
 template <typename T, typename Op>
 void unary_op(const array& a, array& out, Op op) {
  const T* a_ptr = a.data<T>();
@@ -36,10 +44,16 @@ void unary_op(const array& a, array& out, Op op) {
  } else {
    out.set_data(allocator::malloc_or_wait(out.nbytes()));
    T* dst = out.data<T>();
-    for (size_t i = 0; i < out.size(); ++i) {
-      // TODO this is super inefficient, need to fix.
-      int a_idx = elem_to_loc(i, a.shape(), a.strides());
-      dst[i] = op(a_ptr[a_idx]);
+    size_t shape = a.ndim() > 0 ? a.shape(-1) : 1;
+    size_t stride = a.ndim() > 0 ? a.strides(-1) : 1;
+    if (a.ndim() <= 1) {
+      unary_op(a_ptr, dst, op, shape, stride);
+      return;
+    }
+    ContiguousIterator it(a.shape(), a.strides(), a.ndim() - 1);
+    for (size_t elem = 0; elem < a.size(); elem += shape) {
+      unary_op(a_ptr + it.loc, dst + elem, op, shape, stride);
+      it.step();
    }
  }
 }
--- a/mlx/backend/common/utils.cpp
+++ b/mlx/backend/common/utils.cpp
@@ -0,0 +1,138 @@
+// Copyright © 2023-2024 Apple Inc.
+
+#include "mlx/backend/common/utils.h"
+
+namespace mlx::core {
+
+template <typename StrideT>
+std::tuple<std::vector<int>, std::vector<std::vector<StrideT>>>
+collapse_contiguous_dims_impl(
+    const std::vector<int>& shape,
+    const std::vector<std::vector<StrideT>>& strides,
+    StrideT size_cap) {
+  // Make a vector that has axes separated with -1. Collapse all axes between
+  // -1.
+  std::vector<int> to_collapse;
+  if (shape.size() > 0) {
+    if (shape[0] != 1) {
+      to_collapse.push_back(0);
+    }
+    size_t size = shape[0];
+    for (int i = 1; i < shape.size(); i++) {
+      bool contiguous = true;
+      size *= shape[i];
+      for (const std::vector<StrideT>& st : strides) {
+        if (st[i] * shape[i] != st[i - 1] || size > size_cap) {
+          contiguous = false;
+          size = shape[i];
+          break;
+        }
+      }
+      if (!contiguous) {
+        to_collapse.push_back(-1);
+      }
+      if (shape[i] != 1) {
+        to_collapse.push_back(i);
+      }
+    }
+    to_collapse.push_back(-1);
+  }
+
+  std::vector<int> out_shape;
+  std::vector<std::vector<StrideT>> out_strides(strides.size());
+  for (int i = 0;;) {
+    while (i < to_collapse.size() && to_collapse[i] == -1) {
+      ++i;
+    };
+    if (i == to_collapse.size()) {
+      break;
+    }
+    int current_shape = shape[to_collapse[i]];
+    int k = i;
+    while (to_collapse[++k] != -1) {
+      current_shape *= shape[to_collapse[k]];
+    }
+    out_shape.push_back(current_shape);
+    for (int j = 0; j < strides.size(); j++) {
+      const std::vector<StrideT>& st = strides[j];
+      out_strides[j].push_back(st[to_collapse[k - 1]]);
+    }
+    i = k + 1;
+  }
+
+  if (!shape.empty() && out_shape.empty()) {
+    out_shape.push_back(1);
+    for (auto& out_stride : out_strides) {
+      out_stride.push_back(0);
+    }
+  }
+  return std::make_tuple(out_shape, out_strides);
+}
+
+std::tuple<std::vector<int>, std::vector<std::vector<int64_t>>>
+collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<std::vector<int64_t>>& strides,
+    int64_t size_cap /* = std::numeric_limits<int32_t>::max() */) {
+  return collapse_contiguous_dims_impl(shape, strides, size_cap);
+}
+
+std::tuple<std::vector<int>, std::vector<std::vector<size_t>>>
+collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<std::vector<size_t>>& strides,
+    size_t size_cap /* = std::numeric_limits<int32>::max() */) {
+  return collapse_contiguous_dims_impl(shape, strides, size_cap);
+}
+
+template <typename StrideT>
+std::pair<std::vector<int>, std::vector<StrideT>> collapse_contiguous_dims_impl(
+    const std::vector<int>& shape,
+    const std::vector<StrideT>& strides,
+    StrideT size_cap) {
+  std::vector<int> collapsed_shape;
+  std::vector<StrideT> collapsed_strides;
+
+  if (shape.size() > 0) {
+    collapsed_shape.push_back(shape[0]);
+    collapsed_strides.push_back(strides[0]);
+    for (int i = 1; i < shape.size(); i++) {
+      if (shape[i] == 1) {
+        continue;
+      } else if (
+          strides[i] * shape[i] != collapsed_strides.back() ||
+          collapsed_shape.back() * static_cast<StrideT>(shape[i]) > size_cap) {
+        collapsed_shape.push_back(shape[i]);
+        collapsed_strides.push_back(strides[i]);
+      } else {
+        collapsed_shape.back() *= shape[i];
+        collapsed_strides.back() = strides[i];
+      }
+    }
+  }
+
+  return std::make_pair(collapsed_shape, collapsed_strides);
+}
+
+std::pair<std::vector<int>, std::vector<int64_t>> collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<int64_t>& strides,
+    int64_t size_cap /* = std::numeric_limits<int32_t>::max() */) {
+  return collapse_contiguous_dims_impl<int64_t>(shape, strides, size_cap);
+}
+
+std::pair<std::vector<int>, std::vector<size_t>> collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<size_t>& strides,
+    size_t size_cap /* = std::numeric_limits<int32_t>::max() */) {
+  return collapse_contiguous_dims_impl<size_t>(shape, strides, size_cap);
+}
+
+std::pair<std::vector<int>, std::vector<size_t>> collapse_contiguous_dims(
+    const array& a,
+    size_t size_cap /* = std::numeric_limits<int32_t>::max()*/) {
+  return collapse_contiguous_dims_impl<size_t>(
+      a.shape(), a.strides(), size_cap);
+}
+
+} // namespace mlx::core
--- a/mlx/backend/common/utils.h
+++ b/mlx/backend/common/utils.h
@@ -8,12 +8,12 @@

 namespace mlx::core {

-template <typename stride_t>
-inline stride_t elem_to_loc(
+template <typename StrideT>
+inline StrideT elem_to_loc(
    int elem,
    const std::vector<int>& shape,
-    const std::vector<stride_t>& strides) {
-  stride_t loc = 0;
+    const std::vector<StrideT>& strides) {
+  StrideT loc = 0;
  for (int i = shape.size() - 1; i >= 0; --i) {
    auto q_and_r = ldiv(elem, shape[i]);
    loc += q_and_r.rem * strides[i];
@@ -29,9 +29,9 @@ inline size_t elem_to_loc(int elem, const array& a) {
  return elem_to_loc(elem, a.shape(), a.strides());
 }

-template <typename stride_t>
-std::vector<stride_t> make_contiguous_strides(const std::vector<int>& shape) {
-  std::vector<stride_t> strides(shape.size(), 1);
+template <typename StrideT>
+std::vector<StrideT> make_contiguous_strides(const std::vector<int>& shape) {
+  std::vector<StrideT> strides(shape.size(), 1);
  for (int i = shape.size() - 1; i > 0; i--) {
    strides[i - 1] = strides[i] * shape[i];
  }
@@ -44,58 +44,26 @@ std::vector<stride_t> make_contiguous_strides(const std::vector<int>& shape) {
 //
 // When multiple arrays are passed they should all have the same shape. The
 // collapsed axes are also the same so one shape is returned.
-template <typename stride_t>
-inline std::tuple<std::vector<int>, std::vector<std::vector<stride_t>>>
+std::tuple<std::vector<int>, std::vector<std::vector<int64_t>>>
 collapse_contiguous_dims(
    const std::vector<int>& shape,
-    const std::vector<std::vector<stride_t>> strides) {
-  // Make a vector that has axes separated with -1. Collapse all axes between
-  // -1.
-  std::vector<int> to_collapse;
-  if (shape.size() > 0) {
-    to_collapse.push_back(0);
-    for (int i = 1; i < shape.size(); i++) {
-      bool contiguous = true;
-      for (const std::vector<stride_t>& st : strides) {
-        if (st[i] * shape[i] != st[i - 1]) {
-          contiguous = false;
-        }
-        if (!contiguous) {
-          break;
-        }
-      }
-      if (!contiguous) {
-        to_collapse.push_back(-1);
-      }
-      to_collapse.push_back(i);
-    }
-    to_collapse.push_back(-1);
-  }
-
-  std::vector<int> out_shape;
-  std::vector<std::vector<stride_t>> out_strides(strides.size());
-  for (int i = 0; i < to_collapse.size(); i++) {
-    int current_shape = shape[to_collapse[i]];
-    while (to_collapse[++i] != -1) {
-      current_shape *= shape[to_collapse[i]];
-    }
-    out_shape.push_back(current_shape);
-    for (int j = 0; j < strides.size(); j++) {
-      const std::vector<stride_t>& st = strides[j];
-      out_strides[j].push_back(st[to_collapse[i - 1]]);
-    }
-  }
-
-  return std::make_tuple(out_shape, out_strides);
-}
+    const std::vector<std::vector<int64_t>>& strides,
+    int64_t size_cap = std::numeric_limits<int32_t>::max());
+std::tuple<std::vector<int>, std::vector<std::vector<size_t>>>
+collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<std::vector<size_t>>& strides,
+    size_t size_cap = std::numeric_limits<int32_t>::max());

 inline std::tuple<std::vector<int>, std::vector<std::vector<size_t>>>
-collapse_contiguous_dims(const std::vector<array>& xs) {
+collapse_contiguous_dims(
+    const std::vector<array>& xs,
+    size_t size_cap = std::numeric_limits<int32_t>::max()) {
  std::vector<std::vector<size_t>> strides;
  for (auto& x : xs) {
    strides.emplace_back(x.strides());
  }
-  return collapse_contiguous_dims(xs[0].shape(), strides);
+  return collapse_contiguous_dims(xs[0].shape(), strides, size_cap);
 }

 template <typename... Arrays, typename = enable_for_arrays_t<Arrays...>>
@@ -104,27 +72,110 @@ inline auto collapse_contiguous_dims(Arrays&&... xs) {
      std::vector<array>{std::forward<Arrays>(xs)...});
 }

-template <typename stride_t>
+// The single array version of the above.
+std::pair<std::vector<int>, std::vector<int64_t>> collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<int64_t>& strides,
+    int64_t size_cap = std::numeric_limits<int32_t>::max());
+std::pair<std::vector<int>, std::vector<size_t>> collapse_contiguous_dims(
+    const std::vector<int>& shape,
+    const std::vector<size_t>& strides,
+    size_t size_cap = std::numeric_limits<int32_t>::max());
+std::pair<std::vector<int>, std::vector<size_t>> collapse_contiguous_dims(
+    const array& a,
+    size_t size_cap = std::numeric_limits<int32_t>::max());
+
+template <typename StrideT>
+struct ContiguousIterator {
+  inline void step() {
+    int dims = shape_.size();
+    if (dims == 0) {
+      return;
+    }
+    int i = dims - 1;
+    while (pos_[i] == (shape_[i] - 1) && i > 0) {
+      pos_[i] = 0;
+      loc -= (shape_[i] - 1) * strides_[i];
+      i--;
+    }
+    pos_[i]++;
+    loc += strides_[i];
+  }
+
+  void seek(StrideT n) {
+    loc = 0;
+    for (int i = shape_.size() - 1; i >= 0; --i) {
+      auto q_and_r = ldiv(n, shape_[i]);
+      loc += q_and_r.rem * strides_[i];
+      pos_[i] = q_and_r.rem;
+      n = q_and_r.quot;
+    }
+  }
+
+  void reset() {
+    loc = 0;
+    std::fill(pos_.begin(), pos_.end(), 0);
+  }
+
+  ContiguousIterator() {};
+
+  explicit ContiguousIterator(const array& a)
+      : shape_(a.shape()), strides_(a.strides()) {
+    if (!shape_.empty()) {
+      std::tie(shape_, strides_) = collapse_contiguous_dims(shape_, strides_);
+      pos_ = std::vector<int>(shape_.size(), 0);
+    }
+  }
+
+  explicit ContiguousIterator(
+      const std::vector<int>& shape,
+      const std::vector<StrideT>& strides,
+      int dims)
+      : shape_(shape.begin(), shape.begin() + dims),
+        strides_(strides.begin(), strides.begin() + dims) {
+    if (!shape_.empty()) {
+      std::tie(shape_, strides_) = collapse_contiguous_dims(shape_, strides_);
+      pos_ = std::vector<int>(shape_.size(), 0);
+    }
+  }
+
+  StrideT loc{0};
+
+ private:
+  std::vector<int> shape_;
+  std::vector<StrideT> strides_;
+  std::vector<int> pos_;
+};
+
+template <typename StrideT>
 inline auto check_contiguity(
    const std::vector<int>& shape,
-    const std::vector<stride_t>& strides) {
-  size_t data_size = 1;
+    const std::vector<StrideT>& strides) {
+  size_t no_broadcast_data_size = 1;
  size_t f_stride = 1;
  size_t b_stride = 1;
  bool is_row_contiguous = true;
  bool is_col_contiguous = true;

  for (int i = 0, ri = shape.size() - 1; ri >= 0; i++, ri--) {
-    is_row_contiguous &= strides[i] == f_stride || shape[i] == 1;
-    is_col_contiguous &= strides[ri] == b_stride || shape[ri] == 1;
+    is_col_contiguous &= strides[i] == f_stride || shape[i] == 1;
+    is_row_contiguous &= strides[ri] == b_stride || shape[ri] == 1;
    f_stride *= shape[i];
    b_stride *= shape[ri];
    if (strides[i] > 0) {
-      data_size *= shape[i];
+      no_broadcast_data_size *= shape[i];
    }
  }

-  return std::make_tuple(data_size, is_row_contiguous, is_col_contiguous);
+  return std::make_tuple(
+      no_broadcast_data_size, is_row_contiguous, is_col_contiguous);
+}
+
+inline bool is_donatable(const array& in, const array& out) {
+  constexpr size_t donation_extra = 16384;
+
+  return in.is_donatable() && in.itemsize() == out.itemsize() &&
+      in.buffer_size() <= out.nbytes() + donation_extra;
 }

 } // namespace mlx::core
--- a/mlx/backend/metal/CMakeLists.txt
+++ b/mlx/backend/metal/CMakeLists.txt
@@ -1,98 +1,56 @@
 function(make_jit_source SRC_FILE)
-  # This function takes a metal header file,
-  # runs the C preprocessesor on it, and makes
-  # the processed contents available as a string in a C++ function
+  # This function takes a metal header file, runs the C preprocessesor on it,
+  # and makes the processed contents available as a string in a C++ function
  # mlx::core::metal::${SRC_NAME}()
  #
-  # To use the function, declare it in jit/includes.h and
-  # include jit/includes.h.
+  # To use the function, declare it in jit/includes.h and include
+  # jit/includes.h.
  #
-  # Additional arguments to this function are treated as dependencies
-  # in the Cmake build system.
+  # Additional arguments to this function are treated as dependencies in the
+  # Cmake build system.
  get_filename_component(SRC_NAME ${SRC_FILE} NAME)
  add_custom_command(
-    OUTPUT  jit/${SRC_NAME}.cpp
-    COMMAND /bin/bash
-              ${CMAKE_CURRENT_SOURCE_DIR}/make_compiled_preamble.sh
-              ${CMAKE_CURRENT_BINARY_DIR}/jit
-              ${CMAKE_C_COMPILER}
-              ${PROJECT_SOURCE_DIR}
-              ${SRC_FILE}
-              "-DMLX_METAL_VERSION=${MLX_METAL_VERSION}"
-    DEPENDS make_compiled_preamble.sh
-            kernels/${SRC_FILE}.h
-            ${ARGN}
-  )
+    OUTPUT jit/${SRC_NAME}.cpp
+    COMMAND
+      /bin/bash ${CMAKE_CURRENT_SOURCE_DIR}/make_compiled_preamble.sh
+      ${CMAKE_CURRENT_BINARY_DIR}/jit ${CMAKE_C_COMPILER} ${PROJECT_SOURCE_DIR}
+      ${SRC_FILE} "-DMLX_METAL_VERSION=${MLX_METAL_VERSION}"
+    DEPENDS make_compiled_preamble.sh kernels/${SRC_FILE}.h ${ARGN})
  add_custom_target(${SRC_NAME} DEPENDS jit/${SRC_NAME}.cpp)
  add_dependencies(mlx ${SRC_NAME})
-  target_sources(
-    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_BINARY_DIR}/jit/${SRC_NAME}.cpp
-  )
+  target_sources(mlx PRIVATE ${CMAKE_CURRENT_BINARY_DIR}/jit/${SRC_NAME}.cpp)
 endfunction(make_jit_source)

-make_jit_source(
-  utils
-  kernels/bf16.h
-  kernels/complex.h
-  kernels/defines.h
-)
-make_jit_source(
-  unary_ops
-  kernels/erf.h
-  kernels/expm1f.h
-)
+make_jit_source(utils kernels/bf16.h kernels/complex.h kernels/defines.h)
+make_jit_source(unary_ops kernels/erf.h kernels/expm1f.h)
 make_jit_source(binary_ops)
 make_jit_source(ternary_ops)
-make_jit_source(
-  reduce_utils
-  kernels/atomic.h
-  kernels/reduction/ops.h
-)
+make_jit_source(reduce_utils kernels/atomic.h kernels/reduction/ops.h)
 make_jit_source(scatter)
 make_jit_source(gather)
 make_jit_source(hadamard)

-if (MLX_METAL_JIT) 
-  target_sources(
-    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/jit_kernels.cpp
-  )
+if(MLX_METAL_JIT)
+  target_sources(mlx PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/jit_kernels.cpp)
  make_jit_source(arange)
  make_jit_source(copy)
  make_jit_source(unary)
  make_jit_source(binary)
  make_jit_source(binary_two)
-  make_jit_source(
-    fft
-    kernels/fft/radix.h
-    kernels/fft/readwrite.h
-  )
+  make_jit_source(fft kernels/fft/radix.h kernels/fft/readwrite.h)
  make_jit_source(ternary)
  make_jit_source(softmax)
  make_jit_source(scan)
  make_jit_source(sort)
  make_jit_source(
-    reduce
-    kernels/reduction/reduce_all.h
-    kernels/reduction/reduce_col.h
-    kernels/reduction/reduce_row.h
-  )
+    reduce kernels/reduction/reduce_all.h kernels/reduction/reduce_col.h
+    kernels/reduction/reduce_row.h kernels/reduction/reduce_init.h)
  make_jit_source(
-    steel/gemm/gemm
-    kernels/steel/utils.h
-    kernels/steel/gemm/loader.h
-    kernels/steel/gemm/mma.h
-    kernels/steel/gemm/params.h
-    kernels/steel/gemm/transforms.h
-  )
+    steel/gemm/gemm kernels/steel/utils.h kernels/steel/gemm/loader.h
+    kernels/steel/gemm/mma.h kernels/steel/gemm/params.h
+    kernels/steel/gemm/transforms.h)
  make_jit_source(steel/gemm/kernels/steel_gemm_fused)
-  make_jit_source(
-    steel/gemm/kernels/steel_gemm_masked
-    kernels/steel/defines.h
-  )
+  make_jit_source(steel/gemm/kernels/steel_gemm_masked kernels/steel/defines.h)
  make_jit_source(steel/gemm/kernels/steel_gemm_splitk)
  make_jit_source(
    steel/conv/conv
@@ -103,59 +61,51 @@ if (MLX_METAL_JIT)
    kernels/steel/conv/params.h
    kernels/steel/conv/loader.h
    kernels/steel/conv/loaders/loader_channel_l.h
-    kernels/steel/conv/loaders/loader_channel_n.h
-  )
-  make_jit_source(
-    steel/conv/kernels/steel_conv
-  )
-  make_jit_source(
-    steel/conv/kernels/steel_conv_general
-    kernels/steel/defines.h
-    kernels/steel/conv/loaders/loader_general.h
-  )
+    kernels/steel/conv/loaders/loader_channel_n.h)
+  make_jit_source(steel/conv/kernels/steel_conv)
+  make_jit_source(steel/conv/kernels/steel_conv_general kernels/steel/defines.h
+                  kernels/steel/conv/loaders/loader_general.h)
  make_jit_source(quantized)
+  make_jit_source(gemv_masked)
 else()
-  target_sources(
-    mlx
-    PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/nojit_kernels.cpp
-  )
+  target_sources(mlx PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/nojit_kernels.cpp)
 endif()

 target_sources(
  mlx
-  PRIVATE
-  ${CMAKE_CURRENT_SOURCE_DIR}/allocator.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/compiled.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/copy.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/event.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/hadamard.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/indexing.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/matmul.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/scaled_dot_product_attention.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/metal.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/normalization.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/rope.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/slicing.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/ternary.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/unary.cpp
-)
+  PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/allocator.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/compiled.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/conv.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/copy.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/custom_kernel.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/distributed.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/event.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/hadamard.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/indexing.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/matmul.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/scaled_dot_product_attention.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/metal.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/primitives.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/quantized.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/normalization.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/rope.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/slicing.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/softmax.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/ternary.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/unary.cpp
+          ${CMAKE_CURRENT_SOURCE_DIR}/utils.cpp)

-if (NOT MLX_METAL_PATH)
+if(NOT MLX_METAL_PATH)
  set(MLX_METAL_PATH ${CMAKE_CURRENT_BINARY_DIR}/kernels/)
 endif()

 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/kernels)

-target_compile_definitions(
-  mlx PRIVATE METAL_PATH="${MLX_METAL_PATH}/mlx.metallib")
+target_compile_definitions(mlx
+                           PRIVATE METAL_PATH="${MLX_METAL_PATH}/mlx.metallib")
--- a/mlx/backend/metal/allocator.cpp
+++ b/mlx/backend/metal/allocator.cpp
@@ -241,6 +241,10 @@ void MetalAllocator::free(Buffer buffer) {
  }
 }

+size_t MetalAllocator::size(Buffer buffer) const {
+  return static_cast<MTL::Buffer*>(buffer.ptr())->length();
+}
+
 MetalAllocator& allocator() {
  // By creating the |allocator_| on heap, the destructor of MetalAllocator will
  // not be called on exit and all the buffers will be leaked. This is necessary
--- a/mlx/backend/metal/allocator.h
+++ b/mlx/backend/metal/allocator.h
@@ -56,6 +56,7 @@ class MetalAllocator : public allocator::Allocator {
 public:
  virtual Buffer malloc(size_t size, bool allow_swap = false) override;
  virtual void free(Buffer buffer) override;
+  virtual size_t size(Buffer buffer) const override;
  size_t get_active_memory() {
    return active_memory_;
  };
--- a/mlx/backend/metal/binary.cpp
+++ b/mlx/backend/metal/binary.cpp
@@ -19,12 +19,47 @@

 namespace mlx::core {

-constexpr int MAX_BINARY_SPECIALIZED_DIMS = 5;
+std::string get_kernel_name(
+    BinaryOpType bopt,
+    const std::string& op,
+    const array& a,
+    bool use_2d,
+    int ndim,
+    int work_per_thread) {
+  std::ostringstream kname;
+  switch (bopt) {
+    case BinaryOpType::ScalarScalar:
+      kname << "ss";
+      break;
+    case BinaryOpType::ScalarVector:
+      kname << (use_2d ? "sv2" : "sv");
+      break;
+    case BinaryOpType::VectorScalar:
+      kname << (use_2d ? "vs2" : "vs");
+      break;
+    case BinaryOpType::VectorVector:
+      kname << (use_2d ? "vv2" : "vv");
+      break;
+    case BinaryOpType::General:
+      kname << "g";
+      if (ndim <= 3) {
+        kname << ndim;
+      } else {
+        kname << "n";
+        if (work_per_thread > 1) {
+          kname << work_per_thread;
+        }
+      }
+      break;
+  }
+  kname << "_" << op << type_to_name(a);
+  return kname.str();
+}

 void binary_op_gpu_inplace(
    const std::vector<array>& inputs,
    std::vector<array>& outputs,
-    const std::string op,
+    const std::string& op,
    const Stream& s) {
  auto& a = inputs[0];
  auto& b = inputs[1];
@@ -36,79 +71,68 @@ void binary_op_gpu_inplace(
  }

  // Try to collapse contiguous dims
-  auto [shape, strides] = collapse_contiguous_dims(a, b, out);
-  auto& strides_a = strides[0];
-  auto& strides_b = strides[1];
-  auto& strides_out = strides[2];
-
-  std::string kernel_name;
-  {
-    std::ostringstream kname;
-    switch (bopt) {
-      case BinaryOpType::ScalarScalar:
-        kname << "ss";
-        break;
-      case BinaryOpType::ScalarVector:
-        kname << "sv";
-        break;
-      case BinaryOpType::VectorScalar:
-        kname << "vs";
-        break;
-      case BinaryOpType::VectorVector:
-        kname << "vv";
-        break;
-      case BinaryOpType::General:
-        kname << "g";
-        if (shape.size() <= MAX_BINARY_SPECIALIZED_DIMS) {
-          kname << shape.size();
-        } else {
-          kname << "n";
-        }
-        break;
+  auto maybe_collapse = [bopt, &a, &b, &out]() {
+    if (bopt == BinaryOpType::General) {
+      auto [shape, strides] = collapse_contiguous_dims(a, b, out);
+      return std::make_tuple(shape, strides[0], strides[1], strides[2]);
+    } else {
+      std::vector<size_t> e;
+      return std::make_tuple(std::vector<int>{}, e, e, e);
    }
-    kname << op << type_to_name(a);
-    kernel_name = kname.str();
-  }
+  };
+  auto [shape, strides_a, strides_b, strides_out] = maybe_collapse();

+  bool use_2d = out.data_size() > UINT32_MAX;
+  auto ndim = shape.size();
+  int work_per_thread =
+      (bopt == BinaryOpType::General && shape[ndim - 1] > 4) ? 4 : 1;
+  std::string kernel_name =
+      get_kernel_name(bopt, op, a, use_2d, shape.size(), work_per_thread);
  auto& d = metal::device(s.device);

-  auto kernel =
-      get_binary_two_kernel(d, kernel_name, a.dtype(), outputs[0].dtype(), op);
-
+  auto kernel = outputs.size() == 2
+      ? get_binary_two_kernel(d, kernel_name, a.dtype(), out.dtype(), op)
+      : get_binary_kernel(d, kernel_name, a.dtype(), out.dtype(), op);
  auto& compute_encoder = d.get_command_encoder(s.index);
  compute_encoder->setComputePipelineState(kernel);

  // - If a is donated it goes to the first output
  // - If b is donated it goes to the first output if a was not donated
-  //   otherwise it goes to the second output
+  //   otherwise it goes to the second output.
+  // - If there is only one output only one of a and b will be donated.
  bool donate_a = a.data_shared_ptr() == nullptr;
  bool donate_b = b.data_shared_ptr() == nullptr;
-  compute_encoder.set_input_array(donate_a ? outputs[0] : a, 0);
+  int arg_idx = 0;
+  compute_encoder.set_input_array(donate_a ? outputs[0] : a, arg_idx++);
  compute_encoder.set_input_array(
-      donate_b ? (donate_a ? outputs[1] : outputs[0]) : b, 1);
-  compute_encoder.set_output_array(outputs[0], 2);
-  compute_encoder.set_output_array(outputs[1], 3);
+      donate_b ? (donate_a ? outputs[1] : outputs[0]) : b, arg_idx++);
+  compute_encoder.set_output_array(outputs[0], arg_idx++);
+  if (outputs.size() == 2) {
+    compute_encoder.set_output_array(outputs[1], arg_idx++);
+  }

  if (bopt == BinaryOpType::General) {
-    auto ndim = shape.size();
-    if (ndim > 3) {
-      compute_encoder->setBytes(shape.data(), ndim * sizeof(int), 4);
-      compute_encoder->setBytes(strides_a.data(), ndim * sizeof(size_t), 5);
-      compute_encoder->setBytes(strides_b.data(), ndim * sizeof(size_t), 6);
-    } else {
-      // The shape is implicit in the grid for <= 3D
-      compute_encoder->setBytes(strides_a.data(), ndim * sizeof(size_t), 4);
-      compute_encoder->setBytes(strides_b.data(), ndim * sizeof(size_t), 5);
-    }
-
-    if (ndim > MAX_BINARY_SPECIALIZED_DIMS) {
-      compute_encoder->setBytes(&ndim, sizeof(int), 7);
-    }
-
    // Launch up to 3D grid of threads
    size_t dim0 = ndim > 0 ? shape[ndim - 1] : 1;
    size_t dim1 = ndim > 1 ? shape[ndim - 2] : 1;
    size_t rest = out.size() / (dim0 * dim1);
+
+    if (ndim > 3) {
+      compute_encoder->setBytes(shape.data(), ndim * sizeof(int), arg_idx++);
+      compute_encoder->setBytes(
+          strides_a.data(), ndim * sizeof(size_t), arg_idx++);
+      compute_encoder->setBytes(
+          strides_b.data(), ndim * sizeof(size_t), arg_idx++);
+      compute_encoder->setBytes(&ndim, sizeof(int), arg_idx++);
+      dim0 = (dim0 + work_per_thread - 1) / work_per_thread;
+    } else {
+      // The shape is implicit in the grid for <= 3D
+      compute_encoder->setBytes(
+          strides_a.data(), ndim * sizeof(size_t), arg_idx++);
+      compute_encoder->setBytes(
+          strides_b.data(), ndim * sizeof(size_t), arg_idx++);
+    }
+
    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
    if (thread_group_size != 1024) {
      throw std::runtime_error("[Metal::binary] Must use 1024 sized block");
@@ -117,9 +141,10 @@ void binary_op_gpu_inplace(
    MTL::Size grid_dims = MTL::Size(dim0, dim1, rest);
    compute_encoder.dispatchThreads(grid_dims, group_dims);
  } else {
-    // Launch a 1D grid of threads
+    // Launch a 1D or 2D grid of threads
    size_t nthreads = out.data_size();
-    MTL::Size grid_dims = MTL::Size(nthreads, 1, 1);
+    MTL::Size grid_dims = use_2d ? get_2d_grid_dims(out.shape(), out.strides())
+                                 : MTL::Size(nthreads, 1, 1);
    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
    if (thread_group_size > nthreads) {
      thread_group_size = nthreads;
@@ -132,7 +157,7 @@ void binary_op_gpu_inplace(
 void binary_op_gpu(
    const std::vector<array>& inputs,
    std::vector<array>& outputs,
-    const std::string op,
+    const std::string& op,
    const Stream& s) {
  assert(inputs.size() == 2);
  auto& a = inputs[0];
@@ -146,7 +171,7 @@ void binary_op_gpu(
 void binary_op_gpu(
    const std::vector<array>& inputs,
    std::vector<array>& outputs,
-    const std::string op) {
+    const std::string& op) {
  auto& s = outputs[0].primitive().stream();
  binary_op_gpu(inputs, outputs, op, s);
 }
@@ -154,106 +179,16 @@ void binary_op_gpu(
 void binary_op_gpu_inplace(
    const std::vector<array>& inputs,
    array& out,
-    const std::string op,
+    const std::string& op,
    const Stream& s) {
-  auto& a = inputs[0];
-  auto& b = inputs[1];
-  auto bopt = get_binary_op_type(a, b);
-  if (out.size() == 0) {
-    return;
-  }
-
-  // Try to collapse contiguous dims
-  auto [shape, strides] = collapse_contiguous_dims(a, b, out);
-  auto& strides_a = strides[0];
-  auto& strides_b = strides[1];
-  auto& strides_out = strides[2];
-
-  std::string kernel_name;
-  {
-    std::ostringstream kname;
-    switch (bopt) {
-      case BinaryOpType::ScalarScalar:
-        kname << "ss";
-        break;
-      case BinaryOpType::ScalarVector:
-        kname << "sv";
-        break;
-      case BinaryOpType::VectorScalar:
-        kname << "vs";
-        break;
-      case BinaryOpType::VectorVector:
-        kname << "vv";
-        break;
-      case BinaryOpType::General:
-        kname << "g";
-        if (shape.size() <= MAX_BINARY_SPECIALIZED_DIMS) {
-          kname << shape.size();
-        } else {
-          kname << "n";
-        }
-        break;
-    }
-    kname << op << type_to_name(a);
-    kernel_name = kname.str();
-  }
-
-  auto& d = metal::device(s.device);
-
-  auto kernel = get_binary_kernel(d, kernel_name, a.dtype(), out.dtype(), op);
-  auto& compute_encoder = d.get_command_encoder(s.index);
-  compute_encoder->setComputePipelineState(kernel);
-  bool donate_a = a.data_shared_ptr() == nullptr;
-  bool donate_b = b.data_shared_ptr() == nullptr;
-  compute_encoder.set_input_array(donate_a ? out : a, 0);
-  compute_encoder.set_input_array(donate_b ? out : b, 1);
-  compute_encoder.set_output_array(out, 2);
-
-  if (bopt == BinaryOpType::General) {
-    auto ndim = shape.size();
-    if (ndim > 3) {
-      compute_encoder->setBytes(shape.data(), ndim * sizeof(int), 3);
-      compute_encoder->setBytes(strides_a.data(), ndim * sizeof(size_t), 4);
-      compute_encoder->setBytes(strides_b.data(), ndim * sizeof(size_t), 5);
-    } else {
-      // The shape is implicit in the grid for <= 3D
-      compute_encoder->setBytes(strides_a.data(), ndim * sizeof(size_t), 3);
-      compute_encoder->setBytes(strides_b.data(), ndim * sizeof(size_t), 4);
-    }
-
-    if (ndim > MAX_BINARY_SPECIALIZED_DIMS) {
-      compute_encoder->setBytes(&ndim, sizeof(int), 6);
-    }
-
-    // Launch up to 3D grid of threads
-    size_t dim0 = ndim > 0 ? shape[ndim - 1] : 1;
-    size_t dim1 = ndim > 1 ? shape[ndim - 2] : 1;
-    size_t rest = out.size() / (dim0 * dim1);
-    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
-    if (thread_group_size != 1024) {
-      throw std::runtime_error("[Metal::binary] Must use 1024 sized block");
-    }
-    auto group_dims = get_block_dims(dim0, dim1, rest);
-    MTL::Size grid_dims = MTL::Size(dim0, dim1, rest);
-    compute_encoder.dispatchThreads(grid_dims, group_dims);
-  } else {
-    // Launch a 1D grid of threads
-    size_t nthreads =
-        bopt == BinaryOpType::General ? out.size() : out.data_size();
-    MTL::Size grid_dims = MTL::Size(nthreads, 1, 1);
-    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
-    if (thread_group_size > nthreads) {
-      thread_group_size = nthreads;
-    }
-    MTL::Size group_dims = MTL::Size(thread_group_size, 1, 1);
-    compute_encoder.dispatchThreads(grid_dims, group_dims);
-  }
+  std::vector<array> outputs = {out};
+  binary_op_gpu_inplace(inputs, outputs, op, s);
 }

 void binary_op_gpu(
    const std::vector<array>& inputs,
    array& out,
-    const std::string op,
+    const std::string& op,
    const Stream& s) {
  assert(inputs.size() == 2);
  auto& a = inputs[0];
@@ -266,7 +201,7 @@ void binary_op_gpu(
 void binary_op_gpu(
    const std::vector<array>& inputs,
    array& out,
-    const std::string op) {
+    const std::string& op) {
  auto& s = out.primitive().stream();
  binary_op_gpu(inputs, out, op, s);
 }
--- a/mlx/backend/metal/binary.h
+++ b/mlx/backend/metal/binary.h
@@ -9,25 +9,25 @@ namespace mlx::core {
 void binary_op_gpu(
    const std::vector<array>& inputs,
    std::vector<array>& outputs,
-    const std::string op,
+    const std::string& op,
    const Stream& s);

 void binary_op_gpu(
    const std::vector<array>& inputs,
    array& out,
-    const std::string op,
+    const std::string& op,
    const Stream& s);

 void binary_op_gpu_inplace(
    const std::vector<array>& inputs,
    std::vector<array>& outputs,
-    const std::string op,
+    const std::string& op,
    const Stream& s);

 void binary_op_gpu_inplace(
    const std::vector<array>& inputs,
    array& out,
-    const std::string op,
+    const std::string& op,
    const Stream& s);

 } // namespace mlx::core
--- a/mlx/backend/metal/compiled.cpp
+++ b/mlx/backend/metal/compiled.cpp
@@ -22,7 +22,8 @@ inline void build_kernel(
    const std::unordered_set<uintptr_t>& constant_ids,
    bool contiguous,
    int ndim,
-    bool dynamic_dims) {
+    bool dynamic_dims,
+    bool use_big_index = false) {
  // All outputs should have the exact same shape and will be row contiguous
  auto output_shape = outputs[0].shape();
  auto output_strides = outputs[0].strides();
@@ -84,9 +85,15 @@ inline void build_kernel(

  // The thread index in the whole grid
  os << "    uint3 pos [[thread_position_in_grid]]," << std::endl
-     << "    uint3 grid [[threads_per_grid]]) {" << std::endl
-     << "  uint index = pos.x + grid.x * (pos.y + grid.y * pos.z);"
-     << std::endl;
+     << "    uint3 grid [[threads_per_grid]]) {" << std::endl;
+  if (use_big_index) {
+    // This is only used for contiguous kernels which don't have
+    // a third grid dimension
+    os << "  size_t index = pos.x + grid.x * size_t(pos.y);";
+  } else {
+    os << "  uint index = pos.x + grid.x * (pos.y + grid.y * pos.z);";
+  }
+  os << std::endl;

  // Extract the indices per axis to individual uints if we have arrays that
  // are broadcasted or transposed
@@ -212,6 +219,17 @@ void Compiled::eval_gpu(
        /* contiguous = */ true,
        /* ndim = */ 0,
        /* dynamic_dims = */ false);
+    build_kernel(
+        kernel,
+        kernel_lib_ + "_contiguous_big",
+        inputs_,
+        outputs_,
+        tape_,
+        constant_ids_,
+        /* contiguous = */ true,
+        /* ndim = */ 0,
+        /* dynamic_dims = */ false,
+        /* use_big_index = */ true);
    for (int i = 1; i < 8; i++) {
      build_kernel(
          kernel,
@@ -285,7 +303,16 @@ void Compiled::eval_gpu(
      initial_strides.push_back(std::move(xstrides));
    }
    std::tie(shape, strides) =
-        collapse_contiguous_dims(output_shape, initial_strides);
+        collapse_contiguous_dims(output_shape, initial_strides, INT32_MAX);
+  }
+
+  bool use_2d = false;
+  if (contiguous) {
+    size_t max_size = 0;
+    for (auto& in : inputs) {
+      max_size = std::max(max_size, in.data_size());
+    }
+    use_2d = (max_size > UINT32_MAX);
  }

  // Get the kernel from the lib
@@ -298,6 +325,8 @@ void Compiled::eval_gpu(
    } else {
      kernel_name += std::to_string(shape.size());
    }
+  } else if (use_2d) {
+    kernel_name += "_big";
  }
  auto kernel = d.get_kernel(kernel_name, lib);
  auto& compute_encoder = d.get_command_encoder(s.index);
@@ -348,8 +377,10 @@ void Compiled::eval_gpu(

  // Launch the kernel
  if (contiguous) {
-    size_t nthreads = outputs[0].size();
-    MTL::Size grid_dims(nthreads, 1, 1);
+    size_t nthreads = outputs[0].data_size();
+    MTL::Size grid_dims = use_2d
+        ? get_2d_grid_dims(outputs[0].shape(), outputs[0].strides())
+        : MTL::Size(nthreads, 1, 1);
    MTL::Size group_dims(
        std::min(nthreads, kernel->maxTotalThreadsPerThreadgroup()), 1, 1);
    compute_encoder.dispatchThreads(grid_dims, group_dims);
--- a/mlx/backend/metal/conv.cpp
+++ b/mlx/backend/metal/conv.cpp
@@ -72,7 +72,7 @@ void explicit_gemm_conv_ND_gpu(
  wt_reshaped.copy_shared_buffer(wt, wt_restride, wt_flags, wt.data_size());

  // Perform gemm
-  std::vector<array> copies = {in_unfolded, wt_reshaped};
+  std::vector<array> copies = {in_unfolded};
  return steel_matmul(
      s,
      d,
@@ -155,22 +155,27 @@ void explicit_gemm_conv_group_ND_gpu(
  copy_gpu(wt_view, wt_transpose, CopyType::General, s);

  // Perform gemm
-  std::vector<array> copies = {in_unfolded, wt_view, wt_transpose};
-  return steel_matmul_conv_groups(
+  std::vector<array> copies = {in_unfolded, wt_transpose};
+  return steel_matmul_regular(
      s,
      d,
-      /*a = */ in_unfolded,
-      /*b = */ wt_transpose,
-      /*c = */ out,
-      /*M = */ implicit_M,
-      /*N = */ implicit_N,
-      /*K = */ implicit_K,
-      /*a_cols = */ implicit_K * groups,
-      /*b_cols = */ implicit_K,
-      /*out_cols = */ implicit_N * groups,
-      /*a_transposed = */ false,
-      /*b_transposed = */ true,
-      /* groups = */ groups,
+      /* a = */ in_unfolded,
+      /* b = */ wt_transpose,
+      /* c = */ out,
+      /* M = */ implicit_M,
+      /* N = */ implicit_N,
+      /* K = */ implicit_K,
+      /* batch_size_out = */ groups,
+      /* a_cols = */ implicit_K * groups,
+      /* b_cols = */ implicit_K,
+      /* out_cols = */ implicit_N * groups,
+      /* a_transposed = */ false,
+      /* b_transposed = */ true,
+      /* batch_shape = */ {1},
+      /* batch_strides = */ {0},
+      /* A_batch_strides = */ size_t(implicit_K),
+      /* B_batch_strides = */ size_t(implicit_N) * implicit_K,
+      /* matrix_stride_out = */ size_t(implicit_N),
      /*copies = */ copies);
 }

@@ -552,7 +557,7 @@ void winograd_conv_2D_gpu(

  // Fill with zeros
  array zero_arr = array(0, in.dtype());
-  copy_gpu(zero_arr, in_padded, CopyType::Scalar, s);
+  fill_gpu(zero_arr, in_padded, s);
  copies_w.push_back(zero_arr);

  // Pick input slice from padded
@@ -571,7 +576,6 @@ void winograd_conv_2D_gpu(

  copies_w.push_back(in_padded_slice);
  copies_w.push_back(in_padded);
-  copies_w.push_back(zero_arr);

  MLXConvParams<2> conv_params_updated{
      /* const int  N = */ in_padded.shape(0),
@@ -911,14 +915,16 @@ void Convolution::eval_gpu(const std::vector<array>& inputs, array& out) {
  // Throw error
  else {
    throw std::invalid_argument(
-        "[Convolution::eval_gpu] Only supports 1D or 2D convolutions.");
+        "[Convolution::eval_gpu] Only supports 1D, 2D or 3D convolutions.");
  }

  // Clear copies
-  if (copies.size() > 0) {
+  if (!copies.empty()) {
    auto command_buffer = d.get_command_buffer(s.index);
    command_buffer->addCompletedHandler(
-        [copies](MTL::CommandBuffer*) mutable { copies.clear(); });
+        [copies = std::move(copies)](MTL::CommandBuffer*) mutable {
+          copies.clear();
+        });
  }
 }

--- a/mlx/backend/metal/copy.cpp
+++ b/mlx/backend/metal/copy.cpp
@@ -10,7 +10,7 @@

 namespace mlx::core {

-constexpr int MAX_COPY_SPECIALIZED_DIMS = 5;
+constexpr int MAX_COPY_SPECIALIZED_DIMS = 3;

 void copy_gpu(const array& in, array& out, CopyType ctype, const Stream& s) {
  if (ctype == CopyType::Vector) {
@@ -59,21 +59,34 @@ void copy_gpu_inplace(
  }

  // Try to collapse contiguous dims
-  auto [shape, strides] = collapse_contiguous_dims(
-      data_shape, std::vector{strides_in_pre, strides_out_pre});
-  auto& strides_in_ = strides[0];
-  auto& strides_out_ = strides[1];
+  auto maybe_collapse =
+      [ctype, &data_shape, &strides_in_pre, &strides_out_pre]() {
+        if (ctype == CopyType::General || ctype == CopyType::GeneralGeneral) {
+          auto [shape, strides] = collapse_contiguous_dims(
+              data_shape,
+              std::vector{strides_in_pre, strides_out_pre},
+              /* size_cap = */ INT32_MAX);
+          return std::make_tuple(shape, strides[0], strides[1]);
+        } else {
+          std::vector<stride_t> e;
+          return std::make_tuple(std::vector<int>{}, e, e);
+        }
+      };
+  auto [shape, strides_in_, strides_out_] = maybe_collapse();
+  int ndim = shape.size();

+  bool use_2d = out.data_size() > UINT32_MAX;
  auto& d = metal::device(s.device);
+  int work_per_thread = 1;
  std::string kernel_name;
  {
    std::ostringstream kname;
    switch (ctype) {
      case CopyType::Scalar:
-        kname << "s";
+        kname << (use_2d ? "s2" : "s");
        break;
      case CopyType::Vector:
-        kname << "v";
+        kname << (use_2d ? "v2" : "v");
        break;
      case CopyType::General:
        kname << "g";
@@ -82,9 +95,13 @@ void copy_gpu_inplace(
        kname << "gg";
        break;
    }
-    if ((ctype == CopyType::General || ctype == CopyType::GeneralGeneral) &&
-        shape.size() <= MAX_COPY_SPECIALIZED_DIMS) {
-      kname << shape.size();
+    if (ctype == CopyType::General || ctype == CopyType::GeneralGeneral) {
+      if (shape.size() <= MAX_COPY_SPECIALIZED_DIMS) {
+        kname << shape.size();
+      } else if (shape[ndim - 1] >= 4) {
+        work_per_thread = 4;
+        kname << "n4";
+      }
    }
    kname << "_copy";
    kname << type_to_name(in) << type_to_name(out);
@@ -104,10 +121,8 @@ void copy_gpu_inplace(
  compute_encoder.set_output_array(out, 1, out_offset);

  if (ctype == CopyType::General || ctype == CopyType::GeneralGeneral) {
-    int ndim = shape.size();
    std::vector<int64_t> strides_in{strides_in_.begin(), strides_in_.end()};
    std::vector<int64_t> strides_out{strides_out_.begin(), strides_out_.end()};
-
    if (ndim > 3) {
      set_vector_bytes(compute_encoder, shape, ndim, 2);
    }
@@ -116,10 +131,6 @@ void copy_gpu_inplace(
      set_vector_bytes(compute_encoder, strides_out, ndim, 4);
    }

-    if (ndim > MAX_COPY_SPECIALIZED_DIMS) {
-      compute_encoder->setBytes(&ndim, sizeof(int), 5);
-    }
-
    int dim0 = ndim > 0 ? shape[ndim - 1] : 1;
    int dim1 = ndim > 1 ? shape[ndim - 2] : 1;

@@ -128,6 +139,11 @@ void copy_gpu_inplace(
      data_size *= s;
    int rest = data_size / (dim0 * dim1);

+    if (ndim > MAX_COPY_SPECIALIZED_DIMS) {
+      compute_encoder->setBytes(&ndim, sizeof(int), 5);
+      dim0 = (dim0 + work_per_thread - 1) / work_per_thread;
+    }
+
    // NB assuming thread_group_size is a power of 2 larger than 32 x 32
    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
    if (thread_group_size != 1024) {
@@ -139,7 +155,8 @@ void copy_gpu_inplace(
    compute_encoder.dispatchThreads(grid_dims, group_dims);
  } else {
    size_t nthreads = out.data_size();
-    MTL::Size grid_dims = MTL::Size(nthreads, 1, 1);
+    MTL::Size grid_dims = use_2d ? get_2d_grid_dims(out.shape(), out.strides())
+                                 : MTL::Size(nthreads, 1, 1);
    NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
    if (thread_group_size > nthreads) {
      thread_group_size = nthreads;
@@ -154,6 +171,7 @@ void copy_gpu_inplace(
    array& out,
    CopyType ctype,
    const Stream& s) {
+  assert(in.shape() == out.shape());
  return copy_gpu_inplace(
      in, out, in.shape(), in.strides(), out.strides(), 0, 0, ctype, s);
 }
@@ -165,9 +183,37 @@ void copy_gpu_inplace(
    int64_t ioffset,
    CopyType ctype,
    const Stream& s) {
+  assert(in.shape() == out.shape());
  std::vector<int64_t> ostrides{out.strides().begin(), out.strides().end()};
  return copy_gpu_inplace(
      in, out, in.shape(), istride, ostrides, ioffset, 0, ctype, s);
 }

+void fill_gpu(const array& val, array& out, const Stream& s) {
+  if (out.size() == 0) {
+    return;
+  }
+  out.set_data(allocator::malloc_or_wait(out.nbytes()));
+  bool use_2d = out.data_size() > UINT32_MAX;
+  auto& d = metal::device(s.device);
+  std::string kernel_name = std::string(use_2d ? "s2" : "s") + "_copy" +
+      type_to_name(val) + type_to_name(out);
+  auto kernel = get_copy_kernel(d, kernel_name, val, out);
+  auto& compute_encoder = d.get_command_encoder(s.index);
+  compute_encoder->setComputePipelineState(kernel);
+
+  compute_encoder.set_input_array(val, 0);
+  compute_encoder.set_output_array(out, 1);
+
+  size_t nthreads = out.data_size();
+  MTL::Size grid_dims = use_2d ? get_2d_grid_dims(out.shape(), out.strides())
+                               : MTL::Size(nthreads, 1, 1);
+  NS::UInteger thread_group_size = kernel->maxTotalThreadsPerThreadgroup();
+  if (thread_group_size > nthreads) {
+    thread_group_size = nthreads;
+  }
+  MTL::Size group_dims = MTL::Size(thread_group_size, 1, 1);
+  compute_encoder.dispatchThreads(grid_dims, group_dims);
+}
+
 } // namespace mlx::core
--- a/mlx/backend/metal/copy.h
+++ b/mlx/backend/metal/copy.h
@@ -37,4 +37,7 @@ void copy_gpu_inplace(
    CopyType ctype,
    const Stream& s);

+// Fill the output with the scalar val
+void fill_gpu(const array& val, array& out, const Stream& s);
+
 } // namespace mlx::core
--- a/mlx/backend/metal/custom_kernel.cpp
+++ b/mlx/backend/metal/custom_kernel.cpp
@@ -0,0 +1,90 @@
+// Copyright © 2024 Apple Inc.
+
+#include "mlx/backend/metal/copy.h"
+#include "mlx/backend/metal/jit/includes.h"
+#include "mlx/backend/metal/utils.h"
+#include "mlx/fast_primitives.h"
+
+namespace mlx::core::fast {
+
+void CustomKernel::eval_gpu(
+    const std::vector<array>& inputs,
+    std::vector<array>& outputs) {
+  auto& s = stream();
+
+  std::vector<array> copies;
+
+  for (auto& out : outputs) {
+    out.set_data(allocator::malloc_or_wait(out.nbytes()));
+    if (init_value_) {
+      copies.emplace_back(init_value_.value(), out.dtype());
+      fill_gpu(copies.back(), out, s);
+    }
+  }
+
+  auto check_input = [&copies, &s, this](const array& x) -> const array {
+    bool no_copy = x.flags().row_contiguous;
+    if (!ensure_row_contiguous_ || no_copy) {
+      return x;
+    } else {
+      copies.push_back(array(x.shape(), x.dtype(), nullptr, {}));
+      copy_gpu(x, copies.back(), CopyType::General, s);
+      return copies.back();
+    }
+  };
+  std::vector<const array> checked_inputs;
+  for (const array& in : inputs) {
+    checked_inputs.push_back(check_input(in));
+  }
+
+  auto& d = metal::device(s.device);
+  const auto& lib_name = name_;
+  auto lib = d.get_library(lib_name);
+  if (lib == nullptr) {
+    lib = d.get_library(lib_name, metal::utils() + source_);
+  }
+  auto kernel = d.get_kernel(name_, lib);
+  auto& compute_encoder = d.get_command_encoder(s.index);
+  compute_encoder->setComputePipelineState(kernel);
+  int index = 0;
+  for (int i = 0; i < checked_inputs.size(); i++) {
+    const array& in = checked_inputs[i];
+    auto& shape_info = shape_infos_[i];
+    compute_encoder.set_input_array(in, index);
+    index++;
+    if (in.ndim() > 0) {
+      int ndim = in.ndim();
+      if (shape_info.shape) {
+        set_vector_bytes(compute_encoder, in.shape(), ndim, index);
+        index++;
+      }
+      if (shape_info.strides) {
+        set_vector_bytes(compute_encoder, in.strides(), ndim, index);
+        index++;
+      }
+      if (shape_info.ndim) {
+        compute_encoder->setBytes(&ndim, sizeof(int), index);
+        index++;
+      }
+    }
+  }
+  for (auto& out : outputs) {
+    compute_encoder.set_output_array(out, index);
+    index++;
+  }
+
+  const auto [tx, ty, tz] = threadgroup_;
+  MTL::Size group_dims = MTL::Size(tx, ty, tz);
+  const auto [gx, gy, gz] = grid_;
+  MTL::Size grid_dims = MTL::Size(gx, gy, gz);
+  compute_encoder->dispatchThreads(grid_dims, group_dims);
+
+  if (!copies.empty()) {
+    d.get_command_buffer(s.index)->addCompletedHandler(
+        [copies = std::move(copies)](MTL::CommandBuffer*) mutable {
+          copies.clear();
+        });
+  }
+}
+
+} // namespace mlx::core::fast
--- a/mlx/backend/metal/device.cpp
+++ b/mlx/backend/metal/device.cpp
@@ -1,8 +1,6 @@
 // Copyright © 2023-2024 Apple Inc.

-#include <dlfcn.h>
 #include <cstdlib>
-#include <filesystem>
 #include <sstream>

 #include <sys/sysctl.h>
@@ -14,11 +12,8 @@
 #include "mlx/backend/metal/device.h"
 #include "mlx/backend/metal/metal.h"
 #include "mlx/backend/metal/metal_impl.h"
-#include "mlx/backend/metal/mps/gemm.h"
 #include "mlx/backend/metal/utils.h"

-namespace fs = std::filesystem;
-
 namespace mlx::core::metal {

 namespace {
@@ -126,6 +121,49 @@ MTL::Library* load_library(

 } // namespace

+CommandEncoder::CommandEncoder(MTL::CommandBuffer* cbuf) : cbuf(cbuf) {
+  enc = cbuf->computeCommandEncoder(MTL::DispatchTypeConcurrent);
+  enc->retain();
+}
+
+CommandEncoder::~CommandEncoder() {
+  enc->endEncoding();
+  enc->release();
+}
+
+void CommandEncoder::set_input_array(
+    const array& a,
+    int idx,
+    int64_t offset /* = 0 */) {
+  auto r_buf = static_cast<MTL::Resource*>(const_cast<void*>(a.buffer().ptr()));
+  if (auto it = outputs.find(r_buf); it != outputs.end()) {
+    // Insert a barrier
+    enc->memoryBarrier(&r_buf, 1);
+
+    // Remove the output
+    outputs.erase(it);
+  }
+  auto a_buf = static_cast<const MTL::Buffer*>(a.buffer().ptr());
+  auto base_offset = a.data<char>() -
+      static_cast<char*>(const_cast<MTL::Buffer*>(a_buf)->contents());
+  base_offset += offset;
+  enc->setBuffer(a_buf, base_offset, idx);
+}
+
+void CommandEncoder::set_output_array(
+    array& a,
+    int idx,
+    int64_t offset /* = 0 */) {
+  // Add barriers before adding the output to the output set
+  set_input_array(a, idx, offset);
+  auto buf = static_cast<MTL::Resource*>(a.buffer().ptr());
+  if (concurrent) {
+    concurrent_outputs.insert(buf);
+  } else {
+    outputs.insert(buf);
+  }
+}
+
 void CommandEncoder::dispatchThreadgroups(
    MTL::Size grid_dims,
    MTL::Size group_dims) {
@@ -255,23 +293,13 @@ void Device::register_library(
  }
 }

-void Device::register_library(
-    const std::string& lib_name,
-    const std::function<std::string(const std::string&)>& lib_path_func) {
-  if (auto it = library_map_.find(lib_name); it == library_map_.end()) {
-    std::string new_lib_path = lib_path_func(lib_name);
-    auto new_lib = load_library(device_, lib_name, new_lib_path.c_str());
-    library_map_.insert({lib_name, new_lib});
-  }
-}
-
 MTL::Library* Device::get_library_cache_(const std::string& lib_name) {
  // Search for cached metal lib
  MTL::Library* mtl_lib;
  if (auto it = library_map_.find(lib_name); it != library_map_.end()) {
    mtl_lib = it->second;
  } else { // Look for metallib alongside library
-    register_library(lib_name);
+    register_library(lib_name, get_colocated_mtllib_path(lib_name));
    mtl_lib = library_map_[lib_name];
  }

--- a/mlx/backend/metal/device.h
+++ b/mlx/backend/metal/device.h
@@ -3,15 +3,14 @@
 #pragma once

 #include <Metal/Metal.hpp>
+#include <dlfcn.h>
+#include <filesystem>
 #include <functional>
 #include <mutex>
 #include <string>
 #include <unordered_map>
 #include <unordered_set>

-#include <dlfcn.h>
-#include <filesystem>
-
 #include "mlx/array.h"
 #include "mlx/device.h"

@@ -19,6 +18,8 @@ namespace fs = std::filesystem;

 namespace mlx::core::metal {

+// Note, this function must be left inline in a header so that it is not
+// dynamically linked.
 inline std::string get_colocated_mtllib_path(const std::string& lib_name) {
  Dl_info info;
  std::string mtllib_path;
@@ -37,10 +38,7 @@ using MTLFCList =
    std::vector<std::tuple<const void*, MTL::DataType, NS::UInteger>>;

 struct CommandEncoder {
-  CommandEncoder(MTL::CommandBuffer* cbuf) : cbuf(cbuf) {
-    enc = cbuf->computeCommandEncoder(MTL::DispatchTypeConcurrent);
-    enc->retain();
-  };
+  CommandEncoder(MTL::CommandBuffer* cbuf);
  CommandEncoder(const CommandEncoder&) = delete;
  CommandEncoder& operator=(const CommandEncoder&) = delete;

@@ -63,34 +61,8 @@ struct CommandEncoder {
    return enc;
  }

-  void set_input_array(const array& a, int idx, int64_t offset = 0) {
-    auto r_buf =
-        static_cast<MTL::Resource*>(const_cast<void*>(a.buffer().ptr()));
-    if (auto it = outputs.find(r_buf); it != outputs.end()) {
-      // Insert a barrier
-      enc->memoryBarrier(&r_buf, 1);
-
-      // Remove the output
-      outputs.erase(it);
-    }
-    auto a_buf = static_cast<const MTL::Buffer*>(a.buffer().ptr());
-    auto base_offset = a.data<char>() -
-        static_cast<char*>(const_cast<MTL::Buffer*>(a_buf)->contents());
-    base_offset += offset;
-    enc->setBuffer(a_buf, base_offset, idx);
-  }
-
-  void set_output_array(array& a, int idx, int64_t offset = 0) {
-    // Add barriers before adding the output to the output set
-    set_input_array(a, idx, offset);
-    auto buf = static_cast<MTL::Resource*>(a.buffer().ptr());
-    if (concurrent) {
-      concurrent_outputs.insert(buf);
-    } else {
-      outputs.insert(buf);
-    }
-  }
-
+  void set_input_array(const array& a, int idx, int64_t offset = 0);
+  void set_output_array(array& a, int idx, int64_t offset = 0);
  void dispatchThreadgroups(MTL::Size grid_dims, MTL::Size group_dims);
  void dispatchThreads(MTL::Size grid_dims, MTL::Size group_dims);

@@ -98,10 +70,7 @@ struct CommandEncoder {
    return ConcurrentContext(*this);
  }

-  ~CommandEncoder() {
-    enc->endEncoding();
-    enc->release();
-  }
+  ~CommandEncoder();

 private:
  void maybe_split();
@@ -136,10 +105,14 @@ class Device {
  void register_library(
      const std::string& lib_name,
      const std::string& lib_path);
-  void register_library(
-      const std::string& lib_name,
-      const std::function<std::string(const std::string&)>& lib_path_func =
-          get_colocated_mtllib_path);
+
+  // Note, this should remain in the header so that it is not dynamically
+  // linked
+  void register_library(const std::string& lib_name) {
+    if (auto it = library_map_.find(lib_name); it == library_map_.end()) {
+      register_library(lib_name, get_colocated_mtllib_path(lib_name));
+    }
+  }

  MTL::Library* get_library(const std::string& name);

--- a/mlx/backend/metal/distributed.cpp
+++ b/mlx/backend/metal/distributed.cpp
@@ -0,0 +1,142 @@
+// Copyright © 2024 Apple Inc.
+
+#include <cassert>
+
+#include "mlx/allocator.h"
+#include "mlx/backend/metal/device.h"
+#include "mlx/distributed/ops.h"
+#include "mlx/distributed/primitives.h"
+#include "mlx/scheduler.h"
+
+namespace mlx::core::distributed {
+
+void signal_and_wait(const array& in, const array& out, const Stream& s) {
+  auto& d = metal::device(s.device);
+  d.end_encoding(s.index);
+  auto command_buffer = d.get_command_buffer(s.index);
+  if (in.event().valid()) {
+    command_buffer->encodeSignalEvent(
+        static_cast<MTL::Event*>(in.event().raw_event().get()),
+        in.event().value());
+  }
+  command_buffer->encodeWait(
+      static_cast<MTL::Event*>(out.event().raw_event().get()),
+      out.event().value());
+}
+
+void AllReduce::eval_gpu(
+    const std::vector<array>& inputs,
+    std::vector<array>& outputs) {
+  assert(inputs.size() == 1);
+  assert(outputs.size() == 1);
+
+  auto& in = inputs[0];
+  auto& out = outputs[0];
+  if (in.is_donatable()) {
+    out.move_shared_buffer(in);
+  } else {
+    out.set_data(allocator::malloc_or_wait(out.nbytes()));
+  }
+
+  auto task = [in = in,
+               out = out,
+               reduce_type = reduce_type_,
+               group = group()]() mutable {
+    if (in.event().valid()) {
+      in.event().wait();
+    }
+    switch (reduce_type) {
+      case Sum:
+        distributed::detail::all_sum(
+            group, in.data_shared_ptr() == nullptr ? out : in, out);
+        break;
+      default:
+        throw std::runtime_error("Only all reduce sum is supported for now");
+    }
+    out.event().signal();
+  };
+  scheduler::enqueue(detail::communication_stream(), std::move(task));
+
+  signal_and_wait(in, out, stream());
+}
+
+void AllGather::eval_gpu(
+    const std::vector<array>& inputs,
+    std::vector<array>& outputs) {
+  assert(inputs.size() == 1);
+  assert(outputs.size() == 1);
+  auto& in = inputs[0];
+  auto& out = outputs[0];
+
+  out.set_data(allocator::malloc_or_wait(out.nbytes()));
+
+  auto task = [in = in, out = out, group = group()]() mutable {
+    if (in.event().valid()) {
+      in.event().wait();
+    }
+    distributed::detail::all_gather(group, in, out);
+    out.event().signal();
+  };
+  scheduler::enqueue(detail::communication_stream(), std::move(task));
+  signal_and_wait(in, out, stream());
+}
+
+void Send::eval_gpu(
+    const std::vector<array>& inputs,
+    std::vector<array>& outputs) {
+  assert(inputs.size() == 1);
+  assert(outputs.size() == 1);
+
+  auto& in = inputs[0];
+  auto& out = outputs[0];
+
+  // Schedule an async send on the comm stream
+  auto task = [in = in, out = out, group = group(), dst = dst_]() mutable {
+    if (in.event().valid()) {
+      in.event().wait();
+    }
+    distributed::detail::send(group, in, dst);
+    out.event().signal();
+  };
+  scheduler::enqueue(detail::communication_stream(), std::move(task));
+
+  // Encode a signal event for the input but not a wait since we don't need to
+  // wait on the output.
+  auto& s = stream();
+  auto& d = metal::device(s.device);
+  d.end_encoding(s.index);
+  auto command_buffer = d.get_command_buffer(s.index);
+  if (in.event().valid()) {
+    command_buffer->encodeSignalEvent(
+        static_cast<MTL::Event*>(in.event().raw_event().get()),
+        in.event().value());
+  }
+}
+
+void Recv::eval_gpu(
+    const std::vector<array>& inputs,
+    std::vector<array>& outputs) {
+  assert(inputs.size() == 0);
+  assert(outputs.size() == 1);
+
+  auto& out = outputs[0];
+
+  out.set_data(allocator::malloc_or_wait(out.nbytes()));
+
+  // Schedule an async recv on the comm stream
+  auto task = [out = out, group = group(), src = src_]() mutable {
+    distributed::detail::recv(group, out, src);
+    out.event().signal();
+  };
+  scheduler::enqueue(detail::communication_stream(), std::move(task));
+
+  // Encode a wait event as there is no input for the recv to encode a signal.
+  auto& s = stream();
+  auto& d = metal::device(s.device);
+  auto command_buffer = d.get_command_buffer(s.index);
+  command_buffer->encodeWait(
+      static_cast<MTL::Event*>(out.event().raw_event().get()),
+      out.event().value());
+}
+
+} // namespace mlx::core::distributed
--- a/mlx/backend/metal/event.cpp
+++ b/mlx/backend/metal/event.cpp
@@ -27,4 +27,9 @@ void Event::signal() {
  static_cast<MTL::SharedEvent*>(raw_event().get())->setSignaledValue(value());
 }

+bool Event::is_signaled() const {
+  return static_cast<MTL::SharedEvent*>(raw_event().get())->signaledValue() >=
+      value();
+}
+
 } // namespace mlx::core
--- a/mlx/backend/metal/fft.cpp
+++ b/mlx/backend/metal/fft.cpp
@@ -1,4 +1,4 @@
-// Copyright © 2023 Apple Inc.
+// Copyright © 2024 Apple Inc.
 #include <cassert>
 #include <complex>
 #include <map>
@@ -12,8 +12,6 @@
 #include "mlx/backend/metal/slicing.h"
 #include "mlx/backend/metal/unary.h"
 #include "mlx/backend/metal/utils.h"
-#include "mlx/mlx.h"
-#include "mlx/primitives.h"
 #include "mlx/utils.h"

 namespace mlx::core {
@@ -548,8 +546,8 @@ void fft_op(
      auto [data_size, is_row_contiguous, is_col_contiguous] =
          check_contiguity(x.shape(), strides);

-      flags.col_contiguous = is_row_contiguous;
-      flags.row_contiguous = is_col_contiguous;
+      flags.col_contiguous = is_col_contiguous;
+      flags.row_contiguous = is_row_contiguous;
      flags.contiguous = data_size == x_copy.size();

      x_copy.set_data(
@@ -578,7 +576,9 @@ void fft_op(
  if (plan.four_step) {
    four_step_fft(in, out, axis, inverse, real, plan, copies, s);
    d.get_command_buffer(s.index)->addCompletedHandler(
-        [copies](MTL::CommandBuffer*) mutable { copies.clear(); });
+        [copies = std::move(copies)](MTL::CommandBuffer*) mutable {
+          copies.clear();
+        });
    return;
  }

@@ -743,8 +743,13 @@ void fft_op(
        MTL::Size(batch_size, threadgroup_batch_size, threads_per_fft);
    compute_encoder->dispatchThreads(grid_dims, group_dims);
  }
-  d.get_command_buffer(s.index)->addCompletedHandler(
-      [copies](MTL::CommandBuffer*) mutable { copies.clear(); });
+
+  if (!copies.empty()) {
+    d.get_command_buffer(s.index)->addCompletedHandler(
+        [copies = std::move(copies)](MTL::CommandBuffer*) mutable {
+          copies.clear();
+        });
+  }
 }

 void fft_op(
@@ -786,10 +791,9 @@ void nd_fft_op(
    fft_op(in_arr, out_arr, axis, inverse, step_real, inplace, s);
  }

-  std::vector<array> copies = {temp1, temp2};
  auto& d = metal::device(s.device);
  d.get_command_buffer(s.index)->addCompletedHandler(
-      [copies](MTL::CommandBuffer*) mutable { copies.clear(); });
+      [temp_arrs](MTL::CommandBuffer*) mutable { temp_arrs.clear(); });
 }

 void FFT::eval_gpu(const std::vector<array>& inputs, array& out) {
--- a/mlx/backend/metal/hadamard.cpp
+++ b/mlx/backend/metal/hadamard.cpp
@@ -196,8 +196,12 @@ void Hadamard::eval_gpu(const std::vector<array>& inputs, array& out) {
        s);
  }

-  d.get_command_buffer(s.index)->addCompletedHandler(
-      [copies](MTL::CommandBuffer*) mutable { copies.clear(); });
+  if (!copies.empty()) {
+    d.get_command_buffer(s.index)->addCompletedHandler(
+        [copies = std::move(copies)](MTL::CommandBuffer*) mutable {
+          copies.clear();
+        });
+  }
 }

 } // namespace mlx::core
--- a/mlx/backend/metal/indexing.cpp
+++ b/mlx/backend/metal/indexing.cpp
@@ -95,11 +95,21 @@ void Gather::eval_gpu(const std::vector<array>& inputs, array& out) {
    slice_size *= s;
  }

-  // Launch 2D grid of threads: indices x slice
-  size_t dim0 = out.size() / slice_size;
-  size_t dim1 = slice_size;
-  auto group_dims = get_block_dims(dim0, dim1, 1);
-  MTL::Size grid_dims = MTL::Size(dim0, dim1, 1);
+  // Launch 3D grid of threads
+  // First two dimensions for the indices, the last one for the slice
+  size_t dim0 = 1;
+  size_t dim1 = 1;
+  if (nidx) {
+    if (inputs[1].ndim() >= 1) {
+      dim0 = inputs[1].shape(0);
+    }
+    if (inputs[1].ndim() >= 2) {
+      dim1 = inputs[1].size() / dim0;
+    }
+  }
+  size_t dim2 = slice_size;
+  auto group_dims = get_block_dims(dim0, dim1, dim2);
+  MTL::Size grid_dims = MTL::Size(dim0, dim1, dim2);

  // Collect all idx shapes and strides into one place
  std::vector<int> idx_shapes;
--- a/mlx/backend/metal/jit/copy.h
+++ b/mlx/backend/metal/jit/copy.h
@@ -1,100 +0,0 @@
-// Copyright © 2024 Apple Inc.
-
-constexpr std::string_view copy_kernels = R"(
-template [[host_name("s_{0}")]] [[kernel]] void copy_s<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    uint index [[thread_position_in_grid]]);
-template [[host_name("v_{0}")]] [[kernel]] void copy_v<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    uint index [[thread_position_in_grid]]);
-
-template [[host_name("g4_{0}")]] [[kernel]] void
-copy_g_nd<{1}, {2}, 4>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]);
-template [[host_name("gg4_{0}")]] [[kernel]] void
-copy_gg_nd<{1}, {2}, 4>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int64_t* dst_strides [[buffer(4)]],
-    uint3 index [[thread_position_in_grid]]);
-template [[host_name("g5_{0}")]] [[kernel]] void
-copy_g_nd<{1}, {2}, 5>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]);
-template [[host_name("gg5_{0}")]] [[kernel]] void
-copy_gg_nd<{1}, {2}, 5>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int64_t* dst_strides [[buffer(4)]],
-    uint3 index [[thread_position_in_grid]]);
-template [[host_name("g1_{0}")]] [[kernel]] void copy_g_nd1<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t& src_stride [[buffer(3)]],
-    uint index [[thread_position_in_grid]]);
-template [[host_name("g2_{0}")]] [[kernel]] void copy_g_nd2<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    uint2 index [[thread_position_in_grid]],
-    uint2 grid_dim [[threads_per_grid]]);
-template [[host_name("g3_{0}")]] [[kernel]] void copy_g_nd3<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]);
-template [[host_name("gg1_{0}")]] [[kernel]] void
-copy_gg_nd1<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t& src_stride [[buffer(3)]],
-    constant const int64_t& dst_stride [[buffer(4)]],
-    uint index [[thread_position_in_grid]]);
-template [[host_name("gg2_{0}")]] [[kernel]] void
-copy_gg_nd2<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int64_t* dst_strides [[buffer(4)]],
-    uint2 index [[thread_position_in_grid]]);
-template [[host_name("gg3_{0}")]] [[kernel]] void
-copy_gg_nd3<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int64_t* dst_strides [[buffer(4)]],
-    uint3 index [[thread_position_in_grid]]);
-
-template [[host_name("g_{0}")]] [[kernel]] void copy_g<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int& ndim [[buffer(5)]],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]);
-template [[host_name("gg_{0}")]] [[kernel]] void copy_gg<{1}, {2}>(
-    device const {1}* src [[buffer(0)]],
-    device {2}* dst [[buffer(1)]],
-    constant const int* src_shape [[buffer(2)]],
-    constant const int64_t* src_strides [[buffer(3)]],
-    constant const int64_t* dst_strides [[buffer(4)]],
-    constant const int& ndim [[buffer(5)]],
-    uint3 index [[thread_position_in_grid]]);
-)";
--- a/mlx/backend/metal/jit/gemv_masked.h
+++ b/mlx/backend/metal/jit/gemv_masked.h
@@ -0,0 +1,25 @@
+// Copyright © 2024 Apple Inc.
+
+constexpr std::string_view gemv_masked_kernel = R"(
+template [[host_name("{name}")]] [[kernel]] void
+gemv_{trans}masked<{itype}, {outm_t}, {opm_t}, {bm}, {bn}, {sm}, {sn}, {tm}, {tn}, {nc}>(
+    const device {itype}* mat [[buffer(0)]],
+    const device {itype}* in_vec [[buffer(1)]],
+    device {itype}* out_vec [[buffer(3)]],
+    const constant int& in_vec_size [[buffer(4)]],
+    const constant int& out_vec_size [[buffer(5)]],
+    const constant int& marix_ld [[buffer(6)]],
+    const constant int& batch_ndim [[buffer(9)]],
+    const constant int* batch_shape [[buffer(10)]],
+    const constant size_t* vector_batch_stride [[buffer(11)]],
+    const constant size_t* matrix_batch_stride [[buffer(12)]],
+    const device {outm_t}* out_mask [[buffer(20)]],
+    const device {opm_t}* mat_mask [[buffer(21)]],
+    const device {opm_t}* vec_mask [[buffer(22)]],
+    const constant int* mask_strides [[buffer(23)]],
+    const constant size_t* mask_batch_strides [[buffer(24)]],
+    uint3 tid [[threadgroup_position_in_grid]],
+    uint3 lid [[thread_position_in_threadgroup]],
+    uint simd_gid [[simdgroup_index_in_threadgroup]],
+    uint simd_lid [[thread_index_in_simdgroup]]);
+)";
--- a/mlx/backend/metal/jit/includes.h
+++ b/mlx/backend/metal/jit/includes.h
@@ -33,5 +33,6 @@ const char* steel_gemm_splitk();
 const char* conv();
 const char* steel_conv();
 const char* steel_conv_general();
+const char* gemv_masked();

 } // namespace mlx::core::metal
--- a/mlx/backend/metal/jit/indexing.h
+++ b/mlx/backend/metal/jit/indexing.h
@@ -13,8 +13,8 @@ constexpr std::string_view gather_kernels = R"(
    const constant size_t* idx_strides [[buffer(8)]],
    const constant int& idx_ndim [[buffer(9)]],
    {4}
-    uint2 index [[thread_position_in_grid]],
-    uint2 grid_dim [[threads_per_grid]]) {{
+    uint3 index [[thread_position_in_grid]],
+    uint3 grid_dim [[threads_per_grid]]) {{
  Indices<{2}, {3}> idxs{{
    {{ {5} }}, idx_shapes, idx_strides, idx_ndim}};

--- a/mlx/backend/metal/jit/reduce.h
+++ b/mlx/backend/metal/jit/reduce.h
@@ -1,168 +0,0 @@
-// Copyright © 2024 Apple Inc.
-
-constexpr std::string_view reduce_init_kernels = R"(
-[[kernel]] void {0}(
-    device {1}* out [[buffer(0)]],
-    uint tid [[thread_position_in_grid]]) {{
-  out[tid] = {2}<{1}>::init;
-}}
-)";
-
-constexpr std::string_view reduce_kernels = R"(
-template [[host_name("all_{0}")]] [[kernel]] void
-all_reduce<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device mlx_atomic<{2}>* out [[buffer(1)]],
-    const device size_t& in_size [[buffer(2)]],
-    uint gid [[thread_position_in_grid]],
-    uint lid [[thread_position_in_threadgroup]],
-    uint grid_size [[threads_per_grid]],
-    uint simd_per_group [[simdgroups_per_threadgroup]],
-    uint simd_lane_id [[thread_index_in_simdgroup]],
-    uint simd_group_id [[simdgroup_index_in_threadgroup]]);
-template [[host_name("colGeneral_{0}")]] [[kernel]] void
-col_reduce_general<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device mlx_atomic<{2}>* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& reduction_stride [[buffer(3)]],
-    const constant size_t& out_size [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    threadgroup {2}* local_data [[threadgroup(0)]],
-    uint3 tid [[threadgroup_position_in_grid]],
-    uint3 lid [[thread_position_in_threadgroup]],
-    uint3 lsize [[threads_per_threadgroup]]);
-template [[host_name("colSmall_{0}")]] [[kernel]] void
-col_reduce_small<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& reduction_stride [[buffer(3)]],
-    const constant size_t& out_size [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    const constant size_t& non_col_reductions [[buffer(8)]],
-    const constant int* non_col_shapes [[buffer(9)]],
-    const constant size_t* non_col_strides [[buffer(10)]],
-    const constant int& non_col_ndim [[buffer(11)]],
-    uint tid [[thread_position_in_grid]]);
-template [[host_name("rowGeneralSmall_{0}")]] [[kernel]] void
-row_reduce_general_small<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& out_size [[buffer(3)]],
-    const constant size_t& non_row_reductions [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    uint lid [[thread_position_in_grid]]);
-template [[host_name("rowGeneralMed_{0}")]] [[kernel]] void
-row_reduce_general_med<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& out_size [[buffer(3)]],
-    const constant size_t& non_row_reductions [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    uint tid [[threadgroup_position_in_grid]],
-    uint simd_lane_id [[thread_index_in_simdgroup]],
-    uint simd_per_group [[dispatch_simdgroups_per_threadgroup]],
-    uint simd_group_id [[simdgroup_index_in_threadgroup]]);
-template [[host_name("rowGeneral_{0}")]] [[kernel]] void
-row_reduce_general<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device mlx_atomic<{2}>* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& out_size [[buffer(3)]],
-    const constant size_t& non_row_reductions [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    uint3 lid [[thread_position_in_threadgroup]],
-    uint3 lsize [[threads_per_threadgroup]],
-    uint3 tid [[threadgroup_position_in_grid]],
-    uint simd_lane_id [[thread_index_in_simdgroup]],
-    uint simd_per_group [[simdgroups_per_threadgroup]],
-    uint simd_group_id [[simdgroup_index_in_threadgroup]]);
-)";
-
-constexpr std::string_view reduce_non_atomic_kernels = R"(
-template [[host_name("allNoAtomics_{0}")]] [[kernel]] void
-all_reduce_no_atomics<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const device size_t& in_size [[buffer(2)]],
-    uint gid [[thread_position_in_grid]],
-    uint lid [[thread_position_in_threadgroup]],
-    uint grid_size [[threads_per_grid]],
-    uint simd_per_group [[simdgroups_per_threadgroup]],
-    uint simd_lane_id [[thread_index_in_simdgroup]],
-    uint simd_group_id [[simdgroup_index_in_threadgroup]],
-    uint thread_group_id [[threadgroup_position_in_grid]]);
-
-template [[host_name("colGeneralNoAtomics_{0}")]] [[kernel]] void
-  col_reduce_general_no_atomics<{1}, {2}, {3}<{2}>>(
-      const device {1}* in [[buffer(0)]],
-      device {2}* out [[buffer(1)]],
-      const constant size_t& reduction_size [[buffer(2)]],
-      const constant size_t& reduction_stride [[buffer(3)]],
-      const constant size_t& out_size [[buffer(4)]],
-      const constant int* shape [[buffer(5)]],
-      const constant size_t* strides [[buffer(6)]],
-      const constant int& ndim [[buffer(7)]],
-      threadgroup {2}* local_data [[threadgroup(0)]],
-      uint3 tid [[threadgroup_position_in_grid]],
-      uint3 lid [[thread_position_in_threadgroup]],
-      uint3 gid [[thread_position_in_grid]],
-      uint3 lsize [[threads_per_threadgroup]],
-      uint3 gsize [[threads_per_grid]]);
-template [[host_name("colSmall_{0}")]] [[kernel]] void
-col_reduce_small<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& reduction_stride [[buffer(3)]],
-    const constant size_t& out_size [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    const constant size_t& non_col_reductions [[buffer(8)]],
-    const constant int* non_col_shapes [[buffer(9)]],
-    const constant size_t* non_col_strides [[buffer(10)]],
-    const constant int& non_col_ndim [[buffer(11)]],
-    uint tid [[thread_position_in_grid]]);
-template [[host_name("rowGeneralSmall_{0}")]] [[kernel]] void
-row_reduce_general_small<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& out_size [[buffer(3)]],
-    const constant size_t& non_row_reductions [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    uint lid [[thread_position_in_grid]]);
-template [[host_name("rowGeneralNoAtomics_{0}")]] [[kernel]] void
-row_reduce_general_no_atomics<{1}, {2}, {3}<{2}>>(
-    const device {1}* in [[buffer(0)]],
-    device {2}* out [[buffer(1)]],
-    const constant size_t& reduction_size [[buffer(2)]],
-    const constant size_t& out_size [[buffer(3)]],
-    const constant size_t& non_row_reductions [[buffer(4)]],
-    const constant int* shape [[buffer(5)]],
-    const constant size_t* strides [[buffer(6)]],
-    const constant int& ndim [[buffer(7)]],
-    uint3 lid [[thread_position_in_threadgroup]],
-    uint3 lsize [[threads_per_threadgroup]],
-    uint3 gsize [[threads_per_grid]],
-    uint3 tid [[threadgroup_position_in_grid]],
-    uint simd_lane_id [[thread_index_in_simdgroup]],
-    uint simd_per_group [[simdgroups_per_threadgroup]],
-    uint simd_group_id [[simdgroup_index_in_threadgroup]]);
-)";
--- a/mlx/backend/metal/jit_kernels.cpp
+++ b/mlx/backend/metal/jit_kernels.cpp
@@ -1,11 +1,9 @@
 // Copyright © 2024 Apple Inc.
-#include <map>

 #include "mlx/backend/common/compiled.h"
 #include "mlx/backend/metal/jit/arange.h"
-#include "mlx/backend/metal/jit/copy.h"
+#include "mlx/backend/metal/jit/gemv_masked.h"
 #include "mlx/backend/metal/jit/includes.h"
-#include "mlx/backend/metal/jit/reduce.h"
 #include "mlx/backend/metal/jit/scan.h"
 #include "mlx/backend/metal/jit/softmax.h"
 #include "mlx/backend/metal/jit/steel_conv.h"
@@ -44,16 +42,19 @@ MTL::ComputePipelineState* get_unary_kernel(
    const std::string& kernel_name,
    Dtype out_type,
    const std::string op) {
-  std::string lib_name = kernel_name.substr(1);
+  std::string lib_name = kernel_name.substr(kernel_name.find("_") + 1);
  auto lib = d.get_library(lib_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
-    auto u_def = get_template_definition(
-        "v" + lib_name, "unary_v", get_type_string(out_type), op);
-    auto g_def = get_template_definition(
-        "g" + lib_name, "unary_g", get_type_string(out_type), op);
-    kernel_source << metal::utils() << metal::unary_ops() << metal::unary()
-                  << u_def << g_def;
+    kernel_source << metal::utils() << metal::unary_ops() << metal::unary();
+    kernel_source << get_template_definition(
+        "v_" + lib_name, "unary_v", get_type_string(out_type), op);
+    kernel_source << get_template_definition(
+        "v2_" + lib_name, "unary_v2", get_type_string(out_type), op);
+    kernel_source << get_template_definition(
+        "g_" + lib_name, "unary_g", get_type_string(out_type), op);
+    kernel_source << get_template_definition(
+        "gn4_" + lib_name, "unary_g", get_type_string(out_type), op, 4);
    lib = d.get_library(lib_name, kernel_source.str());
  }
  return d.get_kernel(kernel_name, lib);
@@ -65,39 +66,36 @@ void add_binary_kernels(
    Dtype out_type,
    const std::string op,
    std::ostringstream& kernel_source) {
-  const std::map<std::string, std::string> kernel_types = {
+  const std::array<std::pair<std::string, std::string>, 11> kernel_types = {{
      {"ss", "binary_ss"},
      {"vs", "binary_vs"},
      {"sv", "binary_sv"},
      {"vv", "binary_vv"},
+      {"vs2", "binary_vs2"},
+      {"sv2", "binary_sv2"},
+      {"vv2", "binary_vv2"},
      {"g1", "binary_g_nd1"},
      {"g2", "binary_g_nd2"},
      {"g3", "binary_g_nd3"},
-      {"g4", "binary_g_nd"},
-      {"g5", "binary_g_nd"},
      {"gn", "binary_g"},
-  };
-  for (auto [name, func] : kernel_types) {
+  }};
+  for (auto& [name, func] : kernel_types) {
    std::string template_def;
-    if (name == "g4" || name == "g5") {
-      int dim = std::stoi(name.substr(1));
-      template_def = get_template_definition(
-          name + lib_name,
-          func,
-          get_type_string(in_type),
-          get_type_string(out_type),
-          op,
-          dim);
-    } else {
-      template_def = get_template_definition(
-          name + lib_name,
-          func,
-          get_type_string(in_type),
-          get_type_string(out_type),
-          op);
-    }
+    template_def = get_template_definition(
+        name + "_" + lib_name,
+        func,
+        get_type_string(in_type),
+        get_type_string(out_type),
+        op);
    kernel_source << template_def;
  }
+  kernel_source << get_template_definition(
+      "gn4_" + lib_name,
+      "binary_g",
+      get_type_string(in_type),
+      get_type_string(out_type),
+      op,
+      4);
 }

 MTL::ComputePipelineState* get_binary_kernel(
@@ -106,7 +104,7 @@ MTL::ComputePipelineState* get_binary_kernel(
    Dtype in_type,
    Dtype out_type,
    const std::string op) {
-  std::string lib_name = kernel_name.substr(2);
+  std::string lib_name = kernel_name.substr(kernel_name.find("_") + 1);
  auto lib = d.get_library(lib_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
@@ -123,7 +121,7 @@ MTL::ComputePipelineState* get_binary_two_kernel(
    Dtype in_type,
    Dtype out_type,
    const std::string op) {
-  std::string lib_name = kernel_name.substr(2);
+  std::string lib_name = kernel_name.substr(kernel_name.find("_") + 1);
  auto lib = d.get_library(lib_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
@@ -144,28 +142,23 @@ MTL::ComputePipelineState* get_ternary_kernel(
  auto lib = d.get_library(lib_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
-    const std::map<std::string, std::string> kernel_types = {
+    const std::array<std::pair<std::string, std::string>, 6> kernel_types = {{
        {"v", "ternary_v"},
+        {"v2", "ternary_v2"},
        {"g", "ternary_g"},
        {"g1", "ternary_g_nd1"},
        {"g2", "ternary_g_nd2"},
        {"g3", "ternary_g_nd3"},
-        {"g4", "ternary_g_nd"},
-        {"g5", "ternary_g_nd"},
-    };
+    }};
    kernel_source << metal::utils() << metal::ternary_ops() << metal::ternary();
-    for (auto [name, func] : kernel_types) {
+    for (auto& [name, func] : kernel_types) {
      std::string template_def;
-      if (name == "g4" || name == "g5") {
-        int dim = std::stoi(name.substr(1));
-        template_def = get_template_definition(
-            name + "_" + lib_name, func, get_type_string(type), op, dim);
-      } else {
-        template_def = get_template_definition(
-            name + "_" + lib_name, func, get_type_string(type), op);
-      }
+      template_def = get_template_definition(
+          name + "_" + lib_name, func, get_type_string(type), op);
      kernel_source << template_def;
    }
+    kernel_source << get_template_definition(
+        "gn4_" + lib_name, "ternary_g", get_type_string(type), op, 4);
    lib = d.get_library(lib_name, kernel_source.str());
  }
  return d.get_kernel(kernel_name, lib);
@@ -180,12 +173,31 @@ MTL::ComputePipelineState* get_copy_kernel(
  auto lib = d.get_library(lib_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
-    kernel_source << metal::utils() << metal::copy()
-                  << fmt::format(
-                         copy_kernels,
-                         lib_name,
-                         get_type_string(in.dtype()),
-                         get_type_string(out.dtype()));
+    auto in_type = get_type_string(in.dtype());
+    auto out_type = get_type_string(out.dtype());
+    kernel_source
+        << metal::utils() << metal::copy()
+        << get_template_definition("s_" + lib_name, "copy_s", in_type, out_type)
+        << get_template_definition("v_" + lib_name, "copy_v", in_type, out_type)
+        << get_template_definition(
+               "g1_" + lib_name, "copy_g_nd1", in_type, out_type)
+        << get_template_definition(
+               "g2_" + lib_name, "copy_g_nd2", in_type, out_type)
+        << get_template_definition(
+               "g3_" + lib_name, "copy_g_nd3", in_type, out_type)
+        << get_template_definition("g_" + lib_name, "copy_g", in_type, out_type)
+        << get_template_definition(
+               "gn4_" + lib_name, "copy_g", in_type, out_type, 4)
+        << get_template_definition(
+               "gg1_" + lib_name, "copy_gg_nd1", in_type, out_type)
+        << get_template_definition(
+               "gg2_" + lib_name, "copy_gg_nd2", in_type, out_type)
+        << get_template_definition(
+               "gg3_" + lib_name, "copy_gg_nd3", in_type, out_type)
+        << get_template_definition(
+               "gg_" + lib_name, "copy_gg", in_type, out_type)
+        << get_template_definition(
+               "ggn4_" + lib_name, "copy_gg", in_type, out_type, 4);
    lib = d.get_library(lib_name, kernel_source.str());
  }
  return d.get_kernel(kernel_name, lib);
@@ -290,11 +302,11 @@ MTL::ComputePipelineState* get_mb_sort_kernel(
  if (lib == nullptr) {
    std::ostringstream kernel_source;
    kernel_source << metal::utils() << metal::sort();
-    std::vector<std::pair<std::string, std::string>> kernel_types = {
-        {"sort_", "mb_block_sort"},
-        {"partition_", "mb_block_partition"},
-        {"merge_", "mb_block_merge"}};
-    for (auto [name, func] : kernel_types) {
+    std::array<std::pair<std::string, std::string>, 3> kernel_types = {
+        {{"sort_", "mb_block_sort"},
+         {"partition_", "mb_block_partition"},
+         {"merge_", "mb_block_merge"}}};
+    for (auto& [name, func] : kernel_types) {
      kernel_source << get_template_definition(
          name + lib_name,
          func,
@@ -316,12 +328,13 @@ MTL::ComputePipelineState* get_reduce_init_kernel(
  auto lib = d.get_library(kernel_name);
  if (lib == nullptr) {
    std::ostringstream kernel_source;
-    kernel_source << metal::utils() << metal::reduce_utils()
-                  << fmt::format(
-                         reduce_init_kernels,
-                         kernel_name,
-                         get_type_string(out.dtype()),
-                         op_name(out));
+    std::string op_type = op_name(out);
+    op_type[0] = std::toupper(op_name(out)[0]);
+    auto out_type = get_type_string(out.dtype());
+    std::string op = op_type + "<" + out_type + ">";
+    kernel_source << metal::utils() << metal::reduce_utils() << metal::reduce();
+    kernel_source << get_template_definition(
+        kernel_name, "init_reduce", out_type, op);
    lib = d.get_library(kernel_name, kernel_source.str());
  }
  return d.get_kernel(kernel_name, lib);
@@ -330,27 +343,36 @@ MTL::ComputePipelineState* get_reduce_init_kernel(
 MTL::ComputePipelineState* get_reduce_kernel(
    metal::Device& d,
    const std::string& kernel_name,
+    const std::string& func_name,
    const std::string& op_name,
    const array& in,
-    const array& out) {
-  std::string lib_name = kernel_name.substr(kernel_name.find("_") + 1);
-  auto lib = d.get_library(lib_name);
+    const array& out,
+    int ndim /* = -1 */,
+    int bm /* = -1 */,
+    int bn /* = -1 */) {
+  auto lib = d.get_library(kernel_name);
  if (lib == nullptr) {
    std::string op_type = op_name;
    op_type[0] = std::toupper(op_name[0]);
-    bool non_atomic = out.dtype() == int64 || out.dtype() == uint64;
    std::ostringstream kernel_source;
-    kernel_source << metal::utils() << metal::reduce_utils() << metal::reduce()
-                  << fmt::format(
-                         non_atomic ? reduce_non_atomic_kernels
-                                    : reduce_kernels,
-                         lib_name,
-                         get_type_string(in.dtype()),
-                         get_type_string(out.dtype()),
-                         op_type);
-    lib = d.get_library(lib_name, kernel_source.str());
+    auto in_type = get_type_string(in.dtype());
+    auto out_type = get_type_string(out.dtype());
+    std::string op = op_type + "<" + out_type + ">";
+    kernel_source << metal::utils() << metal::reduce_utils() << metal::reduce();
+    if (bm >= 0) {
+      kernel_source << get_template_definition(
+          kernel_name, func_name, in_type, out_type, op, ndim, bm, bn);
+    } else if (ndim >= 0) {
+      kernel_source << get_template_definition(
+          kernel_name, func_name, in_type, out_type, op, ndim);
+    } else {
+      kernel_source << get_template_definition(
+          kernel_name, func_name, in_type, out_type, op);
+    }
+    lib = d.get_library(kernel_name, kernel_source.str());
  }
-  return d.get_kernel(kernel_name, lib);
+  auto st = d.get_kernel(kernel_name, lib);
+  return st;
 }

 MTL::ComputePipelineState* get_steel_gemm_fused_kernel(
@@ -496,6 +518,49 @@ MTL::ComputePipelineState* get_steel_gemm_masked_kernel(
  return d.get_kernel(kernel_name, lib);
 }

+MTL::ComputePipelineState* get_gemv_masked_kernel(
+    metal::Device& d,
+    const std::string& kernel_name,
+    const array& out,
+    const std::optional<array>& mask_out,
+    const std::optional<array>& mask_op,
+    bool transpose_mat,
+    int bm,
+    int bn,
+    int sm,
+    int sn,
+    int tm,
+    int tn,
+    bool contiguous) {
+  const auto& lib_name = kernel_name;
+  auto lib = d.get_library(lib_name);
+  if (lib == nullptr) {
+    std::ostringstream kernel_source;
+    auto out_mask_type = mask_out.has_value()
+        ? get_type_string((*mask_out).dtype())
+        : "nomask_t";
+    auto op_mask_type =
+        mask_op.has_value() ? get_type_string((*mask_op).dtype()) : "nomask_t";
+    kernel_source << metal::utils() << metal::gemv_masked()
+                  << fmt::format(
+                         gemv_masked_kernel,
+                         "name"_a = lib_name,
+                         "itype"_a = get_type_string(out.dtype()),
+                         "outm_t"_a = out_mask_type,
+                         "opm_t"_a = op_mask_type,
+                         "bm"_a = bm,
+                         "bn"_a = bn,
+                         "sm"_a = sm,
+                         "sn"_a = sn,
+                         "tm"_a = tm,
+                         "tn"_a = tn,
+                         "trans"_a = transpose_mat ? "t_" : "",
+                         "nc"_a = contiguous ? "0" : "1");
+    lib = d.get_library(lib_name, kernel_source.str());
+  }
+  return d.get_kernel(kernel_name, lib);
+}
+
 MTL::ComputePipelineState* get_steel_conv_kernel(
    metal::Device& d,
    const std::string& kernel_name,
--- a/mlx/backend/metal/kernels.h
+++ b/mlx/backend/metal/kernels.h
@@ -83,9 +83,13 @@ MTL::ComputePipelineState* get_reduce_init_kernel(
 MTL::ComputePipelineState* get_reduce_kernel(
    metal::Device& d,
    const std::string& kernel_name,
+    const std::string& func_name,
    const std::string& op_name,
    const array& in,
-    const array& out);
+    const array& out,
+    int ndim = -1,
+    int bm = -1,
+    int bn = -1);

 MTL::ComputePipelineState* get_steel_gemm_fused_kernel(
    metal::Device& d,
@@ -151,6 +155,21 @@ MTL::ComputePipelineState* get_steel_conv_kernel(
    int n_channel_specialization,
    bool small_filter);

+MTL::ComputePipelineState* get_gemv_masked_kernel(
+    metal::Device& d,
+    const std::string& kernel_name,
+    const array& out,
+    const std::optional<array>& mask_out,
+    const std::optional<array>& mask_op,
+    bool transpose_mat,
+    int bm,
+    int bn,
+    int sm,
+    int sn,
+    int tm,
+    int tn,
+    bool contiguous);
+
 MTL::ComputePipelineState* get_steel_conv_general_kernel(
    metal::Device& d,
    const std::string& kernel_name,
--- a/mlx/backend/metal/kernels/CMakeLists.txt
+++ b/mlx/backend/metal/kernels/CMakeLists.txt
@@ -1,147 +1,95 @@
-set(
-  BASE_HEADERS
-  bf16.h
-  bf16_math.h
-  complex.h
-  defines.h
-  expm1f.h
-  utils.h
-)
+set(BASE_HEADERS bf16.h bf16_math.h complex.h defines.h expm1f.h utils.h)

 function(build_kernel_base TARGET SRCFILE DEPS)
  set(METAL_FLAGS -Wall -Wextra -fno-fast-math)
  if(MLX_METAL_DEBUG)
-    set(METAL_FLAGS ${METAL_FLAGS}
-        -gline-tables-only
-        -frecord-sources)
+    set(METAL_FLAGS ${METAL_FLAGS} -gline-tables-only -frecord-sources)
  endif()
  add_custom_command(
-    COMMAND xcrun -sdk macosx metal
-                  ${METAL_FLAGS}
-                  -c ${SRCFILE}
-                  -I${PROJECT_SOURCE_DIR}
-                  -o ${TARGET}.air
+    COMMAND xcrun -sdk macosx metal ${METAL_FLAGS} -c ${SRCFILE}
+            -I${PROJECT_SOURCE_DIR} -o ${TARGET}.air
    DEPENDS ${SRCFILE} ${DEPS} ${BASE_HEADERS}
    OUTPUT ${TARGET}.air
    COMMENT "Building ${TARGET}.air"
-    VERBATIM
-  )
+    VERBATIM)
 endfunction(build_kernel_base)

 function(build_kernel KERNEL)
  set(SRCFILE ${CMAKE_CURRENT_SOURCE_DIR}/${KERNEL}.metal)
  cmake_path(GET KERNEL STEM TARGET)
  build_kernel_base(${TARGET} ${SRCFILE} "${ARGN}")
-  set(KERNEL_AIR ${TARGET}.air ${KERNEL_AIR} PARENT_SCOPE)
+  set(KERNEL_AIR
+      ${TARGET}.air ${KERNEL_AIR}
+      PARENT_SCOPE)
 endfunction(build_kernel)

 build_kernel(arg_reduce)
 build_kernel(conv steel/conv/params.h)
 build_kernel(gemv steel/utils.h)
-build_kernel(gemv_masked steel/utils.h)
 build_kernel(layer_norm)
 build_kernel(random)
 build_kernel(rms_norm)
 build_kernel(rope)
-build_kernel(
-  scaled_dot_product_attention
-  scaled_dot_product_attention_params.h
-  steel/defines.h
-  steel/gemm/transforms.h
-  steel/utils.h
-)
+build_kernel(scaled_dot_product_attention scaled_dot_product_attention_params.h
+             steel/defines.h steel/gemm/transforms.h steel/utils.h)

-set(
-  STEEL_HEADERS 
-  steel/defines.h
-  steel/utils.h
-  steel/conv/conv.h
-  steel/conv/loader.h
-  steel/conv/loaders/loader_channel_l.h
-  steel/conv/loaders/loader_channel_n.h
-  steel/conv/loaders/loader_general.h
-  steel/conv/kernels/steel_conv.h
-  steel/conv/kernels/steel_conv_general.h
-  steel/gemm/gemm.h
-  steel/gemm/mma.h
-  steel/gemm/loader.h
-  steel/gemm/transforms.h
-  steel/gemm/kernels/steel_gemm_fused.h
-  steel/gemm/kernels/steel_gemm_masked.h
-  steel/gemm/kernels/steel_gemm_splitk.h
-)
+set(STEEL_HEADERS
+    steel/defines.h
+    steel/utils.h
+    steel/conv/conv.h
+    steel/conv/loader.h
+    steel/conv/loaders/loader_channel_l.h
+    steel/conv/loaders/loader_channel_n.h
+    steel/conv/loaders/loader_general.h
+    steel/conv/kernels/steel_conv.h
+    steel/conv/kernels/steel_conv_general.h
+    steel/gemm/gemm.h
+    steel/gemm/mma.h
+    steel/gemm/loader.h
+    steel/gemm/transforms.h
+    steel/gemm/kernels/steel_gemm_fused.h
+    steel/gemm/kernels/steel_gemm_masked.h
+    steel/gemm/kernels/steel_gemm_splitk.h)

-if (NOT MLX_METAL_JIT)
-build_kernel(arange arange.h)
-build_kernel(binary binary.h binary_ops.h)
-build_kernel(binary_two binary_two.h)
-build_kernel(copy copy.h)
-build_kernel(
-  fft
-  fft.h
-  fft/radix.h
-  fft/readwrite.h
-)
-build_kernel(
-  reduce
-  atomic.h
-  reduction/ops.h
-  reduction/reduce_init.h
-  reduction/reduce_all.h
-  reduction/reduce_col.h
-  reduction/reduce_row.h
-)
-build_kernel(
-  quantized
-  quantized.h
-  ${STEEL_HEADERS}
-)
-build_kernel(scan scan.h)
-build_kernel(softmax softmax.h)
-build_kernel(sort sort.h)
-build_kernel(ternary ternary.h ternary_ops.h)
-build_kernel(unary unary.h unary_ops.h)
-build_kernel(
-  steel/conv/kernels/steel_conv
-  ${STEEL_HEADERS}
-)
-build_kernel(
-  steel/conv/kernels/steel_conv_general
-  ${STEEL_HEADERS}
-)
-build_kernel(
-  steel/gemm/kernels/steel_gemm_fused
-  ${STEEL_HEADERS}
-)
-build_kernel(
-  steel/gemm/kernels/steel_gemm_masked
-  ${STEEL_HEADERS}
-)
-build_kernel(
-  steel/gemm/kernels/steel_gemm_splitk
-  ${STEEL_HEADERS}
-)
+if(NOT MLX_METAL_JIT)
+  build_kernel(arange arange.h)
+  build_kernel(binary binary.h binary_ops.h)
+  build_kernel(binary_two binary_two.h)
+  build_kernel(copy copy.h)
+  build_kernel(fft fft.h fft/radix.h fft/readwrite.h)
+  build_kernel(
+    reduce
+    atomic.h
+    reduction/ops.h
+    reduction/reduce_init.h
+    reduction/reduce_all.h
+    reduction/reduce_col.h
+    reduction/reduce_row.h)
+  build_kernel(quantized quantized.h ${STEEL_HEADERS})
+  build_kernel(scan scan.h)
+  build_kernel(softmax softmax.h)
+  build_kernel(sort sort.h)
+  build_kernel(ternary ternary.h ternary_ops.h)
+  build_kernel(unary unary.h unary_ops.h)
+  build_kernel(steel/conv/kernels/steel_conv ${STEEL_HEADERS})
+  build_kernel(steel/conv/kernels/steel_conv_general ${STEEL_HEADERS})
+  build_kernel(steel/gemm/kernels/steel_gemm_fused ${STEEL_HEADERS})
+  build_kernel(steel/gemm/kernels/steel_gemm_masked ${STEEL_HEADERS})
+  build_kernel(steel/gemm/kernels/steel_gemm_splitk ${STEEL_HEADERS})
+  build_kernel(gemv_masked steel/utils.h)
 endif()

-
 add_custom_command(
  OUTPUT ${MLX_METAL_PATH}/mlx.metallib
-  COMMAND xcrun -sdk macosx metallib ${KERNEL_AIR} -o ${MLX_METAL_PATH}/mlx.metallib
+  COMMAND xcrun -sdk macosx metallib ${KERNEL_AIR} -o
+          ${MLX_METAL_PATH}/mlx.metallib
  DEPENDS ${KERNEL_AIR}
  COMMENT "Building mlx.metallib"
-  VERBATIM
-)
+  VERBATIM)

-add_custom_target(
-  mlx-metallib
-  DEPENDS
-  ${MLX_METAL_PATH}/mlx.metallib
-)
+add_custom_target(mlx-metallib DEPENDS ${MLX_METAL_PATH}/mlx.metallib)

-add_dependencies(
-  mlx
-  mlx-metallib
-)
+add_dependencies(mlx mlx-metallib)

 # Install metallib
 include(GNUInstallDirs)
@@ -149,5 +97,4 @@ include(GNUInstallDirs)
 install(
  FILES ${MLX_METAL_PATH}/mlx.metallib
  DESTINATION ${CMAKE_INSTALL_LIBDIR}
-  COMPONENT metallib
-)
+  COMPONENT metallib)
--- a/mlx/backend/metal/kernels/arg_reduce.metal
+++ b/mlx/backend/metal/kernels/arg_reduce.metal
@@ -70,16 +70,16 @@ IndexValPair<U> simd_shuffle_down(IndexValPair<U> data, uint16_t delta) {
      simd_shuffle_down(data.index, delta), simd_shuffle_down(data.val, delta)};
 }

-template <typename T, typename Op, int N_READS>
+template <typename T, typename Op, int N_READS = 4>
 [[kernel]] void arg_reduce_general(
    const device T* in [[buffer(0)]],
    device uint32_t* out [[buffer(1)]],
-    const device int* shape [[buffer(2)]],
-    const device size_t* in_strides [[buffer(3)]],
-    const device size_t* out_strides [[buffer(4)]],
-    const device size_t& ndim [[buffer(5)]],
-    const device size_t& axis_stride [[buffer(6)]],
-    const device size_t& axis_size [[buffer(7)]],
+    const constant int* shape [[buffer(2)]],
+    const constant size_t* in_strides [[buffer(3)]],
+    const constant size_t* out_strides [[buffer(4)]],
+    const constant size_t& ndim [[buffer(5)]],
+    const constant size_t& axis_stride [[buffer(6)]],
+    const constant size_t& axis_size [[buffer(7)]],
    uint gid [[thread_position_in_grid]],
    uint lid [[thread_position_in_threadgroup]],
    uint lsize [[threads_per_threadgroup]],
@@ -159,28 +159,12 @@ template <typename T, typename Op, int N_READS>
  }
 }

-#define instantiate_arg_reduce_helper(name, itype, op) \
-  template [[host_name(name)]] [[kernel]] void         \
-  arg_reduce_general<itype, op<itype>, 4>(             \
-      const device itype* in [[buffer(0)]],            \
-      device uint32_t* out [[buffer(1)]],              \
-      const device int* shape [[buffer(2)]],           \
-      const device size_t* in_strides [[buffer(3)]],   \
-      const device size_t* out_strides [[buffer(4)]],  \
-      const device size_t& ndim [[buffer(5)]],         \
-      const device size_t& axis_stride [[buffer(6)]],  \
-      const device size_t& axis_size [[buffer(7)]],    \
-      uint gid [[thread_position_in_grid]],            \
-      uint lid [[thread_position_in_threadgroup]],     \
-      uint lsize [[threads_per_threadgroup]],          \
-      uint simd_size [[threads_per_simdgroup]],        \
-      uint simd_lane_id [[thread_index_in_simdgroup]], \
-      uint simd_group_id [[simdgroup_index_in_threadgroup]]);
-
 // clang-format off
 #define instantiate_arg_reduce(name, itype)                      \
-  instantiate_arg_reduce_helper("argmin_" #name , itype, ArgMin) \
-  instantiate_arg_reduce_helper("argmax_" #name , itype, ArgMax)
+  instantiate_kernel(                                            \
+      "argmin_" #name, arg_reduce_general, itype, ArgMin<itype>) \
+  instantiate_kernel(                                            \
+      "argmax_" #name, arg_reduce_general, itype, ArgMax<itype>)

 instantiate_arg_reduce(bool_, bool)
 instantiate_arg_reduce(uint8, uint8_t)
--- a/mlx/backend/metal/kernels/atomic.h
+++ b/mlx/backend/metal/kernels/atomic.h
@@ -37,13 +37,13 @@ struct mlx_atomic<T, enable_if_t<is_metal_atomic<T>>> {

 template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC T
-mlx_atomic_load_explicit(device mlx_atomic<T>* object, uint offset) {
+mlx_atomic_load_explicit(device mlx_atomic<T>* object, size_t offset) {
  return atomic_load_explicit(&(object[offset].val), memory_order_relaxed);
 }

 template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void
-mlx_atomic_store_explicit(device mlx_atomic<T>* object, T val, uint offset) {
+mlx_atomic_store_explicit(device mlx_atomic<T>* object, T val, size_t offset) {
  atomic_store_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

@@ -51,13 +51,15 @@ template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_and_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  atomic_fetch_and_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

 template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
-METAL_FUNC void
-mlx_atomic_fetch_or_explicit(device mlx_atomic<T>* object, T val, uint offset) {
+METAL_FUNC void mlx_atomic_fetch_or_explicit(
+    device mlx_atomic<T>* object,
+    T val,
+    size_t offset) {
  atomic_fetch_or_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

@@ -65,7 +67,7 @@ template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_min_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  atomic_fetch_min_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

@@ -73,7 +75,7 @@ template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_max_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  atomic_fetch_max_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

@@ -81,7 +83,7 @@ template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_add_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  atomic_fetch_add_explicit(&(object[offset].val), val, memory_order_relaxed);
 }

@@ -89,7 +91,7 @@ template <typename T, enable_if_t<is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_mul_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  T expected = mlx_atomic_load_explicit(object, offset);
  while (!mlx_atomic_compare_exchange_weak_explicit(
      object, &expected, val * expected, offset)) {
@@ -101,7 +103,7 @@ METAL_FUNC bool mlx_atomic_compare_exchange_weak_explicit(
    device mlx_atomic<T>* object,
    thread T* expected,
    T val,
-    uint offset) {
+    size_t offset) {
  return atomic_compare_exchange_weak_explicit(
      &(object[offset].val),
      expected,
@@ -115,7 +117,7 @@ template <>
 METAL_FUNC void mlx_atomic_fetch_min_explicit<float>(
    device mlx_atomic<float>* object,
    float val,
-    uint offset) {
+    size_t offset) {
  float expected = mlx_atomic_load_explicit(object, offset);
  while (val < expected) {
    if (mlx_atomic_compare_exchange_weak_explicit(
@@ -130,7 +132,7 @@ template <>
 METAL_FUNC void mlx_atomic_fetch_max_explicit<float>(
    device mlx_atomic<float>* object,
    float val,
-    uint offset) {
+    size_t offset) {
  float expected = mlx_atomic_load_explicit(object, offset);
  while (val > expected) {
    if (mlx_atomic_compare_exchange_weak_explicit(
@@ -157,7 +159,7 @@ union uint_or_packed {

 template <typename T, typename Op>
 struct mlx_atomic_update_helper {
-  uint operator()(uint_or_packed<T> init, T update, uint elem_offset) {
+  uint operator()(uint_or_packed<T> init, T update, size_t elem_offset) {
    Op op;
    init.val[elem_offset] = op(update, init.val[elem_offset]);
    return init.bits;
@@ -168,9 +170,9 @@ template <typename T, typename Op>
 METAL_FUNC void mlx_atomic_update_and_store(
    device mlx_atomic<T>* object,
    T update,
-    uint offset) {
-  uint pack_offset = offset / packing_size<T>;
-  uint elem_offset = offset % packing_size<T>;
+    size_t offset) {
+  size_t pack_offset = offset / packing_size<T>;
+  size_t elem_offset = offset % packing_size<T>;

  mlx_atomic_update_helper<T, Op> helper;
  uint_or_packed<T> expected;
@@ -251,9 +253,9 @@ struct __Min {

 template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC T
-mlx_atomic_load_explicit(device mlx_atomic<T>* object, uint offset) {
-  uint pack_offset = offset / sizeof(T);
-  uint elem_offset = offset % sizeof(T);
+mlx_atomic_load_explicit(device mlx_atomic<T>* object, size_t offset) {
+  size_t pack_offset = offset / sizeof(T);
+  size_t elem_offset = offset % sizeof(T);
  uint_or_packed<T> packed_val;
  packed_val.bits =
      atomic_load_explicit(&(object[pack_offset].val), memory_order_relaxed);
@@ -262,7 +264,7 @@ mlx_atomic_load_explicit(device mlx_atomic<T>* object, uint offset) {

 template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void
-mlx_atomic_store_explicit(device mlx_atomic<T>* object, T val, uint offset) {
+mlx_atomic_store_explicit(device mlx_atomic<T>* object, T val, size_t offset) {
  mlx_atomic_update_and_store<T, __None<T>>(object, val, offset);
 }

@@ -270,9 +272,9 @@ template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_and_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
-  uint pack_offset = offset / packing_size<T>;
-  uint elem_offset = offset % packing_size<T>;
+    size_t offset) {
+  size_t pack_offset = offset / packing_size<T>;
+  size_t elem_offset = offset % packing_size<T>;
  uint_or_packed<T> identity;
  identity.bits = __UINT32_MAX__;
  identity.val[elem_offset] = val;
@@ -282,10 +284,12 @@ METAL_FUNC void mlx_atomic_fetch_and_explicit(
 }

 template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
-METAL_FUNC void
-mlx_atomic_fetch_or_explicit(device mlx_atomic<T>* object, T val, uint offset) {
-  uint pack_offset = offset / packing_size<T>;
-  uint elem_offset = offset % packing_size<T>;
+METAL_FUNC void mlx_atomic_fetch_or_explicit(
+    device mlx_atomic<T>* object,
+    T val,
+    size_t offset) {
+  size_t pack_offset = offset / packing_size<T>;
+  size_t elem_offset = offset % packing_size<T>;
  uint_or_packed<T> identity;
  identity.bits = 0;
  identity.val[elem_offset] = val;
@@ -298,7 +302,7 @@ template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_min_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  mlx_atomic_update_and_store<T, __Min<T>>(object, val, offset);
 }

@@ -306,7 +310,7 @@ template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_max_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  mlx_atomic_update_and_store<T, __Max<T>>(object, val, offset);
 }

@@ -314,7 +318,7 @@ template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_add_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  mlx_atomic_update_and_store<T, __Add<T>>(object, val, offset);
 }

@@ -322,7 +326,7 @@ template <typename T, enable_if_t<!is_metal_atomic<T>, bool> = true>
 METAL_FUNC void mlx_atomic_fetch_mul_explicit(
    device mlx_atomic<T>* object,
    T val,
-    uint offset) {
+    size_t offset) {
  mlx_atomic_update_and_store<T, __Mul<T>>(object, val, offset);
 }

@@ -331,7 +335,7 @@ METAL_FUNC bool mlx_atomic_compare_exchange_weak_explicit(
    device mlx_atomic<T>* object,
    thread uint* expected,
    uint val,
-    uint offset) {
+    size_t offset) {
  return atomic_compare_exchange_weak_explicit(
      &(object[offset].val),
      expected,
--- a/mlx/backend/metal/kernels/binary.h
+++ b/mlx/backend/metal/kernels/binary.h
@@ -36,6 +36,39 @@ template <typename T, typename U, typename Op>
  c[index] = Op()(a[index], b[index]);
 }

+template <typename T, typename U, typename Op>
+[[kernel]] void binary_sv2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  c[offset] = Op()(a[0], b[offset]);
+}
+
+template <typename T, typename U, typename Op>
+[[kernel]] void binary_vs2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  c[offset] = Op()(a[offset], b[0]);
+}
+
+template <typename T, typename U, typename Op>
+[[kernel]] void binary_vv2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  c[offset] = Op()(a[offset], b[offset]);
+}
+
 template <typename T, typename U, typename Op>
 [[kernel]] void binary_g_nd1(
    device const T* a,
@@ -60,7 +93,7 @@ template <typename T, typename U, typename Op>
    uint2 grid_dim [[threads_per_grid]]) {
  auto a_idx = elem_to_loc_2(index, a_strides);
  auto b_idx = elem_to_loc_2(index, b_strides);
-  size_t out_idx = index.x + (size_t)grid_dim.x * index.y;
+  size_t out_idx = index.x + size_t(grid_dim.x) * index.y;
  c[out_idx] = Op()(a[a_idx], b[b_idx]);
 }

@@ -76,27 +109,11 @@ template <typename T, typename U, typename Op>
  auto a_idx = elem_to_loc_3(index, a_strides);
  auto b_idx = elem_to_loc_3(index, b_strides);
  size_t out_idx =
-      index.x + (size_t)grid_dim.x * (index.y + (size_t)grid_dim.y * index.z);
+      index.x + grid_dim.x * (index.y + size_t(grid_dim.y) * index.z);
  c[out_idx] = Op()(a[a_idx], b[b_idx]);
 }

-template <typename T, typename U, typename Op, int DIM>
-[[kernel]] void binary_g_nd(
-    device const T* a,
-    device const T* b,
-    device U* c,
-    constant const int shape[DIM],
-    constant const size_t a_strides[DIM],
-    constant const size_t b_strides[DIM],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]) {
-  auto idx = elem_to_loc_2_nd<DIM>(index, shape, a_strides, b_strides);
-  size_t out_idx =
-      index.x + (size_t)grid_dim.x * (index.y + (size_t)grid_dim.y * index.z);
-  c[out_idx] = Op()(a[idx.x], b[idx.y]);
-}
-
-template <typename T, typename U, typename Op>
+template <typename T, typename U, typename Op, int N = 1>
 [[kernel]] void binary_g(
    device const T* a,
    device const T* b,
@@ -107,7 +124,16 @@ template <typename T, typename U, typename Op>
    constant const int& ndim,
    uint3 index [[thread_position_in_grid]],
    uint3 grid_dim [[threads_per_grid]]) {
-  auto idx = elem_to_loc_2_nd(index, shape, a_strides, b_strides, ndim);
-  size_t out_idx = index.x + grid_dim.x * (index.y + grid_dim.y * index.z);
-  c[out_idx] = Op()(a[idx.x], b[idx.y]);
+  auto idx = elem_to_loc_2_nd(
+      {N * index.x, index.y, index.z}, shape, a_strides, b_strides, ndim);
+  auto xshape = shape[ndim - 1];
+  size_t out_idx =
+      N * index.x + xshape * (index.y + size_t(grid_dim.y) * index.z);
+  auto a_xstride = a_strides[ndim - 1];
+  auto b_xstride = b_strides[ndim - 1];
+  for (int i = 0; i < N && (int(N * index.x) + i) < xshape; ++i) {
+    c[out_idx++] = Op()(a[idx.x], b[idx.y]);
+    idx.x += a_xstride;
+    idx.y += b_xstride;
+  }
 }
--- a/mlx/backend/metal/kernels/binary.metal
+++ b/mlx/backend/metal/kernels/binary.metal
@@ -9,17 +9,19 @@
 #include "mlx/backend/metal/kernels/binary_ops.h"
 #include "mlx/backend/metal/kernels/binary.h"

-#define instantiate_binary_all(op, tname, itype, otype)                 \
-  instantiate_kernel("ss" #op #tname, binary_ss, itype, otype, op)      \
-  instantiate_kernel("sv" #op #tname, binary_sv, itype, otype, op)      \
-  instantiate_kernel("vs" #op #tname, binary_vs, itype, otype, op)      \
-  instantiate_kernel("vv" #op #tname, binary_vv, itype, otype, op)      \
-  instantiate_kernel("gn" #op #tname, binary_g, itype, otype, op)       \
-  instantiate_kernel("g1" #op #tname, binary_g_nd1, itype, otype, op)   \
-  instantiate_kernel("g2" #op #tname, binary_g_nd2, itype, otype, op)   \
-  instantiate_kernel("g3" #op #tname, binary_g_nd3, itype, otype, op)   \
-  instantiate_kernel("g4" #op #tname, binary_g_nd, itype, otype, op, 4) \
-  instantiate_kernel("g5" #op #tname, binary_g_nd,  itype, otype, op, 5)
+#define instantiate_binary_all(op, tname, itype, otype)                  \
+  instantiate_kernel("ss_" #op #tname, binary_ss, itype, otype, op)      \
+  instantiate_kernel("sv_" #op #tname, binary_sv, itype, otype, op)      \
+  instantiate_kernel("vs_" #op #tname, binary_vs, itype, otype, op)      \
+  instantiate_kernel("vv_" #op #tname, binary_vv, itype, otype, op)      \
+  instantiate_kernel("sv2_" #op #tname, binary_sv2, itype, otype, op)    \
+  instantiate_kernel("vs2_" #op #tname, binary_vs2, itype, otype, op)    \
+  instantiate_kernel("vv2_" #op #tname, binary_vv2, itype, otype, op)    \
+  instantiate_kernel("gn_" #op #tname, binary_g, itype, otype, op)       \
+  instantiate_kernel("gn4_" #op #tname, binary_g, itype, otype, op, 4)   \
+  instantiate_kernel("g1_" #op #tname, binary_g_nd1, itype, otype, op)   \
+  instantiate_kernel("g2_" #op #tname, binary_g_nd2, itype, otype, op)   \
+  instantiate_kernel("g3_" #op #tname, binary_g_nd3, itype, otype, op)   \

 #define instantiate_binary_integer(op)                   \
  instantiate_binary_all(op, uint8, uint8_t, uint8_t)    \
--- a/mlx/backend/metal/kernels/binary_ops.h
+++ b/mlx/backend/metal/kernels/binary_ops.h
@@ -217,7 +217,7 @@ struct Power {

  template <>
  complex64_t operator()(complex64_t x, complex64_t y) {
-    auto x_theta = metal::atan(x.imag / x.real);
+    auto x_theta = metal::atan2(x.imag, x.real);
    auto x_ln_r = 0.5 * metal::log(x.real * x.real + x.imag * x.imag);
    auto mag = metal::exp(y.real * x_ln_r - y.imag * x_theta);
    auto phase = y.imag * x_ln_r + y.real * x_theta;
--- a/mlx/backend/metal/kernels/binary_two.h
+++ b/mlx/backend/metal/kernels/binary_two.h
@@ -48,6 +48,48 @@ template <typename T, typename U, typename Op>
  d[index] = out[1];
 }

+template <typename T, typename U, typename Op>
+[[kernel]] void binary_sv2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    device U* d,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  auto out = Op()(a[0], b[offset]);
+  c[offset] = out[0];
+  d[offset] = out[1];
+}
+
+template <typename T, typename U, typename Op>
+[[kernel]] void binary_vs2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    device U* d,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  auto out = Op()(a[offset], b[0]);
+  c[offset] = out[0];
+  d[offset] = out[1];
+}
+
+template <typename T, typename U, typename Op>
+[[kernel]] void binary_vv2(
+    device const T* a,
+    device const T* b,
+    device U* c,
+    device U* d,
+    uint2 index [[thread_position_in_grid]],
+    uint2 grid_dim [[threads_per_grid]]) {
+  size_t offset = index.x + grid_dim.x * size_t(index.y);
+  auto out = Op()(a[offset], b[offset]);
+  c[offset] = out[0];
+  d[offset] = out[1];
+}
+
 template <typename T, typename U, typename Op>
 [[kernel]] void binary_g_nd1(
    device const T* a,
@@ -76,7 +118,7 @@ template <typename T, typename U, typename Op>
    uint2 grid_dim [[threads_per_grid]]) {
  auto a_idx = elem_to_loc_2(index, a_strides);
  auto b_idx = elem_to_loc_2(index, b_strides);
-  size_t out_idx = index.x + (size_t)grid_dim.x * index.y;
+  size_t out_idx = index.x + size_t(grid_dim.x) * index.y;
  auto out = Op()(a[a_idx], b[b_idx]);
  c[out_idx] = out[0];
  d[out_idx] = out[1];
@@ -95,32 +137,13 @@ template <typename T, typename U, typename Op>
  auto a_idx = elem_to_loc_3(index, a_strides);
  auto b_idx = elem_to_loc_3(index, b_strides);
  size_t out_idx =
-      index.x + (size_t)grid_dim.x * (index.y + (size_t)grid_dim.y * index.z);
+      index.x + grid_dim.x * (index.y + size_t(grid_dim.y) * index.z);
  auto out = Op()(a[a_idx], b[b_idx]);
  c[out_idx] = out[0];
  d[out_idx] = out[1];
 }

-template <typename T, typename U, typename Op, int DIM>
-[[kernel]] void binary_g_nd(
-    device const T* a,
-    device const T* b,
-    device U* c,
-    device U* d,
-    constant const int shape[DIM],
-    constant const size_t a_strides[DIM],
-    constant const size_t b_strides[DIM],
-    uint3 index [[thread_position_in_grid]],
-    uint3 grid_dim [[threads_per_grid]]) {
-  auto idx = elem_to_loc_2_nd<DIM>(index, shape, a_strides, b_strides);
-  size_t out_idx =
-      index.x + (size_t)grid_dim.x * (index.y + (size_t)grid_dim.y * index.z);
-  auto out = Op()(a[idx.x], b[idx.y]);
-  c[out_idx] = out[0];
-  d[out_idx] = out[1];
-}
-
-template <typename T, typename U, typename Op>
+template <typename T, typename U, typename Op, int N = 1>
 [[kernel]] void binary_g(
    device const T* a,
    device const T* b,
@@ -132,9 +155,18 @@ template <typename T, typename U, typename Op>
    constant const int& ndim,
    uint3 index [[thread_position_in_grid]],
    uint3 grid_dim [[threads_per_grid]]) {
-  auto idx = elem_to_loc_2_nd(index, shape, a_strides, b_strides, ndim);
-  size_t out_idx = index.x + grid_dim.x * (index.y + grid_dim.y * index.z);
-  auto out = Op()(a[idx.x], b[idx.y]);
-  c[out_idx] = out[0];
-  d[out_idx] = out[1];
+  auto idx = elem_to_loc_2_nd(
+      {N * index.x, index.y, index.z}, shape, a_strides, b_strides, ndim);
+  auto xshape = shape[ndim - 1];
+  size_t out_idx =
+      N * index.x + xshape * (index.y + size_t(grid_dim.y) * index.z);
+  auto a_xstride = a_strides[ndim - 1];
+  auto b_xstride = b_strides[ndim - 1];
+  for (int i = 0; i < N && (int(N * index.x) + i) < xshape; ++i) {
+    auto out = Op()(a[idx.x], b[idx.y]);
+    c[out_idx] = out[0];
+    d[out_idx++] = out[1];
+    idx.x += a_xstride;
+    idx.y += b_xstride;
+  }
 }
--- a/mlx/backend/metal/kernels/binary_two.metal
+++ b/mlx/backend/metal/kernels/binary_two.metal
@@ -7,17 +7,19 @@
 #include "mlx/backend/metal/kernels/binary_ops.h"
 #include "mlx/backend/metal/kernels/binary_two.h"

-#define instantiate_binary_all(op, tname, itype, otype)                 \
-  instantiate_kernel("ss" #op #tname, binary_ss, itype, otype, op)      \
-  instantiate_kernel("sv" #op #tname, binary_sv, itype, otype, op)      \
-  instantiate_kernel("vs" #op #tname, binary_vs, itype, otype, op)      \
-  instantiate_kernel("vv" #op #tname, binary_vv, itype, otype, op)      \
-  instantiate_kernel("gn" #op #tname, binary_g, itype, otype, op)       \
-  instantiate_kernel("g1" #op #tname, binary_g_nd1, itype, otype, op)   \
-  instantiate_kernel("g2" #op #tname, binary_g_nd2, itype, otype, op)   \
-  instantiate_kernel("g3" #op #tname, binary_g_nd3, itype, otype, op)   \
-  instantiate_kernel("g4" #op #tname, binary_g_nd, itype, otype, op, 4) \
-  instantiate_kernel("g5" #op #tname, binary_g_nd,  itype, otype, op, 5)
+#define instantiate_binary_all(op, tname, itype, otype)                  \
+  instantiate_kernel("ss_" #op #tname, binary_ss, itype, otype, op)      \
+  instantiate_kernel("sv_" #op #tname, binary_sv, itype, otype, op)      \
+  instantiate_kernel("vs_" #op #tname, binary_vs, itype, otype, op)      \
+  instantiate_kernel("vv_" #op #tname, binary_vv, itype, otype, op)      \
+  instantiate_kernel("sv2_" #op #tname, binary_sv2, itype, otype, op)    \
+  instantiate_kernel("vs2_" #op #tname, binary_vs2, itype, otype, op)    \
+  instantiate_kernel("vv2_" #op #tname, binary_vv2, itype, otype, op)    \
+  instantiate_kernel("gn_" #op #tname, binary_g, itype, otype, op)       \
+  instantiate_kernel("gn4_" #op #tname, binary_g, itype, otype, op, 4)   \
+  instantiate_kernel("g1_" #op #tname, binary_g_nd1, itype, otype, op)   \
+  instantiate_kernel("g2_" #op #tname, binary_g_nd2, itype, otype, op)   \
+  instantiate_kernel("g3_" #op #tname, binary_g_nd3, itype, otype, op)   \

 #define instantiate_binary_float(op)                \
  instantiate_binary_all(op, float16, half, half)   \
--- a/mlx/backend/metal/kernels/complex.h
+++ b/mlx/backend/metal/kernels/complex.h
@@ -23,6 +23,8 @@ struct complex64_t {

  // Constructors
  constexpr complex64_t(float real, float imag) : real(real), imag(imag) {};
+  constexpr complex64_t() : real(0), imag(0) {};
+  constexpr complex64_t() threadgroup : real(0), imag(0) {};

  // Conversions to complex64_t
  template <
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Awni Hannun	c21331d47f	version bump (#1477 )	2024-10-10 13:05:17 -07:00
Awni Hannun	e1c9600da3	Add `mx.random.permutation` (#1471 ) * random permutation * comment	2024-10-08 19:42:19 -07:00
Awni Hannun	1fa0d20a30	consistently handle all -inf in softmax (#1470 )	2024-10-08 09:54:02 -07:00
Awni Hannun	3274c6a087	Fix array is_available race cases (#1468 )	2024-10-07 19:13:50 -07:00
Angelos Katharopoulos	9b12093739	Add the roll op (#1455 )	2024-10-07 17:21:42 -07:00
Awni Hannun	f374b6ca4d	Bump nanobind to 2.2 (#1461 ) * bump nanobind * extension version for tests	2024-10-07 16:52:40 -07:00
Awni Hannun	0070e1db40	Fix deep recursion with siblings (#1462 ) * fix recursion with siblings * fix * add test * increase tol	2024-10-07 06:15:33 -07:00
Awni Hannun	95d04805b3	Fix complex power on Metal (#1460 )	2024-10-06 19:58:30 -07:00
Awni Hannun	e4534dac17	Conv grad with groups + bugfix (#1449 ) * fix bug in flipped conv with groups, start of grad for groups * fix * fix * fix + test	2024-10-06 07:08:53 -07:00
Angelos Katharopoulos	fef3c4ec1d	Fix mpi test in CI (#1456 ) * Fix mpi test in CI * Set bind to none	2024-10-06 06:09:17 -07:00
Awni Hannun	1bdc038bf9	fix argpartition + faster {arg} sorts / partitions (#1453 )	2024-10-03 14:21:25 -07:00
Awni Hannun	5523d9c426	faster cpu indexing (#1450 )	2024-10-03 13:53:47 -07:00
Angelos Katharopoulos	d878015228	Fix normalization check_input (#1452 )	2024-10-03 13:26:56 -07:00
Cheng	5900e3249f	Fix building on Linux (#1446 )	2024-09-30 07:00:39 -07:00
Angelos Katharopoulos	bacced53d3	Fix row reduce with very few rows (#1447 )	2024-09-29 20:00:35 -07:00
Lucas Newman	4a64d4bff1	Add support for grouped 1D convolutions to the nn API (#1444 ) * Fix the weight shape for grouped convolutions from the nn API. * Add tests. * Pre-commit formatting. * Add input validation. * Use integer division instead of casting. * docs * nit --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-28 06:41:07 -07:00
Awni Hannun	b1e2b53c2d	bump (#1445 )	2024-09-27 13:53:02 -07:00
Awni Hannun	11354d5bff	Avoid io timeout for large arrays (#1442 )	2024-09-27 13:32:14 -07:00
Awni Hannun	718aea3f1d	allow take to work with integer index (#1440 )	2024-09-26 15:58:03 -07:00
Awni Hannun	5b6f38df2b	Faster cpu ops (#1434 ) * faster binary and cleaner copy * use recursive template for other ops * more cleanup * fix from cleanup * more clean * fix binary * use contiguous iterator * add 3d * nits * fix * fix? * fix * fix rebase	2024-09-26 09:19:13 -07:00
Awni Hannun	0b4a58699e	Some overhead reductions in mx.fast.metal_kernel (#1437 ) * some overhead reductions * fix * use += * use more +=	2024-09-25 17:25:21 -07:00
Awni Hannun	4f9f9ebb6f	Faster Metal unary and binary for general case (#1431 ) * faster unary and binary for general case * update ternary + jit fix * fix jit * unary work per thread	2024-09-25 12:07:43 -07:00
Awni Hannun	afc9c0ec1b	dtype is copy assignable (#1436 )	2024-09-25 12:07:13 -07:00
Awni Hannun	195b429d99	Put along axis + fixe for partition grad (#1430 ) * put along axis, fixes for partition grad * zeros for arg reduce	2024-09-23 10:03:38 -07:00
Luke Carlson	2b878e9dd7	Create CITATION.cff (#1425 )	2024-09-20 11:39:46 -07:00
Awni Hannun	67b6bf530d	Optimization for general ND copies (#1421 )	2024-09-17 17:59:51 -07:00
Nripesh Niketan	6af5ca35b2	feat: add cross_product (#1252 ) * feat: add cross_product * lint * python binding * refactor: Improve error message for cross_product function * refactor: more close to numpy cross product * refactor: improve error message for cross_product function * finish * fix acks * allow old numpy * doc --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-17 13:12:43 -07:00
Awni Hannun	4f46e9c997	More fixes for arrays with large sizes (#1405 ) * compile works for big arrays when contiguous * style * nits in docs * a bunch more stuff * update jit * update jit * use constant for shapes and strides and remove elem_to_loc overload * use kernel instantiation * docs nits * update binary and ternary * comments	2024-09-17 12:46:31 -07:00
Awni Hannun	c6739ba7f3	Faster RNN layers (#1419 ) * faster rnn * use admm	2024-09-17 06:04:19 -07:00
Angelos Katharopoulos	914409fef9	Data parallel helper (#1407 )	2024-09-16 18:17:21 -07:00
jjuang-apple	8d68a3e805	remove fmt dependencies from MLX install (#1417 )	2024-09-16 13:32:28 -07:00
jjuang-apple	6bbcc453ef	avoid using find_library to make install truly portable (#1416 )	2024-09-16 13:21:32 -07:00
Awni Hannun	d5ed4d7a71	override class function (#1418 )	2024-09-16 13:21:04 -07:00
Nripesh Niketan	669c27140d	Chore: add pre-commit hook for cmake (#1362 ) * reset and lint * format --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-16 12:53:01 -07:00
Max-Heinrich Laves	adcc88e208	Conv cpu improvements (#1410 )	2024-09-15 18:45:10 -07:00
Awni Hannun	d6492b0163	fix clip (#1415 )	2024-09-14 16:09:09 -07:00
Awni Hannun	b3f52c9fbe	ensure io/comm streams are active before eval (#1412 )	2024-09-14 06:17:36 -07:00
c0g	bd8396fad8	Fix typo in transformer docs (#1414 )	2024-09-14 06:05:15 -07:00
Angelos Katharopoulos	d0c58841d1	Patch bump (#1408 )	2024-09-12 16:44:23 -07:00
Angelos Katharopoulos	881f09b2e2	Allow querying the allocator for the buffer size (#1404 )	2024-09-11 21:02:16 -07:00
Awni Hannun	8b30acd7eb	fix module attribute set, reset, set (#1403 )	2024-09-11 16:30:42 -07:00
Awni Hannun	02efb310ca	Xcode 160 (#1384 ) * xcode 16.0 with debug tests * limit nproc for builds * vmap bug * assert bug * run python tests in debug mode * fix view, bool copies preserve bits' * actual view fix	2024-09-10 15:15:17 -07:00
Awni Hannun	e7e59c6f05	Fix copying scalars by adding fill_gpu (#1402 ) * fix copying scalars by adding fill_gpu * Another copy scalar changed to fill --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-09-09 15:54:08 -07:00
Awni Hannun	3ae6aabe9f	throw for certain cases of non captured inputs in compile (#1401 )	2024-09-09 14:54:31 -07:00
xnorai	dc627dcb5e	Replace the use of `result_of_t` with `invoke_result_t` (#1397 ) * Fix C++20 incompatibility * Fix C++20 incompatibility	2024-09-06 19:52:57 -07:00
Max-Heinrich Laves	efeb9c0f02	Transposed Convolution (#1245 ) * initial implementation for conv_transpose ran pre-commit implemented conv_transpose updated conv_general docstring updated conv_general docstring updated code comments removed commented run_conv_checks updated acknowledgments added missing entry to ops.rst added op to nn.layers resolved merge conflicts * removed ConvolutionTranspose primitive as suggested by reviewer removed ConvolutionTranspose primitive as suggested by reviewer * remove transpose flag, add another test --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-09-06 19:52:38 -07:00
Awni Hannun	ba3e913c7a	Simplifications for MLX C (#1396 ) * simplifications for MLX C * use vectors instead of map * update examples	2024-09-06 19:16:50 -07:00
Awni Hannun	7cca1727af	Fix slice data size (#1394 ) * fix slice data size and add tests * fix contiguous flag * simplify stride and perform copy for non-contiguous arrays * fix cpu * comment	2024-09-04 19:10:43 -07:00
Bhargav Yagnik	11371fe251	Test to prevent bugs like #1386 (#1391 ) * updated test_array for missing ops * formatting changes	2024-09-04 17:24:30 -07:00
Awni Hannun	41c603d48a	fix jit reduce (#1395 )	2024-09-04 14:03:10 -07:00
Angelos Katharopoulos	969337345f	Fix reduce edge case (#1389 )	2024-09-01 21:37:51 -07:00
Awni Hannun	9592766939	add std as method (#1387 ) * add std as method * add std as method	2024-09-01 19:49:16 -07:00
Angelos Katharopoulos	58dca7d846	Fix copy in the sort primitive (#1383 )	2024-08-31 08:32:14 -07:00
Awni Hannun	0d302cd25b	Fix compiel with byte sized constants (#1381 )	2024-08-30 17:24:35 -07:00
Alex Barron	da691257ec	Fix overflow in quantize/dequantize (#1379 ) * add 2d indices to prevent overflow * use nthreads not out size	2024-08-30 13:32:41 -07:00
Angelos Katharopoulos	1600092e92	Patch bump (#1376 )	2024-08-29 16:54:30 -07:00
Awni Hannun	dba2bd1105	Even Even Faster IO (#1374 ) * even more faster io * make reader pool static * make python reader thread safe * one more optimization	2024-08-29 16:05:40 -07:00
Alex Barron	28be4de7c2	Fix JIT reductions (#1373 )	2024-08-28 16:39:11 -07:00
Awni Hannun	a6c3b38fba	Async load (#1372 ) * async load * async load	2024-08-28 14:21:55 -07:00
Awni Hannun	fcb65a3897	Even Faster I/O (#1369 ) * try multithreading for faster IO * smaller batch size * Account for pread returning less than size * nit --------- Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-08-28 11:49:07 -07:00
Saanidhya	4e22a1dffe	In continuation to PR1243 to solve issue #1240 (#1365 ) * Solves issue #1240 * Correction * Update python/mlx/utils.py * Update python/mlx/utils.py --------- Co-authored-by: Awni Hannun <awni@apple.com> Co-authored-by: Awni Hannun <awni.hannun@gmail.com>	2024-08-28 11:40:41 -07:00
Awni Hannun	291cf40aca	Some fixes to typing (#1371 ) * some fixes to typing * fix module reference * comment	2024-08-28 11:16:19 -07:00
Jeethu Rao	bd47e1f066	Fix neon_fast_exp and add more softmax tests (#1367 )	2024-08-27 23:42:42 -07:00
Aditya Dhulipala	e6b223df5f	Pinv (#875 )	2024-08-27 23:06:12 -07:00
Angelos Katharopoulos	e64349bbdd	Make eval just wait if all arrays are scheduled (#1368 )	2024-08-27 17:01:22 -07:00
Angelos Katharopoulos	cdb59faea6	Adds send/recv ops in distributed (#1366 )	2024-08-26 23:01:37 -07:00
Alex Barron	1d94ac3f90	Add optional headers to ``mx.fast.metal_kernel`` (#1358 )	2024-08-26 21:45:45 -07:00
Awni Hannun	5f7d19d1f5	MPI ops in GPU stream for faster comms (#1356 )	2024-08-26 15:12:50 -07:00
Awni Hannun	2fdf9eb535	Fix ternary for large arrays (#1359 ) * fix ternary for large arrays * fix	2024-08-26 11:22:27 -07:00
Awni Hannun	860d3a50d7	fix extension metal library finding (#1361 )	2024-08-26 09:18:50 -07:00
Alex Barron	d1183821a7	int() and float() for mx.array (#1360 )	2024-08-25 20:41:44 -07:00
Angelos Katharopoulos	8081df79be	Fix boolean all reduce bug (#1355 )	2024-08-24 10:09:32 -07:00
Nripesh Niketan	64bec4fad7	Chore: update pre-commit hooks (#1353 ) * Chore: update pre-commit refs * run pre-commit	2024-08-24 06:46:36 -07:00
Alex Barron	b96e105244	Add `grid_sample` example to `metal_kernel` docs (#1352 ) * Add `zero_outputs` and `atomic_outputs` options to `metal_kernel` * add grid sample to docs * zero_outputs -> init_value * add missing header for linux	2024-08-23 18:24:16 -07:00
Awni Hannun	3b4d5484c7	Bump extension MLX version (#1350 ) * Bump extension MLX version * fix some docs nits	2024-08-23 12:38:34 -07:00
Alex Barron	684e11c664	patch (#1347 )	2024-08-23 10:42:02 -07:00
Angelos Katharopoulos	b57a52813b	Further reduction tuning (#1349 ) * More reduction tuning * Forgotten pdb * Small column long row specialization	2024-08-23 10:35:25 -07:00
Alex Barron	da8deb2b62	fix bug with multiple attributes (#1348 ) Co-authored-by: Alex Barron <abarron22@apple.com>	2024-08-23 10:06:15 -07:00
Awni Hannun	98b6ce3460	Refactor reductions and fix scatter atomics for large sizes (#1300 ) Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com>	2024-08-22 16:03:31 -07:00
Awni Hannun	f9e00efe31	fix nanobind and stub gen in circle (#1346 )	2024-08-22 14:07:27 -07:00
Alex Barron	0fd2a1f4b0	Custom Metal Kernels from Python (#1325 ) * start * simple kernels working * restructure * inverse example working * docs + fixes * missing file * fix imports * address comments * add docs + fix test * Review comments + refactor to a single function * update docs * remove hashing * fix contig bug in test * back to a class * trailing whitespace * fix tests * match c++ and python apis * add link + make args kw_only	2024-08-22 13:46:29 -07:00
Awni Hannun	df3233454d	2d gather specialization (#1339 )	2024-08-22 10:48:24 -07:00
Awni Hannun	82db84b899	bump nanobind + fix extension (#1344 )	2024-08-21 16:05:07 -07:00
Awni Hannun	8ae751d3da	fix io (#1343 ) * fix io * fix io * comment	2024-08-21 13:14:46 -07:00
Awni Hannun	d40e76809f	Fix rope (#1340 ) * add test * fix rope * fix test	2024-08-20 17:37:52 -07:00
Awni Hannun	bb1b76d9dc	RoPE with frequencies as optional input (#1337 ) * start rope with freq input * rope with frequencies * nits * fix bug * fix bug + test * cleanup * optional base	2024-08-19 18:30:50 -07:00
Angelos Katharopoulos	9d26441224	Fix contiguity check (#1336 ) Co-authored-by: Alex Barron <abarron22@apple.com>	2024-08-19 16:05:06 -07:00
Awni Hannun	f12f24a77c	fix compiling with space in paths (#1332 )	2024-08-15 16:39:24 -07:00
Awni Hannun	ae5b5cabfd	Fix optimizer reloading from checkpoint (#1329 ) * fix optimizer reloading from checkpoint * comment	2024-08-15 07:33:23 -07:00
Awni Hannun	d0630ffe8c	Read arrays from files faster (#1330 ) * read faster * faster write as well * set default permission for linux * comment	2024-08-14 20:09:56 -07:00
Alex Barron	99bb7d3a58	GPU mx.sign for complex64 (#1326 )	2024-08-14 07:54:53 -07:00
Awni Hannun	63ae767232	fix transformer (#1327 )	2024-08-13 16:04:26 -07:00
Awni Hannun	eaaea02010	Add `isfinite` (#1318 ) * isfinite * remove reduce test since fix is not complete	2024-08-13 14:49:28 -07:00
Bhargav Yagnik	a098bc92e0	Fix: Preserve input dtype in Dropout layer output (#1323 ) * Fix: Preserve input dtype in Dropout layer output - Modified Dropout implementation to ensure that the output dtype matches the input dtype. - This resolves the issue #1321 * Update test cases in test_nn.py - Revised test cases to align with updated dropout code - Fixed assertion method: replaced self.assertTrue with self.assertEqual for accurate comparisons in test_nn.py -> test_rope, test_alibi and test_dropout, * updated dropout.py	2024-08-13 11:54:21 -07:00
Awni Hannun	1086dc4db0	patch (#1320 )	2024-08-12 16:13:33 -07:00
Brian Keene	19fb69e2ed	Add memory_efficient_threshold kwarg to sdpa kernel (#1319 ) Allows opt-in to memory efficient GPU shader at proscribed sequence length. Otherwise, utilizes aggregate MLX primitives for best latency.	2024-08-12 12:57:09 -07:00
Awni Hannun	9231617eb3	Move to nanobind v2 (#1316 )	2024-08-08 17:17:46 -07:00
Alex Barron	32668a7317	CPU mx.linalg.cholesky_inverse and mx.linalg.tri_inv (#1307 ) * add cholesky inv + tri inv * always run tri_inv on cpu * consistent naming	2024-08-08 15:18:02 -07:00
Angelos Katharopoulos	780c197f95	Fix test tolerance and patch bump (#1315 )	2024-08-08 14:51:09 -07:00
Angelos Katharopoulos	eb8819e91e	Revert variance to be numerically stable (#1314 )	2024-08-08 13:35:02 -07:00
Awni Hannun	30bbea2f08	Add gemv masked to JIT plus some fixes (#1310 ) * add gemv masked to JIT plus some fixes * some cleanup * add utils * fix * fix 2 * more cleaning * fix * remove unused mps matmul support * one more nit * revert	2024-08-07 13:38:07 -07:00
Alex Barron	635ccd9e25	Add "edge" mode to mx.pad (#1309 ) * Add edge padding mode * fix pad in pooling * string arg instead of enum	2024-08-06 11:23:10 -07:00
nicolov	8c9f0278b9	Add vmap to scatter (#1200 ) * Add vmap to scatter * updates * vmap updates + a few more tests * bug fix --------- Co-authored-by: Awni Hannun <awni@apple.com>	2024-08-05 20:12:27 -07:00
Awni Hannun	58d0e199e1	add bfloat conv for windograd (#1306 ) * add bfloat conv for windograd * accumulate in fp32 * accumulate in fp32 * accumulate in bf16	2024-08-05 15:51:13 -07:00
Awni Hannun	10b5835501	fix creating array from bf16 tensors in jax / torch (#1305 )	2024-08-01 16:20:51 -07:00
Awni Hannun	6c8dd307eb	faster group norm (#1304 )	2024-08-01 12:49:23 -07:00
Awni Hannun	43ffdab172	fix rope and random (#1301 ) * fix rope and random * comment	2024-07-31 16:18:25 -07:00
Awni Hannun	40b6d67333	Fixes for large arrays with a few ops (#1299 ) * fixes for large arrays with a few ops * fix bug * fix all of copy	2024-07-30 17:18:39 -07:00
Alex Barron	c52d1600f0	Fused Affine Quantize/Dequantize ops (#1282 ) * Add fast affine dequantize * add full quantize kernel * fused kernel with scale/bias computation * fix docstring * fix no jit error * fix test * test fix * reduce fast api to only affine_quantize	2024-07-29 15:11:38 -07:00
Awni Hannun	aa1d6cadad	Fix docs latex build and nits (#1297 ) * fix docs latex build and nits * fix stub gen and try to clean up building	2024-07-29 11:44:06 -07:00
Atakan Tekparmak	6e06e3a904	feat: Added "tanh" option to GELU approximation (#1268 )	2024-07-28 09:07:56 +02:00
Yaroslav	8cfb9fc0b8	Update requirements.txt (#1291 )	2024-07-26 12:59:52 -07:00
Awni Hannun	7b456fd2c0	Array api (#1289 ) * some updates for numpy 2.0 and array api * some updates for numpy 2.0 and array api * fix array api doc	2024-07-26 10:40:49 -07:00