docs/build/doctrees/dev/extensions.doctree

<EFBFBD><05>P<00>sphinx.addnodes<65><73>document<6E><74><EFBFBD>)<29><>}<7D>(<28>	rawsource<63><65><00><>children<65>]<5D><>docutils.nodes<65><73>section<6F><6E><EFBFBD>)<29><>}<7D>(hhh]<5D>(h	<09>title<6C><65><EFBFBD>)<29><>}<7D>(h<05>Developer Documentation<6F>h]<5D>h	<09>Text<78><74><EFBFBD><EFBFBD>Developer Documentation<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(<28>parent<6E>h<11>	_document<6E>h<03>source<63>N<EFBFBD>line<6E>Nuba<62>
attributes<EFBFBD>}<7D>(<28>ids<64>]<5D><>classes<65>]<5D><>names<65>]<5D><>dupnames<65>]<5D><>backrefs<66>]<5D>u<EFBFBD>tagname<6D>hhhhhh<1D>7/Users/awnihannun/repos/mlx/docs/src/dev/extensions.rst<73>hKubh	<09>	paragraph<70><68><EFBFBD>)<29><>}<7D>(hXpMLX provides a open and flexible backend to which users may add operations
and specialized implementations without much hassle. While the library supplies
efficient operations that can be used and composed for any number of
applications, there may arise cases where new functionalities or highly
optimized implementations are needed. For such cases, you may design and
implement your own operations that link to and build on top of :mod:`mlx.core`.
We will introduce the inner-workings of MLX and go over a simple example to
learn the steps involved in adding new operations to MLX with your own CPU
and GPU implementations.<2E>h]<5D>(hX<>MLX provides a open and flexible backend to which users may add operations
and specialized implementations without much hassle. While the library supplies
efficient operations that can be used and composed for any number of
applications, there may arise cases where new functionalities or highly
optimized implementations are needed. For such cases, you may design and
implement your own operations that link to and build on top of <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh/hhhNhNubh<00>pending_xref<65><66><EFBFBD>)<29><>}<7D>(h<05>:mod:`mlx.core`<60>h]<5D>h	<09>literal<61><6C><EFBFBD>)<29><>}<7D>(hh;h]<5D>h<16>mlx.core<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh?hhhNhNubah}<7D>(h!]<5D>h#]<5D>(<28>xref<65><66>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hh9ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F><63>dev/extensions<6E><73>	refdomain<69>hJ<68>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E><EFBFBD>	py:module<6C>N<EFBFBD>py:class<73>N<EFBFBD>	reftarget<65><74>mlx.core<72>uh+h7hh,hKhh/ubh<16><>.
We will introduce the inner-workings of MLX and go over a simple example to
learn the steps involved in adding new operations to MLX with your own CPU
and GPU implementations.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh/hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKhhhhubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Introducing the Example<6C>h]<5D>h<16>Introducing the Example<6C><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhmhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhhjhhhh,hKubh.)<29><>}<7D>(hX9Let's say that you would like an operation that takes in two arrays,
``x`` and ``y``, scales them both by some coefficients ``alpha`` and ``beta``
respectively, and then adds them together to get the result
``z = alpha * x + beta * y``. Well, you can very easily do that by just
writing out a function as follows:<3A>h]<5D>(h<16>GLet’s say that you would like an operation that takes in two arrays,
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh{hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hh{ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh{hhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hh{ubh<16>(, scales them both by some coefficients <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh{hhhNhNubh>)<29><>}<7D>(h<05>	``alpha``<60>h]<5D>h<16>alpha<68><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hh{ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>hh{sbh>)<29><>}<7D>(h<05>``beta``<60>h]<5D>h<16>beta<74><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hh{ubh<16>=
respectively, and then adds them together to get the result
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh{hhhNhNubh>)<29><>}<7D>(h<05>``z = alpha * x + beta * y``<60>h]<5D>h<16>z = alpha * x + beta * y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hh{ubh<16>N. Well, you can very easily do that by just
writing out a function as follows:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh{hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKhhjhhubh	<09>
literal_block<63><6B><EFBFBD>)<29><>}<7D>(h<05><>import mlx.core as mx

def simple_axpby(x: mx.array, y: mx.array, alpha: float, beta: float) -> mx.array:
    return alpha * x + beta * y<>h]<5D>h<16><>import mlx.core as mx

def simple_axpby(x: mx.array, y: mx.array, alpha: float, beta: float) -> mx.array:
    return alpha * x + beta * y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>hh<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>	xml:space<63><65>preserve<76><65>force<63><65><EFBFBD>language<67><65>python<6F><6E>highlight_args<67>}<7D>uh+h<>hh,hKhhjhhubh.)<29><>}<7D>(h<05>cThis function performs that operation while leaving the implementations and
differentiation to MLX.<2E>h]<5D>h<16>cThis function performs that operation while leaving the implementations and
differentiation to MLX.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKhhjhhubh.)<29><>}<7D>(hX@However, you work with vector math libraries often and realize that the
``axpby`` routine defines the same operation ``Y = (alpha * X) + (beta * Y)``.
You would really like the part of your applications that does this operation
on the CPU to be very fast - so you decide that you want it to rely on the
``axpby`` routine provided by the Accelerate_ framework. Continuing to impose
our assumptions on to you, let's also assume that you want to learn how add
your own implementation for the gradients of your new operation while going
over the ins-and-outs of the MLX framework.<2E>h]<5D>(h<16>HHowever, you work with vector math libraries often and realize that the
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>$ routine defines the same operation <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05> ``Y = (alpha * X) + (beta * Y)``<60>h]<5D>h<16>Y = (alpha * X) + (beta * Y)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj"hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16><>.
You would really like the part of your applications that does this operation
on the CPU to be very fast - so you decide that you want it to rely on the
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj4hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16> routine provided by the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh	<09>	reference<63><65><EFBFBD>)<29><>}<7D>(h<05>Accelerate_<65>h]<5D>h<16>
Accelerate<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjHhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>
Accelerate<EFBFBD><EFBFBD>refuri<72><69>Ghttps://developer.apple.com/documentation/accelerate/blas?language=objc<6A>uh+jFhj<00>resolved<65>Kubh<16><> framework. Continuing to impose
our assumptions on to you, let’s also assume that you want to learn how add
your own implementation for the gradients of your new operation while going
over the ins-and-outs of the MLX framework.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK!hhjhhubh.)<29><>}<7D>(h<05>eWell, what a coincidence! You are in the right place. Over the course of this
example, we will learn:<3A>h]<5D>h<16>eWell, what a coincidence! You are in the right place. Over the course of this
example, we will learn:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjehhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK*hhjhhubh	<09>bullet_list<73><74><EFBFBD>)<29><>}<7D>(hhh]<5D>(h	<09>	list_item<65><6D><EFBFBD>)<29><>}<7D>(h<05>VThe structure of the MLX library from the frontend API to the backend implementations.<2E>h]<5D>h.)<29><>}<7D>(hj|h]<5D>h<16>VThe structure of the MLX library from the frontend API to the backend implementations.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj~hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK-hjzubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhjuhhhh,hNubjy)<29><>}<7D>(h<05>pHow to implement your own CPU backend that redirects to Accelerate_ when appropriate (and a fallback if needed).<2E>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h<16>8How to implement your own CPU backend that redirects to <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubjG)<29><>}<7D>(h<05>Accelerate_<65>h]<5D>h<16>
Accelerate<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>
Accelerate<EFBFBD>jXjYuh+jFhj<>jZKubh<16>- when appropriate (and a fallback if needed).<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK.hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhjuhhhh,hNubjy)<29><>}<7D>(h<05>9How to implement your own GPU implementation using metal.<2E>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>h<16>9How to implement your own GPU implementation using metal.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK/hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhjuhhhh,hNubjy)<29><>}<7D>(h<05>(How to add your own ``vjp`` and ``jvp``.<2E>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h<16>How to add your own <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``vjp``<60>h]<5D>h<16>vjp<6A><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``jvp``<60>h]<5D>h<16>jvp<76><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK0hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhjuhhhh,hNubjy)<29><>}<7D>(h<05>NHow to build your implementations, link them to MLX, and bind them to python.
<EFBFBD>h]<5D>h.)<29><>}<7D>(h<05>MHow to build your implementations, link them to MLX, and bind them to python.<2E>h]<5D>h<16>MHow to build your implementations, link them to MLX, and bind them to python.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK1hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhjuhhhh,hNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>bullet<65><74>*<2A>uh+jshh,hK-hhjhhubeh}<7D>(h!]<5D><>introducing-the-example<6C>ah#]<5D>h%]<5D><>introducing the example<6C>ah']<5D>h)]<5D>uh+h
hhhhhh,hKubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Operations and Primitives<65>h]<5D>h<16>Operations and Primitives<65><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj9hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj6hhhh,hK4ubh.)<29><>}<7D>(h<05><>In one sentence, operations in MLX build the computation graph, and primitives
provide the rules for evaluation and transformations of said graph. Let's start
by discussing operations in more detail.<2E>h]<5D>h<16><>In one sentence, operations in MLX build the computation graph, and primitives
provide the rules for evaluation and transformations of said graph. Let’s start
by discussing operations in more detail.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjGhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK6hj6hhubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>
Operations<EFBFBD>h]<5D>h<16>
Operations<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjXhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhjUhhhh,hK;ubh.)<29><>}<7D>(h<05><>Operations are the frontend functions that operate on arrays. They are defined
in the C++ API (:ref:`cpp_ops`) and then we provide bindings to these
operations in the Python API (:ref:`ops`).<2E>h]<5D>(h<16>_Operations are the frontend functions that operate on arrays. They are defined
in the C++ API (<28><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubh8)<29><>}<7D>(h<05>:ref:`cpp_ops`<60>h]<5D>h	<09>inline<6E><65><EFBFBD>)<29><>}<7D>(hjph]<5D>h<16>cpp_ops<70><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjthhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>std<74><64>std-ref<65>eh%]<5D>h']<5D>h)]<5D>uh+jrhjnubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j~<00>reftype<70><65>ref<65><66>refexplicit<69><74><EFBFBD>refwarn<72><6E>h^<5E>cpp_ops<70>uh+h7hh,hK=hjfubh<16>F) and then we provide bindings to these
operations in the Python API (<28><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubh8)<29><>}<7D>(h<05>
:ref:`ops`<60>h]<5D>js)<29><>}<7D>(hj<>h]<5D>h<16>ops<70><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>std<74><64>std-ref<65>eh%]<5D>h']<5D>h)]<5D>uh+jrhj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>ref<65><66>refexplicit<69><74><EFBFBD>refwarn<72><6E>h^<5E>ops<70>uh+h7hh,hK=hjfubh<16>).<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK=hjUhhubh.)<29><>}<7D>(h<05><>We would like an operation, :meth:`axpby` that takes in two arrays ``x`` and ``y``,
and two scalars, ``alpha`` and ``beta``. This is how we would define it in the
C++ API:<3A>h]<5D>(h<16>We would like an operation, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>
:meth:`axpby`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>axpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>axpby<62>uh+h7hh,hKAhj<>ubh<16> that takes in two arrays <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>,
and two scalars, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>	``alpha``<60>h]<5D>h<16>alpha<68><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbh>)<29><>}<7D>(h<05>``beta``<60>h]<5D>h<16>beta<74><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>0. This is how we would define it in the
C++ API:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKAhjUhhubh<62>)<29><>}<7D>(hX<>/**
*  Scale and sum two vectors element-wise
*  z = alpha * x + beta * y
*
*  Follow numpy style broadcasting between x and y
*  Inputs are upcasted to floats if needed
**/
array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s = {} // Stream on which to schedule the operation
);<3B>h]<5D>hX<>/**
*  Scale and sum two vectors element-wise
*  z = alpha * x + beta * y
*
*  Follow numpy style broadcasting between x and y
*  Inputs are upcasted to floats if needed
**/
array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s = {} // Stream on which to schedule the operation
);<3B><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj8sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hKEhjUhhubh.)<29><>}<7D>(h<05><>This operation itself can call other operations within it if needed. So, the
simplest way to go about implementing this operation would be do so in terms
of existing operations.<2E>h]<5D>h<16><>This operation itself can call other operations within it if needed. So, the
simplest way to go about implementing this operation would be do so in terms
of existing operations.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjHhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKWhjUhhubh<62>)<29><>}<7D>(hX<>array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s /* = {} */ // Stream on which to schedule the operation
) {
    // Scale x and y on the provided stream
    auto ax = multiply(array(alpha), x, s);
    auto by = multiply(array(beta), y, s);

    // Add and return
    return add(ax, by, s);
}<7D>h]<5D>hX<>array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s /* = {} */ // Stream on which to schedule the operation
) {
    // Scale x and y on the provided stream
    auto ax = multiply(array(alpha), x, s);
    auto by = multiply(array(beta), y, s);

    // Add and return
    return add(ax, by, s);
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjVsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hK[hjUhhubh.)<29><>}<7D>(hX+However, as we discussed earlier, this is not our goal. The operations themselves
do not contain the implementations that act on the data, nor do they contain the
rules of transformations. Rather, they are an easy to use interface that build
on top of the building blocks we call :class:`Primitive`.<2E>h]<5D>(hXHowever, as we discussed earlier, this is not our goal. The operations themselves
do not contain the implementations that act on the data, nor do they contain the
rules of transformations. Rather, they are an easy to use interface that build
on top of the building blocks we call <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hjph]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjrhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjnubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j|<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hKlhjfubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKlhjUhhubeh}<7D>(h!]<5D><>
operations<EFBFBD>ah#]<5D>h%]<5D><>
operations<EFBFBD>ah']<5D>h)]<5D>uh+h
hj6hhhh,hK;ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>
Primitives<EFBFBD>h]<5D>h<16>
Primitives<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj<>hhhh,hKrubh.)<29><>}<7D>(hX<>A :class:`Primitive` is part of the computation graph of an :class:`array`. It
defines how to create an output given a set of input :class:`array` . Further,
a :class:`Primitive` is a class that contains rules on how it is evaluated
on the CPU or GPU, and how it acts under transformations such as ``vjp`` and
``jvp``. These words on their own can be a bit abstract, so lets take a step
back and go to our example to give ourselves a more concrete image.<2E>h]<5D>(h<16>A <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hKthj<>ubh<16>( is part of the computation graph of an <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hKthj<>ubh<16>:. It
defines how to create an output given a set of input <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hKthj<>ubh<16> . Further,
a <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj'h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj)hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj%ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j3<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hKthj<>ubh<16>x is a class that contains rules on how it is evaluated
on the CPU or GPU, and how it acts under transformations such as <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``vjp``<60>h]<5D>h<16>vjp<6A><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjIhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> and
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``jvp``<60>h]<5D>h<16>jvp<76><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16><>. These words on their own can be a bit abstract, so lets take a step
back and go to our example to give ourselves a more concrete image.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hKthj<>hhubh<62>)<29><>}<7D>(hX<>class Axpby : public Primitive {
  public:
    explicit Axpby(Stream stream, float alpha, float beta)
        : Primitive(stream), alpha_(alpha), beta_(beta){};

    /**
    * A primitive must know how to evaluate itself on the CPU/GPU
    * for the given inputs and populate the output array.
    *
    * To avoid unnecessary allocations, the evaluation function
    * is responsible for allocating space for the array.
    */
    void eval_cpu(const std::vector<array>& inputs, array& out) override;
    void eval_gpu(const std::vector<array>& inputs, array& out) override;

    /** The Jacobian-vector product. */
    array jvp(
        const std::vector<array>& primals,
        const std::vector<array>& tangents,
        const std::vector<int>& argnums) override;

    /** The vector-Jacobian product. */
    std::vector<array> vjp(
        const std::vector<array>& primals,
        const array& cotan,
        const std::vector<int>& argnums) override;

    /**
    * The primitive must know how to vectorize itself across
    * the given axes. The output is a pair containing the array
    * representing the vectorized computation and the axis which
    * corresponds to the output vectorized dimension.
    */
    std::pair<array, int> vmap(
        const std::vector<array>& inputs,
        const std::vector<int>& axes) override;

    /** Print the primitive. */
    void print(std::ostream& os) override {
        os << "Axpby";
    }

    /** Equivalence check **/
    bool is_equivalent(const Primitive& other) const override;

  private:
    float alpha_;
    float beta_;

    /** Fall back implementation for evaluation on CPU */
    void eval(const std::vector<array>& inputs, array& out);
};<3B>h]<5D>hX<>class Axpby : public Primitive {
  public:
    explicit Axpby(Stream stream, float alpha, float beta)
        : Primitive(stream), alpha_(alpha), beta_(beta){};

    /**
    * A primitive must know how to evaluate itself on the CPU/GPU
    * for the given inputs and populate the output array.
    *
    * To avoid unnecessary allocations, the evaluation function
    * is responsible for allocating space for the array.
    */
    void eval_cpu(const std::vector<array>& inputs, array& out) override;
    void eval_gpu(const std::vector<array>& inputs, array& out) override;

    /** The Jacobian-vector product. */
    array jvp(
        const std::vector<array>& primals,
        const std::vector<array>& tangents,
        const std::vector<int>& argnums) override;

    /** The vector-Jacobian product. */
    std::vector<array> vjp(
        const std::vector<array>& primals,
        const array& cotan,
        const std::vector<int>& argnums) override;

    /**
    * The primitive must know how to vectorize itself across
    * the given axes. The output is a pair containing the array
    * representing the vectorized computation and the axis which
    * corresponds to the output vectorized dimension.
    */
    std::pair<array, int> vmap(
        const std::vector<array>& inputs,
        const std::vector<int>& axes) override;

    /** Print the primitive. */
    void print(std::ostream& os) override {
        os << "Axpby";
    }

    /** Equivalence check **/
    bool is_equivalent(const Primitive& other) const override;

  private:
    float alpha_;
    float beta_;

    /** Fall back implementation for evaluation on CPU */
    void eval(const std::vector<array>& inputs, array& out);
};<3B><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjssbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hK{hj<>hhubh.)<29><>}<7D>(hX<>The :class:`Axpby` class derives from the base :class:`Primitive` class and
follows the above demonstrated interface. :class:`Axpby` treats ``alpha`` and
``beta`` as parameters. It then provides implementations of how the array ``out``
is produced given ``inputs`` through :meth:`Axpby::eval_cpu` and
:meth:`Axpby::eval_gpu`. Further, it provides rules of transformations in
:meth:`Axpby::jvp`, :meth:`Axpby::vjp`, and :meth:`Axpby::vmap`.<2E>h]<5D>(h<16>The <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hK<>hj<>ubh<16> class derives from the base <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hK<>hj<>ubh<16>5 class and
follows the above demonstrated interface. <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hK<>hj<>ubh<16> treats <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>	``alpha``<60>h]<5D>h<16>alpha<68><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> and
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``beta``<60>h]<5D>h<16>beta<74><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>B as parameters. It then provides implementations of how the array <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``out``<60>h]<5D>h<16>out<75><74><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>
is produced given <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>
``inputs``<60>h]<5D>h<16>inputs<74><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj-hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>	 through <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_cpu`<60>h]<5D>h>)<29><>}<7D>(hjAh]<5D>h<16>Axpby::eval_cpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjChhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj?ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jM<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_cpu<70>uh+h7hh,hK<>hj<>ubh<16> and
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_gpu`<60>h]<5D>h>)<29><>}<7D>(hjeh]<5D>h<16>Axpby::eval_gpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjghhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjcubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jq<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_gpu<70>uh+h7hh,hK<>hj<>ubh<16>3. Further, it provides rules of transformations in
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::jvp`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::jvp()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
Axpby::jvp<76>uh+h7hh,hK<>hj<>ubh<16>, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::vjp`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::vjp()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
Axpby::vjp<6A>uh+h7hh,hK<>hj<>ubh<16>, and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::vmap`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>
Axpby::vmap()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::vmap<61>uh+h7hh,hK<>hj<>ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj<>hhubeh}<7D>(h!]<5D><>
primitives<EFBFBD>ah#]<5D>h%]<5D><>
primitives<EFBFBD>ah']<5D>h)]<5D>uh+h
hj6hhhh,hKrubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Using the Primitives<65>h]<5D>h<16>Using the Primitives<65><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhjhhhh,hK<>ubh.)<29><>}<7D>(hXOperations can use this :class:`Primitive` to add a new :class:`array` to
the computation graph. An :class:`array` can be constructed by providing its
data type, shape, the :class:`Primitive` that computes it, and the
:class:`array` inputs that are passed to the primitive.<2E>h]<5D>(h<16>Operations can use this <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j(<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hK<>hjubh<16> to add a new <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hj@h]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjBhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jL<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hK<>hjubh<16> to
the computation graph. An <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hjdh]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjbubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jp<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hK<>hjubh<16>; can be constructed by providing its
data type, shape, the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hK<>hjubh<16> that computes it, and the
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hK<>hjubh<16>) inputs that are passed to the primitive.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hjhhubh.)<29><>}<7D>(h<05>NLet's re-implement our operation now in terms of our :class:`Axpby` primitive.<2E>h]<5D>(h<16>7Let’s re-implement our operation now in terms of our <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hK<>hj<>ubh<16> primitive.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hjhhubh<62>)<29><>}<7D>(hX8array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s /* = {} */ // Stream on which to schedule the operation
) {
    // Promote dtypes between x and y as needed
    auto promoted_dtype = promote_types(x.dtype(), y.dtype());

    // Upcast to float32 for non-floating point inputs x and y
    auto out_dtype = is_floating_point(promoted_dtype)
        ? promoted_dtype
        : promote_types(promoted_dtype, float32);

    // Cast x and y up to the determined dtype (on the same stream s)
    auto x_casted = astype(x, out_dtype, s);
    auto y_casted = astype(y, out_dtype, s);

    // Broadcast the shapes of x and y (on the same stream s)
    auto broadcasted_inputs = broadcast_arrays({x_casted, y_casted}, s);
    auto out_shape = broadcasted_inputs[0].shape();

    // Construct the array as the output of the Axpby primitive
    // with the broadcasted and upcasted arrays as inputs
    return array(
        /* const std::vector<int>& shape = */ out_shape,
        /* Dtype dtype = */ out_dtype,
        /* std::unique_ptr<Primitive> primitive = */
        std::make_unique<Axpby>(to_stream(s), alpha, beta),
        /* const std::vector<array>& inputs = */ broadcasted_inputs);
}<7D>h]<5D>hX8array axpby(
    const array& x, // Input array x
    const array& y, // Input array y
    const float alpha, // Scaling factor for x
    const float beta, // Scaling factor for y
    StreamOrDevice s /* = {} */ // Stream on which to schedule the operation
) {
    // Promote dtypes between x and y as needed
    auto promoted_dtype = promote_types(x.dtype(), y.dtype());

    // Upcast to float32 for non-floating point inputs x and y
    auto out_dtype = is_floating_point(promoted_dtype)
        ? promoted_dtype
        : promote_types(promoted_dtype, float32);

    // Cast x and y up to the determined dtype (on the same stream s)
    auto x_casted = astype(x, out_dtype, s);
    auto y_casted = astype(y, out_dtype, s);

    // Broadcast the shapes of x and y (on the same stream s)
    auto broadcasted_inputs = broadcast_arrays({x_casted, y_casted}, s);
    auto out_shape = broadcasted_inputs[0].shape();

    // Construct the array as the output of the Axpby primitive
    // with the broadcasted and upcasted arrays as inputs
    return array(
        /* const std::vector<int>& shape = */ out_shape,
        /* Dtype dtype = */ out_dtype,
        /* std::unique_ptr<Primitive> primitive = */
        std::make_unique<Axpby>(to_stream(s), alpha, beta),
        /* const std::vector<array>& inputs = */ broadcasted_inputs);
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hK<>hjhhubh.)<29><>}<7D>(h<05>)This operation now handles the following:<3A>h]<5D>h<16>)This operation now handles the following:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hjhhubh	<09>enumerated_list<73><74><EFBFBD>)<29><>}<7D>(hhh]<5D>(jy)<29><>}<7D>(h<05>/Upcast inputs and resolve the output data type.<2E>h]<5D>h.)<29><>}<7D>(hj+h]<5D>h<16>/Upcast inputs and resolve the output data type.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj-hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj)ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj&hhhh,hNubjy)<29><>}<7D>(h<05>2Broadcast the inputs and resolve the output shape.<2E>h]<5D>h.)<29><>}<7D>(hjBh]<5D>h<16>2Broadcast the inputs and resolve the output shape.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjDhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj@ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj&hhhh,hNubjy)<29><>}<7D>(h<05>WConstruct the primitive :class:`Axpby` using the given stream, ``alpha``, and ``beta``.<2E>h]<5D>h.)<29><>}<7D>(hjYh]<5D>(h<16>Construct the primitive <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[hhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hjdh]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjfhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjbubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jp<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hK<>hj[ubh<16> using the given stream, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[hhhNhNubh>)<29><>}<7D>(h<05>	``alpha``<60>h]<5D>h<16>alpha<68><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj[ubh<16>, and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[hhhNhNubh>)<29><>}<7D>(h<05>``beta``<60>h]<5D>h<16>beta<74><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj[ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hjWubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj&hhhh,hNubjy)<29><>}<7D>(h<05>HConstruct the output :class:`array` using the primitive and the inputs.
<EFBFBD>h]<5D>h.)<29><>}<7D>(h<05>GConstruct the output :class:`array` using the primitive and the inputs.<2E>h]<5D>(h<16>Construct the output <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`array`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>array<61>uh+h7hh,hK<>hj<>ubh<16>$ using the primitive and the inputs.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj&hhhh,hNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>enumtype<70><65>arabic<69><63>prefix<69>h<06>suffix<69><78>.<2E>uh+j$hjhhhh,hK<>ubeh}<7D>(h!]<5D><>using-the-primitives<65>ah#]<5D>h%]<5D><>using the primitives<65>ah']<5D>h)]<5D>uh+h
hj6hhhh,hK<>ubeh}<7D>(h!]<5D><>operations-and-primitives<65>ah#]<5D>h%]<5D><>operations and primitives<65>ah']<5D>h)]<5D>uh+h
hhhhhh,hK4ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Implementing the Primitive<76>h]<5D>h<16>Implementing the Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj
hhhh,hK<>ubh.)<29><>}<7D>(hX@No computation happens when we call the operation alone. In effect, the
operation only builds the computation graph. When we evaluate the output
array, MLX schedules the execution of the computation graph, and calls
:meth:`Axpby::eval_cpu` or :meth:`Axpby::eval_gpu` depending on the
stream/device specified by the user.<2E>h]<5D>(h<16><>No computation happens when we call the operation alone. In effect, the
operation only builds the computation graph. When we evaluate the output
array, MLX schedules the execution of the computation graph, and calls
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_cpu`<60>h]<5D>h>)<29><>}<7D>(hj(h]<5D>h<16>Axpby::eval_cpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj*hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj&ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j4<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_cpu<70>uh+h7hh,hK<>hjubh<16> or <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_gpu`<60>h]<5D>h>)<29><>}<7D>(hjLh]<5D>h<16>Axpby::eval_gpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjNhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjJubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jX<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_gpu<70>uh+h7hh,hK<>hjubh<16>6 depending on the
stream/device specified by the user.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj
hhubh	<09>warning<6E><67><EFBFBD>)<29><>}<7D>(h<05><>When :meth:`Primitive::eval_cpu` or :meth:`Primitive::eval_gpu` are called,
no memory has been allocated for the output array. It falls on the implementation
of these functions to allocate memory as needed<65>h]<5D>h.)<29><>}<7D>(h<05><>When :meth:`Primitive::eval_cpu` or :meth:`Primitive::eval_gpu` are called,
no memory has been allocated for the output array. It falls on the implementation
of these functions to allocate memory as needed<65>h]<5D>(h<16>When <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjzhhhNhNubh8)<29><>}<7D>(h<05>:meth:`Primitive::eval_cpu`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Primitive::eval_cpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Primitive::eval_cpu<70>uh+h7hh,hK<>hjzubh<16> or <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjzhhhNhNubh8)<29><>}<7D>(h<05>:meth:`Primitive::eval_gpu`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Primitive::eval_gpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Primitive::eval_gpu<70>uh+h7hh,hK<>hjzubh<16><> are called,
no memory has been allocated for the output array. It falls on the implementation
of these functions to allocate memory as needed<65><64><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjzhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hjvubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jthj
hhhh,hNubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Implementing the CPU Backend<6E>h]<5D>h<16>Implementing the CPU Backend<6E><64><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj<>hhhh,hK<>ubh.)<29><>}<7D>(h<05><>Let's start by trying to implement a naive and generic version of
:meth:`Axpby::eval_cpu`. We declared this as a private member function of
:class:`Axpby` earlier called :meth:`Axpby::eval`.<2E>h]<5D>(h<16>DLet’s start by trying to implement a naive and generic version of
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_cpu`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::eval_cpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_cpu<70>uh+h7hh,hK<>hj<>ubh<16>3. We declared this as a private member function of
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hj	h]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj	hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj	ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j!	<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hK<>hj<>ubh<16> earlier called <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval`<60>h]<5D>h>)<29><>}<7D>(hj9	h]<5D>h<16>
Axpby::eval()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj;	hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj7	ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jE	<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval<61>uh+h7hh,hK<>hj<>ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hK<>hj<>hhubh.)<29><>}<7D>(h<05><>Our naive method will go over each element of the output array, find the
corresponding input elements of ``x`` and ``y`` and perform the operation
pointwise. This is captured in the templated function :meth:`axpby_impl`.<2E>h]<5D>(h<16>iOur naive method will go over each element of the output array, find the
corresponding input elements of <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hja	hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hji	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hja	ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hja	hhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj{	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hja	ubh<16>Q and perform the operation
pointwise. This is captured in the templated function <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hja	hhhNhNubh8)<29><>}<7D>(h<05>:meth:`axpby_impl`<60>h]<5D>h>)<29><>}<7D>(hj<>	h]<5D>h<16>axpby_impl()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>	ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD>	<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
axpby_impl<EFBFBD>uh+h7hh,hMhja	ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hja	hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMhj<>hhubh<62>)<29><>}<7D>(hX<>template <typename T>
void axpby_impl(
        const array& x,
        const array& y,
        array& out,
        float alpha_,
        float beta_) {
    // We only allocate memory when we are ready to fill the output
    // malloc_or_wait synchronously allocates available memory
    // There may be a wait executed here if the allocation is requested
    // under memory-pressured conditions
    out.set_data(allocator::malloc_or_wait(out.nbytes()));

    // Collect input and output data pointers
    const T* x_ptr = x.data<T>();
    const T* y_ptr = y.data<T>();
    T* out_ptr = out.data<T>();

    // Cast alpha and beta to the relevant types
    T alpha = static_cast<T>(alpha_);
    T beta = static_cast<T>(beta_);

    // Do the element-wise operation for each output
    for (size_t out_idx = 0; out_idx < out.size(); out_idx++) {
        // Map linear indices to offsets in x and y
        auto x_offset = elem_to_loc(out_idx, x.shape(), x.strides());
        auto y_offset = elem_to_loc(out_idx, y.shape(), y.strides());

        // We allocate the output to be contiguous and regularly strided
        // (defaults to row major) and hence it doesn't need additional mapping
        out_ptr[out_idx] = alpha * x_ptr[x_offset] + beta * y_ptr[y_offset];
    }
}<7D>h]<5D>hX<>template <typename T>
void axpby_impl(
        const array& x,
        const array& y,
        array& out,
        float alpha_,
        float beta_) {
    // We only allocate memory when we are ready to fill the output
    // malloc_or_wait synchronously allocates available memory
    // There may be a wait executed here if the allocation is requested
    // under memory-pressured conditions
    out.set_data(allocator::malloc_or_wait(out.nbytes()));

    // Collect input and output data pointers
    const T* x_ptr = x.data<T>();
    const T* y_ptr = y.data<T>();
    T* out_ptr = out.data<T>();

    // Cast alpha and beta to the relevant types
    T alpha = static_cast<T>(alpha_);
    T beta = static_cast<T>(beta_);

    // Do the element-wise operation for each output
    for (size_t out_idx = 0; out_idx < out.size(); out_idx++) {
        // Map linear indices to offsets in x and y
        auto x_offset = elem_to_loc(out_idx, x.shape(), x.strides());
        auto y_offset = elem_to_loc(out_idx, y.shape(), y.strides());

        // We allocate the output to be contiguous and regularly strided
        // (defaults to row major) and hence it doesn't need additional mapping
        out_ptr[out_idx] = alpha * x_ptr[x_offset] + beta * y_ptr[y_offset];
    }
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>	sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hMhj<>hhubh.)<29><>}<7D>(hX
Now, we would like our implementation to be able to do this pointwise operation
for all incoming floating point arrays. Accordingly, we add dispatches for
``float32``, ``float16``, ``bfloat16`` and ``complex64``. We throw an error
if we encounter an unexpected type.<2E>h]<5D>(h<16><>Now, we would like our implementation to be able to do this pointwise operation
for all incoming floating point arrays. Accordingly, we add dispatches for
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubh>)<29><>}<7D>(h<05>``float32``<60>h]<5D>h<16>float32<33><32><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>	ubh<16>, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubh>)<29><>}<7D>(h<05>``float16``<60>h]<5D>h<16>float16<31><36><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>	ubh<16>, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>	sbh>)<29><>}<7D>(h<05>``bfloat16``<60>h]<5D>h<16>bfloat16<31><36><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>	ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubh>)<29><>}<7D>(h<05>
``complex64``<60>h]<5D>h<16>	complex64<36><34><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>	ubh<16>7. We throw an error
if we encounter an unexpected type.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>	hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM+hj<>hhubh<62>)<29><>}<7D>(hXV/** Fall back implementation for evaluation on CPU */
void Axpby::eval(const std::vector<array>& inputs, array& out) {
    // Check the inputs (registered in the op while constructing the out array)
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Dispatch to the correct dtype
    if (out.dtype() == float32) {
        return axpby_impl<float>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == float16) {
        return axpby_impl<float16_t>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == bfloat16) {
        return axpby_impl<bfloat16_t>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == complex64) {
        return axpby_impl<complex64_t>(x, y, out, alpha_, beta_);
    } else {
        throw std::runtime_error(
            "Axpby is only supported for floating point types.");
    }
}<7D>h]<5D>hXV/** Fall back implementation for evaluation on CPU */
void Axpby::eval(const std::vector<array>& inputs, array& out) {
    // Check the inputs (registered in the op while constructing the out array)
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Dispatch to the correct dtype
    if (out.dtype() == float32) {
        return axpby_impl<float>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == float16) {
        return axpby_impl<float16_t>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == bfloat16) {
        return axpby_impl<bfloat16_t>(x, y, out, alpha_, beta_);
    } else if (out.dtype() == complex64) {
        return axpby_impl<complex64_t>(x, y, out, alpha_, beta_);
    } else {
        throw std::runtime_error(
            "Axpby is only supported for floating point types.");
    }
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj
sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM0hj<>hhubh.)<29><>}<7D>(h<05><>We have a fallback implementation! Now, to do what we are really here to do.
Remember we wanted to use the ``axpby`` routine provided by the Accelerate_
framework? Well, there are 3 complications to keep in mind:<3A>h]<5D>(h<16>kWe have a fallback implementation! Now, to do what we are really here to do.
Remember we wanted to use the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj-
hhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj5
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj-
ubh<16> routine provided by the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj-
hhhNhNubjG)<29><>}<7D>(h<05>Accelerate_<65>h]<5D>h<16>
Accelerate<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjG
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>
Accelerate<EFBFBD>jXjYuh+jFhj-
jZKubh<16><
framework? Well, there are 3 complications to keep in mind:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj-
hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMHhj<>hhubj%)<29><>}<7D>(hhh]<5D>(jy)<29><>}<7D>(h<05><>Accelerate does not provide implementations of ``axpby`` for half precision
floats. We can only direct to it for ``float32`` types<65>h]<5D>h.)<29><>}<7D>(h<05><>Accelerate does not provide implementations of ``axpby`` for half precision
floats. We can only direct to it for ``float32`` types<65>h]<5D>(h<16>/Accelerate does not provide implementations of <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjh
hhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjp
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjh
ubh<16>9 for half precision
floats. We can only direct to it for <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjh
hhhNhNubh>)<29><>}<7D>(h<05>``float32``<60>h]<5D>h<16>float32<33><32><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjh
ubh<16> types<65><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjh
hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMLhjd
ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhja
hhhh,hNubjy)<29><>}<7D>(hX6Accelerate assumes the inputs ``x`` and ``y`` are contiguous and all elements
have fixed strides between them. Possibly due to broadcasts and transposes,
we aren't guaranteed that the inputs fit this requirement. We can
only direct to Accelerate if both ``x`` and ``y`` are row contiguous or
column contiguous.<2E>h]<5D>h.)<29><>}<7D>(hX6Accelerate assumes the inputs ``x`` and ``y`` are contiguous and all elements
have fixed strides between them. Possibly due to broadcasts and transposes,
we aren't guaranteed that the inputs fit this requirement. We can
only direct to Accelerate if both ``x`` and ``y`` are row contiguous or
column contiguous.<2E>h]<5D>(h<16>Accelerate assumes the inputs <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16><> are contiguous and all elements
have fixed strides between them. Possibly due to broadcasts and transposes,
we aren’t guaranteed that the inputs fit this requirement. We can
only direct to Accelerate if both <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>
sbh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16>) are row contiguous or
column contiguous.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMNhj<>
ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhja
hhhh,hNubjy)<29><>}<7D>(h<05><>Accelerate performs the routine ``Y = (alpha * X) + (beta * Y)`` inplace.
MLX expects to write out the answer to a new array. We must copy the elements
of ``y`` into the output array and use that as an input to ``axpby``
<EFBFBD>h]<5D>h.)<29><>}<7D>(h<05><>Accelerate performs the routine ``Y = (alpha * X) + (beta * Y)`` inplace.
MLX expects to write out the answer to a new array. We must copy the elements
of ``y`` into the output array and use that as an input to ``axpby``<60>h]<5D>(h<16> Accelerate performs the routine <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05> ``Y = (alpha * X) + (beta * Y)``<60>h]<5D>h<16>Y = (alpha * X) + (beta * Y)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>[ inplace.
MLX expects to write out the answer to a new array. We must copy the elements
of <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>3 into the output array and use that as an input to <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj0hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMShjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhja
hhhh,hNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>j<EFBFBD>j<>j<>hj<>j<>uh+j$hj<>hhhh,hMLubh.)<29><>}<7D>(h<05><>Let's write out an implementation that uses Accelerate in the right conditions.
It must simply allocate data for the output, copy elements of ``y`` into it,
and then call the :meth:`catlas_saxpby` from accelerate.<2E>h]<5D>(h<16><>Let’s write out an implementation that uses Accelerate in the right conditions.
It must simply allocate data for the output, copy elements of <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjPhhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjXhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjPubh<16> into it,
and then call the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjPhhhNhNubh8)<29><>}<7D>(h<05>:meth:`catlas_saxpby`<60>h]<5D>h>)<29><>}<7D>(hjlh]<5D>h<16>catlas_saxpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjnhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jx<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
catlas_saxpby<62>uh+h7hh,hMWhjPubh<16> from accelerate.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjPhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMWhj<>hhubh<62>)<29><>}<7D>(hXltemplate <typename T>
void axpby_impl_accelerate(
        const array& x,
        const array& y,
        array& out,
        float alpha_,
        float beta_) {
    // Accelerate library provides catlas_saxpby which does
    // Y = (alpha * X) + (beta * Y) in place
    // To use it, we first copy the data in y over to the output array

    // This specialization requires both x and y be contiguous in the same mode
    // i.e: corresponding linear indices in both point to corresponding elements
    // The data in the output array is allocated to match the strides in y
    // such that x, y, and out are contiguous in the same mode and
    // no transposition is needed
    out.set_data(
        allocator::malloc_or_wait(y.data_size() * out.itemsize()),
        y.data_size(),
        y.strides(),
        y.flags());

    // We then copy over the elements using the contiguous vector specialization
    copy_inplace(y, out, CopyType::Vector);

    // Get x and y pointers for catlas_saxpby
    const T* x_ptr = x.data<T>();
    T* y_ptr = out.data<T>();

    T alpha = static_cast<T>(alpha_);
    T beta = static_cast<T>(beta_);

    // Call the inplace accelerate operator
    catlas_saxpby(
        /* N = */ out.size(),
        /* ALPHA = */ alpha,
        /* X = */ x_ptr,
        /* INCX = */ 1,
        /* BETA = */ beta,
        /* Y = */ y_ptr,
        /* INCY = */ 1);
}<7D>h]<5D>hXltemplate <typename T>
void axpby_impl_accelerate(
        const array& x,
        const array& y,
        array& out,
        float alpha_,
        float beta_) {
    // Accelerate library provides catlas_saxpby which does
    // Y = (alpha * X) + (beta * Y) in place
    // To use it, we first copy the data in y over to the output array

    // This specialization requires both x and y be contiguous in the same mode
    // i.e: corresponding linear indices in both point to corresponding elements
    // The data in the output array is allocated to match the strides in y
    // such that x, y, and out are contiguous in the same mode and
    // no transposition is needed
    out.set_data(
        allocator::malloc_or_wait(y.data_size() * out.itemsize()),
        y.data_size(),
        y.strides(),
        y.flags());

    // We then copy over the elements using the contiguous vector specialization
    copy_inplace(y, out, CopyType::Vector);

    // Get x and y pointers for catlas_saxpby
    const T* x_ptr = x.data<T>();
    T* y_ptr = out.data<T>();

    T alpha = static_cast<T>(alpha_);
    T beta = static_cast<T>(beta_);

    // Call the inplace accelerate operator
    catlas_saxpby(
        /* N = */ out.size(),
        /* ALPHA = */ alpha,
        /* X = */ x_ptr,
        /* INCX = */ 1,
        /* BETA = */ beta,
        /* Y = */ y_ptr,
        /* INCY = */ 1);
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM[hj<>hhubh.)<29><>}<7D>(h<05><>Great! But what about the inputs that do not fit the criteria for accelerate?
Luckily, we can always just direct back to :meth:`Axpby::eval`.<2E>h]<5D>(h<16>yGreat! But what about the inputs that do not fit the criteria for accelerate?
Luckily, we can always just direct back to <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>
Axpby::eval()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval<61>uh+h7hh,hM<>hj<>ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05>FWith this in mind, lets finally implement our :meth:`Axpby::eval_cpu`.<2E>h]<5D>(h<16>.With this in mind, lets finally implement our <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_cpu`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::eval_cpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_cpu<70>uh+h7hh,hM<>hj<>ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh<62>)<29><>}<7D>(hX<>/** Evaluate primitive on CPU using accelerate specializations */
void Axpby::eval_cpu(const std::vector<array>& inputs, array& out) {
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Accelerate specialization for contiguous single precision float arrays
    if (out.dtype() == float32 &&
        ((x.flags().row_contiguous && y.flags().row_contiguous) ||
        (x.flags().col_contiguous && y.flags().col_contiguous))) {
        axpby_impl_accelerate<float>(x, y, out, alpha_, beta_);
        return;
    }

    // Fall back to common backend if specializations are not available
    eval(inputs, out);
}<7D>h]<5D>hX<>/** Evaluate primitive on CPU using accelerate specializations */
void Axpby::eval_cpu(const std::vector<array>& inputs, array& out) {
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Accelerate specialization for contiguous single precision float arrays
    if (out.dtype() == float32 &&
        ((x.flags().row_contiguous && y.flags().row_contiguous) ||
        (x.flags().col_contiguous && y.flags().col_contiguous))) {
        axpby_impl_accelerate<float>(x, y, out, alpha_, beta_);
        return;
    }

    // Fall back to common backend if specializations are not available
    eval(inputs, out);
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05>iWe have now hit a milestone! Just this much is enough to run the operation
:meth:`axpby` on a CPU stream!<21>h]<5D>(h<16>KWe have now hit a milestone! Just this much is enough to run the operation
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>
:meth:`axpby`<60>h]<5D>h>)<29><>}<7D>(hj"h]<5D>h<16>axpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj$hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j.<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>axpby<62>uh+h7hh,hM<>hjubh<16> on a CPU stream!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05><>If you do not plan on running the operation on the GPU or using transforms on
computation graphs that contain :class:`Axpby`, you can stop implementing the
primitive here and enjoy the speed-ups you get from the Accelerate library.<2E>h]<5D>(h<16>nIf you do not plan on running the operation on the GPU or using transforms on
computation graphs that contain <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjJhhhNhNubh8)<29><>}<7D>(h<05>:class:`Axpby`<60>h]<5D>h>)<29><>}<7D>(hjTh]<5D>h<16>Axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjVhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjRubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j`<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby<62>uh+h7hh,hM<>hjJubh<16>k, you can stop implementing the
primitive here and enjoy the speed-ups you get from the Accelerate library.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjJhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubeh}<7D>(h!]<5D><>implementing-the-cpu-backend<6E>ah#]<5D>h%]<5D><>implementing the cpu backend<6E>ah']<5D>h)]<5D>uh+h
hj
hhhh,hK<>ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Implementing the GPU Backend<6E>h]<5D>h<16>Implementing the GPU Backend<6E><64><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj<>hhhh,hM<>ubh.)<29><>}<7D>(h<05>Apple silicon devices address their GPUs using the Metal_ shading language, and
all GPU kernels in MLX are written using metal.<2E>h]<5D>(h<16>3Apple silicon devices address their GPUs using the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubjG)<29><>}<7D>(h<05>Metal_<6C>h]<5D>h<16>Metal<61><6C><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>Metal<61>jX<00>=https://developer.apple.com/documentation/metal?language=objc<6A>uh+jFhj<>jZKubh<16>F shading language, and
all GPU kernels in MLX are written using metal.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh	<09>note<74><65><EFBFBD>)<29><>}<7D>(h<05><>Here are some helpful resources if you are new to metal!

* A walkthrough of the metal compute pipeline: `Metal Example`_
* Documentation for metal shading language: `Metal Specification`_
* Using metal from C++: `Metal-cpp`_<>h]<5D>(h.)<29><>}<7D>(h<05>8Here are some helpful resources if you are new to metal!<21>h]<5D>h<16>8Here are some helpful resources if you are new to metal!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubjt)<29><>}<7D>(hhh]<5D>(jy)<29><>}<7D>(h<05>=A walkthrough of the metal compute pipeline: `Metal Example`_<>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h<16>-A walkthrough of the metal compute pipeline: <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubjG)<29><>}<7D>(h<05>`Metal Example`_<>h]<5D>h<16>
Metal Example<6C><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>
Metal Example<6C>jX<00>^https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc<6A>uh+jFhj<>jZKubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj<>ubjy)<29><>}<7D>(h<05>@Documentation for metal shading language: `Metal Specification`_<>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h<16>*Documentation for metal shading language: <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubjG)<29><>}<7D>(h<05>`Metal Specification`_<>h]<5D>h<16>Metal Specification<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>Metal Specification<6F>jX<00>Jhttps://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf<64>uh+jFhj<>jZKubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj<>ubjy)<29><>}<7D>(h<05>"Using metal from C++: `Metal-cpp`_<>h]<5D>h.)<29><>}<7D>(hj!
h]<5D>(h<16>Using metal from C++: <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj#
hhhNhNubjG)<29><>}<7D>(h<05>`Metal-cpp`_<>h]<5D>h<16>	Metal-cpp<70><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj*
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>	Metal-cpp<70>jX<00>&https://developer.apple.com/metal/cpp/<2F>uh+jFhj#
jZKubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj
ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>j,j-uh+jshh,hM<>hj<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j<>hj<>hhhNhNubh.)<29><>}<7D>(hXLet's keep the GPU algorithm simple. We will launch exactly as many threads
as there are elements in the output. Each thread will pick the element it needs
from ``x`` and ``y``, do the pointwise operation, and then update its assigned
element in the output.<2E>h]<5D>(h<16><>Let’s keep the GPU algorithm simple. We will launch exactly as many threads
as there are elements in the output. Each thread will pick the element it needs
from <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjS
hhhNhNubh>)<29><>}<7D>(h<05>``x``<60>h]<5D>h<16>x<><78><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj[
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjS
ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjS
hhhNhNubh>)<29><>}<7D>(h<05>``y``<60>h]<5D>h<16>y<><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjm
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjS
ubh<16>Q, do the pointwise operation, and then update its assigned
element in the output.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjS
hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh<62>)<29><>}<7D>(hXYtemplate <typename T>
[[kernel]] void axpby_general(
        device const T* x [[buffer(0)]],
        device const T* y [[buffer(1)]],
        device T* out [[buffer(2)]],
        constant const float& alpha [[buffer(3)]],
        constant const float& beta [[buffer(4)]],
        constant const int* shape [[buffer(5)]],
        constant const size_t* x_strides [[buffer(6)]],
        constant const size_t* y_strides [[buffer(7)]],
        constant const int& ndim [[buffer(8)]],
        uint index [[thread_position_in_grid]]) {
    // Convert linear indices to offsets in array
    auto x_offset = elem_to_loc(index, shape, x_strides, ndim);
    auto y_offset = elem_to_loc(index, shape, y_strides, ndim);

    // Do the operation and update the output
    out[index] =
        static_cast<T>(alpha) * x[x_offset] + static_cast<T>(beta) * y[y_offset];
}<7D>h]<5D>hXYtemplate <typename T>
[[kernel]] void axpby_general(
        device const T* x [[buffer(0)]],
        device const T* y [[buffer(1)]],
        device T* out [[buffer(2)]],
        constant const float& alpha [[buffer(3)]],
        constant const float& beta [[buffer(4)]],
        constant const int* shape [[buffer(5)]],
        constant const size_t* x_strides [[buffer(6)]],
        constant const size_t* y_strides [[buffer(7)]],
        constant const int& ndim [[buffer(8)]],
        uint index [[thread_position_in_grid]]) {
    // Convert linear indices to offsets in array
    auto x_offset = elem_to_loc(index, shape, x_strides, ndim);
    auto y_offset = elem_to_loc(index, shape, y_strides, ndim);

    // Do the operation and update the output
    out[index] =
        static_cast<T>(alpha) * x[x_offset] + static_cast<T>(beta) * y[y_offset];
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>
sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05><>We then need to instantiate this template for all floating point types and give
each instantiation a unique host name so we can identify the right kernel for
each data type.<2E>h]<5D>h<16><>We then need to instantiate this template for all floating point types and give
each instantiation a unique host name so we can identify the right kernel for
each data type.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh<62>)<29><>}<7D>(hX<>#define instantiate_axpby(type_name, type)              \
    template [[host_name("axpby_general_" #type_name)]] \
    [[kernel]] void axpby_general<type>(                \
        device const type* x [[buffer(0)]],             \
        device const type* y [[buffer(1)]],             \
        device type* out [[buffer(2)]],                 \
        constant const float& alpha [[buffer(3)]],      \
        constant const float& beta [[buffer(4)]],       \
        constant const int* shape [[buffer(5)]],        \
        constant const size_t* x_strides [[buffer(6)]], \
        constant const size_t* y_strides [[buffer(7)]], \
        constant const int& ndim [[buffer(8)]],         \
        uint index [[thread_position_in_grid]]);

instantiate_axpby(float32, float);
instantiate_axpby(float16, half);
instantiate_axpby(bfloat16, bfloat16_t);
instantiate_axpby(complex64, complex64_t);<3B>h]<5D>hX<>#define instantiate_axpby(type_name, type)              \
    template [[host_name("axpby_general_" #type_name)]] \
    [[kernel]] void axpby_general<type>(                \
        device const type* x [[buffer(0)]],             \
        device const type* y [[buffer(1)]],             \
        device type* out [[buffer(2)]],                 \
        constant const float& alpha [[buffer(3)]],      \
        constant const float& beta [[buffer(4)]],       \
        constant const int* shape [[buffer(5)]],        \
        constant const size_t* x_strides [[buffer(6)]], \
        constant const size_t* y_strides [[buffer(7)]], \
        constant const int& ndim [[buffer(8)]],         \
        uint index [[thread_position_in_grid]]);

instantiate_axpby(float32, float);
instantiate_axpby(float16, half);
instantiate_axpby(bfloat16, bfloat16_t);
instantiate_axpby(complex64, complex64_t);<3B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>
sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj<>hhubh.)<29><>}<7D>(hX<>This kernel will be compiled into a metal library ``mlx_ext.metallib`` as we
will see later in :ref:`Building with CMake`. In the following example, we
assume that the library ``mlx_ext.metallib`` will always be co-located with
the executable/ shared-library calling the :meth:`register_library` function.
The :meth:`register_library` function takes the library's name and potential
path (or in this case, a function that can produce the path of the metal
library) and tries to load that library if it hasn't already been registered
by the relevant static :class:`mlx::core::metal::Device` object. This is why,
it is important to package your C++ library with the metal library. We will
go over this process in more detail later.<2E>h]<5D>(h<16>2This kernel will be compiled into a metal library <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh>)<29><>}<7D>(h<05>``mlx_ext.metallib``<60>h]<5D>h<16>mlx_ext.metallib<69><62><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16> as we
will see later in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh8)<29><>}<7D>(h<05>:ref:`Building with CMake`<60>h]<5D>js)<29><>}<7D>(hj<>
h]<5D>h<16>Building with CMake<6B><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>std<74><64>std-ref<65>eh%]<5D>h']<5D>h)]<5D>uh+jrhj<>
ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD>
<00>reftype<70><65>ref<65><66>refexplicit<69><74><EFBFBD>refwarn<72><6E>h^<5E>building with cmake<6B>uh+h7hh,hM<>hj<>
ubh<16>7. In the following example, we
assume that the library <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh>)<29><>}<7D>(h<05>``mlx_ext.metallib``<60>h]<5D>h<16>mlx_ext.metallib<69><62><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>
ubh<16>K will always be co-located with
the executable/ shared-library calling the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh8)<29><>}<7D>(h<05>:meth:`register_library`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>register_library()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>register_library<72>uh+h7hh,hM<>hj<>
ubh<16> function.
The <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh8)<29><>}<7D>(h<05>:meth:`register_library`<60>h]<5D>h>)<29><>}<7D>(hj)h]<5D>h<16>register_library()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj+hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj'ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j5<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>register_library<72>uh+h7hh,hM<>hj<>
ubh<16><> function takes the library’s name and potential
path (or in this case, a function that can produce the path of the metal
library) and tries to load that library if it hasn’t already been registered
by the relevant static <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubh8)<29><>}<7D>(h<05>!:class:`mlx::core::metal::Device`<60>h]<5D>h>)<29><>}<7D>(hjMh]<5D>h<16>mlx::core::metal::Device<63><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjOhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjKubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jY<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx::core::metal::Device<63>uh+h7hh,hM<>hj<>
ubh<16><> object. This is why,
it is important to package your C++ library with the metal library. We will
go over this process in more detail later.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>
hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05><>The logic to determine the kernel, set the inputs, resolve the grid dimensions
and dispatch it to the GPU are contained in :meth:`Axpby::eval_gpu` as shown
below.<2E>h]<5D>(h<16>{The logic to determine the kernel, set the inputs, resolve the grid dimensions
and dispatch it to the GPU are contained in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjuhhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::eval_gpu`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>Axpby::eval_gpu()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj}ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>Axpby::eval_gpu<70>uh+h7hh,hM<>hjuubh<16> as shown
below.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjuhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh<62>)<29><>}<7D>(hX4
/** Evaluate primitive on GPU */
void Axpby::eval_gpu(const std::vector<array>& inputs, array& out) {
    // Prepare inputs
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Each primitive carries the stream it should execute on
    // and each stream carries its device identifiers
    auto& s = stream();
    // We get the needed metal device using the stream
    auto& d = metal::device(s.device);

    // Allocate output memory
    out.set_data(allocator::malloc_or_wait(out.nbytes()));

    // Resolve name of kernel (corresponds to axpby.metal)
    std::ostringstream kname;
    kname << "axpby_" << "general_" << type_to_name(out);

    // Make sure the metal library is available and look for it
    // in the same folder as this executable if needed
    d.register_library("mlx_ext", metal::get_colocated_mtllib_path);

    // Make a kernel from this metal library
    auto kernel = d.get_kernel(kname.str(), "mlx_ext");

    // Prepare to encode kernel
    auto compute_encoder = d.get_command_encoder(s.index);
    compute_encoder->setComputePipelineState(kernel);

    // Kernel parameters are registered with buffer indices corresponding to
    // those in the kernel declaration at axpby.metal
    int ndim = out.ndim();
    size_t nelem = out.size();

    // Encode input arrays to kernel
    set_array_buffer(compute_encoder, x, 0);
    set_array_buffer(compute_encoder, y, 1);

    // Encode output arrays to kernel
    set_array_buffer(compute_encoder, out, 2);

    // Encode alpha and beta
    compute_encoder->setBytes(&alpha_, sizeof(float), 3);
    compute_encoder->setBytes(&beta_, sizeof(float), 4);

    // Encode shape, strides and ndim
    compute_encoder->setBytes(x.shape().data(), ndim * sizeof(int), 5);
    compute_encoder->setBytes(x.strides().data(), ndim * sizeof(size_t), 6);
    compute_encoder->setBytes(y.strides().data(), ndim * sizeof(size_t), 7);
    compute_encoder->setBytes(&ndim, sizeof(int), 8);

    // We launch 1 thread for each input and make sure that the number of
    // threads in any given threadgroup is not higher than the max allowed
    size_t tgp_size = std::min(nelem, kernel->maxTotalThreadsPerThreadgroup());

    // Fix the 3D size of each threadgroup (in terms of threads)
    MTL::Size group_dims = MTL::Size(tgp_size, 1, 1);

    // Fix the 3D size of the launch grid (in terms of threads)
    MTL::Size grid_dims = MTL::Size(nelem, 1, 1);

    // Launch the grid with the given number of threads divided among
    // the given threadgroups
    compute_encoder->dispatchThreads(grid_dims, group_dims);
}<7D>h]<5D>hX4
/** Evaluate primitive on GPU */
void Axpby::eval_gpu(const std::vector<array>& inputs, array& out) {
    // Prepare inputs
    assert(inputs.size() == 2);
    auto& x = inputs[0];
    auto& y = inputs[1];

    // Each primitive carries the stream it should execute on
    // and each stream carries its device identifiers
    auto& s = stream();
    // We get the needed metal device using the stream
    auto& d = metal::device(s.device);

    // Allocate output memory
    out.set_data(allocator::malloc_or_wait(out.nbytes()));

    // Resolve name of kernel (corresponds to axpby.metal)
    std::ostringstream kname;
    kname << "axpby_" << "general_" << type_to_name(out);

    // Make sure the metal library is available and look for it
    // in the same folder as this executable if needed
    d.register_library("mlx_ext", metal::get_colocated_mtllib_path);

    // Make a kernel from this metal library
    auto kernel = d.get_kernel(kname.str(), "mlx_ext");

    // Prepare to encode kernel
    auto compute_encoder = d.get_command_encoder(s.index);
    compute_encoder->setComputePipelineState(kernel);

    // Kernel parameters are registered with buffer indices corresponding to
    // those in the kernel declaration at axpby.metal
    int ndim = out.ndim();
    size_t nelem = out.size();

    // Encode input arrays to kernel
    set_array_buffer(compute_encoder, x, 0);
    set_array_buffer(compute_encoder, y, 1);

    // Encode output arrays to kernel
    set_array_buffer(compute_encoder, out, 2);

    // Encode alpha and beta
    compute_encoder->setBytes(&alpha_, sizeof(float), 3);
    compute_encoder->setBytes(&beta_, sizeof(float), 4);

    // Encode shape, strides and ndim
    compute_encoder->setBytes(x.shape().data(), ndim * sizeof(int), 5);
    compute_encoder->setBytes(x.strides().data(), ndim * sizeof(size_t), 6);
    compute_encoder->setBytes(y.strides().data(), ndim * sizeof(size_t), 7);
    compute_encoder->setBytes(&ndim, sizeof(int), 8);

    // We launch 1 thread for each input and make sure that the number of
    // threads in any given threadgroup is not higher than the max allowed
    size_t tgp_size = std::min(nelem, kernel->maxTotalThreadsPerThreadgroup());

    // Fix the 3D size of each threadgroup (in terms of threads)
    MTL::Size group_dims = MTL::Size(tgp_size, 1, 1);

    // Fix the 3D size of the launch grid (in terms of threads)
    MTL::Size grid_dims = MTL::Size(nelem, 1, 1);

    // Launch the grid with the given number of threads divided among
    // the given threadgroups
    compute_encoder->dispatchThreads(grid_dims, group_dims);
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05>HWe can now call the :meth:`axpby` operation on both the CPU and the GPU!<21>h]<5D>(h<16>We can now call the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>
:meth:`axpby`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>axpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>axpby<62>uh+h7hh,hM@hj<>ubh<16>' operation on both the CPU and the GPU!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM@hj<>hhubh.)<29><>}<7D>(hX<>A few things to note about MLX and metal before moving on. MLX keeps track
of the active ``compute_encoder``. We rely on :meth:`d.get_command_encoder`
to give us the active metal compute command encoder instead of building a
new one and calling :meth:`compute_encoder->end_encoding` at the end.
MLX keeps adding kernels (compute pipelines) to the active command encoder
until some specified limit is hit or the compute encoder needs to be flushed
for synchronization. MLX also handles enqueuing and committing the associated
command buffers as needed. We suggest taking a deeper dive into
:class:`metal::Device` if you would like to study this routine further.<2E>h]<5D>(h<16>YA few things to note about MLX and metal before moving on. MLX keeps track
of the active <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``compute_encoder``<60>h]<5D>h<16>compute_encoder<65><72><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>
. We rely on <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`d.get_command_encoder`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>d.get_command_encoder()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>d.get_command_encoder<65>uh+h7hh,hMBhj<>ubh<16>_
to give us the active metal compute command encoder instead of building a
new one and calling <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>%:meth:`compute_encoder->end_encoding`<60>h]<5D>h>)<29><>}<7D>(hj)h]<5D>h<16>compute_encoder->end_encoding()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj+hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj'ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j5<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>compute_encoder->end_encoding<6E>uh+h7hh,hMBhj<>ubhX3 at the end.
MLX keeps adding kernels (compute pipelines) to the active command encoder
until some specified limit is hit or the compute encoder needs to be flushed
for synchronization. MLX also handles enqueuing and committing the associated
command buffers as needed. We suggest taking a deeper dive into
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`metal::Device`<60>h]<5D>h>)<29><>}<7D>(hjMh]<5D>h<16>
metal::Device<63><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjOhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjKubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jY<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
metal::Device<63>uh+h7hh,hMBhj<>ubh<16>1 if you would like to study this routine further.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMBhj<>hhubeh}<7D>(h!]<5D><>implementing-the-gpu-backend<6E>ah#]<5D>h%]<5D><>implementing the gpu backend<6E>ah']<5D>h)]<5D>uh+h
hj
hhhh,hM<>ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Primitive Transforms<6D>h]<5D>h<16>Primitive Transforms<6D><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj}hhhh,hMMubh.)<29><>}<7D>(hX8Now that we have come this far, let's also learn how to add implementations to
transformations in a :class:`Primitive`. These transformations can be built on
top of our operations, including the one we just defined now. Which then gives
us the following :meth:`Axpby::jvp` and :meth:`Axpby::vjp` implementations.<2E>h]<5D>(h<16>fNow that we have come this far, let’s also learn how to add implementations to
transformations in a <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hMOhj<>ubh<16><>. These transformations can be built on
top of our operations, including the one we just defined now. Which then gives
us the following <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::jvp`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::jvp()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
Axpby::jvp<76>uh+h7hh,hMOhj<>ubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`Axpby::vjp`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>Axpby::vjp()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
Axpby::vjp<6A>uh+h7hh,hMOhj<>ubh<16> implementations.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMOhj}hhubh<62>)<29><>}<7D>(hX<>/** The Jacobian-vector product. */
array Axpby::jvp(
        const std::vector<array>& primals,
        const std::vector<array>& tangents,
        const std::vector<int>& argnums) {
    // Forward mode diff that pushes along the tangents
    // The jvp transform on the primitive can built with ops
    // that are scheduled on the same stream as the primitive

    // If argnums = {0}, we only push along x in which case the
    // jvp is just the tangent scaled by alpha
    // Similarly, if argnums = {1}, the jvp is just the tangent
    // scaled by beta
    if (argnums.size() > 1) {
        auto scale = argnums[0] == 0 ? alpha_ : beta_;
        auto scale_arr = array(scale, tangents[0].dtype());
        return multiply(scale_arr, tangents[0], stream());
    }
    // If, argnums = {0, 1}, we take contributions from both
    // which gives us jvp = tangent_x * alpha + tangent_y * beta
    else {
        return axpby(tangents[0], tangents[1], alpha_, beta_, stream());
    }
}<7D>h]<5D>hX<>/** The Jacobian-vector product. */
array Axpby::jvp(
        const std::vector<array>& primals,
        const std::vector<array>& tangents,
        const std::vector<int>& argnums) {
    // Forward mode diff that pushes along the tangents
    // The jvp transform on the primitive can built with ops
    // that are scheduled on the same stream as the primitive

    // If argnums = {0}, we only push along x in which case the
    // jvp is just the tangent scaled by alpha
    // Similarly, if argnums = {1}, the jvp is just the tangent
    // scaled by beta
    if (argnums.size() > 1) {
        auto scale = argnums[0] == 0 ? alpha_ : beta_;
        auto scale_arr = array(scale, tangents[0].dtype());
        return multiply(scale_arr, tangents[0], stream());
    }
    // If, argnums = {0, 1}, we take contributions from both
    // which gives us jvp = tangent_x * alpha + tangent_y * beta
    else {
        return axpby(tangents[0], tangents[1], alpha_, beta_, stream());
    }
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hMThj}hhubh<62>)<29><>}<7D>(hX<>/** The vector-Jacobian product. */
std::vector<array> Axpby::vjp(
        const std::vector<array>& primals,
        const array& cotan,
        const std::vector<int>& argnums) {
    // Reverse mode diff
    std::vector<array> vjps;
    for (auto arg : argnums) {
        auto scale = arg == 0 ? alpha_ : beta_;
        auto scale_arr = array(scale, cotan.dtype());
        vjps.push_back(multiply(scale_arr, cotan, stream()));
    }
    return vjps;
}<7D>h]<5D>hX<>/** The vector-Jacobian product. */
std::vector<array> Axpby::vjp(
        const std::vector<array>& primals,
        const array& cotan,
        const std::vector<int>& argnums) {
    // Reverse mode diff
    std::vector<array> vjps;
    for (auto arg : argnums) {
        auto scale = arg == 0 ? alpha_ : beta_;
        auto scale_arr = array(scale, cotan.dtype());
        vjps.push_back(multiply(scale_arr, cotan, stream()));
    }
    return vjps;
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hMohj}hhubh.)<29><>}<7D>(h<05>eFinally, you need not have a transformation fully defined to start using your
own :class:`Primitive`.<2E>h]<5D>(h<16>RFinally, you need not have a transformation fully defined to start using your
own <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj(hhhNhNubh8)<29><>}<7D>(h<05>:class:`Primitive`<60>h]<5D>h>)<29><>}<7D>(hj2h]<5D>h<16>	Primitive<76><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj4hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hj0ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j><00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>	Primitive<76>uh+h7hh,hM<>hj(ubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj(hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj}hhubh<62>)<29><>}<7D>(h<05><>/** Vectorize primitive along given axis */
std::pair<array, int> Axpby::vmap(
        const std::vector<array>& inputs,
        const std::vector<int>& axes) {
    throw std::runtime_error("Axpby has no vmap implementation.");
}<7D>h]<5D>h<16><>/** Vectorize primitive along given axis */
std::pair<array, int> Axpby::vmap(
        const std::vector<array>& inputs,
        const std::vector<int>& axes) {
    throw std::runtime_error("Axpby has no vmap implementation.");
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjZsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj}hhubeh}<7D>(h!]<5D><>primitive-transforms<6D>ah#]<5D>h%]<5D><>primitive transforms<6D>ah']<5D>h)]<5D>uh+h
hj
hhhh,hMMubeh}<7D>(h!]<5D><>implementing-the-primitive<76>ah#]<5D>h%]<5D><>implementing the primitive<76>ah']<5D>h)]<5D>uh+h
hhhhhh,hK<>ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Building and Binding<6E>h]<5D>h<16>Building and Binding<6E><67><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj}hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhjzhhhh,hM<>ubh.)<29><>}<7D>(h<05>4Let's look at the overall directory structure first.<2E>h]<5D>h<16>6Let’s look at the overall directory structure first.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hjzhhubh	<09>
line_block<EFBFBD><EFBFBD><EFBFBD>)<29><>}<7D>(hhh]<5D>(h	h<1E><>)<29><>}<7D>(h<05>
extensions<EFBFBD>h]<5D>h<16>
extensions<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h<1E>indent<6E>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>├── axpby<62>h]<5D>h<16>├── axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>│   ├── axpby.cpp<70>h]<5D>h<16>│   ├── axpby.cpp<70><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>│   ├── axpby.h<>h]<5D>h<16>│   ├── axpby.h<><68><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>│   └── axpby.metal<61>h]<5D>h<16>│   └── axpby.metal<61><6C><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>├── mlx_sample_extensions<6E>h]<5D>h<16>├── mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>│   └── __init__.py<70>h]<5D>h<16>│   └── __init__.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>├── bindings.cpp<70>h]<5D>h<16>├── bindings.cpp<70><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>├── CMakeLists.txt<78>h]<5D>h<16>├── CMakeLists.txt<78><74><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubj<62>)<29><>}<7D>(h<05>└── setup.py<70>h]<5D>h<16>└── setup.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hM<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j<>hjzhhhh,hM<>ubjt)<29><>}<7D>(hhh]<5D>(jy)<29><>}<7D>(h<05>7``extensions/axpby/`` defines the C++ extension library<72>h]<5D>h.)<29><>}<7D>(hj7h]<5D>(h>)<29><>}<7D>(h<05>``extensions/axpby/``<60>h]<5D>h<16>extensions/axpby/<2F><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj9ubh<16>" defines the C++ extension library<72><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj9hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj5ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj2hhhh,hNubjy)<29><>}<7D>(h<05>]``extensions/mlx_sample_extensions`` sets out the structure for the
associated python package<67>h]<5D>h.)<29><>}<7D>(h<05>]``extensions/mlx_sample_extensions`` sets out the structure for the
associated python package<67>h]<5D>(h>)<29><>}<7D>(h<05>$``extensions/mlx_sample_extensions``<60>h]<5D>h<16> extensions/mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjbhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj^ubh<16>9 sets out the structure for the
associated python package<67><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj^hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hjZubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj2hhhh,hNubjy)<29><>}<7D>(h<05>F``extensions/bindings.cpp`` provides python bindings for our operation<6F>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h>)<29><>}<7D>(h<05>``extensions/bindings.cpp``<60>h]<5D>h<16>extensions/bindings.cpp<70><70><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>+ provides python bindings for our operation<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj2hhhh,hNubjy)<29><>}<7D>(h<05>X``extensions/CMakeLists.txt`` holds CMake rules to build the library and
python bindings<67>h]<5D>h.)<29><>}<7D>(h<05>X``extensions/CMakeLists.txt`` holds CMake rules to build the library and
python bindings<67>h]<5D>(h>)<29><>}<7D>(h<05>``extensions/CMakeLists.txt``<60>h]<5D>h<16>extensions/CMakeLists.txt<78><74><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>; holds CMake rules to build the library and
python bindings<67><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj2hhhh,hNubjy)<29><>}<7D>(h<05>_``extensions/setup.py`` holds the ``setuptools`` rules to build and install
the python package
<EFBFBD>h]<5D>h.)<29><>}<7D>(h<05>^``extensions/setup.py`` holds the ``setuptools`` rules to build and install
the python package<67>h]<5D>(h>)<29><>}<7D>(h<05>``extensions/setup.py``<60>h]<5D>h<16>extensions/setup.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> holds the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``setuptools``<60>h]<5D>h<16>
setuptools<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>. rules to build and install
the python package<67><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj2hhhh,hNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>j,j-uh+jshh,hM<>hjzhhubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Binding to Python<6F>h]<5D>h<16>Binding to Python<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj	hhhh,hM<>ubh.)<29><>}<7D>(h<05><>We use PyBind11_ to build a Python API for the C++ library. Since bindings
for all needed components such as `mlx.core.array`, `mlx.core.stream`, etc.
are already provided, adding our :meth:`axpby` becomes very simple!<21>h]<5D>(h<16>We use <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubjG)<29><>}<7D>(h<05>	PyBind11_<31>h]<5D>h<16>PyBind11<31><31><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj"hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>PyBind11<31>jX<00>*https://pybind11.readthedocs.io/en/stable/<2F>uh+jFhjjZKubh<16>] to build a Python API for the C++ library. Since bindings
for all needed components such as <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh	<09>title_reference<63><65><EFBFBD>)<29><>}<7D>(h<05>`mlx.core.array`<60>h]<5D>h<16>mlx.core.array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj9hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j7hjubh<16>, <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubj8)<29><>}<7D>(h<05>`mlx.core.stream`<60>h]<5D>h<16>mlx.core.stream<61><6D><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjKhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j7hjubh<16>(, etc.
are already provided, adding our <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh8)<29><>}<7D>(h<05>
:meth:`axpby`<60>h]<5D>h>)<29><>}<7D>(hj_h]<5D>h<16>axpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjahhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj]ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>jk<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>axpby<62>uh+h7hh,hM<>hjubh<16> becomes very simple!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj	hhubh<62>)<29><>}<7D>(hX9PYBIND11_MODULE(mlx_sample_extensions, m) {
    m.doc() = "Sample C++ and metal extensions for MLX";

    m.def(
        "axpby",
        &axpby,
        "x"_a,
        "y"_a,
        py::pos_only(),
        "alpha"_a,
        "beta"_a,
        py::kw_only(),
        "stream"_a = py::none(),
        R"pbdoc(
            Scale and sum two vectors element-wise
            ``z = alpha * x + beta * y``

            Follows numpy style broadcasting between ``x`` and ``y``
            Inputs are upcasted to floats if needed

            Args:
                x (array): Input array.
                y (array): Input array.
                alpha (float): Scaling factor for ``x``.
                beta (float): Scaling factor for ``y``.

            Returns:
                array: ``alpha * x + beta * y``
        )pbdoc");
}<7D>h]<5D>hX9PYBIND11_MODULE(mlx_sample_extensions, m) {
    m.doc() = "Sample C++ and metal extensions for MLX";

    m.def(
        "axpby",
        &axpby,
        "x"_a,
        "y"_a,
        py::pos_only(),
        "alpha"_a,
        "beta"_a,
        py::kw_only(),
        "stream"_a = py::none(),
        R"pbdoc(
            Scale and sum two vectors element-wise
            ``z = alpha * x + beta * y``

            Follows numpy style broadcasting between ``x`` and ``y``
            Inputs are upcasted to floats if needed

            Args:
                x (array): Input array.
                y (array): Input array.
                alpha (float): Scaling factor for ``x``.
                beta (float): Scaling factor for ``y``.

            Returns:
                array: ``alpha * x + beta * y``
        )pbdoc");
}<7D><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>C++<2B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj	hhubh.)<29><>}<7D>(h<05>Most of the complexity in the above example comes from additional bells and
whistles such as the literal names and doc-strings.<2E>h]<5D>h<16>Most of the complexity in the above example comes from additional bells and
whistles such as the literal names and doc-strings.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj	hhubju)<29><>}<7D>(h<05><>:mod:`mlx.core` needs to be imported before importing
:mod:`mlx_sample_extensions` as defined by the pybind11 module above to
ensure that the casters for :mod:`mlx.core` components like
:class:`mlx.core.array` are available.<2E>h]<5D>h.)<29><>}<7D>(h<05><>:mod:`mlx.core` needs to be imported before importing
:mod:`mlx_sample_extensions` as defined by the pybind11 module above to
ensure that the casters for :mod:`mlx.core` components like
:class:`mlx.core.array` are available.<2E>h]<5D>(h8)<29><>}<7D>(h<05>:mod:`mlx.core`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>mlx.core<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx.core<72>uh+h7hh,hM<>hj<>ubh<16>' needs to be imported before importing
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:mod:`mlx_sample_extensions`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx_sample_extensions<6E>uh+h7hh,hM<>hj<>ubh<16>H as defined by the pybind11 module above to
ensure that the casters for <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:mod:`mlx.core`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>mlx.core<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx.core<72>uh+h7hh,hM<>hj<>ubh<16> components like
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:class:`mlx.core.array`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>mlx.core.array<61><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j'<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx.core.array<61>uh+h7hh,hM<>hj<>ubh<16> are available.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jthj	hhhh,hNubh	<09>target<65><74><EFBFBD>)<29><>}<7D>(h<05>.. _Building with CMake:<3A>h]<5D>h}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refid<69><64>building-with-cmake<6B>uh+jIhM<>hj	hhhh,ubeh}<7D>(h!]<5D><>binding-to-python<6F>ah#]<5D>h%]<5D><>binding to python<6F>ah']<5D>h)]<5D>uh+h
hjzhhhh,hM<>ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Building with CMake<6B>h]<5D>h<16>Building with CMake<6B><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjbhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj_hhhh,hM<>ubh.)<29><>}<7D>(h<05><>Building the C++ extension library itself is simple, it only requires that you
``find_package(MLX CONFIG)`` and then link it to your library.<2E>h]<5D>(h<16>OBuilding the C++ extension library itself is simple, it only requires that you
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjphhhNhNubh>)<29><>}<7D>(h<05>``find_package(MLX CONFIG)``<60>h]<5D>h<16>find_package(MLX CONFIG)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjxhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjpubh<16>" and then link it to your library.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjphhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj_hhubh<62>)<29><>}<7D>(hX## Add library
add_library(mlx_ext)

# Add sources
target_sources(
    mlx_ext
    PUBLIC
    ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.cpp
)

# Add include headers
target_include_directories(
    mlx_ext PUBLIC ${CMAKE_CURRENT_LIST_DIR}
)

# Link to mlx
target_link_libraries(mlx_ext PUBLIC mlx)<29>h]<5D>hX## Add library
add_library(mlx_ext)

# Add sources
target_sources(
    mlx_ext
    PUBLIC
    ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.cpp
)

# Add include headers
target_include_directories(
    mlx_ext PUBLIC ${CMAKE_CURRENT_LIST_DIR}
)

# Link to mlx
target_link_libraries(mlx_ext PUBLIC mlx)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>cmake<6B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj_hhubh.)<29><>}<7D>(hXWe also need to build the attached metal library. For convenience, we provide a
:meth:`mlx_build_metallib` function that builds a ``.metallib`` target given
sources, headers, destinations, etc. (defined in ``cmake/extension.cmake`` and
automatically imported with MLX package).<2E>h]<5D>(h<16>PWe also need to build the attached metal library. For convenience, we provide a
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`mlx_build_metallib`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>mlx_build_metallib()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx_build_metallib<69>uh+h7hh,hM<>hj<>ubh<16> function that builds a <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>
``.metallib``<60>h]<5D>h<16>	.metallib<69><62><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>? target given
sources, headers, destinations, etc. (defined in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``cmake/extension.cmake``<60>h]<5D>h<16>cmake/extension.cmake<6B><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>. and
automatically imported with MLX package).<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj_hhubh.)<29><>}<7D>(h<05>)Here is what that looks like in practice!<21>h]<5D>h<16>)Here is what that looks like in practice!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj_hhubh<62>)<29><>}<7D>(hXR# Build metallib
if(MLX_BUILD_METAL)

mlx_build_metallib(
    TARGET mlx_ext_metallib
    TITLE mlx_ext
    SOURCES ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.metal
    INCLUDE_DIRS ${PROJECT_SOURCE_DIR} ${MLX_INCLUDE_DIRS}
    OUTPUT_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY}
)

add_dependencies(
    mlx_ext
    mlx_ext_metallib
)

endif()<29>h]<5D>hXR# Build metallib
if(MLX_BUILD_METAL)

mlx_build_metallib(
    TARGET mlx_ext_metallib
    TITLE mlx_ext
    SOURCES ${CMAKE_CURRENT_LIST_DIR}/axpby/axpby.metal
    INCLUDE_DIRS ${PROJECT_SOURCE_DIR} ${MLX_INCLUDE_DIRS}
    OUTPUT_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY}
)

add_dependencies(
    mlx_ext
    mlx_ext_metallib
)

endif()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>cmake<6B>h<EFBFBD>}<7D>uh+h<>hh,hM<>hj_hhubh.)<29><>}<7D>(h<05>(Finally, we build the Pybind11_ bindings<67>h]<5D>(h<16>Finally, we build the <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubjG)<29><>}<7D>(h<05>	Pybind11_<31>h]<5D>h<16>Pybind11<31><31><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>Pybind11<31>jXj2uh+jFhjjZKubh<16>	 bindings<67><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMhj_hhubh<62>)<29><>}<7D>(hXpybind11_add_module(
    mlx_sample_extensions
    ${CMAKE_CURRENT_LIST_DIR}/bindings.cpp
)
target_link_libraries(mlx_sample_extensions PRIVATE mlx_ext)

if(BUILD_SHARED_LIBS)
    target_link_options(mlx_sample_extensions PRIVATE -Wl,-rpath,@loader_path)
endif()<29>h]<5D>hXpybind11_add_module(
    mlx_sample_extensions
    ${CMAKE_CURRENT_LIST_DIR}/bindings.cpp
)
target_link_libraries(mlx_sample_extensions PRIVATE mlx_ext)

if(BUILD_SHARED_LIBS)
    target_link_options(mlx_sample_extensions PRIVATE -Wl,-rpath,@loader_path)
endif()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj6sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>cmake<6B>h<EFBFBD>}<7D>uh+h<>hh,hMhj_hhubeh}<7D>(h!]<5D>(jV<00>id1<64>eh#]<5D>h%]<5D><>building with cmake<6B>ah']<5D><>building with cmake<6B>ah)]<5D>uh+h
hjzhhhh,hM<><02>
referenced<EFBFBD>K<01>expect_referenced_by_name<6D>}<7D>jKjKs<>expect_referenced_by_id<69>}<7D>jVjKsubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Building with ``setuptools``<60>h]<5D>(h<16>Building with <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjWhhhNhNubh>)<29><>}<7D>(h<05>``setuptools``<60>h]<5D>h<16>
setuptools<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj_hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjWubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhjThhhh,hMubh.)<29><>}<7D>(h<05><>Once we have set out the CMake build rules as described above, we can use the
build utilities defined in :mod:`mlx.extension` for a simple build process.<2E>h]<5D>(h<16>iOnce we have set out the CMake build rules as described above, we can use the
build utilities defined in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjshhhNhNubh8)<29><>}<7D>(h<05>:mod:`mlx.extension`<60>h]<5D>h>)<29><>}<7D>(hj}h]<5D>h<16>
mlx.extension<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hj{ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
mlx.extension<6F>uh+h7hh,hMhjsubh<16> for a simple build process.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjshhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMhjThhubh<62>)<29><>}<7D>(hX^from mlx import extension
from setuptools import setup

if __name__ == "__main__":
    setup(
        name="mlx_sample_extensions",
        version="0.0.0",
        description="Sample C++ and Metal extensions for MLX primitives.",
        ext_modules=[extension.CMakeExtension("mlx_sample_extensions")],
        cmdclass={"build_ext": extension.CMakeBuild},
        packages = ["mlx_sample_extensions"],
        package_dir = {"": "mlx_sample_extensions"},
        package_data = {"mlx_sample_extensions" : ["*.so", "*.dylib", "*.metallib"]},
        zip_safe=False,
        python_requires=">=3.7",
    )<29>h]<5D>hX^from mlx import extension
from setuptools import setup

if __name__ == "__main__":
    setup(
        name="mlx_sample_extensions",
        version="0.0.0",
        description="Sample C++ and Metal extensions for MLX primitives.",
        ext_modules=[extension.CMakeExtension("mlx_sample_extensions")],
        cmdclass={"build_ext": extension.CMakeBuild},
        packages = ["mlx_sample_extensions"],
        package_dir = {"": "mlx_sample_extensions"},
        package_data = {"mlx_sample_extensions" : ["*.so", "*.dylib", "*.metallib"]},
        zip_safe=False,
        python_requires=">=3.7",
    )<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>python<6F>h<EFBFBD>}<7D>uh+h<>hh,hM"hjThhubj<62>)<29><>}<7D>(hXoWe treat ``extensions/mlx_sample_extensions`` as the package directory
even though it only contains a ``__init__.py`` to ensure the following:

* :mod:`mlx.core` is always imported before importing  :mod:`mlx_sample_extensions`
* The C++ extension library and the metal library are co-located with the python
  bindings and copied together if the package is installed<65>h]<5D>(h.)<29><>}<7D>(h<05><>We treat ``extensions/mlx_sample_extensions`` as the package directory
even though it only contains a ``__init__.py`` to ensure the following:<3A>h]<5D>(h<16>	We treat <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>$``extensions/mlx_sample_extensions``<60>h]<5D>h<16> extensions/mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16>9 as the package directory
even though it only contains a <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>``__init__.py``<60>h]<5D>h<16>__init__.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16> to ensure the following:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM6hj<>ubjt)<29><>}<7D>(hhh]<5D>(jy)<29><>}<7D>(h<05>Q:mod:`mlx.core` is always imported before importing  :mod:`mlx_sample_extensions`<60>h]<5D>h.)<29><>}<7D>(hj<>h]<5D>(h8)<29><>}<7D>(h<05>:mod:`mlx.core`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>mlx.core<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx.core<72>uh+h7hh,hM9hj<>ubh<16>& is always imported before importing  <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:mod:`mlx_sample_extensions`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-mod<6F>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j'<00>reftype<70><65>mod<6F><64>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>mlx_sample_extensions<6E>uh+h7hh,hM9hj<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM9hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj<>ubjy)<29><>}<7D>(h<05><>The C++ extension library and the metal library are co-located with the python
bindings and copied together if the package is installed<65>h]<5D>h.)<29><>}<7D>(h<05><>The C++ extension library and the metal library are co-located with the python
bindings and copied together if the package is installed<65>h]<5D>h<16><>The C++ extension library and the metal library are co-located with the python
bindings and copied together if the package is installed<65><64><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjIhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM:hjEubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+jxhj<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>j,j-uh+jshh,hM9hj<>ubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j<>hjThhhNhNubh.)<29><>}<7D>(h<05>lYou can build inplace for development using
``python setup.py build_ext -j8 --inplace`` (in ``extensions/``)<29>h]<5D>(h<16>,You can build inplace for development using
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjihhhNhNubh>)<29><>}<7D>(h<05>+``python setup.py build_ext -j8 --inplace``<60>h]<5D>h<16>'python setup.py build_ext -j8 --inplace<63><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjqhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjiubh<16> (in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjihhhNhNubh>)<29><>}<7D>(h<05>``extensions/``<60>h]<5D>h<16>extensions/<2F><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjiubh<16>)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjihhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM=hjThhubh.)<29><>}<7D>(h<05>5This will result in a directory structure as follows:<3A>h]<5D>h<16>5This will result in a directory structure as follows:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM@hjThhubj<62>)<29><>}<7D>(hhh]<5D>(j<>)<29><>}<7D>(h<05>
extensions<EFBFBD>h]<5D>h<16>
extensions<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMBubj<62>)<29><>}<7D>(h<05>├── mlx_sample_extensions<6E>h]<5D>h<16>├── mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMCubj<62>)<29><>}<7D>(h<05>│   ├── __init__.py<70>h]<5D>h<16>│   ├── __init__.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMDubj<62>)<29><>}<7D>(h<05>8│   ├── libmlx_ext.dylib # C++ extension library<72>h]<5D>h<16>8│   ├── libmlx_ext.dylib # C++ extension library<72><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMEubj<62>)<29><>}<7D>(h<05>0│   ├── mlx_ext.metallib # Metal library<72>h]<5D>h<16>0│   ├── mlx_ext.metallib # Metal library<72><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMFubj<62>)<29><>}<7D>(h<05>K│   └── mlx_sample_extensions.cpython-3x-darwin.so # Python Binding<6E>h]<5D>h<16>K│   └── mlx_sample_extensions.cpython-3x-darwin.so # Python Binding<6E><67><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMGubj<62>)<29><>}<7D>(h<05>...<2E>h]<5D>h<16>…<><E280A6><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hj<>Khj<>hhhh,hMHubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+j<>hjThhhh,hMBubh.)<29><>}<7D>(hX2When you try to install using the command ``python -m pip install .``
(in ``extensions/``), the package will be installed with the same structure as
``extensions/mlx_sample_extensions`` and the C++ and metal library will be
copied along with the python binding since they are specified as ``package_data``.<2E>h]<5D>(h<16>*When you try to install using the command <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>``python -m pip install .``<60>h]<5D>h<16>python -m pip install .<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>
(in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>``extensions/``<60>h]<5D>h<16>extensions/<2F><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj.hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16><), the package will be installed with the same structure as
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>$``extensions/mlx_sample_extensions``<60>h]<5D>h<16> extensions/mlx_sample_extensions<6E><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj@hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>h and the C++ and metal library will be
copied along with the python binding since they are specified as <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubh>)<29><>}<7D>(h<05>``package_data``<60>h]<5D>h<16>package_data<74><61><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjRhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hjubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMJhjThhubeh}<7D>(h!]<5D><>building-with-setuptools<6C>ah#]<5D>h%]<5D><>building with setuptools<6C>ah']<5D>h)]<5D>uh+h
hjzhhhh,hMubeh}<7D>(h!]<5D><>building-and-binding<6E>ah#]<5D>h%]<5D><>building and binding<6E>ah']<5D>h)]<5D>uh+h
hhhhhh,hM<>ubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Usage<67>h]<5D>h<16>Usage<67><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj}hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhjzhhhh,hMPubh.)<29><>}<7D>(h<05><>After installing the extension as described above, you should be able to simply
import the python package and play with it as you would any other MLX operation!<21>h]<5D>h<16><>After installing the extension as described above, you should be able to simply
import the python package and play with it as you would any other MLX operation!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMRhjzhhubh.)<29><>}<7D>(h<05>0Let's looks at a simple script and it's results!<21>h]<5D>h<16>4Let’s looks at a simple script and it’s results!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMUhjzhhubh<62>)<29><>}<7D>(h<05><>import mlx.core as mx
from mlx_sample_extensions import axpby

a = mx.ones((3, 4))
b = mx.ones((3, 4))
c = axpby(a, b, 4.0, 2.0, stream=mx.cpu)

print(f"c shape: {c.shape}")
print(f"c dtype: {c.dtype}")
print(f"c correctness: {mx.all(c == 6.0).item()}")<29>h]<5D>h<16><>import mlx.core as mx
from mlx_sample_extensions import axpby

a = mx.ones((3, 4))
b = mx.ones((3, 4))
c = axpby(a, b, 4.0, 2.0, stream=mx.cpu)

print(f"c shape: {c.shape}")
print(f"c dtype: {c.dtype}")
print(f"c correctness: {mx.all(c == 6.0).item()}")<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>python<6F>h<EFBFBD>}<7D>uh+h<>hh,hMWhjzhhubh.)<29><>}<7D>(h<05>Output:<3A>h]<5D>h<16>Output:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMdhjzhhubh<62>)<29><>}<7D>(h<05>4c shape: [3, 4]
c dtype: float32
c correctness: True<75>h]<5D>h<16>4c shape: [3, 4]
c dtype: float32
c correctness: True<75><65><EFBFBD><EFBFBD><EFBFBD>}<7D>hj<>sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>python<6F>h<EFBFBD>}<7D>uh+h<>hh,hMfhjzhhubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Results<74>h]<5D>h<16>Results<74><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj<>hhhh,hMmubh.)<29><>}<7D>(h<05><>Let's run a quick benchmark and see how our new ``axpby`` operation compares
with the naive :meth:`simple_axpby` we defined at first on the CPU.<2E>h]<5D>(h<16>2Let’s run a quick benchmark and see how our new <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh>)<29><>}<7D>(h<05>	``axpby``<60>h]<5D>h<16>axpby<62><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubh<16># operation compares
with the naive <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubh8)<29><>}<7D>(h<05>:meth:`simple_axpby`<60>h]<5D>h>)<29><>}<7D>(hjh]<5D>h<16>simple_axpby()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hjubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>simple_axpby<62>uh+h7hh,hMohj<>ubh<16>  we defined at first on the CPU.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hMohj<>hhubh<62>)<29><>}<7D>(hX<>import mlx.core as mx
from mlx_sample_extensions import axpby
import time

mx.set_default_device(mx.cpu)

def simple_axpby(x: mx.array, y: mx.array, alpha: float, beta: float) -> mx.array:
    return alpha * x + beta * y

M = 256
N = 512

x = mx.random.normal((M, N))
y = mx.random.normal((M, N))
alpha = 4.0
beta = 2.0

mx.eval((x, y))

def bench(f):
    # Warm up
    for i in range(100):
        z = f(x, y, alpha, beta)
        mx.eval(z)

    # Timed run
    s = time.time()
    for i in range(5000):
        z = f(x, y, alpha, beta)
        mx.eval(z)
    e = time.time()
    return e - s

simple_time = bench(simple_axpby)
custom_time = bench(axpby)

print(f"Simple axpby: {simple_time:.3f} s | Custom axpby: {custom_time:.3f} s")<29>h]<5D>hX<>import mlx.core as mx
from mlx_sample_extensions import axpby
import time

mx.set_default_device(mx.cpu)

def simple_axpby(x: mx.array, y: mx.array, alpha: float, beta: float) -> mx.array:
    return alpha * x + beta * y

M = 256
N = 512

x = mx.random.normal((M, N))
y = mx.random.normal((M, N))
alpha = 4.0
beta = 2.0

mx.eval((x, y))

def bench(f):
    # Warm up
    for i in range(100):
        z = f(x, y, alpha, beta)
        mx.eval(z)

    # Timed run
    s = time.time()
    for i in range(5000):
        z = f(x, y, alpha, beta)
        mx.eval(z)
    e = time.time()
    return e - s

simple_time = bench(simple_axpby)
custom_time = bench(axpby)

print(f"Simple axpby: {simple_time:.3f} s | Custom axpby: {custom_time:.3f} s")<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>hj*sbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD><68>python<6F>h<EFBFBD>}<7D>uh+h<>hh,hMrhj<>hhubh.)<29><>}<7D>(h<05>Results:<3A>h]<5D>h<16>Results:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj:hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh<62>)<29><>}<7D>(h<05>-Simple axpby: 0.114 s | Custom axpby: 0.109 s<>h]<5D>h<16>-Simple axpby: 0.114 s | Custom axpby: 0.109 s<><73><EFBFBD><EFBFBD><EFBFBD>}<7D>hjHsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>h<EFBFBD><68>h<EFBFBD>j<EFBFBD>h<>}<7D>uh+h<>hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05>+We see some modest improvements right away!<21>h]<5D>h<16>+We see some modest improvements right away!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjWhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubh.)<29><>}<7D>(h<05><>This operation is now good to be used to build other operations,
in :class:`mlx.nn.Module` calls, and also as a part of graph
transformations such as :meth:`grad` and :meth:`simplify`!<21>h]<5D>(h<16>DThis operation is now good to be used to build other operations,
in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjehhhNhNubh8)<29><>}<7D>(h<05>:class:`mlx.nn.Module`<60>h]<5D>h>)<29><>}<7D>(hjoh]<5D>h<16>
mlx.nn.Module<6C><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjqhhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-class<73>eh%]<5D>h']<5D>h)]<5D>uh+h=hjmubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j{<00>reftype<70><65>class<73><73>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>
mlx.nn.Module<6C>uh+h7hh,hM<>hjeubh<16>< calls, and also as a part of graph
transformations such as <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjehhhNhNubh8)<29><>}<7D>(h<05>:meth:`grad`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>grad()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>grad<61>uh+h7hh,hM<>hjeubh<16> and <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjehhhNhNubh8)<29><>}<7D>(h<05>:meth:`simplify`<60>h]<5D>h>)<29><>}<7D>(hj<>h]<5D>h<16>
simplify()<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>(hI<68>py<70><79>py-meth<74>eh%]<5D>h']<5D>h)]<5D>uh+h=hj<>ubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>refdoc<6F>hV<68>	refdomain<69>j<EFBFBD><00>reftype<70><65>meth<74><68>refexplicit<69><74><EFBFBD>refwarn<72><6E>h\Nh]Nh^<5E>simplify<66>uh+h7hh,hM<>hjeubh<16>!<21><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjehhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hj<>hhubeh}<7D>(h!]<5D><>results<74>ah#]<5D>h%]<5D><>results<74>ah']<5D>h)]<5D>uh+h
hjzhhhh,hMmubeh}<7D>(h!]<5D><>usage<67>ah#]<5D>h%]<5D><>usage<67>ah']<5D>h)]<5D>uh+h
hhhhhh,hMPubh)<29><>}<7D>(hhh]<5D>(h)<29><>}<7D>(h<05>Scripts<74>h]<5D>h<16>Scripts<74><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hj<>hhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhj<>hhhh,hM<>ubh	<09>
admonition<EFBFBD><EFBFBD><EFBFBD>)<29><>}<7D>(h<05>=The full example code is available in `mlx-examples <code>`_.<2E>h]<5D>(h)<29><>}<7D>(h<05>Download the code<64>h]<5D>h<16>Download the code<64><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+hhh,hM<>hjubh.)<29><>}<7D>(hjh]<5D>(h<16>&The full example code is available in <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubjG)<29><>}<7D>(h<05>`mlx-examples <code>`_<>h]<5D>h<16>mlx-examples<65><73><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>name<6D><65>mlx-examples<65>jX<00>code<64>uh+jFhjubjJ)<29><>}<7D>(h<05> <code><3E>h]<5D>h}<7D>(h!]<5D><>mlx-examples<65>ah#]<5D>h%]<5D><>mlx-examples<65>ah']<5D>h)]<5D><>refuri<72>j+uh+jIjOKhjubh<16>.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjhhhNhNubeh}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hh,hM<>hjubeh}<7D>(h!]<5D>h#]<5D><>admonition-download-the-code<64>ah%]<5D>h']<5D>h)]<5D>uh+jhj<>hhhh,hNubh	<09>comment<6E><74><EFBFBD>)<29><>}<7D>(h<05>code: `TODO_LINK/extensions`_<>h]<5D>h<16>code: `TODO_LINK/extensions`_<><5F><EFBFBD><EFBFBD><EFBFBD>}<7D>hjLsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>h<EFBFBD>h<EFBFBD>uh+jJhj<>hhhh,hM<>ubjJ)<29><>}<7D>(h<05>W.. _Accelerate: https://developer.apple.com/documentation/accelerate/blas?language=objc<6A>h]<5D>h}<7D>(h!]<5D><>
accelerate<EFBFBD>ah#]<5D>h%]<5D><>
accelerate<EFBFBD>ah']<5D>h)]<5D>jXjYuh+jIhM<>hj<>hhhh,jOKubjJ)<29><>}<7D>(h<05>H.. _Metal: https://developer.apple.com/documentation/metal?language=objc<6A>h]<5D>h}<7D>(h!]<5D><>metal<61>ah#]<5D>h%]<5D><>metal<61>ah']<5D>h)]<5D>jXj<>uh+jIhM<>hj<>hhhh,jOKubjJ)<29><>}<7D>(h<05>5.. _Metal-cpp: https://developer.apple.com/metal/cpp/<2F>h]<5D>h}<7D>(h!]<5D><>	metal-cpp<70>ah#]<5D>h%]<5D><>	metal-cpp<70>ah']<5D>h)]<5D>jXj:
uh+jIhM<>hj<>hhhh,jOKubjJ)<29><>}<7D>(h<05>e.. _`Metal Specification`: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf<64>h]<5D>h}<7D>(h!]<5D><>metal-specification<6F>ah#]<5D>h%]<5D><>metal specification<6F>ah']<5D>h)]<5D>jXj
uh+jIhM<>hj<>hhhh,jOKubjJ)<29><>}<7D>(h<05>s.. _`Metal Example`: https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc<6A>h]<5D>h}<7D>(h!]<5D><>
metal-example<6C>ah#]<5D>h%]<5D><>
metal example<6C>ah']<5D>h)]<5D>jXj<>uh+jIhM<>hj<>hhhh,jOKubjJ)<29><>}<7D>(h<05>8.. _PyBind11: https://pybind11.readthedocs.io/en/stable/<2F>h]<5D>h}<7D>(h!]<5D><>pybind11<31>ah#]<5D>h%]<5D><>pybind11<31>ah']<5D>h)]<5D>jXj2uh+jIhM<>hj<>hhhh,jOKubeh}<7D>(h!]<5D><>scripts<74>ah#]<5D>h%]<5D><>scripts<74>ah']<5D>h)]<5D>uh+h
hhhhhh,hM<>ubeh}<7D>(h!]<5D><>developer-documentation<6F>ah#]<5D>h%]<5D><>developer documentation<6F>ah']<5D>h)]<5D>uh+h
hhhhhh,hKubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>source<63>h,uh+h<01>current_source<63>N<EFBFBD>current_line<6E>N<EFBFBD>settings<67><73>docutils.frontend<6E><64>Values<65><73><EFBFBD>)<29><>}<7D>(hN<>	generator<6F>N<EFBFBD>	datestamp<6D>N<EFBFBD>source_link<6E>N<EFBFBD>
source_url<EFBFBD>N<EFBFBD>
toc_backlinks<6B><73>entry<72><79>footnote_backlinks<6B>K<01>
sectnum_xform<72>K<01>strip_comments<74>N<EFBFBD>strip_elements_with_classes<65>N<EFBFBD>
strip_classes<65>N<EFBFBD>report_level<65>K<02>
halt_level<EFBFBD>K<05>exit_status_level<65>K<05>debug<75>N<EFBFBD>warning_stream<61>N<EFBFBD>	traceback<63><6B><EFBFBD>input_encoding<6E><67>	utf-8-sig<69><67>input_encoding_error_handler<65><72>strict<63><74>output_encoding<6E><67>utf-8<><38>output_encoding_error_handler<65>j<EFBFBD><00>error_encoding<6E><67>utf-8<><38>error_encoding_error_handler<65><72>backslashreplace<63><65>
language_code<64><65>en<65><6E>record_dependencies<65>N<EFBFBD>config<69>N<EFBFBD>	id_prefix<69>h<06>auto_id_prefix<69><78>id<69><64>
dump_settings<67>N<EFBFBD>dump_internals<6C>N<EFBFBD>dump_transforms<6D>N<EFBFBD>dump_pseudo_xml<6D>N<EFBFBD>expose_internals<6C>N<EFBFBD>strict_visitor<6F>N<EFBFBD>_disable_config<69>N<EFBFBD>_source<63>h,<2C>_destination<6F>N<EFBFBD>
_config_files<65>]<5D><>file_insertion_enabled<65><64><EFBFBD>raw_enabled<65>K<01>line_length_limit<69>M'<27>pep_references<65>N<EFBFBD>pep_base_url<72><6C>https://peps.python.org/<2F><>pep_file_url_template<74><65>pep-%04d<34><64>rfc_references<65>N<EFBFBD>rfc_base_url<72><6C>&https://datatracker.ietf.org/doc/html/<2F><>	tab_width<74>K<08>trim_footnote_reference_space<63><65><EFBFBD>syntax_highlight<68><74>long<6E><67>smart_quotes<65><73><EFBFBD>smartquotes_locales<65>]<5D><>character_level_inline_markup<75><70><EFBFBD>doctitle_xform<72><6D><EFBFBD>
docinfo_xform<72>K<01>sectsubtitle_xform<72><6D><EFBFBD>
image_loading<6E><67>link<6E><6B>embed_stylesheet<65><74><EFBFBD>cloak_email_addresses<65><73><EFBFBD>section_self_link<6E><6B><EFBFBD>env<6E>Nub<75>reporter<65>N<EFBFBD>indirect_targets<74>]<5D><>substitution_defs<66>}<7D><>substitution_names<65>}<7D><>refnames<65>}<7D>(<28>
accelerate<EFBFBD>]<5D>(jHj<>jG
e<>metal<61>]<5D>j<EFBFBD>a<>
metal example<6C>]<5D>j<EFBFBD>a<>metal specification<6F>]<5D>j
a<>	metal-cpp<70>]<5D>j*
a<>pybind11<31>]<5D>(j"jeu<65>refids<64>}<7D>jV]<5D>jKas<61>nameids<64>}<7D>(j<>j<>j3j0j
jj<>j<>j<>j<>jj<>jwjtj<>j~jzjwjojljwjtj\jYjKjVjojlj<>j<>j<>j<>j<>j<>j5j2jcj`jojlj{jxj<>j<>j<>j<>j<>j<>u<>	nametypes<65>}<7D>(j<><00>j3<00>j
<00>j<EFBFBD><00>j<EFBFBD><00>j<00>jw<00>j<EFBFBD><00>jz<00>jo<00>jw<00>j\<00>jK<00>jo<00>j<EFBFBD><00>j<EFBFBD><00>j<EFBFBD><00>j5<00>jc<00>jo<00>j{<00>j<EFBFBD><00>j<EFBFBD><00>j<EFBFBD><00>uh!}<7D>(j<>hj0hjjj6j<>jUj<>j<>j<>jjtj
j~j<>jwj<>jlj}jtjzjYj	jVj_jHj_jljTj<>jzj<>j<>j<>j<>j2j,j`jZjljfjxjrj<>j~j<>j<>j<>j<>u<>
footnote_refs<66>}<7D><>
citation_refs<66>}<7D><>
autofootnotes<65>]<5D><>autofootnote_refs<66>]<5D><>symbol_footnotes<65>]<5D><>symbol_footnote_refs<66>]<5D><>	footnotes<65>]<5D><>	citations<6E>]<5D><>autofootnote_start<72>K<01>symbol_footnote_start<72>K<00>
id_counter<EFBFBD><EFBFBD>collections<6E><73>Counter<65><72><EFBFBD>}<7D>j<EFBFBD>Ks<><73>R<EFBFBD><52>parse_messages<65>]<5D>h	<09>system_message<67><65><EFBFBD>)<29><>}<7D>(hhh]<5D>h.)<29><>}<7D>(h<05>6Duplicate implicit target name: "building with cmake".<2E>h]<5D>h<16>:Duplicate implicit target name: “building with cmake”.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hjHhhhNhNubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hjEubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>jHa<>level<65>K<01>type<70><65>INFO<46><4F>source<63>h,<2C>line<6E>M<EFBFBD>uh+jChj_hhhh,hM<>uba<62>transform_messages<65>]<5D>jD)<29><>}<7D>(hhh]<5D>h.)<29><>}<7D>(hhh]<5D>h<16>9Hyperlink target "building-with-cmake" is not referenced.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>hjfsbah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D>uh+h-hjcubah}<7D>(h!]<5D>h#]<5D>h%]<5D>h']<5D>h)]<5D><>level<65>K<01>type<70>j^<00>source<63>h,<2C>line<6E>M<EFBFBD>uh+jCuba<62>transformer<65>N<EFBFBD>include_log<6F>]<5D><>
decoration<EFBFBD>Nhhub.