mirror of
https://github.com/ml-explore/mlx.git
synced 2025-06-25 01:41:17 +08:00
parent
d11d77e581
commit
cfc39d84b7
@ -36,6 +36,7 @@ are the CPU and GPU.
|
|||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
quick_start
|
quick_start
|
||||||
|
unified_memory
|
||||||
using_streams
|
using_streams
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
|
@ -62,10 +62,3 @@ and :func:`jvp` for Jacobian-vector products.
|
|||||||
|
|
||||||
Use :func:`value_and_grad` to efficiently compute both a function's output and
|
Use :func:`value_and_grad` to efficiently compute both a function's output and
|
||||||
gradient with respect to the function's input.
|
gradient with respect to the function's input.
|
||||||
|
|
||||||
|
|
||||||
Devices and Streams
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
78
docs/src/unified_memory.rst
Normal file
78
docs/src/unified_memory.rst
Normal file
@ -0,0 +1,78 @@
|
|||||||
|
.. _unified_memory:
|
||||||
|
|
||||||
|
Unified Memory
|
||||||
|
==============
|
||||||
|
|
||||||
|
.. currentmodule:: mlx.core
|
||||||
|
|
||||||
|
Apple silicon has a unified memory architecture. The CPU and GPU have direct
|
||||||
|
access to the same memory pool. MLX is designed to take advantage that.
|
||||||
|
|
||||||
|
Concretely, when you make an array in MLX you don't have to specify its location:
|
||||||
|
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
a = mx.random.normal((100,))
|
||||||
|
b = mx.random.normal((100,))
|
||||||
|
|
||||||
|
Both ``a`` and ``b`` live in unified memory.
|
||||||
|
|
||||||
|
In MLX, rather than moving arrays to devices, you specify the device when you
|
||||||
|
run the operation. Any device can perform any operation on ``a`` and ``b``
|
||||||
|
without needing to move them from one memory location to another. For example:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
mx.add(a, b, stream=mx.cpu)
|
||||||
|
mx.add(a, b, stream=mx.gpu)
|
||||||
|
|
||||||
|
In the above, both the CPU and the GPU will perform the same add
|
||||||
|
operation. The operations can (and likely will) be run in parallel since
|
||||||
|
there are no dependencies between them. See :ref:`using_streams` for more
|
||||||
|
information the semantics of streams in MLX.
|
||||||
|
|
||||||
|
In the above ``add`` example, there are no dependencies between operations, so
|
||||||
|
there is no possibility for race conditions. If there are dependencies, the
|
||||||
|
MLX scheduler will automatically manage them. For example:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
c = mx.add(a, b, stream=mx.cpu)
|
||||||
|
d = mx.add(a, c, stream=mx.gpu)
|
||||||
|
|
||||||
|
In the above case, the second ``add`` runs on the GPU but it depends on the
|
||||||
|
output of the first ``add`` which is running on the CPU. MLX will
|
||||||
|
automatically insert a dependency between the two streams so that the second
|
||||||
|
``add`` only starts executing after the first is complete and ``c`` is
|
||||||
|
available.
|
||||||
|
|
||||||
|
A Simple Example
|
||||||
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Here is a more interesting (albeit slightly contrived example) of how unified
|
||||||
|
memory can be helpful. Suppose we have the following computation:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
def fun(a, b, d1, d2):
|
||||||
|
x = mx.matmul(a, b, stream=d1)
|
||||||
|
for _ in range(500):
|
||||||
|
b = mx.exp(b, stream=d2)
|
||||||
|
return x, b
|
||||||
|
|
||||||
|
which we want to run with the following arguments:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
a = mx.random.uniform(shape=(4096, 512))
|
||||||
|
b = mx.random.uniform(shape=(512, 4))
|
||||||
|
|
||||||
|
The first ``matmul`` operation is a good fit for the GPU since it's more
|
||||||
|
compute dense. The second sequence of operations are a better fit for the CPU,
|
||||||
|
since they are very small and would probably be overhead bound on the GPU.
|
||||||
|
|
||||||
|
If we time the computation fully on the GPU, we get 2.8 milliseconds. But if we
|
||||||
|
run the computation with ``d1=mx.gpu`` and ``d2=mx.cpu``, then the time is only
|
||||||
|
about 1.4 milliseconds, about twice as fast. These times were measured on an M1
|
||||||
|
Max.
|
@ -1,3 +1,5 @@
|
|||||||
|
.. _using_streams:
|
||||||
|
|
||||||
Using Streams
|
Using Streams
|
||||||
=============
|
=============
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user