Some docs on unified memory (#62)

* doc on unified memory
2025-12-11 23:14:50 +08:00 · 2023-12-07 19:42:24 -08:00
parent d11d77e581
commit cfc39d84b7
4 changed files with 81 additions and 7 deletions
--- a/docs/src/index.rst
+++ b/docs/src/index.rst
@@ -36,6 +36,7 @@ are the CPU and GPU.
   :maxdepth: 1
   quick_start
   unified_memory
   using_streams
 .. toctree::
--- a/docs/src/quick_start.rst
+++ b/docs/src/quick_start.rst
@@ -62,10 +62,3 @@ and :func:`jvp` for Jacobian-vector products.
 Use :func:`value_and_grad` to efficiently compute both a function's output and
 gradient with respect to the function's input. 
 Devices and Streams 
 -------------------
--- a/docs/src/unified_memory.rst
+++ b/docs/src/unified_memory.rst
@@ -0,0 +1,78 @@
 .. _unified_memory:
 Unified Memory
 ==============
 .. currentmodule:: mlx.core
 Apple silicon has a unified memory architecture. The CPU and GPU have direct
 access to the same memory pool. MLX is designed to take advantage that.
 Concretely, when you make an array in MLX you don't have to specify its location:
 .. code-block:: python
  a = mx.random.normal((100,))
  b = mx.random.normal((100,))
 Both ``a`` and ``b`` live in unified memory.
 In MLX, rather than moving arrays to devices, you specify the device when you
 run the operation. Any device can perform any operation on ``a`` and ``b``
 without needing to move them from one memory location to another. For example: 
 .. code-block:: python
  mx.add(a, b, stream=mx.cpu)
  mx.add(a, b, stream=mx.gpu)
 In the above, both the CPU and the GPU will perform the same add
 operation. The operations can (and likely will) be run in parallel since
 there are no dependencies between them. See :ref:`using_streams` for more
 information the semantics of streams in MLX.
 In the above ``add`` example, there are no dependencies between operations, so
 there is no possibility for race conditions. If there are dependencies, the
 MLX scheduler will automatically manage them. For example:
 .. code-block:: python
  c = mx.add(a, b, stream=mx.cpu)
  d = mx.add(a, c, stream=mx.gpu)
 In the above case, the second ``add`` runs on the GPU but it depends on the
 output of the first ``add`` which is running on the CPU. MLX will
 automatically insert a dependency between the two streams so that the second
 ``add`` only starts executing after the first is complete and ``c`` is
 available.
 A Simple Example
 ~~~~~~~~~~~~~~~~
 Here is a more interesting (albeit slightly contrived example) of how unified
 memory can be helpful. Suppose we have the following computation:
 .. code-block:: python
  def fun(a, b, d1, d2):
    x = mx.matmul(a, b, stream=d1)
    for _ in range(500):
        b = mx.exp(b, stream=d2)
    return x, b
 which we want to run with the following arguments:
 .. code-block:: python
  a = mx.random.uniform(shape=(4096, 512))
  b = mx.random.uniform(shape=(512, 4))
 The first ``matmul`` operation is a good fit for the GPU since it's more
 compute dense. The second sequence of operations are a better fit for the CPU,
 since they are very small and would probably be overhead bound on the GPU.
 If we time the computation fully on the GPU, we get 2.8 milliseconds. But if we
 run the computation with ``d1=mx.gpu`` and ``d2=mx.cpu``, then the time is only
 about 1.4 milliseconds, about twice as fast. These times were measured on an M1
 Max.
--- a/docs/src/using_streams.rst
+++ b/docs/src/using_streams.rst
@@ -1,3 +1,5 @@
 .. _using_streams:
 Using Streams
 =============