mirror of
				https://github.com/ml-explore/mlx.git
				synced 2025-11-04 10:38:10 +08:00 
			
		
		
		
	
		
			
				
	
	
		
			79 lines
		
	
	
		
			2.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			79 lines
		
	
	
		
			2.5 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
.. _unified_memory:
 | 
						|
 | 
						|
Unified Memory
 | 
						|
==============
 | 
						|
 | 
						|
.. currentmodule:: mlx.core
 | 
						|
 | 
						|
Apple silicon has a unified memory architecture. The CPU and GPU have direct
 | 
						|
access to the same memory pool. MLX is designed to take advantage of that.
 | 
						|
 | 
						|
Concretely, when you make an array in MLX you don't have to specify its location:
 | 
						|
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
  a = mx.random.normal((100,))
 | 
						|
  b = mx.random.normal((100,))
 | 
						|
 | 
						|
Both ``a`` and ``b`` live in unified memory.
 | 
						|
 | 
						|
In MLX, rather than moving arrays to devices, you specify the device when you
 | 
						|
run the operation. Any device can perform any operation on ``a`` and ``b``
 | 
						|
without needing to move them from one memory location to another. For example:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
  mx.add(a, b, stream=mx.cpu)
 | 
						|
  mx.add(a, b, stream=mx.gpu)
 | 
						|
 | 
						|
In the above, both the CPU and the GPU will perform the same add
 | 
						|
operation. The operations can (and likely will) be run in parallel since
 | 
						|
there are no dependencies between them. See :ref:`using_streams` for more
 | 
						|
information the semantics of streams in MLX.
 | 
						|
 | 
						|
In the above ``add`` example, there are no dependencies between operations, so
 | 
						|
there is no possibility for race conditions. If there are dependencies, the
 | 
						|
MLX scheduler will automatically manage them. For example:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
  c = mx.add(a, b, stream=mx.cpu)
 | 
						|
  d = mx.add(a, c, stream=mx.gpu)
 | 
						|
 | 
						|
In the above case, the second ``add`` runs on the GPU but it depends on the
 | 
						|
output of the first ``add`` which is running on the CPU. MLX will
 | 
						|
automatically insert a dependency between the two streams so that the second
 | 
						|
``add`` only starts executing after the first is complete and ``c`` is
 | 
						|
available.
 | 
						|
 | 
						|
A Simple Example
 | 
						|
~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Here is a more interesting (albeit slightly contrived example) of how unified
 | 
						|
memory can be helpful. Suppose we have the following computation:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
  def fun(a, b, d1, d2):
 | 
						|
    x = mx.matmul(a, b, stream=d1)
 | 
						|
    for _ in range(500):
 | 
						|
        b = mx.exp(b, stream=d2)
 | 
						|
    return x, b
 | 
						|
 | 
						|
which we want to run with the following arguments:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
  a = mx.random.uniform(shape=(4096, 512))
 | 
						|
  b = mx.random.uniform(shape=(512, 4))
 | 
						|
 | 
						|
The first ``matmul`` operation is a good fit for the GPU since it's more
 | 
						|
compute dense. The second sequence of operations are a better fit for the CPU,
 | 
						|
since they are very small and would probably be overhead bound on the GPU.
 | 
						|
 | 
						|
If we time the computation fully on the GPU, we get 2.8 milliseconds. But if we
 | 
						|
run the computation with ``d1=mx.gpu`` and ``d2=mx.cpu``, then the time is only
 | 
						|
about 1.4 milliseconds, about twice as fast. These times were measured on an M1
 | 
						|
Max.
 |