mirror of
https://github.com/ml-explore/mlx.git
synced 2025-12-16 01:49:05 +08:00
Progress with the docs
This commit is contained in:
BIN
docs/src/_static/distributed/m3-ultra-mesh-broken.png
Normal file
BIN
docs/src/_static/distributed/m3-ultra-mesh-broken.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 16 KiB |
BIN
docs/src/_static/distributed/m3-ultra-mesh.png
Normal file
BIN
docs/src/_static/distributed/m3-ultra-mesh.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 22 KiB |
@@ -7,22 +7,29 @@ Distributed Communication
|
||||
|
||||
MLX supports distributed communication operations that allow the computational cost
|
||||
of training or inference to be shared across many physical machines. At the
|
||||
moment we support three different communication backends:
|
||||
moment we support several different communication backends introduced below.
|
||||
|
||||
.. list-table::
|
||||
:widths: 20 80
|
||||
:header-rows: 1
|
||||
|
||||
* - Backend
|
||||
- Description
|
||||
* - :ref:`MPI <mpi_section>`
|
||||
- A full featured and mature distributed communications library.
|
||||
* - :ref:`RING <ring_section>`
|
||||
- Ring all reduce and all gather over TCP sockets. Always available and
|
||||
usually faster than MPI.
|
||||
* - :ref:`JACCL <ring_section>`
|
||||
- Low latency communication with RDMA over thunderbolt. Necessary for
|
||||
things like tensor parallelism.
|
||||
* - :ref:`NCCL <nccl_section>`
|
||||
- The backend of choice for CUDA environments.
|
||||
|
||||
* `MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ a
|
||||
full-featured and mature distributed communications library
|
||||
* A **ring** backend of our own that uses native TCP sockets. It should be
|
||||
faster for thunderbolt connections, but it also works over Ethernet.
|
||||
* `nccl <https://developer.nvidia.com/nccl>`_, for use in CUDA environments.
|
||||
|
||||
The list of all currently supported operations and their documentation can be
|
||||
seen in the :ref:`API docs<distributed>`.
|
||||
|
||||
.. note::
|
||||
Some operations may not be supported or not as fast as they should be.
|
||||
We are adding more and tuning the ones we have as we are figuring out the
|
||||
best way to do distributed computing on Macs using MLX.
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
|
||||
@@ -85,7 +92,7 @@ Selecting Backend
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
You can select the backend you want to use when calling :func:`init` by passing
|
||||
one of ``{'any', 'ring', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
|
||||
one of ``{'any', 'ring', 'jaccl', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
|
||||
available backends. If they all fail then a singleton group is created.
|
||||
|
||||
.. note::
|
||||
@@ -192,16 +199,247 @@ almost identical to the example above:
|
||||
loss = step(model, x, y)
|
||||
mx.eval(loss, model.parameters())
|
||||
|
||||
.. _ring_section:
|
||||
|
||||
Getting Started with Ring
|
||||
-------------------------
|
||||
|
||||
The ring backend does not depend on any third party library so it is always
|
||||
available. It uses TCP sockets so the nodes need to be reachable via a network.
|
||||
As the name suggests the nodes are connected in a ring which means that rank 1
|
||||
can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
|
||||
and so on and so forth. As a result :func:`send` and :func:`recv` with
|
||||
arbitrary sender and receiver is not supported in the ring backend.
|
||||
|
||||
Defining a Ring
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The easiest way to define and use a ring is via a JSON hostfile and the
|
||||
``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
|
||||
defines a hostname to ssh into to run commands on this node and one or more IPs
|
||||
that this node will listen to for connections.
|
||||
|
||||
For example the hostfile below defines a 4 node ring. ``hostname1`` will be
|
||||
rank 0, ``hostname2`` rank 1 etc.
|
||||
|
||||
.. code:: json
|
||||
|
||||
[
|
||||
{"ssh": "hostname1", "ips": ["123.123.123.1"]},
|
||||
{"ssh": "hostname2", "ips": ["123.123.123.2"]},
|
||||
{"ssh": "hostname3", "ips": ["123.123.123.3"]},
|
||||
{"ssh": "hostname4", "ips": ["123.123.123.4"]}
|
||||
]
|
||||
|
||||
Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
|
||||
node, run the script which will listen for connections in each of the provided
|
||||
IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
|
||||
connection from ``123.123.123.4`` and so on and so forth.
|
||||
|
||||
Thunderbolt Ring
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Although the ring backend can have benefits over MPI even for Ethernet, its
|
||||
main purpose is to use Thunderbolt rings for higher bandwidth communication.
|
||||
Setting up such thunderbolt rings can be done manually, but is a relatively
|
||||
tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
|
||||
|
||||
To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
|
||||
Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
|
||||
utility as follows:
|
||||
|
||||
.. code:: shell
|
||||
|
||||
mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --backend ring
|
||||
|
||||
By default the script will attempt to discover the thunderbolt ring and provide
|
||||
you with the commands to configure each node as well as the ``hostfile.json``
|
||||
to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
|
||||
then ``--auto-setup`` can be used to configure them automatically.
|
||||
|
||||
If you want to go through the process manually, the steps are as follows:
|
||||
|
||||
* Disable the thunderbolt bridge interface
|
||||
* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
|
||||
corresponding to that cable in nodes ``i`` and ``i + 1``.
|
||||
* Set up a unique subnetwork connecting the two nodes for the corresponding
|
||||
interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
|
||||
and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
|
||||
``192.168.0.2`` respectively to the two nodes. For more details you can see
|
||||
the commands prepared by the utility script.
|
||||
|
||||
.. _jaccl_section:
|
||||
|
||||
Getting Started with RDMA over Thunderbolt
|
||||
------------------------------------------
|
||||
|
||||
Starting from version 26.2 RDMA over thunderbolt is available in MacOS and
|
||||
enables low-latency communication between Macs with thunderbolt 5. MLX provides
|
||||
the JACCL backend that uses this functionality to achieve communication latency
|
||||
an order of magnitude lower than the ring backend.
|
||||
|
||||
.. note::
|
||||
|
||||
The name JACCL (pronounced Jackal) stands for *Jack and Angelos' Collective
|
||||
Communication Library* and it is an obvious pun to Nvidia's NCCL but also
|
||||
tribute to *Jack Beasley* who led the development of RDMA over Thunderbolt
|
||||
at Apple.
|
||||
|
||||
Enabling RDMA
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Until the feature matures, enabling RDMA over thunderbolt is slightly more
|
||||
involved and **cannot** be done remotely even with sudo. In fact it has to be
|
||||
done in macOS recovery:
|
||||
|
||||
1. `Start your computer in recovery <https://support.apple.com/en-us/102518>`_.
|
||||
2. Open the Terminal by going to Utilities -> Terminal.
|
||||
3. Run ``rdma_ctl enable``.
|
||||
4. Reboot.
|
||||
|
||||
To verify that you have successfully enabled Thunderbolt RDMA you can run
|
||||
``ibv_devices`` which should produce something like the following for an M3 Ultra.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~ % ibv_devices
|
||||
device node GUID
|
||||
------ ----------------
|
||||
rdma_en2 8096a9d9edbaac05
|
||||
rdma_en3 8196a9d9edbaac05
|
||||
rdma_en5 8396a9d9edbaac05
|
||||
rdma_en4 8296a9d9edbaac05
|
||||
rdma_en6 8496a9d9edbaac05
|
||||
rdma_en7 8596a9d9edbaac05
|
||||
|
||||
Defining a Mesh
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The JACCL backend supports only fully connected topologies. Namely, there needs
|
||||
to be a thunderbolt cable connecting all pairs of Macs directly. For example in
|
||||
the following topology visualizations the left one is valid because there is a
|
||||
connection from any node to any other node, while for the one on the right M3
|
||||
Ultra 1 is not connected to M3 Ultra 2.
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<div style="display: flex; text-align: center; align-items: end; font-size: 80%;">
|
||||
<div>
|
||||
<img src="/_static/distributed/m3-ultra-mesh.png" alt="M3 Ultra thunderbolt mesh" style="width: 55%">
|
||||
<p>Fully connected mesh of four M3 Ultra.</p>
|
||||
</div>
|
||||
<div>
|
||||
<img src="/_static/distributed/m3-ultra-mesh-broken.png" alt="M3 Ultra broken thunderbolt mesh" style="width: 55%">
|
||||
<p>Not a valid mesh (M3 Ultra 1 is not connected to M3 Ultra 2).</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Similar to the ring backend, the easiest way to use JACCL with MLX is to write
|
||||
a JSON hostfile that will be used by ``mlx.launch``. The hostfile needs to contain
|
||||
|
||||
- Hostnames to use for launching scripts via ssh
|
||||
- An IP for rank 0 that is reachable by all nodes
|
||||
- A list of rdma devices that connect each node to each other node
|
||||
|
||||
The following JSON defines the valid 4-node mesh from the image above.
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
[
|
||||
{
|
||||
"ssh": "m3-ultra-1",
|
||||
"ips": ["123.123.123.1"],
|
||||
"rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"]
|
||||
},
|
||||
{
|
||||
"ssh": "m3-ultra-2",
|
||||
"ips": [],
|
||||
"rdma": ["rdma_en5", null, "rdma_en3", "rdma_en4"]
|
||||
},
|
||||
{
|
||||
"ssh": "m3-ultra-3",
|
||||
"ips": [],
|
||||
"rdma": ["rdma_en4", "rdma_en3", null, "rdma_en5"]
|
||||
},
|
||||
{
|
||||
"ssh": "m3-ultra-4",
|
||||
"ips": [],
|
||||
"rdma": ["rdma_en3", "rdma_en4", "rdma_en5", null]
|
||||
}
|
||||
]
|
||||
|
||||
Even though TCP/IP is not used when communicating with Thunderbolt RDMA,
|
||||
disabling the thunderbolt bridge is still required as well as setting up
|
||||
isolated local networks for each thunderbolt connection.
|
||||
|
||||
All of the above can be done instead via ``mlx.distributed_config``. The helper
|
||||
script will
|
||||
|
||||
- ssh into each node
|
||||
- extract the thunderbolt connectivity
|
||||
- check for a valid mesh
|
||||
- provide the commands to configure each node (or run them if sudo is available)
|
||||
- generate the hostfile to be used with ``mlx.launch``
|
||||
|
||||
Putting it All Together
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For example to launch a distributed MLX script that uses JACCL is fairly simple
|
||||
if the nodes are reachable via ssh and have password-less sudo.
|
||||
|
||||
First, connect all the thunderbolt cables. Then we can verify the connections
|
||||
by using the ``mlx.distributed_config`` script to visualize them.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mlx.distributed_config --verbose \
|
||||
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
|
||||
--over thunderbolt --dot | dot -Tpng | open -f -a Preview
|
||||
|
||||
After making sure that everything looks right we can auto-configure the nodes
|
||||
and save the hostfile to ``m3-ultra-jaccl.json`` by running:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mlx.distributed_config --verbose \
|
||||
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
|
||||
--over thunderbolt --backend jaccl \
|
||||
--auto-setup --output m3-ultra-jaccl.json
|
||||
|
||||
And now we are ready to run a distributed MLX script such as distributed inference
|
||||
of a gigantic model using MLX-LM.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mlx.launch --verbose --backend jaccl --hostfile m3-ultra-jaccl.json \
|
||||
--env MLX_METAL_FAST_SYNCH=1 -- \ # <--- important
|
||||
/path/to/remote/python -m mlx_lm chat --model mlx-community/DeepSeek-V3.2-8bit --shard
|
||||
|
||||
.. note::
|
||||
|
||||
Defining the environment variable ``MLX_METAL_FAST_SYNCH=1`` enables a
|
||||
different, faster way of synchronizing between the GPU and the CPU. It is
|
||||
not specific to the JACCL backend and can be used in all cases where the CPU
|
||||
and GPU need to collaborate for some computation and is pretty critical for
|
||||
low-latency communication since the communication is done by the CPU.
|
||||
|
||||
.. _nccl_section:
|
||||
|
||||
Getting Started with NCCL
|
||||
-------------------------
|
||||
|
||||
.. _mpi_section:
|
||||
|
||||
Getting Started with MPI
|
||||
------------------------
|
||||
|
||||
MLX already comes with the ability to "talk" to MPI if it is installed on the
|
||||
machine. Launching distributed MLX programs that use MPI can be done with
|
||||
``mpirun`` as expected. However, in the following examples we will be using
|
||||
``mlx.launch --backend mpi`` which takes care of some nuisances such as setting
|
||||
absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` shared
|
||||
library.
|
||||
MLX already comes with the ability to "talk" to `MPI
|
||||
<https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ if it is installed
|
||||
on the machine. Launching distributed MLX programs that use MPI can be done
|
||||
with ``mpirun`` as expected. However, in the following examples we will be
|
||||
using ``mlx.launch --backend mpi`` which takes care of some nuisances such as
|
||||
setting absolute paths for the ``mpirun`` executable and the ``libmpi.dyld``
|
||||
shared library.
|
||||
|
||||
The simplest possible usage is the following which, assuming the minimal
|
||||
example in the beginning of this page, should result in:
|
||||
@@ -269,78 +507,9 @@ Force MPI to use the most performant network interface by setting ``--mca
|
||||
btl_tcp_if_include <iface>`` where ``<iface>`` should be the interface you want
|
||||
to use.
|
||||
|
||||
Getting Started with Ring
|
||||
Distributed Without ``mlx.launch``
|
||||
----------------------------------
|
||||
|
||||
|
||||
Using the helper scripts
|
||||
-------------------------
|
||||
|
||||
The ring backend does not depend on any third party library so it is always
|
||||
available. It uses TCP sockets so the nodes need to be reachable via a network.
|
||||
As the name suggests the nodes are connected in a ring which means that rank 1
|
||||
can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
|
||||
and so on and so forth. As a result :func:`send` and :func:`recv` with
|
||||
arbitrary sender and receiver is not supported in the ring backend.
|
||||
|
||||
Defining a Ring
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The easiest way to define and use a ring is via a JSON hostfile and the
|
||||
``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
|
||||
defines a hostname to ssh into to run commands on this node and one or more IPs
|
||||
that this node will listen to for connections.
|
||||
|
||||
For example the hostfile below defines a 4 node ring. ``hostname1`` will be
|
||||
rank 0, ``hostname2`` rank 1 etc.
|
||||
|
||||
.. code:: json
|
||||
|
||||
[
|
||||
{"ssh": "hostname1", "ips": ["123.123.123.1"]},
|
||||
{"ssh": "hostname2", "ips": ["123.123.123.2"]},
|
||||
{"ssh": "hostname3", "ips": ["123.123.123.3"]},
|
||||
{"ssh": "hostname4", "ips": ["123.123.123.4"]}
|
||||
]
|
||||
|
||||
Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
|
||||
node, run the script which will listen for connections in each of the provided
|
||||
IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
|
||||
connection from ``123.123.123.4`` and so on and so forth.
|
||||
|
||||
Thunderbolt Ring
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Although the ring backend can have benefits over MPI even for Ethernet, its
|
||||
main purpose is to use Thunderbolt rings for higher bandwidth communication.
|
||||
Setting up such thunderbolt rings can be done manually, but is a relatively
|
||||
tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
|
||||
|
||||
To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
|
||||
Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
|
||||
utility as follows:
|
||||
|
||||
.. code:: shell
|
||||
|
||||
mlx.distributed_config --verbose --hosts host1,host2,host3,host4
|
||||
|
||||
By default the script will attempt to discover the thunderbolt ring and provide
|
||||
you with the commands to configure each node as well as the ``hostfile.json``
|
||||
to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
|
||||
then ``--auto-setup`` can be used to configure them automatically.
|
||||
|
||||
To validate your connection without configuring anything
|
||||
``mlx.distributed_config`` can also plot the ring using DOT format.
|
||||
|
||||
.. code:: shell
|
||||
|
||||
mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --dot >ring.dot
|
||||
dot -Tpng ring.dot >ring.png
|
||||
open ring.png
|
||||
|
||||
If you want to go through the process manually, the steps are as follows:
|
||||
|
||||
* Disable the thunderbolt bridge interface
|
||||
* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
|
||||
corresponding to that cable in nodes ``i`` and ``i + 1``.
|
||||
* Set up a unique subnetwork connecting the two nodes for the corresponding
|
||||
interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
|
||||
and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
|
||||
``192.168.0.2`` respectively to the two nodes. For more details you can see
|
||||
the commands prepared by the utility script.
|
||||
|
||||
Reference in New Issue
Block a user