mirror of
https://github.com/ml-explore/mlx.git
synced 2025-12-16 01:49:05 +08:00
Compare commits
9 Commits
ebda161a86
...
ibv-backen
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d2bc340df4 | ||
|
|
fabc947df4 | ||
|
|
5523087cfb | ||
|
|
2f939acefa | ||
|
|
3b416a2e36 | ||
|
|
753c6a4d0f | ||
|
|
d3a754c8aa | ||
|
|
595a4ad206 | ||
|
|
a4dc1fac6c |
@@ -119,10 +119,6 @@ if(MLX_BUILD_METAL)
|
|||||||
COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
|
COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
|
||||||
OUTPUT_VARIABLE MACOS_SDK_VERSION
|
OUTPUT_VARIABLE MACOS_SDK_VERSION
|
||||||
OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
|
OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
|
||||||
execute_process(
|
|
||||||
COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-path"
|
|
||||||
OUTPUT_VARIABLE CMAKE_OSX_SYSROOT
|
|
||||||
OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
|
|
||||||
|
|
||||||
if(${MACOS_SDK_VERSION} LESS 14.0)
|
if(${MACOS_SDK_VERSION} LESS 14.0)
|
||||||
message(
|
message(
|
||||||
|
|||||||
BIN
docs/src/_static/distributed/m3-ultra-mesh-broken.png
Normal file
BIN
docs/src/_static/distributed/m3-ultra-mesh-broken.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 16 KiB |
BIN
docs/src/_static/distributed/m3-ultra-mesh.png
Normal file
BIN
docs/src/_static/distributed/m3-ultra-mesh.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 22 KiB |
@@ -7,22 +7,29 @@ Distributed Communication
|
|||||||
|
|
||||||
MLX supports distributed communication operations that allow the computational cost
|
MLX supports distributed communication operations that allow the computational cost
|
||||||
of training or inference to be shared across many physical machines. At the
|
of training or inference to be shared across many physical machines. At the
|
||||||
moment we support three different communication backends:
|
moment we support several different communication backends introduced below.
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 20 80
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Backend
|
||||||
|
- Description
|
||||||
|
* - :ref:`MPI <mpi_section>`
|
||||||
|
- A full featured and mature distributed communications library.
|
||||||
|
* - :ref:`RING <ring_section>`
|
||||||
|
- Ring all reduce and all gather over TCP sockets. Always available and
|
||||||
|
usually faster than MPI.
|
||||||
|
* - :ref:`JACCL <ring_section>`
|
||||||
|
- Low latency communication with RDMA over thunderbolt. Necessary for
|
||||||
|
things like tensor parallelism.
|
||||||
|
* - :ref:`NCCL <nccl_section>`
|
||||||
|
- The backend of choice for CUDA environments.
|
||||||
|
|
||||||
* `MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ a
|
|
||||||
full-featured and mature distributed communications library
|
|
||||||
* A **ring** backend of our own that uses native TCP sockets. It should be
|
|
||||||
faster for thunderbolt connections, but it also works over Ethernet.
|
|
||||||
* `nccl <https://developer.nvidia.com/nccl>`_, for use in CUDA environments.
|
|
||||||
|
|
||||||
The list of all currently supported operations and their documentation can be
|
The list of all currently supported operations and their documentation can be
|
||||||
seen in the :ref:`API docs<distributed>`.
|
seen in the :ref:`API docs<distributed>`.
|
||||||
|
|
||||||
.. note::
|
|
||||||
Some operations may not be supported or not as fast as they should be.
|
|
||||||
We are adding more and tuning the ones we have as we are figuring out the
|
|
||||||
best way to do distributed computing on Macs using MLX.
|
|
||||||
|
|
||||||
Getting Started
|
Getting Started
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
@@ -85,7 +92,7 @@ Selecting Backend
|
|||||||
^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
You can select the backend you want to use when calling :func:`init` by passing
|
You can select the backend you want to use when calling :func:`init` by passing
|
||||||
one of ``{'any', 'ring', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
|
one of ``{'any', 'ring', 'jaccl', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
|
||||||
available backends. If they all fail then a singleton group is created.
|
available backends. If they all fail then a singleton group is created.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
@@ -110,6 +117,8 @@ The following examples aim to clarify the backend initialization logic in MLX:
|
|||||||
world_ring = mx.distributed.init(backend="ring")
|
world_ring = mx.distributed.init(backend="ring")
|
||||||
world_any = mx.distributed.init() # same as MPI because it was initialized first!
|
world_any = mx.distributed.init() # same as MPI because it was initialized first!
|
||||||
|
|
||||||
|
.. _training_example:
|
||||||
|
|
||||||
Training Example
|
Training Example
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
@@ -192,16 +201,273 @@ almost identical to the example above:
|
|||||||
loss = step(model, x, y)
|
loss = step(model, x, y)
|
||||||
mx.eval(loss, model.parameters())
|
mx.eval(loss, model.parameters())
|
||||||
|
|
||||||
|
.. _ring_section:
|
||||||
|
|
||||||
|
Getting Started with Ring
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
The ring backend does not depend on any third party library so it is always
|
||||||
|
available. It uses TCP sockets so the nodes need to be reachable via a network.
|
||||||
|
As the name suggests the nodes are connected in a ring which means that rank 1
|
||||||
|
can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
|
||||||
|
and so on and so forth. As a result :func:`send` and :func:`recv` with
|
||||||
|
arbitrary sender and receiver is not supported in the ring backend.
|
||||||
|
|
||||||
|
Defining a Ring
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The easiest way to define and use a ring is via a JSON hostfile and the
|
||||||
|
``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
|
||||||
|
defines a hostname to ssh into to run commands on this node and one or more IPs
|
||||||
|
that this node will listen to for connections.
|
||||||
|
|
||||||
|
For example the hostfile below defines a 4 node ring. ``hostname1`` will be
|
||||||
|
rank 0, ``hostname2`` rank 1 etc.
|
||||||
|
|
||||||
|
.. code:: json
|
||||||
|
|
||||||
|
[
|
||||||
|
{"ssh": "hostname1", "ips": ["123.123.123.1"]},
|
||||||
|
{"ssh": "hostname2", "ips": ["123.123.123.2"]},
|
||||||
|
{"ssh": "hostname3", "ips": ["123.123.123.3"]},
|
||||||
|
{"ssh": "hostname4", "ips": ["123.123.123.4"]}
|
||||||
|
]
|
||||||
|
|
||||||
|
Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
|
||||||
|
node, run the script which will listen for connections in each of the provided
|
||||||
|
IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
|
||||||
|
connection from ``123.123.123.4`` and so on and so forth.
|
||||||
|
|
||||||
|
Thunderbolt Ring
|
||||||
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Although the ring backend can have benefits over MPI even for Ethernet, its
|
||||||
|
main purpose is to use Thunderbolt rings for higher bandwidth communication.
|
||||||
|
Setting up such thunderbolt rings can be done manually, but is a relatively
|
||||||
|
tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
|
||||||
|
|
||||||
|
To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
|
||||||
|
Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
|
||||||
|
utility as follows:
|
||||||
|
|
||||||
|
.. code:: shell
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --backend ring
|
||||||
|
|
||||||
|
By default the script will attempt to discover the thunderbolt ring and provide
|
||||||
|
you with the commands to configure each node as well as the ``hostfile.json``
|
||||||
|
to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
|
||||||
|
then ``--auto-setup`` can be used to configure them automatically.
|
||||||
|
|
||||||
|
If you want to go through the process manually, the steps are as follows:
|
||||||
|
|
||||||
|
* Disable the thunderbolt bridge interface
|
||||||
|
* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
|
||||||
|
corresponding to that cable in nodes ``i`` and ``i + 1``.
|
||||||
|
* Set up a unique subnetwork connecting the two nodes for the corresponding
|
||||||
|
interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
|
||||||
|
and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
|
||||||
|
``192.168.0.2`` respectively to the two nodes. For more details you can see
|
||||||
|
the commands prepared by the utility script.
|
||||||
|
|
||||||
|
.. _jaccl_section:
|
||||||
|
|
||||||
|
Getting Started with RDMA over Thunderbolt
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
Starting from version 26.2 RDMA over thunderbolt is available in MacOS and
|
||||||
|
enables low-latency communication between Macs with thunderbolt 5. MLX provides
|
||||||
|
the JACCL backend that uses this functionality to achieve communication latency
|
||||||
|
an order of magnitude lower than the ring backend.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The name JACCL (pronounced Jackal) stands for *Jack and Angelos' Collective
|
||||||
|
Communication Library* and it is an obvious pun to Nvidia's NCCL but also
|
||||||
|
tribute to *Jack Beasley* who led the development of RDMA over Thunderbolt
|
||||||
|
at Apple.
|
||||||
|
|
||||||
|
Enabling RDMA
|
||||||
|
^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Until the feature matures, enabling RDMA over thunderbolt is slightly more
|
||||||
|
involved and **cannot** be done remotely even with sudo. In fact, it has to be
|
||||||
|
done in macOS recovery:
|
||||||
|
|
||||||
|
1. `Start your computer in recovery <https://support.apple.com/en-us/102518>`_.
|
||||||
|
2. Open the Terminal by going to Utilities -> Terminal.
|
||||||
|
3. Run ``rdma_ctl enable``.
|
||||||
|
4. Reboot.
|
||||||
|
|
||||||
|
To verify that you have successfully enabled Thunderbolt RDMA you can run
|
||||||
|
``ibv_devices`` which should produce something like the following for an M3 Ultra.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
~ % ibv_devices
|
||||||
|
device node GUID
|
||||||
|
------ ----------------
|
||||||
|
rdma_en2 8096a9d9edbaac05
|
||||||
|
rdma_en3 8196a9d9edbaac05
|
||||||
|
rdma_en5 8396a9d9edbaac05
|
||||||
|
rdma_en4 8296a9d9edbaac05
|
||||||
|
rdma_en6 8496a9d9edbaac05
|
||||||
|
rdma_en7 8596a9d9edbaac05
|
||||||
|
|
||||||
|
Defining a Mesh
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The JACCL backend supports only fully connected topologies. Namely, there needs
|
||||||
|
to be a thunderbolt cable connecting all pairs of Macs directly. For example, in
|
||||||
|
the following topology visualizations, the left one is valid because there is a
|
||||||
|
connection from any node to any other node, while for the one on the right M3
|
||||||
|
Ultra 1 is not connected to M3 Ultra 2.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<div style="display: flex; text-align: center; align-items: end; font-size: 80%;">
|
||||||
|
<div>
|
||||||
|
<img src="/_static/distributed/m3-ultra-mesh.png" alt="M3 Ultra thunderbolt mesh" style="width: 55%">
|
||||||
|
<p>Fully connected mesh of four M3 Ultra.</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<img src="/_static/distributed/m3-ultra-mesh-broken.png" alt="M3 Ultra broken thunderbolt mesh" style="width: 55%">
|
||||||
|
<p>Not a valid mesh (M3 Ultra 1 is not connected to M3 Ultra 2).</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Similar to the ring backend, the easiest way to use JACCL with MLX is to write
|
||||||
|
a JSON hostfile that will be used by ``mlx.launch``. The hostfile needs to contain
|
||||||
|
|
||||||
|
- Hostnames to use for launching scripts via ssh
|
||||||
|
- An IP for rank 0 that is reachable by all nodes
|
||||||
|
- A list of rdma devices that connect each node to each other node
|
||||||
|
|
||||||
|
The following JSON defines the valid 4-node mesh from the image above.
|
||||||
|
|
||||||
|
.. code-block:: json
|
||||||
|
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"ssh": "m3-ultra-1",
|
||||||
|
"ips": ["123.123.123.1"],
|
||||||
|
"rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"ssh": "m3-ultra-2",
|
||||||
|
"ips": [],
|
||||||
|
"rdma": ["rdma_en5", null, "rdma_en3", "rdma_en4"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"ssh": "m3-ultra-3",
|
||||||
|
"ips": [],
|
||||||
|
"rdma": ["rdma_en4", "rdma_en3", null, "rdma_en5"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"ssh": "m3-ultra-4",
|
||||||
|
"ips": [],
|
||||||
|
"rdma": ["rdma_en3", "rdma_en4", "rdma_en5", null]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
Even though TCP/IP is not used when communicating with Thunderbolt RDMA,
|
||||||
|
disabling the thunderbolt bridge is still required as well as setting up
|
||||||
|
isolated local networks for each thunderbolt connection.
|
||||||
|
|
||||||
|
All of the above can be done instead via ``mlx.distributed_config``. This helper
|
||||||
|
script will
|
||||||
|
|
||||||
|
- ssh into each node
|
||||||
|
- extract the thunderbolt connectivity
|
||||||
|
- check for a valid mesh
|
||||||
|
- provide the commands to configure each node (or run them if sudo is available)
|
||||||
|
- generate the hostfile to be used with ``mlx.launch``
|
||||||
|
|
||||||
|
Putting it All Together
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
For example launching a distributed MLX script that uses JACCL is fairly simple
|
||||||
|
if the nodes are reachable via ssh and have password-less sudo.
|
||||||
|
|
||||||
|
First, connect all the thunderbolt cables. Then we can verify the connections
|
||||||
|
by using the ``mlx.distributed_config`` script to visualize them.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose \
|
||||||
|
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
|
||||||
|
--over thunderbolt --dot | dot -Tpng | open -f -a Preview
|
||||||
|
|
||||||
|
After making sure that everything looks right we can auto-configure the nodes
|
||||||
|
and save the hostfile to ``m3-ultra-jaccl.json`` by running:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose \
|
||||||
|
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
|
||||||
|
--over thunderbolt --backend jaccl \
|
||||||
|
--auto-setup --output m3-ultra-jaccl.json
|
||||||
|
|
||||||
|
And now we are ready to run a distributed MLX script such as distributed inference
|
||||||
|
of a gigantic model using MLX-LM.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.launch --verbose --backend jaccl --hostfile m3-ultra-jaccl.json \
|
||||||
|
--env MLX_METAL_FAST_SYNCH=1 -- \ # <--- important
|
||||||
|
/path/to/remote/python -m mlx_lm chat --model mlx-community/DeepSeek-V3.2-8bit --shard
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Defining the environment variable ``MLX_METAL_FAST_SYNCH=1`` enables a
|
||||||
|
different, faster way of synchronizing between the GPU and the CPU. It is
|
||||||
|
not specific to the JACCL backend and can be used in all cases where the CPU
|
||||||
|
and GPU need to collaborate for some computation and is pretty critical for
|
||||||
|
low-latency communication since the communication is done by the CPU.
|
||||||
|
|
||||||
|
.. _nccl_section:
|
||||||
|
|
||||||
|
Getting Started with NCCL
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
MLX on CUDA environments ships with the ability to talk to `NCCL
|
||||||
|
<https://developer.nvidia.com/nccl>`_ which is a high-performance collective
|
||||||
|
communication library that supports both multi-gpu and multi-node setups.
|
||||||
|
|
||||||
|
For CUDA environments, NCCL is the default backend for ``mlx.launch`` and all
|
||||||
|
it takes to run a distributed job is
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.launch -n 8 test.py
|
||||||
|
|
||||||
|
# perfect for interactive scripts
|
||||||
|
mlx.launch -n 8 python -m mlx_lm chat --model my-model --shard
|
||||||
|
|
||||||
|
You can also use ``mlx.launch`` to ssh to a remote node and launch a script
|
||||||
|
with the same ease
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.launch --hosts my-cuda-node -n 8 test.py
|
||||||
|
|
||||||
|
In many cases you may not want to use ``mlx.launch`` with the NCCL backend
|
||||||
|
because the cluster scheduler will be the one launching the processes. You can
|
||||||
|
:ref:`see which environment variables need to be defined <no_mlx_launch>` in
|
||||||
|
order for the MLX NCCL backend to be initialized correctly.
|
||||||
|
|
||||||
|
.. _mpi_section:
|
||||||
|
|
||||||
Getting Started with MPI
|
Getting Started with MPI
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
MLX already comes with the ability to "talk" to MPI if it is installed on the
|
MLX already comes with the ability to "talk" to `MPI
|
||||||
machine. Launching distributed MLX programs that use MPI can be done with
|
<https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ if it is installed
|
||||||
``mpirun`` as expected. However, in the following examples we will be using
|
on the machine. Launching distributed MLX programs that use MPI can be done
|
||||||
``mlx.launch --backend mpi`` which takes care of some nuisances such as setting
|
with ``mpirun`` as expected. However, in the following examples we will be
|
||||||
absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` shared
|
using ``mlx.launch --backend mpi`` which takes care of some nuisances such as
|
||||||
library.
|
setting absolute paths for the ``mpirun`` executable and the ``libmpi.dyld``
|
||||||
|
shared library.
|
||||||
|
|
||||||
The simplest possible usage is the following which, assuming the minimal
|
The simplest possible usage is the following which, assuming the minimal
|
||||||
example in the beginning of this page, should result in:
|
example in the beginning of this page, should result in:
|
||||||
@@ -269,78 +535,116 @@ Force MPI to use the most performant network interface by setting ``--mca
|
|||||||
btl_tcp_if_include <iface>`` where ``<iface>`` should be the interface you want
|
btl_tcp_if_include <iface>`` where ``<iface>`` should be the interface you want
|
||||||
to use.
|
to use.
|
||||||
|
|
||||||
Getting Started with Ring
|
.. _no_mlx_launch:
|
||||||
-------------------------
|
|
||||||
|
|
||||||
The ring backend does not depend on any third party library so it is always
|
Distributed Without ``mlx.launch``
|
||||||
available. It uses TCP sockets so the nodes need to be reachable via a network.
|
----------------------------------
|
||||||
As the name suggests the nodes are connected in a ring which means that rank 1
|
|
||||||
can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
|
|
||||||
and so on and so forth. As a result :func:`send` and :func:`recv` with
|
|
||||||
arbitrary sender and receiver is not supported in the ring backend.
|
|
||||||
|
|
||||||
Defining a Ring
|
None of the implementations of the distributed backends require launching with
|
||||||
^^^^^^^^^^^^^^^
|
``mlx.launch``. The script simply connects to each host. Starts a process per
|
||||||
|
rank and sets up the necessary environment variables before delegating to your
|
||||||
|
MLX script. See the :doc:`dedicated documentation page <launching_distributed>`
|
||||||
|
for more details.
|
||||||
|
|
||||||
The easiest way to define and use a ring is via a JSON hostfile and the
|
For many use-cases this will be the easiest way to perform distributed
|
||||||
``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
|
computations in MLX. However, there may be reasons that you cannot or should
|
||||||
defines a hostname to ssh into to run commands on this node and one or more IPs
|
not use ``mlx.launch``. A common such case is the use of a scheduler that
|
||||||
that this node will listen to for connections.
|
starts all the processes for you on machines undetermined at the time of
|
||||||
|
scheduling the job.
|
||||||
|
|
||||||
For example the hostfile below defines a 4 node ring. ``hostname1`` will be
|
Below we list the environment variables required to use each backend.
|
||||||
rank 0, ``hostname2`` rank 1 etc.
|
|
||||||
|
|
||||||
.. code:: json
|
Ring
|
||||||
|
^^^^^^
|
||||||
|
|
||||||
[
|
**MLX_RANK** should contain a single 0-based integer that defines the rank of
|
||||||
{"ssh": "hostname1", "ips": ["123.123.123.1"]},
|
the process.
|
||||||
{"ssh": "hostname2", "ips": ["123.123.123.2"]},
|
|
||||||
{"ssh": "hostname3", "ips": ["123.123.123.3"]},
|
|
||||||
{"ssh": "hostname4", "ips": ["123.123.123.4"]}
|
|
||||||
]
|
|
||||||
|
|
||||||
Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
|
**MLX_HOSTFILE** should contain the path to a json file that contains IPs and
|
||||||
node, run the script which will listen for connections in each of the provided
|
ports for each rank to listen to, something like the following:
|
||||||
IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
|
|
||||||
connection from ``123.123.123.4`` and so on and so forth.
|
|
||||||
|
|
||||||
Thunderbolt Ring
|
.. code-block:: json
|
||||||
^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
Although the ring backend can have benefits over MPI even for Ethernet, its
|
[
|
||||||
main purpose is to use Thunderbolt rings for higher bandwidth communication.
|
["123.123.1.1:5000", "123.123.1.2:5000"],
|
||||||
Setting up such thunderbolt rings can be done manually, but is a relatively
|
["123.123.2.1:5000", "123.123.2.2:5000"],
|
||||||
tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
|
["123.123.3.1:5000", "123.123.3.2:5000"],
|
||||||
|
["123.123.4.1:5000", "123.123.4.2:5000"]
|
||||||
|
]
|
||||||
|
|
||||||
To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
|
**MLX_RING_VERBOSE** is optional and if set to 1 it enables some more logging
|
||||||
Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
|
from the distributed backend.
|
||||||
utility as follows:
|
|
||||||
|
|
||||||
.. code:: shell
|
JACCL
|
||||||
|
^^^^^
|
||||||
|
|
||||||
mlx.distributed_config --verbose --hosts host1,host2,host3,host4
|
**MLX_RANK** should contain a single 0-based integer that defines the rank of
|
||||||
|
the process.
|
||||||
|
|
||||||
By default the script will attempt to discover the thunderbolt ring and provide
|
**MLX_JACCL_COORDINATOR** should contain the IP and port that rank 0 can listen
|
||||||
you with the commands to configure each node as well as the ``hostfile.json``
|
to all the other ranks connect to in order to establish the RDMA connections.
|
||||||
to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
|
|
||||||
then ``--auto-setup`` can be used to configure them automatically.
|
|
||||||
|
|
||||||
To validate your connection without configuring anything
|
**MLX_IBV_DEVICES** should contain the path to a json file that contains the
|
||||||
``mlx.distributed_config`` can also plot the ring using DOT format.
|
ibverbs device names that connect each node to each other node, something like
|
||||||
|
the following:
|
||||||
|
|
||||||
.. code:: shell
|
.. code-block:: json
|
||||||
|
|
||||||
mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --dot >ring.dot
|
[
|
||||||
dot -Tpng ring.dot >ring.png
|
[null, "rdma_en5", "rdma_en4", "rdma_en3"],
|
||||||
open ring.png
|
["rdma_en5", null, "rdma_en3", "rdma_en4"],
|
||||||
|
["rdma_en4", "rdma_en3", null, "rdma_en5"],
|
||||||
|
["rdma_en3", "rdma_en4", "rdma_en5", null]
|
||||||
|
]
|
||||||
|
|
||||||
If you want to go through the process manually, the steps are as follows:
|
|
||||||
|
|
||||||
* Disable the thunderbolt bridge interface
|
NCCL
|
||||||
* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
|
^^^^^
|
||||||
corresponding to that cable in nodes ``i`` and ``i + 1``.
|
|
||||||
* Set up a unique subnetwork connecting the two nodes for the corresponding
|
**MLX_RANK** should contain a single 0-based integer that defines the rank of
|
||||||
interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
|
the process.
|
||||||
and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
|
|
||||||
``192.168.0.2`` respectively to the two nodes. For more details you can see
|
**MLX_WORLD_SIZE** should contain the total number of processes that will be
|
||||||
the commands prepared by the utility script.
|
launched.
|
||||||
|
|
||||||
|
**NCCL_HOST_IP** and **NCCL_PORT** should contain the IP and port that all
|
||||||
|
hosts can connect to to establish the NCCL communication.
|
||||||
|
|
||||||
|
**CUDA_VISIBLE_DEVICES** should contain the local index of the gpu that
|
||||||
|
corresponds to this process.
|
||||||
|
|
||||||
|
Of course any `other environment variable
|
||||||
|
<https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`_ that is
|
||||||
|
used by NCCL can be set.
|
||||||
|
|
||||||
|
.. _tips_and_tricks:
|
||||||
|
|
||||||
|
Tips and Tricks
|
||||||
|
----------------
|
||||||
|
|
||||||
|
This is a small collection of tips to help you utilize better the distributed
|
||||||
|
communication capabilities of MLX.
|
||||||
|
|
||||||
|
- *Test locally first.*
|
||||||
|
|
||||||
|
You can use the pattern ``mlx.launch -n2 -- my_script.py`` to run a small
|
||||||
|
scale test on a single node first.
|
||||||
|
|
||||||
|
- *Batch your communication.*
|
||||||
|
|
||||||
|
As described in the :ref:`training example <training_example>`, performing a
|
||||||
|
lot of small communication can hurt performance. Copy the approach of
|
||||||
|
:func:`mlx.nn.average_gradients` to gather many small communications in a
|
||||||
|
single large one.
|
||||||
|
|
||||||
|
- *Visualize the connectivity.*
|
||||||
|
|
||||||
|
Use ``mlx.distributed_config --hosts h1,h2,h3 --over thunderbolt --dot`` to
|
||||||
|
visualize the connnections and make sure that the cables are connected
|
||||||
|
correctly. See the :ref:`JACCL section <jaccl_section>` for examples.
|
||||||
|
|
||||||
|
- *Use the debugger.*
|
||||||
|
|
||||||
|
``mlx.launch`` is meant for interactive use. It broadcasts stdin to all
|
||||||
|
processes and gathers stdout from all processes. This makes using ``pdb`` a
|
||||||
|
breeze.
|
||||||
|
|||||||
@@ -7,13 +7,106 @@ Launching Distributed Programs
|
|||||||
|
|
||||||
.. currentmodule:: mlx.core.distributed
|
.. currentmodule:: mlx.core.distributed
|
||||||
|
|
||||||
Installing the MLX python package provides a helper script ``mlx.launch`` that
|
Installing the MLX python package provides two utilities to help you configure
|
||||||
can be used to run python scripts distributed on several nodes. It allows
|
your Macs for distributed computation and also launch distributed programs on
|
||||||
launching using either the MPI backend or the ring backend. See the
|
multiple nodes or with many processes in a single node. These utilities are aptly named
|
||||||
:doc:`distributed docs <distributed>` for the different backends.
|
|
||||||
|
|
||||||
Usage
|
- ``mlx.launch``
|
||||||
-----
|
- ``mlx.distributed_config``
|
||||||
|
|
||||||
|
See the :doc:`distributed docs <distributed>` for an introduction and
|
||||||
|
getting-started guides to the various backends.
|
||||||
|
|
||||||
|
``mlx.distributed_config``
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Unless you are launching distributed jobs locally for development or multi-gpu
|
||||||
|
CUDA environments, then you have several Macs that you need to configure for
|
||||||
|
distributed communication with MLX.
|
||||||
|
|
||||||
|
``mlx.distributed_config`` aims to automate the process of configuring the
|
||||||
|
network interfaces (especially for communication over thunderbolt) and also
|
||||||
|
creating the hostfile to be used with ``mlx.launch``.
|
||||||
|
|
||||||
|
We will analyse 3 cases of using ``mlx.distributed_config``
|
||||||
|
|
||||||
|
1. RDMA over thunderbolt using JACCL
|
||||||
|
2. TCP/IP over thunderbolt using the ring backend
|
||||||
|
3. TCP/IP over ethernet using the ring backend
|
||||||
|
|
||||||
|
JACCL
|
||||||
|
^^^^^^^
|
||||||
|
|
||||||
|
After following :ref:`the steps to enable RDMA <jaccl_section>` you can run the
|
||||||
|
following command to configure the nodes and create the hostfile.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose --backend jaccl \
|
||||||
|
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 --over thunderbolt \
|
||||||
|
--auto-setup --output m3-ultra-jaccl.json
|
||||||
|
|
||||||
|
Let's walk through the steps that the script takes to configure the nodes.
|
||||||
|
|
||||||
|
1. Ssh to all nodes to verify that they are reachable
|
||||||
|
2. Extract the thunderbolt connectivity. Namely run commands on each node to
|
||||||
|
calculate which node is connected to which other node.
|
||||||
|
3. Verify that we have a valid fully connected mesh
|
||||||
|
4. Check that RDMA is enabled
|
||||||
|
5. Extract the ethernet IP from interface en0
|
||||||
|
6. Disable the thunderbolt bridge and set up peer to peer networks for each
|
||||||
|
thunderbolt cable
|
||||||
|
7. Write the hostfile
|
||||||
|
|
||||||
|
Knowing the above steps allows you to manually configure the nodes but also
|
||||||
|
debug any configuration issue. For instance changing the Ethernet IP to a
|
||||||
|
different interface directly in the config is possible (as long as it is
|
||||||
|
reachable from all nodes).
|
||||||
|
|
||||||
|
The ``--auto-setup`` argument requires password-less sudo on each node. If it
|
||||||
|
isn't available then the configuration script will print commands to be run on
|
||||||
|
each node.
|
||||||
|
|
||||||
|
Ring over thunderbolt
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Setting up a ring backend over thunderbolt only requires changing the
|
||||||
|
``--backend`` from ``jaccl`` to ``ring``.
|
||||||
|
|
||||||
|
The steps are very similar with the main difference being that instead of
|
||||||
|
verifying that the nodes are fully connected, the script attempts to identify a
|
||||||
|
ring topology (or multiple rings).
|
||||||
|
|
||||||
|
Ring over Ethernet
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Configuring the ring backend over ethernet doesn't require setting up network
|
||||||
|
interface and as such it simply extracts the ``en0`` IP from each node and
|
||||||
|
writes the hostfile.
|
||||||
|
|
||||||
|
Debugging cable connections
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
``mlx.distributed_config`` can help you debug the connectivity of your nodes
|
||||||
|
over thunderbolt by exporting a graph of the connections.
|
||||||
|
|
||||||
|
Running
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose \
|
||||||
|
--hosts host1,host2,host3,host4 \
|
||||||
|
--over thunderbolt --dot
|
||||||
|
|
||||||
|
will export a `GraphViz <https://graphviz.org>`_ representation of the
|
||||||
|
connections between the nodes which makes it very easy to figure out which
|
||||||
|
cable is not connected correctly.
|
||||||
|
|
||||||
|
See :ref:`the JACCL section <jaccl_section>` for an example.
|
||||||
|
|
||||||
|
|
||||||
|
``mlx.launch``
|
||||||
|
--------------
|
||||||
|
|
||||||
The minimal usage example of ``mlx.launch`` is simply
|
The minimal usage example of ``mlx.launch`` is simply
|
||||||
|
|
||||||
@@ -33,6 +126,10 @@ the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
|
|||||||
It also takes care of forwarding the output of each remote process to stdout
|
It also takes care of forwarding the output of each remote process to stdout
|
||||||
and stderr respectively.
|
and stderr respectively.
|
||||||
|
|
||||||
|
Importantly, it also broadcasts stdin to each process which enables interactive
|
||||||
|
programs to work in distributed mode as well as debugging using the interactive
|
||||||
|
debugger.
|
||||||
|
|
||||||
Providing Hosts
|
Providing Hosts
|
||||||
^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
@@ -63,10 +160,62 @@ host and on the same path. A good checklist to debug errors is the following:
|
|||||||
``mlx.launch --print-python`` to see what that path is.
|
``mlx.launch --print-python`` to see what that path is.
|
||||||
* the script you want to run is available on all hosts at the same path
|
* the script you want to run is available on all hosts at the same path
|
||||||
|
|
||||||
|
If you are launching from a node with a completely different setup than the
|
||||||
|
nodes that the program will run on, you can specify ``--no-verify-script`` so
|
||||||
|
that ``mlx.launch`` does not attempt to verify that the executable and script
|
||||||
|
exist locally before launching the distributed job.
|
||||||
|
|
||||||
|
.. _ring_specifics:
|
||||||
|
|
||||||
|
Ring Specifics
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`ring <ring_section>` backend, which is also the default
|
||||||
|
backend, can be explicitly selected with the argument ``--backend ring``. The
|
||||||
|
ring backend has some specific requirements and arguments that are different to
|
||||||
|
other backends:
|
||||||
|
|
||||||
|
* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
|
||||||
|
ssh to a hostname that does not correspond to the IP we want to bind to we
|
||||||
|
have to provide a hostfile.
|
||||||
|
* ``--starting-port`` defines the port to bind to on the remote hosts.
|
||||||
|
Specifically rank 0 for the first IP will use this port and each subsequent
|
||||||
|
IP or rank will add 1 to this port.
|
||||||
|
* ``--connections-per-ip`` allows us to increase the number of connections
|
||||||
|
between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
|
||||||
|
``mpirun``.
|
||||||
|
|
||||||
|
.. _jaccl_specifics:
|
||||||
|
|
||||||
|
JACCL Specifics
|
||||||
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`JACCL <jaccl_section>` backend can be selected with the argument
|
||||||
|
``--backend jaccl``. A hostfile is necessary to launch with this backend
|
||||||
|
because it needs to contain the RDMA devices connecting each node to each other
|
||||||
|
node.
|
||||||
|
|
||||||
|
NCCL Specifics
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`NCCL <nccl_section>` backend is the default backend for CUDA
|
||||||
|
environments. When launching from a Mac to a Linux machine with CUDA then the
|
||||||
|
backend should be selected using ``--backend nccl``.
|
||||||
|
|
||||||
|
The ``--repeat-hosts, -n`` argument should be used to launch multi-node and
|
||||||
|
multi-gpu jobs. For instance
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.launch --backend nccl --hosts linux-1,linux-2 -n 8 --no-verify-script -- ./my-job.sh
|
||||||
|
|
||||||
|
will attempt to launch 16 processes, 8 on each node that will all run
|
||||||
|
``my-job.sh``.
|
||||||
|
|
||||||
.. _mpi_specifics:
|
.. _mpi_specifics:
|
||||||
|
|
||||||
MPI Specifics
|
MPI Specifics
|
||||||
-------------
|
^^^^^^^^^^^^^
|
||||||
|
|
||||||
One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
|
One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
|
||||||
``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
|
``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
|
||||||
@@ -83,23 +232,3 @@ to choose a specific interface for the byte-transfer-layer of MPI we can call
|
|||||||
.. code:: shell
|
.. code:: shell
|
||||||
|
|
||||||
mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
|
mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
|
||||||
|
|
||||||
|
|
||||||
.. _ring_specifics:
|
|
||||||
|
|
||||||
Ring Specifics
|
|
||||||
--------------
|
|
||||||
|
|
||||||
The ring backend, which is also the default backend, can be explicitly selected
|
|
||||||
with the argument ``--backend ring``. The ring backend has some specific
|
|
||||||
requirements and arguments that are different to MPI:
|
|
||||||
|
|
||||||
* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
|
|
||||||
ssh to a hostname that does not correspond to the IP we want to bind to we
|
|
||||||
have to provide a hostfile.
|
|
||||||
* ``--starting-port`` defines the port to bind to on the remote hosts.
|
|
||||||
Specifically rank 0 for the first IP will use this port and each subsequent
|
|
||||||
IP or rank will add 1 to this port.
|
|
||||||
* ``--connections-per-ip`` allows us to increase the number of connections
|
|
||||||
between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
|
|
||||||
``mpirun``.
|
|
||||||
|
|||||||
@@ -106,6 +106,8 @@ class IPConfigurator:
|
|||||||
for src_port, p in enumerate(h.ports):
|
for src_port, p in enumerate(h.ports):
|
||||||
if not p.connected_to:
|
if not p.connected_to:
|
||||||
continue
|
continue
|
||||||
|
if p.connected_to not in uuid_reverse_index:
|
||||||
|
continue
|
||||||
if (src_node, src_port) in assigned:
|
if (src_node, src_port) in assigned:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
@@ -241,7 +243,7 @@ def make_connectivity_matrix(tb_hosts, uuid_reverse_index):
|
|||||||
for i, h in enumerate(tb_hosts):
|
for i, h in enumerate(tb_hosts):
|
||||||
c = [0] * len(tb_hosts)
|
c = [0] * len(tb_hosts)
|
||||||
for p in h.ports:
|
for p in h.ports:
|
||||||
if p.connected_to is not None:
|
if p.connected_to in uuid_reverse_index:
|
||||||
j, _ = uuid_reverse_index[p.connected_to]
|
j, _ = uuid_reverse_index[p.connected_to]
|
||||||
c[j] += 1
|
c[j] += 1
|
||||||
connectivity.append(c)
|
connectivity.append(c)
|
||||||
|
|||||||
@@ -49,7 +49,7 @@ class RemoteProcess(CommandProcess):
|
|||||||
is_local = host == "127.0.0.1"
|
is_local = host == "127.0.0.1"
|
||||||
cmd = RemoteProcess.make_launch_script(rank, cwd, files, env, command)
|
cmd = RemoteProcess.make_launch_script(rank, cwd, files, env, command)
|
||||||
if not is_local:
|
if not is_local:
|
||||||
cmd = f"ssh {host} {shlex.quote(cmd)}"
|
cmd = f"ssh -tt -o LogLevel=QUIET {host} {shlex.quote(cmd)}"
|
||||||
|
|
||||||
self._host = host
|
self._host = host
|
||||||
self._pidfile = None
|
self._pidfile = None
|
||||||
@@ -57,6 +57,7 @@ class RemoteProcess(CommandProcess):
|
|||||||
self._process = Popen(
|
self._process = Popen(
|
||||||
cmd,
|
cmd,
|
||||||
shell=True,
|
shell=True,
|
||||||
|
executable="/bin/bash",
|
||||||
stdin=PIPE,
|
stdin=PIPE,
|
||||||
stdout=PIPE,
|
stdout=PIPE,
|
||||||
stderr=PIPE,
|
stderr=PIPE,
|
||||||
@@ -91,7 +92,14 @@ class RemoteProcess(CommandProcess):
|
|||||||
cmd = RemoteProcess.make_kill_script(self._pidfile)
|
cmd = RemoteProcess.make_kill_script(self._pidfile)
|
||||||
if not self._is_local:
|
if not self._is_local:
|
||||||
cmd = f"ssh {self._host} {shlex.quote(cmd)}"
|
cmd = f"ssh {self._host} {shlex.quote(cmd)}"
|
||||||
c = run(cmd, check=True, shell=True, capture_output=True, text=True)
|
c = run(
|
||||||
|
cmd,
|
||||||
|
check=True,
|
||||||
|
shell=True,
|
||||||
|
executable="/bin/bash",
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
|
||||||
self._killed = c.stdout.strip() == "1"
|
self._killed = c.stdout.strip() == "1"
|
||||||
|
|
||||||
@@ -99,10 +107,13 @@ class RemoteProcess(CommandProcess):
|
|||||||
def make_launch_script(rank, cwd, files, env, command):
|
def make_launch_script(rank, cwd, files, env, command):
|
||||||
script = ""
|
script = ""
|
||||||
|
|
||||||
|
# Disable echo
|
||||||
|
script = "stty -echo; "
|
||||||
|
|
||||||
# Write the PID to a file so we can kill the process if needed
|
# Write the PID to a file so we can kill the process if needed
|
||||||
script += "pidfile=$(mktemp); "
|
script += "pidfile=$(mktemp); "
|
||||||
script += "echo $$ > $pidfile; "
|
script += "echo $$ > $pidfile; "
|
||||||
script += "echo $pidfile; "
|
script += 'printf "%s\\n" $pidfile; '
|
||||||
|
|
||||||
# Change the working directory if one was requested. Otherwise attempt to
|
# Change the working directory if one was requested. Otherwise attempt to
|
||||||
# change to the current one but don't fail if it wasn't possible.
|
# change to the current one but don't fail if it wasn't possible.
|
||||||
|
|||||||
@@ -67,7 +67,7 @@ void init_distributed(nb::module_& parent_module) {
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
backend (str, optional): The name of the backend to check for availability.
|
backend (str, optional): The name of the backend to check for availability.
|
||||||
It takes the same values as ``init()``. Default: ``any``.
|
It takes the same values as :func:`init()`. Default: ``"any"``.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
bool: Whether the distributed backend is available.
|
bool: Whether the distributed backend is available.
|
||||||
|
|||||||
Reference in New Issue
Block a user