mirror of
https://github.com/ml-explore/mlx.git
synced 2025-12-16 01:49:05 +08:00
Update the mlx.launch and mlx.distributed_config docs
This commit is contained in:
@@ -7,13 +7,106 @@ Launching Distributed Programs
|
|||||||
|
|
||||||
.. currentmodule:: mlx.core.distributed
|
.. currentmodule:: mlx.core.distributed
|
||||||
|
|
||||||
Installing the MLX python package provides a helper script ``mlx.launch`` that
|
Installing the MLX python package provides two utilities to help you configure
|
||||||
can be used to run python scripts distributed on several nodes. It allows
|
your Macs for distributed computation and also launch distributed programs on
|
||||||
launching using either the MPI backend or the ring backend. See the
|
multiple nodes or with many processes in a single node. These utilities are aptly named
|
||||||
:doc:`distributed docs <distributed>` for the different backends.
|
|
||||||
|
|
||||||
Usage
|
- ``mlx.launch``
|
||||||
-----
|
- ``mlx.distributed_config``
|
||||||
|
|
||||||
|
See the :doc:`distributed docs <distributed>` for an introduction and
|
||||||
|
getting-started guides to the various backends.
|
||||||
|
|
||||||
|
``mlx.distributed_config``
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Unless you are launching distributed jobs locally for development or multi-gpu
|
||||||
|
CUDA environments, then you have several Macs that you need to configure for
|
||||||
|
distributed communication with MLX.
|
||||||
|
|
||||||
|
``mlx.distributed_config`` aims to automate the process of configuring the
|
||||||
|
network interfaces (especially for communication over thunderbolt) and also
|
||||||
|
creating the hostfile to be used with ``mlx.launch``.
|
||||||
|
|
||||||
|
We will analyse 3 cases of using ``mlx.distributed_config``
|
||||||
|
|
||||||
|
1. RDMA over thunderbolt using JACCL
|
||||||
|
2. TCP/IP over thunderbolt using the ring backend
|
||||||
|
3. TCP/IP over ethernet using the ring backend
|
||||||
|
|
||||||
|
JACCL
|
||||||
|
^^^^^^^
|
||||||
|
|
||||||
|
After following :ref:`the steps to enable RDMA <jaccl_section>` you can run the
|
||||||
|
following command to configure the nodes and create the hostfile.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose --backend jaccl \
|
||||||
|
--hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 --over thunderbolt \
|
||||||
|
--auto-setup --output m3-ultra-jaccl.json
|
||||||
|
|
||||||
|
Let's walk through the steps that the script takes to configure the nodes.
|
||||||
|
|
||||||
|
1. Ssh to all nodes to verify that they are reachable
|
||||||
|
2. Extract the thunderbolt connectivity. Namely run commands on each node to
|
||||||
|
calculate which node is connected to which other node.
|
||||||
|
3. Verify that we have a valid fully connected mesh
|
||||||
|
4. Check that RDMA is enabled
|
||||||
|
5. Extract the ethernet IP from interface en0
|
||||||
|
6. Disable the thunderbolt bridge and set up peer to peer networks for each
|
||||||
|
thunderbolt cable
|
||||||
|
7. Write the hostfile
|
||||||
|
|
||||||
|
Knowing the above steps allows you to manually configure the nodes but also
|
||||||
|
debug any configuration issue. For instance changing the Ethernet IP to a
|
||||||
|
different interface directly in the config is possible (as long as it is
|
||||||
|
reachable from all nodes).
|
||||||
|
|
||||||
|
The ``--auto-setup`` argument requires password-less sudo on each node. If it
|
||||||
|
isn't available then the configuration script will print commands to be run on
|
||||||
|
each node.
|
||||||
|
|
||||||
|
Ring over thunderbolt
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Setting up a ring backend over thunderbolt only requires changing the
|
||||||
|
``--backend`` from ``jaccl`` to ``ring``.
|
||||||
|
|
||||||
|
The steps are very similar with the main difference being that instead of
|
||||||
|
verifying that the nodes are fully connected, the script attempts to identify a
|
||||||
|
ring topology (or multiple rings).
|
||||||
|
|
||||||
|
Ring over Ethernet
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Configuring the ring backend over ethernet doesn't require setting up network
|
||||||
|
interface and as such it simply extracts the ``en0`` IP from each node and
|
||||||
|
writes the hostfile.
|
||||||
|
|
||||||
|
Debugging cable connections
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
``mlx.distributed_config`` can help you debug the connectivity of your nodes
|
||||||
|
over thunderbolt by exporting a graph of the connections.
|
||||||
|
|
||||||
|
Running
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.distributed_config --verbose \
|
||||||
|
--hosts host1,host2,host3,host4 \
|
||||||
|
--over thunderbolt --dot
|
||||||
|
|
||||||
|
will export a `GraphViz <https://graphviz.org>`_ representation of the
|
||||||
|
connections between the nodes which makes it very easy to figure out which
|
||||||
|
cable is not connected correctly.
|
||||||
|
|
||||||
|
See :ref:`the JACCL section <jaccl_section>` for an example.
|
||||||
|
|
||||||
|
|
||||||
|
``mlx.launch``
|
||||||
|
--------------
|
||||||
|
|
||||||
The minimal usage example of ``mlx.launch`` is simply
|
The minimal usage example of ``mlx.launch`` is simply
|
||||||
|
|
||||||
@@ -33,6 +126,10 @@ the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
|
|||||||
It also takes care of forwarding the output of each remote process to stdout
|
It also takes care of forwarding the output of each remote process to stdout
|
||||||
and stderr respectively.
|
and stderr respectively.
|
||||||
|
|
||||||
|
Importantly, it also broadcasts stdin to each process which enables interactive
|
||||||
|
programs to work in distributed mode as well as debugging using the interactive
|
||||||
|
debugger.
|
||||||
|
|
||||||
Providing Hosts
|
Providing Hosts
|
||||||
^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
@@ -63,10 +160,62 @@ host and on the same path. A good checklist to debug errors is the following:
|
|||||||
``mlx.launch --print-python`` to see what that path is.
|
``mlx.launch --print-python`` to see what that path is.
|
||||||
* the script you want to run is available on all hosts at the same path
|
* the script you want to run is available on all hosts at the same path
|
||||||
|
|
||||||
|
If you are launching from a node with a completely different setup than the
|
||||||
|
nodes that the program will run on, you can specify ``--no-verify-script`` so
|
||||||
|
that ``mlx.launch`` does not attempt to verify that the executable and script
|
||||||
|
exist locally before launching the distributed job.
|
||||||
|
|
||||||
|
.. _ring_specifics:
|
||||||
|
|
||||||
|
Ring Specifics
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`ring <ring_section>` backend, which is also the default
|
||||||
|
backend, can be explicitly selected with the argument ``--backend ring``. The
|
||||||
|
ring backend has some specific requirements and arguments that are different to
|
||||||
|
other backends:
|
||||||
|
|
||||||
|
* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
|
||||||
|
ssh to a hostname that does not correspond to the IP we want to bind to we
|
||||||
|
have to provide a hostfile.
|
||||||
|
* ``--starting-port`` defines the port to bind to on the remote hosts.
|
||||||
|
Specifically rank 0 for the first IP will use this port and each subsequent
|
||||||
|
IP or rank will add 1 to this port.
|
||||||
|
* ``--connections-per-ip`` allows us to increase the number of connections
|
||||||
|
between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
|
||||||
|
``mpirun``.
|
||||||
|
|
||||||
|
.. _jaccl_specifics:
|
||||||
|
|
||||||
|
JACCL Specifics
|
||||||
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`JACCL <jaccl_section>` backend can be selected with the argument
|
||||||
|
``--backend jaccl``. A hostfile is necessary to launch with this backend
|
||||||
|
because it needs to contain the RDMA devices connecting each node to each other
|
||||||
|
node.
|
||||||
|
|
||||||
|
NCCL Specifics
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :ref:`NCCL <nccl_section>` backend is the default backend for CUDA
|
||||||
|
environments. When launching from a Mac to a Linux machine with CUDA then the
|
||||||
|
backend should be selected using ``--backend nccl``.
|
||||||
|
|
||||||
|
The ``--repeat-hosts, -n`` argument should be used to launch multi-node and
|
||||||
|
multi-gpu jobs. For instance
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
mlx.launch --backend nccl --hosts linux-1,linux-2 -n 8 --no-verify-script -- ./my-job.sh
|
||||||
|
|
||||||
|
will attempt to launch 16 processes, 8 on each node that will all run
|
||||||
|
``my-job.sh``.
|
||||||
|
|
||||||
.. _mpi_specifics:
|
.. _mpi_specifics:
|
||||||
|
|
||||||
MPI Specifics
|
MPI Specifics
|
||||||
-------------
|
^^^^^^^^^^^^^
|
||||||
|
|
||||||
One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
|
One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
|
||||||
``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
|
``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
|
||||||
@@ -83,23 +232,3 @@ to choose a specific interface for the byte-transfer-layer of MPI we can call
|
|||||||
.. code:: shell
|
.. code:: shell
|
||||||
|
|
||||||
mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
|
mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
|
||||||
|
|
||||||
|
|
||||||
.. _ring_specifics:
|
|
||||||
|
|
||||||
Ring Specifics
|
|
||||||
--------------
|
|
||||||
|
|
||||||
The ring backend, which is also the default backend, can be explicitly selected
|
|
||||||
with the argument ``--backend ring``. The ring backend has some specific
|
|
||||||
requirements and arguments that are different to MPI:
|
|
||||||
|
|
||||||
* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
|
|
||||||
ssh to a hostname that does not correspond to the IP we want to bind to we
|
|
||||||
have to provide a hostfile.
|
|
||||||
* ``--starting-port`` defines the port to bind to on the remote hosts.
|
|
||||||
Specifically rank 0 for the first IP will use this port and each subsequent
|
|
||||||
IP or rank will add 1 to this port.
|
|
||||||
* ``--connections-per-ip`` allows us to increase the number of connections
|
|
||||||
between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
|
|
||||||
``mpirun``.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user