Update the mlx.launch and mlx.distributed_config docs

2025-12-16 01:49:05 +08:00 · 2025-12-12 17:03:38 -08:00
parent fabc947df4
commit d2bc340df4
1 changed files with 156 additions and 27 deletions
--- a/docs/src/usage/launching_distributed.rst
+++ b/docs/src/usage/launching_distributed.rst
@@ -7,13 +7,106 @@ Launching Distributed Programs
 .. currentmodule:: mlx.core.distributed
-Installing the MLX python package provides a helper script ``mlx.launch`` that
+Installing the MLX python package provides two utilities to help you configure
-can be used to run python scripts distributed on several nodes. It allows
+your Macs for distributed computation and also launch distributed programs on
-launching using either the MPI backend or the ring backend. See the
+multiple nodes or with many processes in a single node. These utilities are aptly named
 :doc:`distributed docs <distributed>` for the different backends.
-Usage
+- ``mlx.launch``
-----
+- ``mlx.distributed_config``
 See the :doc:`distributed docs <distributed>` for an introduction and
 getting-started guides to the various backends.
 ``mlx.distributed_config`` 
 ---------------------------
 Unless you are launching distributed jobs locally for development or multi-gpu
 CUDA environments, then you have several Macs that you need to configure for
 distributed communication with MLX.
 ``mlx.distributed_config`` aims to automate the process of configuring the
 network interfaces (especially for communication over thunderbolt) and also
 creating the hostfile to be used with ``mlx.launch``.
 We will analyse 3 cases of using ``mlx.distributed_config``
 1. RDMA over thunderbolt using JACCL
 2. TCP/IP over thunderbolt using the ring backend
 3. TCP/IP over ethernet using the ring backend
 JACCL
 ^^^^^^^
 After following :ref:`the steps to enable RDMA <jaccl_section>` you can run the
 following command to configure the nodes and create the hostfile.
 .. code-block::
   mlx.distributed_config --verbose --backend jaccl \
        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 --over thunderbolt \
        --auto-setup --output m3-ultra-jaccl.json
 Let's walk through the steps that the script takes to configure the nodes.
 1. Ssh to all nodes to verify that they are reachable
 2. Extract the thunderbolt connectivity. Namely run commands on each node to
   calculate which node is connected to which other node.
 3. Verify that we have a valid fully connected mesh
 4. Check that RDMA is enabled
 5. Extract the ethernet IP from interface en0
 6. Disable the thunderbolt bridge and set up peer to peer networks for each
   thunderbolt cable
 7. Write the hostfile
 Knowing the above steps allows you to manually configure the nodes but also
 debug any configuration issue. For instance changing the Ethernet IP to a
 different interface directly in the config is possible (as long as it is
 reachable from all nodes).
 The ``--auto-setup`` argument requires password-less sudo on each node. If it
 isn't available then the configuration script will print commands to be run on
 each node.
 Ring over thunderbolt
 ^^^^^^^^^^^^^^^^^^^^^
 Setting up a ring backend over thunderbolt only requires changing the
 ``--backend`` from ``jaccl`` to ``ring``.
 The steps are very similar with the main difference being that instead of
 verifying that the nodes are fully connected, the script attempts to identify a
 ring topology (or multiple rings).
 Ring over Ethernet
 ^^^^^^^^^^^^^^^^^^
 Configuring the ring backend over ethernet doesn't require setting up network
 interface and as such it simply extracts the ``en0`` IP from each node and
 writes the hostfile.
 Debugging cable connections
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ``mlx.distributed_config`` can help you debug the connectivity of your nodes
 over thunderbolt by exporting a graph of the connections.
 Running
 .. code-block::
   mlx.distributed_config --verbose \
        --hosts host1,host2,host3,host4 \
        --over thunderbolt --dot
 will export a `GraphViz <https://graphviz.org>`_ representation of the
 connections between the nodes which makes it very easy to figure out which
 cable is not connected correctly.
 See :ref:`the JACCL section <jaccl_section>` for an example.
 ``mlx.launch``
 --------------
 The minimal usage example of ``mlx.launch`` is simply
@@ -33,6 +126,10 @@ the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
 It also takes care of forwarding the output of each remote process to stdout
 and stderr respectively.
 Importantly, it also broadcasts stdin to each process which enables interactive
 programs to work in distributed mode as well as debugging using the interactive
 debugger.
 Providing Hosts
 ^^^^^^^^^^^^^^^^
@@ -63,10 +160,62 @@ host and on the same path. A good checklist to debug errors is the following:
  ``mlx.launch --print-python`` to see what that path is.
 * the script you want to run is available on all hosts at the same path
 If you are launching from a node with a completely different setup than the
 nodes that the program will run on, you can specify ``--no-verify-script`` so
 that ``mlx.launch`` does not attempt to verify that the executable and script
 exist locally before launching the distributed job.
 .. _ring_specifics:
 Ring Specifics
 ^^^^^^^^^^^^^^
 The :ref:`ring <ring_section>` backend, which is also the default
 backend, can be explicitly selected with the argument ``--backend ring``. The
 ring backend has some specific requirements and arguments that are different to
 other backends:
 * The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
  ssh to a hostname that does not correspond to the IP we want to bind to we
  have to provide a hostfile.
 * ``--starting-port`` defines the port to bind to on the remote hosts.
  Specifically rank 0 for the first IP will use this port and each subsequent
  IP or rank will add 1 to this port.
 * ``--connections-per-ip`` allows us to increase the number of connections
  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
  ``mpirun``.
 .. _jaccl_specifics:
 JACCL Specifics
 ^^^^^^^^^^^^^^^^
 The :ref:`JACCL <jaccl_section>` backend can be selected with the argument
 ``--backend jaccl``. A hostfile is necessary to launch with this backend
 because it needs to contain the RDMA devices connecting each node to each other
 node.
 NCCL Specifics
 ^^^^^^^^^^^^^^
 The :ref:`NCCL <nccl_section>` backend is the default backend for CUDA
 environments. When launching from a Mac to a Linux machine with CUDA then the
 backend should be selected using ``--backend nccl``.
 The ``--repeat-hosts, -n`` argument should be used to launch multi-node and
 multi-gpu jobs. For instance
 .. code-block::
   mlx.launch --backend nccl --hosts linux-1,linux-2 -n 8 --no-verify-script -- ./my-job.sh
 will attempt to launch 16 processes, 8 on each node that will all run
 ``my-job.sh``.
 .. _mpi_specifics:
 MPI Specifics
-------------
+^^^^^^^^^^^^^
 One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
 ``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
@@ -83,23 +232,3 @@ to choose a specific interface for the byte-transfer-layer of MPI we can call
 .. code:: shell
    mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
 .. _ring_specifics:
 Ring Specifics
 --------------
 The ring backend, which is also the default backend, can be explicitly selected
 with the argument ``--backend ring``. The ring backend has some specific
 requirements and arguments that are different to MPI:
 * The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
  ssh to a hostname that does not correspond to the IP we want to bind to we
  have to provide a hostfile.
 * ``--starting-port`` defines the port to bind to on the remote hosts.
  Specifically rank 0 for the first IP will use this port and each subsequent
  IP or rank will add 1 to this port.
 * ``--connections-per-ip`` allows us to increase the number of connections
  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
  ``mpirun``.