Update the mlx.launch and mlx.distributed_config docs

Possible doc fix
Finish the distributed docs
2025-12-16 01:49:05 +08:00 · 2025-12-12 17:03:38 -08:00 · 2025-12-12 15:34:59 -08:00 · 2025-12-12 14:16:16 -08:00 · 2025-12-12 04:36:46 -08:00 · 2025-12-12 01:21:43 -08:00
8 changed files with 553 additions and 111 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -119,10 +119,6 @@ if(MLX_BUILD_METAL)
    COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
    OUTPUT_VARIABLE MACOS_SDK_VERSION
    OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
  execute_process(
    COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-path"
    OUTPUT_VARIABLE CMAKE_OSX_SYSROOT
    OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
  if(${MACOS_SDK_VERSION} LESS 14.0)
    message(
--- a/docs/src/_static/distributed/m3-ultra-mesh-broken.png
+++ b/docs/src/_static/distributed/m3-ultra-mesh-broken.png
--- a/docs/src/_static/distributed/m3-ultra-mesh.png
+++ b/docs/src/_static/distributed/m3-ultra-mesh.png
--- a/docs/src/usage/distributed.rst
+++ b/docs/src/usage/distributed.rst
@@ -7,22 +7,29 @@ Distributed Communication
 MLX supports distributed communication operations that allow the computational cost
 of training or inference to be shared across many physical machines. At the
-moment we support three different communication backends:
+moment we support several different communication backends introduced below.
 .. list-table::
   :widths: 20 80
   :header-rows: 1
   * - Backend
     - Description
   * - :ref:`MPI <mpi_section>`
     - A full featured and mature distributed communications library.
   * - :ref:`RING <ring_section>`
     - Ring all reduce and all gather over TCP sockets. Always available and
       usually faster than MPI.
   * - :ref:`JACCL <ring_section>`
     - Low latency communication with RDMA over thunderbolt. Necessary for
       things like tensor parallelism.
   * - :ref:`NCCL <nccl_section>`
     - The backend of choice for CUDA environments.
 * `MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ a
  full-featured and mature distributed communications library
 * A **ring** backend of our own that uses native TCP sockets. It should be
  faster for thunderbolt connections, but it also works over Ethernet.
 * `nccl <https://developer.nvidia.com/nccl>`_, for use in CUDA environments.
 The list of all currently supported operations and their documentation can be
 seen in the :ref:`API docs<distributed>`.
 .. note::
   Some operations may not be supported or not as fast as they should be.
   We are adding more and tuning the ones we have as we are figuring out the
   best way to do distributed computing on Macs using MLX.
 Getting Started
 ---------------
@@ -85,7 +92,7 @@ Selecting Backend
 ^^^^^^^^^^^^^^^^^
 You can select the backend you want to use when calling :func:`init` by passing
-one of ``{'any', 'ring', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
+one of ``{'any', 'ring', 'jaccl', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
 available backends. If they all fail then a singleton group is created.
 .. note::
@@ -110,6 +117,8 @@ The following examples aim to clarify the backend initialization logic in MLX:
    world_ring = mx.distributed.init(backend="ring")
    world_any = mx.distributed.init()  # same as MPI because it was initialized first!
 .. _training_example:
 Training Example
 ----------------
@@ -192,16 +201,273 @@ almost identical to the example above:
        loss = step(model, x, y)
        mx.eval(loss, model.parameters())
 .. _ring_section:
 Getting Started with Ring
 -------------------------
 The ring backend does not depend on any third party library so it is always
 available. It uses TCP sockets so the nodes need to be reachable via a network.
 As the name suggests the nodes are connected in a ring which means that rank 1
 can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
 and so on and so forth. As a result :func:`send` and :func:`recv` with
 arbitrary sender and receiver is not supported in the ring backend.
 Defining a Ring
 ^^^^^^^^^^^^^^^
 The easiest way to define and use a ring is via a JSON hostfile and the
 ``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
 defines a hostname to ssh into to run commands on this node and one or more IPs
 that this node will listen to for connections.
 For example the hostfile below defines a 4 node ring. ``hostname1`` will be
 rank 0, ``hostname2`` rank 1 etc.
 .. code:: json
    [
        {"ssh": "hostname1", "ips": ["123.123.123.1"]},
        {"ssh": "hostname2", "ips": ["123.123.123.2"]},
        {"ssh": "hostname3", "ips": ["123.123.123.3"]},
        {"ssh": "hostname4", "ips": ["123.123.123.4"]}
    ]
 Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
 node, run the script which will listen for connections in each of the provided
 IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
 connection from ``123.123.123.4`` and so on and so forth.
 Thunderbolt Ring
 ^^^^^^^^^^^^^^^^
 Although the ring backend can have benefits over MPI even for Ethernet, its
 main purpose is to use Thunderbolt rings for higher bandwidth communication.
 Setting up such thunderbolt rings can be done manually, but is a relatively
 tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
 To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
 Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
 utility as follows:
 .. code:: shell
   mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --backend ring
 By default the script will attempt to discover the thunderbolt ring and provide
 you with the commands to configure each node as well as the ``hostfile.json``
 to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
 then ``--auto-setup`` can be used to configure them automatically.
 If you want to go through the process manually, the steps are as follows:
 * Disable the thunderbolt bridge interface
 * For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
  corresponding to that cable in nodes ``i`` and ``i + 1``.
 * Set up a unique subnetwork connecting the two nodes for the corresponding
  interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
  and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
  ``192.168.0.2`` respectively to the two nodes. For more details you can see
  the commands prepared by the utility script.
 .. _jaccl_section:
 Getting Started with RDMA over Thunderbolt
 ------------------------------------------
 Starting from version 26.2 RDMA over thunderbolt is available in MacOS and
 enables low-latency communication between Macs with thunderbolt 5. MLX provides
 the JACCL backend that uses this functionality to achieve communication latency
 an order of magnitude lower than the ring backend.
 .. note::
   The name JACCL (pronounced Jackal) stands for *Jack and Angelos' Collective
   Communication Library* and it is an obvious pun to Nvidia's NCCL but also
   tribute to *Jack Beasley* who led the development of RDMA over Thunderbolt
   at Apple.
 Enabling RDMA
 ^^^^^^^^^^^^^
 Until the feature matures, enabling RDMA over thunderbolt is slightly more
 involved and **cannot** be done remotely even with sudo. In fact, it has to be
 done in macOS recovery:
 1. `Start your computer in recovery <https://support.apple.com/en-us/102518>`_.
 2. Open the Terminal by going to Utilities -> Terminal.
 3. Run ``rdma_ctl enable``.
 4. Reboot.
 To verify that you have successfully enabled Thunderbolt RDMA you can run
 ``ibv_devices`` which should produce something like the following for an M3 Ultra.
 .. code-block:: bash
    ~ % ibv_devices
    device          	   node GUID
    ------          	----------------
    rdma_en2        	8096a9d9edbaac05
    rdma_en3        	8196a9d9edbaac05
    rdma_en5        	8396a9d9edbaac05
    rdma_en4        	8296a9d9edbaac05
    rdma_en6        	8496a9d9edbaac05
    rdma_en7        	8596a9d9edbaac05
 Defining a Mesh
 ^^^^^^^^^^^^^^^
 The JACCL backend supports only fully connected topologies. Namely, there needs
 to be a thunderbolt cable connecting all pairs of Macs directly. For example, in
 the following topology visualizations, the left one is valid because there is a
 connection from any node to any other node, while for the one on the right M3
 Ultra 1 is not connected to M3 Ultra 2.
 .. raw:: html
   <div style="display: flex; text-align: center; align-items: end; font-size: 80%;">
     <div>
       <img src="/_static/distributed/m3-ultra-mesh.png" alt="M3 Ultra thunderbolt mesh" style="width: 55%">
       <p>Fully connected mesh of four M3 Ultra.</p>
     </div>
     <div>
       <img src="/_static/distributed/m3-ultra-mesh-broken.png" alt="M3 Ultra broken thunderbolt mesh" style="width: 55%">
       <p>Not a valid mesh (M3 Ultra 1 is not connected to M3 Ultra 2).</p>
     </div>
   </div>
 Similar to the ring backend, the easiest way to use JACCL with MLX is to write
 a JSON hostfile that will be used by ``mlx.launch``. The hostfile needs to contain
 - Hostnames to use for launching scripts via ssh
 - An IP for rank 0 that is reachable by all nodes
 - A list of rdma devices that connect each node to each other node
 The following JSON defines the valid 4-node mesh from the image above.
 .. code-block:: json
    [
        {
            "ssh": "m3-ultra-1",
            "ips": ["123.123.123.1"],
            "rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"]
        },
        {
            "ssh": "m3-ultra-2",
            "ips": [],
            "rdma": ["rdma_en5", null, "rdma_en3", "rdma_en4"]
        },
        {
            "ssh": "m3-ultra-3",
            "ips": [],
            "rdma": ["rdma_en4", "rdma_en3", null, "rdma_en5"]
        },
        {
            "ssh": "m3-ultra-4",
            "ips": [],
            "rdma": ["rdma_en3", "rdma_en4", "rdma_en5", null]
        }
    ]
 Even though TCP/IP is not used when communicating with Thunderbolt RDMA,
 disabling the thunderbolt bridge is still required as well as setting up
 isolated local networks for each thunderbolt connection.
 All of the above can be done instead via ``mlx.distributed_config``. This helper
 script will
 - ssh into each node
 - extract the thunderbolt connectivity
 - check for a valid mesh
 - provide the commands to configure each node (or run them if sudo is available)
 - generate the hostfile to be used with ``mlx.launch``
 Putting it All Together
 ^^^^^^^^^^^^^^^^^^^^^^^^
 For example launching a distributed MLX script that uses JACCL is fairly simple
 if the nodes are reachable via ssh and have password-less sudo.
 First, connect all the thunderbolt cables. Then we can verify the connections
 by using the ``mlx.distributed_config`` script to visualize them.
 .. code-block::
   mlx.distributed_config --verbose \
        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
        --over thunderbolt --dot | dot -Tpng | open -f -a Preview
 After making sure that everything looks right we can auto-configure the nodes
 and save the hostfile to ``m3-ultra-jaccl.json`` by running:
 .. code-block::
   mlx.distributed_config --verbose \
        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
        --over thunderbolt --backend jaccl \
        --auto-setup --output m3-ultra-jaccl.json
 And now we are ready to run a distributed MLX script such as distributed inference
 of a gigantic model using MLX-LM.
 .. code-block::
   mlx.launch --verbose --backend jaccl --hostfile m3-ultra-jaccl.json \
        --env MLX_METAL_FAST_SYNCH=1 -- \  # <--- important
        /path/to/remote/python -m mlx_lm chat --model mlx-community/DeepSeek-V3.2-8bit --shard
 .. note::
   Defining the environment variable ``MLX_METAL_FAST_SYNCH=1`` enables a
   different, faster way of synchronizing between the GPU and the CPU. It is
   not specific to the JACCL backend and can be used in all cases where the CPU
   and GPU need to collaborate for some computation and is pretty critical for
   low-latency communication since the communication is done by the CPU.
 .. _nccl_section:
 Getting Started with NCCL
 -------------------------
 MLX on CUDA environments ships with the ability to talk to `NCCL
 <https://developer.nvidia.com/nccl>`_ which is a high-performance collective
 communication library that supports both multi-gpu and multi-node setups.
 For CUDA environments, NCCL is the default backend for ``mlx.launch`` and all
 it takes to run a distributed job is
 .. code-block::
   mlx.launch -n 8 test.py
   # perfect for interactive scripts
   mlx.launch -n 8 python -m mlx_lm chat --model my-model --shard
 You can also use ``mlx.launch`` to ssh to a remote node and launch a script
 with the same ease
 .. code-block::
   mlx.launch --hosts my-cuda-node -n 8 test.py
 In many cases you may not want to use ``mlx.launch`` with the NCCL backend
 because the cluster scheduler will be the one launching the processes. You can
 :ref:`see which environment variables need to be defined <no_mlx_launch>` in
 order for the MLX NCCL backend to be initialized correctly.
 .. _mpi_section:
 Getting Started with MPI
 ------------------------
-MLX already comes with the ability to "talk" to MPI if it is installed on the
+MLX already comes with the ability to "talk" to `MPI
-machine. Launching distributed MLX programs that use MPI can be done with
+<https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ if it is installed
-``mpirun`` as expected. However, in the following examples we will be using
+on the machine. Launching distributed MLX programs that use MPI can be done
-``mlx.launch --backend mpi`` which takes care of some nuisances such as setting
+with ``mpirun`` as expected. However, in the following examples we will be
-absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` shared
+using ``mlx.launch --backend mpi`` which takes care of some nuisances such as
-library.
+setting absolute paths for the ``mpirun`` executable and the ``libmpi.dyld``
 shared library.
 The simplest possible usage is the following which, assuming the minimal
 example in the beginning of this page, should result in:
@@ -269,78 +535,116 @@ Force MPI to use the most performant network interface by setting ``--mca
 btl_tcp_if_include <iface>`` where ``<iface>`` should be the interface you want
 to use.
-Getting Started with Ring
+.. _no_mlx_launch:
 -------------------------
-The ring backend does not depend on any third party library so it is always
+Distributed Without ``mlx.launch``
-available. It uses TCP sockets so the nodes need to be reachable via a network.
+----------------------------------
 As the name suggests the nodes are connected in a ring which means that rank 1
 can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
 and so on and so forth. As a result :func:`send` and :func:`recv` with
 arbitrary sender and receiver is not supported in the ring backend.
-Defining a Ring
+None of the implementations of the distributed backends require launching with
-^^^^^^^^^^^^^^^
+``mlx.launch``. The script simply connects to each host. Starts a process per
 rank and sets up the necessary environment variables before delegating to your
 MLX script. See the :doc:`dedicated documentation page <launching_distributed>`
 for more details.
-The easiest way to define and use a ring is via a JSON hostfile and the
+For many use-cases this will be the easiest way to perform distributed
-``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
+computations in MLX. However, there may be reasons that you cannot or should
-defines a hostname to ssh into to run commands on this node and one or more IPs
+not use ``mlx.launch``. A common such case is the use of a scheduler that
-that this node will listen to for connections.
+starts all the processes for you on machines undetermined at the time of
 scheduling the job.
-For example the hostfile below defines a 4 node ring. ``hostname1`` will be
+Below we list the environment variables required to use each backend.
 rank 0, ``hostname2`` rank 1 etc.
-.. code:: json
+Ring
 ^^^^^^
-    [
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
-        {"ssh": "hostname1", "ips": ["123.123.123.1"]},
+the process.
        {"ssh": "hostname2", "ips": ["123.123.123.2"]},
        {"ssh": "hostname3", "ips": ["123.123.123.3"]},
        {"ssh": "hostname4", "ips": ["123.123.123.4"]}
    ]
-Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
+**MLX_HOSTFILE** should contain the path to a json file that contains IPs and
-node, run the script which will listen for connections in each of the provided
+ports for each rank to listen to, something like the following:
 IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
 connection from ``123.123.123.4`` and so on and so forth.
-Thunderbolt Ring
+.. code-block:: json
 ^^^^^^^^^^^^^^^^
-Although the ring backend can have benefits over MPI even for Ethernet, its
+   [
-main purpose is to use Thunderbolt rings for higher bandwidth communication.
+     ["123.123.1.1:5000", "123.123.1.2:5000"],
-Setting up such thunderbolt rings can be done manually, but is a relatively
+     ["123.123.2.1:5000", "123.123.2.2:5000"],
-tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
+     ["123.123.3.1:5000", "123.123.3.2:5000"],
     ["123.123.4.1:5000", "123.123.4.2:5000"]
   ]
-To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
+**MLX_RING_VERBOSE** is optional and if set to 1 it enables some more logging
-Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
+from the distributed backend.
 utility as follows:
-.. code:: shell
+JACCL
 ^^^^^
-   mlx.distributed_config --verbose --hosts host1,host2,host3,host4
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
 the process.
-By default the script will attempt to discover the thunderbolt ring and provide
+**MLX_JACCL_COORDINATOR** should contain the IP and port that rank 0 can listen
-you with the commands to configure each node as well as the ``hostfile.json``
+to all the other ranks connect to in order to establish the RDMA connections.
 to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
 then ``--auto-setup`` can be used to configure them automatically.
-To validate your connection without configuring anything
+**MLX_IBV_DEVICES** should contain the path to a json file that contains the
-``mlx.distributed_config`` can also plot the ring using DOT format.
+ibverbs device names that connect each node to each other node, something like
 the following:
-.. code:: shell
+.. code-block:: json
-   mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --dot >ring.dot
+   [
-   dot -Tpng ring.dot >ring.png
+      [null, "rdma_en5", "rdma_en4", "rdma_en3"],
-   open ring.png
+      ["rdma_en5", null, "rdma_en3", "rdma_en4"],
      ["rdma_en4", "rdma_en3", null, "rdma_en5"],
      ["rdma_en3", "rdma_en4", "rdma_en5", null]
   ]
 If you want to go through the process manually, the steps are as follows:
-* Disable the thunderbolt bridge interface
+NCCL
-* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
+^^^^^
-  corresponding to that cable in nodes ``i`` and ``i + 1``.
+
-* Set up a unique subnetwork connecting the two nodes for the corresponding
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
-  interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
+the process.
-  and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
+
-  ``192.168.0.2`` respectively to the two nodes. For more details you can see
+**MLX_WORLD_SIZE** should contain the total number of processes that will be
-  the commands prepared by the utility script.
+launched.
 **NCCL_HOST_IP** and **NCCL_PORT** should contain the IP and port that all
 hosts can connect to to establish the NCCL communication.
 **CUDA_VISIBLE_DEVICES** should contain the local index of the gpu that
 corresponds to this process.
 Of course any `other environment variable
 <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`_ that is
 used by NCCL can be set.
 .. _tips_and_tricks:
 Tips and Tricks
 ----------------
 This is a small collection of tips to help you utilize better the distributed
 communication capabilities of MLX.
 - *Test locally first.*
  You can use the pattern ``mlx.launch -n2 -- my_script.py`` to run a small
  scale test on a single node first.
 - *Batch your communication.*
  As described in the :ref:`training example <training_example>`, performing a
  lot of small communication can hurt performance. Copy the approach of
  :func:`mlx.nn.average_gradients` to gather many small communications in a
  single large one.
 - *Visualize the connectivity.*
  Use ``mlx.distributed_config --hosts h1,h2,h3 --over thunderbolt --dot`` to
  visualize the connnections and make sure that the cables are connected
  correctly. See the :ref:`JACCL section <jaccl_section>` for examples.
 - *Use the debugger.*
  ``mlx.launch`` is meant for interactive use. It broadcasts stdin to all
  processes and gathers stdout from all processes. This makes using ``pdb`` a
  breeze.
--- a/docs/src/usage/launching_distributed.rst
+++ b/docs/src/usage/launching_distributed.rst
@@ -7,13 +7,106 @@ Launching Distributed Programs
 .. currentmodule:: mlx.core.distributed
-Installing the MLX python package provides a helper script ``mlx.launch`` that
+Installing the MLX python package provides two utilities to help you configure
-can be used to run python scripts distributed on several nodes. It allows
+your Macs for distributed computation and also launch distributed programs on
-launching using either the MPI backend or the ring backend. See the
+multiple nodes or with many processes in a single node. These utilities are aptly named
 :doc:`distributed docs <distributed>` for the different backends.
-Usage
+- ``mlx.launch``
-----
+- ``mlx.distributed_config``
 See the :doc:`distributed docs <distributed>` for an introduction and
 getting-started guides to the various backends.
 ``mlx.distributed_config`` 
 ---------------------------
 Unless you are launching distributed jobs locally for development or multi-gpu
 CUDA environments, then you have several Macs that you need to configure for
 distributed communication with MLX.
 ``mlx.distributed_config`` aims to automate the process of configuring the
 network interfaces (especially for communication over thunderbolt) and also
 creating the hostfile to be used with ``mlx.launch``.
 We will analyse 3 cases of using ``mlx.distributed_config``
 1. RDMA over thunderbolt using JACCL
 2. TCP/IP over thunderbolt using the ring backend
 3. TCP/IP over ethernet using the ring backend
 JACCL
 ^^^^^^^
 After following :ref:`the steps to enable RDMA <jaccl_section>` you can run the
 following command to configure the nodes and create the hostfile.
 .. code-block::
   mlx.distributed_config --verbose --backend jaccl \
        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 --over thunderbolt \
        --auto-setup --output m3-ultra-jaccl.json
 Let's walk through the steps that the script takes to configure the nodes.
 1. Ssh to all nodes to verify that they are reachable
 2. Extract the thunderbolt connectivity. Namely run commands on each node to
   calculate which node is connected to which other node.
 3. Verify that we have a valid fully connected mesh
 4. Check that RDMA is enabled
 5. Extract the ethernet IP from interface en0
 6. Disable the thunderbolt bridge and set up peer to peer networks for each
   thunderbolt cable
 7. Write the hostfile
 Knowing the above steps allows you to manually configure the nodes but also
 debug any configuration issue. For instance changing the Ethernet IP to a
 different interface directly in the config is possible (as long as it is
 reachable from all nodes).
 The ``--auto-setup`` argument requires password-less sudo on each node. If it
 isn't available then the configuration script will print commands to be run on
 each node.
 Ring over thunderbolt
 ^^^^^^^^^^^^^^^^^^^^^
 Setting up a ring backend over thunderbolt only requires changing the
 ``--backend`` from ``jaccl`` to ``ring``.
 The steps are very similar with the main difference being that instead of
 verifying that the nodes are fully connected, the script attempts to identify a
 ring topology (or multiple rings).
 Ring over Ethernet
 ^^^^^^^^^^^^^^^^^^
 Configuring the ring backend over ethernet doesn't require setting up network
 interface and as such it simply extracts the ``en0`` IP from each node and
 writes the hostfile.
 Debugging cable connections
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ``mlx.distributed_config`` can help you debug the connectivity of your nodes
 over thunderbolt by exporting a graph of the connections.
 Running
 .. code-block::
   mlx.distributed_config --verbose \
        --hosts host1,host2,host3,host4 \
        --over thunderbolt --dot
 will export a `GraphViz <https://graphviz.org>`_ representation of the
 connections between the nodes which makes it very easy to figure out which
 cable is not connected correctly.
 See :ref:`the JACCL section <jaccl_section>` for an example.
 ``mlx.launch``
 --------------
 The minimal usage example of ``mlx.launch`` is simply
@@ -33,6 +126,10 @@ the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
 It also takes care of forwarding the output of each remote process to stdout
 and stderr respectively.
 Importantly, it also broadcasts stdin to each process which enables interactive
 programs to work in distributed mode as well as debugging using the interactive
 debugger.
 Providing Hosts
 ^^^^^^^^^^^^^^^^
@@ -63,10 +160,62 @@ host and on the same path. A good checklist to debug errors is the following:
  ``mlx.launch --print-python`` to see what that path is.
 * the script you want to run is available on all hosts at the same path
 If you are launching from a node with a completely different setup than the
 nodes that the program will run on, you can specify ``--no-verify-script`` so
 that ``mlx.launch`` does not attempt to verify that the executable and script
 exist locally before launching the distributed job.
 .. _ring_specifics:
 Ring Specifics
 ^^^^^^^^^^^^^^
 The :ref:`ring <ring_section>` backend, which is also the default
 backend, can be explicitly selected with the argument ``--backend ring``. The
 ring backend has some specific requirements and arguments that are different to
 other backends:
 * The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
  ssh to a hostname that does not correspond to the IP we want to bind to we
  have to provide a hostfile.
 * ``--starting-port`` defines the port to bind to on the remote hosts.
  Specifically rank 0 for the first IP will use this port and each subsequent
  IP or rank will add 1 to this port.
 * ``--connections-per-ip`` allows us to increase the number of connections
  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
  ``mpirun``.
 .. _jaccl_specifics:
 JACCL Specifics
 ^^^^^^^^^^^^^^^^
 The :ref:`JACCL <jaccl_section>` backend can be selected with the argument
 ``--backend jaccl``. A hostfile is necessary to launch with this backend
 because it needs to contain the RDMA devices connecting each node to each other
 node.
 NCCL Specifics
 ^^^^^^^^^^^^^^
 The :ref:`NCCL <nccl_section>` backend is the default backend for CUDA
 environments. When launching from a Mac to a Linux machine with CUDA then the
 backend should be selected using ``--backend nccl``.
 The ``--repeat-hosts, -n`` argument should be used to launch multi-node and
 multi-gpu jobs. For instance
 .. code-block::
   mlx.launch --backend nccl --hosts linux-1,linux-2 -n 8 --no-verify-script -- ./my-job.sh
 will attempt to launch 16 processes, 8 on each node that will all run
 ``my-job.sh``.
 .. _mpi_specifics:
 MPI Specifics
-------------
+^^^^^^^^^^^^^
 One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
 ``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
@@ -83,23 +232,3 @@ to choose a specific interface for the byte-transfer-layer of MPI we can call
 .. code:: shell
    mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
 .. _ring_specifics:
 Ring Specifics
 --------------
 The ring backend, which is also the default backend, can be explicitly selected
 with the argument ``--backend ring``. The ring backend has some specific
 requirements and arguments that are different to MPI:
 * The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
  ssh to a hostname that does not correspond to the IP we want to bind to we
  have to provide a hostfile.
 * ``--starting-port`` defines the port to bind to on the remote hosts.
  Specifically rank 0 for the first IP will use this port and each subsequent
  IP or rank will add 1 to this port.
 * ``--connections-per-ip`` allows us to increase the number of connections
  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
  ``mpirun``.
--- a/python/mlx/_distributed_utils/config.py
+++ b/python/mlx/_distributed_utils/config.py
@@ -106,6 +106,8 @@ class IPConfigurator:
            for src_port, p in enumerate(h.ports):
                if not p.connected_to:
                    continue
                if p.connected_to not in uuid_reverse_index:
                    continue
                if (src_node, src_port) in assigned:
                    continue
@@ -241,7 +243,7 @@ def make_connectivity_matrix(tb_hosts, uuid_reverse_index):
    for i, h in enumerate(tb_hosts):
        c = [0] * len(tb_hosts)
        for p in h.ports:
-            if p.connected_to is not None:
+            if p.connected_to in uuid_reverse_index:
                j, _ = uuid_reverse_index[p.connected_to]
                c[j] += 1
        connectivity.append(c)
--- a/python/mlx/_distributed_utils/launch.py
+++ b/python/mlx/_distributed_utils/launch.py
@@ -49,7 +49,7 @@ class RemoteProcess(CommandProcess):
        is_local = host == "127.0.0.1"
        cmd = RemoteProcess.make_launch_script(rank, cwd, files, env, command)
        if not is_local:
-            cmd = f"ssh {host} {shlex.quote(cmd)}"
+            cmd = f"ssh -tt -o LogLevel=QUIET {host} {shlex.quote(cmd)}"
        self._host = host
        self._pidfile = None
@@ -57,6 +57,7 @@ class RemoteProcess(CommandProcess):
        self._process = Popen(
            cmd,
            shell=True,
            executable="/bin/bash",
            stdin=PIPE,
            stdout=PIPE,
            stderr=PIPE,
@@ -91,7 +92,14 @@ class RemoteProcess(CommandProcess):
        cmd = RemoteProcess.make_kill_script(self._pidfile)
        if not self._is_local:
            cmd = f"ssh {self._host} {shlex.quote(cmd)}"
-        c = run(cmd, check=True, shell=True, capture_output=True, text=True)
+        c = run(
            cmd,
            check=True,
            shell=True,
            executable="/bin/bash",
            capture_output=True,
            text=True,
        )
        self._killed = c.stdout.strip() == "1"
@@ -99,10 +107,13 @@ class RemoteProcess(CommandProcess):
    def make_launch_script(rank, cwd, files, env, command):
        script = ""
        # Disable echo
        script = "stty -echo; "
        # Write the PID to a file so we can kill the process if needed
        script += "pidfile=$(mktemp); "
        script += "echo $$ > $pidfile; "
-        script += "echo $pidfile; "
+        script += 'printf "%s\\n" $pidfile; '
        # Change the working directory if one was requested. Otherwise attempt to
        # change to the current one but don't fail if it wasn't possible.
--- a/python/src/distributed.cpp
+++ b/python/src/distributed.cpp
@@ -67,7 +67,7 @@ void init_distributed(nb::module_& parent_module) {
      Args:
        backend (str, optional): The name of the backend to check for availability.
-          It takes the same values as ``init()``. Default: ``any``.
+          It takes the same values as :func:`init()`. Default: ``"any"``.
      Returns:
        bool: Whether the distributed backend is available.
Author	SHA1	Message	Date
Angelos Katharopoulos	d2bc340df4	Update the mlx.launch and mlx.distributed_config docs	2025-12-12 17:03:38 -08:00
Angelos Katharopoulos	fabc947df4	Possible doc fix	2025-12-12 15:34:59 -08:00
Angelos Katharopoulos	5523087cfb	Finish the distributed docs	2025-12-12 14:16:16 -08:00
Angelos Katharopoulos	2f939acefa	Progress with the docs	2025-12-12 04:36:46 -08:00
Angelos Katharopoulos	3b416a2e36	Comments	2025-12-12 01:21:43 -08:00
Angelos Katharopoulos	753c6a4d0f	Disable echo for interactive distributed	2025-12-11 17:47:42 -08:00
Angelos Katharopoulos	d3a754c8aa	Fix config when more cables are connected	2025-12-10 02:14:29 -08:00
Angelos Katharopoulos	595a4ad206	Improve interactivity of mlx.launch	2025-12-10 00:54:27 -08:00
Angelos Katharopoulos	a4dc1fac6c	Set the shell to bash explicitly	2025-12-09 15:17:09 -08:00