Update the mlx.launch and mlx.distributed_config docs

Possible doc fix
Finish the distributed docs
2025-12-16 01:49:05 +08:00 · 2025-12-12 17:03:38 -08:00 · 2025-12-12 15:34:59 -08:00 · 2025-12-12 14:16:16 -08:00 · 2025-12-12 04:36:46 -08:00 · 2025-12-12 01:21:43 -08:00
8 changed files with 553 additions and 111 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -119,10 +119,6 @@ if(MLX_BUILD_METAL)
    COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-version"
    OUTPUT_VARIABLE MACOS_SDK_VERSION
    OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)
-  execute_process(
-    COMMAND zsh "-c" "/usr/bin/xcrun -sdk macosx --show-sdk-path"
-    OUTPUT_VARIABLE CMAKE_OSX_SYSROOT
-    OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ERROR_IS_FATAL ANY)

  if(${MACOS_SDK_VERSION} LESS 14.0)
    message(
--- a/docs/src/_static/distributed/m3-ultra-mesh-broken.png
+++ b/docs/src/_static/distributed/m3-ultra-mesh-broken.png
--- a/docs/src/_static/distributed/m3-ultra-mesh.png
+++ b/docs/src/_static/distributed/m3-ultra-mesh.png
--- a/docs/src/usage/distributed.rst
+++ b/docs/src/usage/distributed.rst
@@ -7,22 +7,29 @@ Distributed Communication

 MLX supports distributed communication operations that allow the computational cost
 of training or inference to be shared across many physical machines. At the
-moment we support three different communication backends:
+moment we support several different communication backends introduced below.
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Backend
+     - Description
+   * - :ref:`MPI <mpi_section>`
+     - A full featured and mature distributed communications library.
+   * - :ref:`RING <ring_section>`
+     - Ring all reduce and all gather over TCP sockets. Always available and
+       usually faster than MPI.
+   * - :ref:`JACCL <ring_section>`
+     - Low latency communication with RDMA over thunderbolt. Necessary for
+       things like tensor parallelism.
+   * - :ref:`NCCL <nccl_section>`
+     - The backend of choice for CUDA environments.

-* `MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ a
-  full-featured and mature distributed communications library
-* A **ring** backend of our own that uses native TCP sockets. It should be
-  faster for thunderbolt connections, but it also works over Ethernet.
-* `nccl <https://developer.nvidia.com/nccl>`_, for use in CUDA environments.

 The list of all currently supported operations and their documentation can be
 seen in the :ref:`API docs<distributed>`.

-.. note::
-   Some operations may not be supported or not as fast as they should be.
-   We are adding more and tuning the ones we have as we are figuring out the
-   best way to do distributed computing on Macs using MLX.
-
 Getting Started
 ---------------

@@ -85,7 +92,7 @@ Selecting Backend
 ^^^^^^^^^^^^^^^^^

 You can select the backend you want to use when calling :func:`init` by passing
-one of ``{'any', 'ring', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
+one of ``{'any', 'ring', 'jaccl', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all
 available backends. If they all fail then a singleton group is created.

 .. note::
@@ -110,6 +117,8 @@ The following examples aim to clarify the backend initialization logic in MLX:
    world_ring = mx.distributed.init(backend="ring")
    world_any = mx.distributed.init()  # same as MPI because it was initialized first!

+.. _training_example:
+
 Training Example
 ----------------

@@ -192,16 +201,273 @@ almost identical to the example above:
        loss = step(model, x, y)
        mx.eval(loss, model.parameters())

+.. _ring_section:
+
+Getting Started with Ring
+-------------------------
+
+The ring backend does not depend on any third party library so it is always
+available. It uses TCP sockets so the nodes need to be reachable via a network.
+As the name suggests the nodes are connected in a ring which means that rank 1
+can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
+and so on and so forth. As a result :func:`send` and :func:`recv` with
+arbitrary sender and receiver is not supported in the ring backend.
+
+Defining a Ring
+^^^^^^^^^^^^^^^
+
+The easiest way to define and use a ring is via a JSON hostfile and the
+``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
+defines a hostname to ssh into to run commands on this node and one or more IPs
+that this node will listen to for connections.
+
+For example the hostfile below defines a 4 node ring. ``hostname1`` will be
+rank 0, ``hostname2`` rank 1 etc.
+
+.. code:: json
+
+    [
+        {"ssh": "hostname1", "ips": ["123.123.123.1"]},
+        {"ssh": "hostname2", "ips": ["123.123.123.2"]},
+        {"ssh": "hostname3", "ips": ["123.123.123.3"]},
+        {"ssh": "hostname4", "ips": ["123.123.123.4"]}
+    ]
+
+Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
+node, run the script which will listen for connections in each of the provided
+IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
+connection from ``123.123.123.4`` and so on and so forth.
+
+Thunderbolt Ring
+^^^^^^^^^^^^^^^^
+
+Although the ring backend can have benefits over MPI even for Ethernet, its
+main purpose is to use Thunderbolt rings for higher bandwidth communication.
+Setting up such thunderbolt rings can be done manually, but is a relatively
+tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
+
+To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
+Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
+utility as follows:
+
+.. code:: shell
+
+   mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --backend ring
+
+By default the script will attempt to discover the thunderbolt ring and provide
+you with the commands to configure each node as well as the ``hostfile.json``
+to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
+then ``--auto-setup`` can be used to configure them automatically.
+
+If you want to go through the process manually, the steps are as follows:
+
+* Disable the thunderbolt bridge interface
+* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
+  corresponding to that cable in nodes ``i`` and ``i + 1``.
+* Set up a unique subnetwork connecting the two nodes for the corresponding
+  interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
+  and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
+  ``192.168.0.2`` respectively to the two nodes. For more details you can see
+  the commands prepared by the utility script.
+
+.. _jaccl_section:
+
+Getting Started with RDMA over Thunderbolt
+------------------------------------------
+
+Starting from version 26.2 RDMA over thunderbolt is available in MacOS and
+enables low-latency communication between Macs with thunderbolt 5. MLX provides
+the JACCL backend that uses this functionality to achieve communication latency
+an order of magnitude lower than the ring backend.
+
+.. note::
+
+   The name JACCL (pronounced Jackal) stands for *Jack and Angelos' Collective
+   Communication Library* and it is an obvious pun to Nvidia's NCCL but also
+   tribute to *Jack Beasley* who led the development of RDMA over Thunderbolt
+   at Apple.
+
+Enabling RDMA
+^^^^^^^^^^^^^
+
+Until the feature matures, enabling RDMA over thunderbolt is slightly more
+involved and **cannot** be done remotely even with sudo. In fact, it has to be
+done in macOS recovery:
+
+1. `Start your computer in recovery <https://support.apple.com/en-us/102518>`_.
+2. Open the Terminal by going to Utilities -> Terminal.
+3. Run ``rdma_ctl enable``.
+4. Reboot.
+
+To verify that you have successfully enabled Thunderbolt RDMA you can run
+``ibv_devices`` which should produce something like the following for an M3 Ultra.
+
+.. code-block:: bash
+
+    ~ % ibv_devices
+    device          	   node GUID
+    ------          	----------------
+    rdma_en2        	8096a9d9edbaac05
+    rdma_en3        	8196a9d9edbaac05
+    rdma_en5        	8396a9d9edbaac05
+    rdma_en4        	8296a9d9edbaac05
+    rdma_en6        	8496a9d9edbaac05
+    rdma_en7        	8596a9d9edbaac05
+
+Defining a Mesh
+^^^^^^^^^^^^^^^
+
+The JACCL backend supports only fully connected topologies. Namely, there needs
+to be a thunderbolt cable connecting all pairs of Macs directly. For example, in
+the following topology visualizations, the left one is valid because there is a
+connection from any node to any other node, while for the one on the right M3
+Ultra 1 is not connected to M3 Ultra 2.
+
+.. raw:: html
+
+   <div style="display: flex; text-align: center; align-items: end; font-size: 80%;">
+     <div>
+       <img src="/_static/distributed/m3-ultra-mesh.png" alt="M3 Ultra thunderbolt mesh" style="width: 55%">
+       <p>Fully connected mesh of four M3 Ultra.</p>
+     </div>
+     <div>
+       <img src="/_static/distributed/m3-ultra-mesh-broken.png" alt="M3 Ultra broken thunderbolt mesh" style="width: 55%">
+       <p>Not a valid mesh (M3 Ultra 1 is not connected to M3 Ultra 2).</p>
+     </div>
+   </div>
+
+Similar to the ring backend, the easiest way to use JACCL with MLX is to write
+a JSON hostfile that will be used by ``mlx.launch``. The hostfile needs to contain
+
+- Hostnames to use for launching scripts via ssh
+- An IP for rank 0 that is reachable by all nodes
+- A list of rdma devices that connect each node to each other node
+
+The following JSON defines the valid 4-node mesh from the image above.
+
+.. code-block:: json
+
+    [
+        {
+            "ssh": "m3-ultra-1",
+            "ips": ["123.123.123.1"],
+            "rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"]
+        },
+        {
+            "ssh": "m3-ultra-2",
+            "ips": [],
+            "rdma": ["rdma_en5", null, "rdma_en3", "rdma_en4"]
+        },
+        {
+            "ssh": "m3-ultra-3",
+            "ips": [],
+            "rdma": ["rdma_en4", "rdma_en3", null, "rdma_en5"]
+        },
+        {
+            "ssh": "m3-ultra-4",
+            "ips": [],
+            "rdma": ["rdma_en3", "rdma_en4", "rdma_en5", null]
+        }
+    ]
+
+Even though TCP/IP is not used when communicating with Thunderbolt RDMA,
+disabling the thunderbolt bridge is still required as well as setting up
+isolated local networks for each thunderbolt connection.
+
+All of the above can be done instead via ``mlx.distributed_config``. This helper
+script will
+
+- ssh into each node
+- extract the thunderbolt connectivity
+- check for a valid mesh
+- provide the commands to configure each node (or run them if sudo is available)
+- generate the hostfile to be used with ``mlx.launch``
+
+Putting it All Together
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For example launching a distributed MLX script that uses JACCL is fairly simple
+if the nodes are reachable via ssh and have password-less sudo.
+
+First, connect all the thunderbolt cables. Then we can verify the connections
+by using the ``mlx.distributed_config`` script to visualize them.
+
+.. code-block::
+
+   mlx.distributed_config --verbose \
+        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
+        --over thunderbolt --dot | dot -Tpng | open -f -a Preview
+
+After making sure that everything looks right we can auto-configure the nodes
+and save the hostfile to ``m3-ultra-jaccl.json`` by running:
+
+.. code-block::
+
+   mlx.distributed_config --verbose \
+        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \
+        --over thunderbolt --backend jaccl \
+        --auto-setup --output m3-ultra-jaccl.json
+
+And now we are ready to run a distributed MLX script such as distributed inference
+of a gigantic model using MLX-LM.
+
+.. code-block::
+
+   mlx.launch --verbose --backend jaccl --hostfile m3-ultra-jaccl.json \
+        --env MLX_METAL_FAST_SYNCH=1 -- \  # <--- important
+        /path/to/remote/python -m mlx_lm chat --model mlx-community/DeepSeek-V3.2-8bit --shard
+
+.. note::
+
+   Defining the environment variable ``MLX_METAL_FAST_SYNCH=1`` enables a
+   different, faster way of synchronizing between the GPU and the CPU. It is
+   not specific to the JACCL backend and can be used in all cases where the CPU
+   and GPU need to collaborate for some computation and is pretty critical for
+   low-latency communication since the communication is done by the CPU.
+
+.. _nccl_section:
+
+Getting Started with NCCL
+-------------------------
+
+MLX on CUDA environments ships with the ability to talk to `NCCL
+<https://developer.nvidia.com/nccl>`_ which is a high-performance collective
+communication library that supports both multi-gpu and multi-node setups.
+
+For CUDA environments, NCCL is the default backend for ``mlx.launch`` and all
+it takes to run a distributed job is
+
+.. code-block::
+
+   mlx.launch -n 8 test.py
+
+   # perfect for interactive scripts
+   mlx.launch -n 8 python -m mlx_lm chat --model my-model --shard
+
+You can also use ``mlx.launch`` to ssh to a remote node and launch a script
+with the same ease
+
+.. code-block::
+
+   mlx.launch --hosts my-cuda-node -n 8 test.py
+
+In many cases you may not want to use ``mlx.launch`` with the NCCL backend
+because the cluster scheduler will be the one launching the processes. You can
+:ref:`see which environment variables need to be defined <no_mlx_launch>` in
+order for the MLX NCCL backend to be initialized correctly.
+
+.. _mpi_section:

 Getting Started with MPI
 ------------------------

-MLX already comes with the ability to "talk" to MPI if it is installed on the
-machine. Launching distributed MLX programs that use MPI can be done with
-``mpirun`` as expected. However, in the following examples we will be using
-``mlx.launch --backend mpi`` which takes care of some nuisances such as setting
-absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` shared
-library.
+MLX already comes with the ability to "talk" to `MPI
+<https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ if it is installed
+on the machine. Launching distributed MLX programs that use MPI can be done
+with ``mpirun`` as expected. However, in the following examples we will be
+using ``mlx.launch --backend mpi`` which takes care of some nuisances such as
+setting absolute paths for the ``mpirun`` executable and the ``libmpi.dyld``
+shared library.

 The simplest possible usage is the following which, assuming the minimal
 example in the beginning of this page, should result in:
@@ -269,78 +535,116 @@ Force MPI to use the most performant network interface by setting ``--mca
 btl_tcp_if_include <iface>`` where ``<iface>`` should be the interface you want
 to use.

-Getting Started with Ring
-------------------------
+.. _no_mlx_launch:

-The ring backend does not depend on any third party library so it is always
-available. It uses TCP sockets so the nodes need to be reachable via a network.
-As the name suggests the nodes are connected in a ring which means that rank 1
-can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3
-and so on and so forth. As a result :func:`send` and :func:`recv` with
-arbitrary sender and receiver is not supported in the ring backend.
+Distributed Without ``mlx.launch``
+----------------------------------

-Defining a Ring
-^^^^^^^^^^^^^^^
+None of the implementations of the distributed backends require launching with
+``mlx.launch``. The script simply connects to each host. Starts a process per
+rank and sets up the necessary environment variables before delegating to your
+MLX script. See the :doc:`dedicated documentation page <launching_distributed>`
+for more details.

-The easiest way to define and use a ring is via a JSON hostfile and the
-``mlx.launch`` :doc:`helper script <launching_distributed>`. For each node one
-defines a hostname to ssh into to run commands on this node and one or more IPs
-that this node will listen to for connections.
+For many use-cases this will be the easiest way to perform distributed
+computations in MLX. However, there may be reasons that you cannot or should
+not use ``mlx.launch``. A common such case is the use of a scheduler that
+starts all the processes for you on machines undetermined at the time of
+scheduling the job.

-For example the hostfile below defines a 4 node ring. ``hostname1`` will be
-rank 0, ``hostname2`` rank 1 etc.
+Below we list the environment variables required to use each backend.

-.. code:: json
+Ring
+^^^^^^

-    [
-        {"ssh": "hostname1", "ips": ["123.123.123.1"]},
-        {"ssh": "hostname2", "ips": ["123.123.123.2"]},
-        {"ssh": "hostname3", "ips": ["123.123.123.3"]},
-        {"ssh": "hostname4", "ips": ["123.123.123.4"]}
-    ]
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
+the process.

-Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each
-node, run the script which will listen for connections in each of the provided
-IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a
-connection from ``123.123.123.4`` and so on and so forth.
+**MLX_HOSTFILE** should contain the path to a json file that contains IPs and
+ports for each rank to listen to, something like the following:

-Thunderbolt Ring
-^^^^^^^^^^^^^^^^
+.. code-block:: json

-Although the ring backend can have benefits over MPI even for Ethernet, its
-main purpose is to use Thunderbolt rings for higher bandwidth communication.
-Setting up such thunderbolt rings can be done manually, but is a relatively
-tedious process. To simplify this, we provide the utility ``mlx.distributed_config``.
+   [
+     ["123.123.1.1:5000", "123.123.1.2:5000"],
+     ["123.123.2.1:5000", "123.123.2.2:5000"],
+     ["123.123.3.1:5000", "123.123.3.2:5000"],
+     ["123.123.4.1:5000", "123.123.4.2:5000"]
+   ]

-To use ``mlx.distributed_config`` your computers need to be accessible by ssh via
-Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the
-utility as follows:
+**MLX_RING_VERBOSE** is optional and if set to 1 it enables some more logging
+from the distributed backend.

-.. code:: shell
+JACCL
+^^^^^

-   mlx.distributed_config --verbose --hosts host1,host2,host3,host4
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
+the process.

-By default the script will attempt to discover the thunderbolt ring and provide
-you with the commands to configure each node as well as the ``hostfile.json``
-to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes
-then ``--auto-setup`` can be used to configure them automatically.
+**MLX_JACCL_COORDINATOR** should contain the IP and port that rank 0 can listen
+to all the other ranks connect to in order to establish the RDMA connections.

-To validate your connection without configuring anything
-``mlx.distributed_config`` can also plot the ring using DOT format.
+**MLX_IBV_DEVICES** should contain the path to a json file that contains the
+ibverbs device names that connect each node to each other node, something like
+the following:

-.. code:: shell
+.. code-block:: json

-   mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --dot >ring.dot
-   dot -Tpng ring.dot >ring.png
-   open ring.png
+   [
+      [null, "rdma_en5", "rdma_en4", "rdma_en3"],
+      ["rdma_en5", null, "rdma_en3", "rdma_en4"],
+      ["rdma_en4", "rdma_en3", null, "rdma_en5"],
+      ["rdma_en3", "rdma_en4", "rdma_en5", null]
+   ]

-If you want to go through the process manually, the steps are as follows:

-* Disable the thunderbolt bridge interface
-* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces
-  corresponding to that cable in nodes ``i`` and ``i + 1``.
-* Set up a unique subnetwork connecting the two nodes for the corresponding
-  interfaces. For instance if the cable corresponds to ``en2`` on node ``i``
-  and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and
-  ``192.168.0.2`` respectively to the two nodes. For more details you can see
-  the commands prepared by the utility script.
+NCCL
+^^^^^
+
+**MLX_RANK** should contain a single 0-based integer that defines the rank of
+the process.
+
+**MLX_WORLD_SIZE** should contain the total number of processes that will be
+launched.
+
+**NCCL_HOST_IP** and **NCCL_PORT** should contain the IP and port that all
+hosts can connect to to establish the NCCL communication.
+
+**CUDA_VISIBLE_DEVICES** should contain the local index of the gpu that
+corresponds to this process.
+
+Of course any `other environment variable
+<https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`_ that is
+used by NCCL can be set.
+
+.. _tips_and_tricks:
+
+Tips and Tricks
+----------------
+
+This is a small collection of tips to help you utilize better the distributed
+communication capabilities of MLX.
+
+- *Test locally first.*
+
+  You can use the pattern ``mlx.launch -n2 -- my_script.py`` to run a small
+  scale test on a single node first.
+
+- *Batch your communication.*
+
+  As described in the :ref:`training example <training_example>`, performing a
+  lot of small communication can hurt performance. Copy the approach of
+  :func:`mlx.nn.average_gradients` to gather many small communications in a
+  single large one.
+
+- *Visualize the connectivity.*
+
+  Use ``mlx.distributed_config --hosts h1,h2,h3 --over thunderbolt --dot`` to
+  visualize the connnections and make sure that the cables are connected
+  correctly. See the :ref:`JACCL section <jaccl_section>` for examples.
+
+- *Use the debugger.*
+
+  ``mlx.launch`` is meant for interactive use. It broadcasts stdin to all
+  processes and gathers stdout from all processes. This makes using ``pdb`` a
+  breeze.
--- a/docs/src/usage/launching_distributed.rst
+++ b/docs/src/usage/launching_distributed.rst
@@ -7,13 +7,106 @@ Launching Distributed Programs

 .. currentmodule:: mlx.core.distributed

-Installing the MLX python package provides a helper script ``mlx.launch`` that
-can be used to run python scripts distributed on several nodes. It allows
-launching using either the MPI backend or the ring backend. See the
-:doc:`distributed docs <distributed>` for the different backends.
+Installing the MLX python package provides two utilities to help you configure
+your Macs for distributed computation and also launch distributed programs on
+multiple nodes or with many processes in a single node. These utilities are aptly named

-Usage
-----
+- ``mlx.launch``
+- ``mlx.distributed_config``
+
+See the :doc:`distributed docs <distributed>` for an introduction and
+getting-started guides to the various backends.
+
+``mlx.distributed_config`` 
+---------------------------
+
+Unless you are launching distributed jobs locally for development or multi-gpu
+CUDA environments, then you have several Macs that you need to configure for
+distributed communication with MLX.
+
+``mlx.distributed_config`` aims to automate the process of configuring the
+network interfaces (especially for communication over thunderbolt) and also
+creating the hostfile to be used with ``mlx.launch``.
+
+We will analyse 3 cases of using ``mlx.distributed_config``
+
+1. RDMA over thunderbolt using JACCL
+2. TCP/IP over thunderbolt using the ring backend
+3. TCP/IP over ethernet using the ring backend
+
+JACCL
+^^^^^^^
+
+After following :ref:`the steps to enable RDMA <jaccl_section>` you can run the
+following command to configure the nodes and create the hostfile.
+
+.. code-block::
+
+   mlx.distributed_config --verbose --backend jaccl \
+        --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 --over thunderbolt \
+        --auto-setup --output m3-ultra-jaccl.json
+
+Let's walk through the steps that the script takes to configure the nodes.
+
+1. Ssh to all nodes to verify that they are reachable
+2. Extract the thunderbolt connectivity. Namely run commands on each node to
+   calculate which node is connected to which other node.
+3. Verify that we have a valid fully connected mesh
+4. Check that RDMA is enabled
+5. Extract the ethernet IP from interface en0
+6. Disable the thunderbolt bridge and set up peer to peer networks for each
+   thunderbolt cable
+7. Write the hostfile
+
+Knowing the above steps allows you to manually configure the nodes but also
+debug any configuration issue. For instance changing the Ethernet IP to a
+different interface directly in the config is possible (as long as it is
+reachable from all nodes).
+
+The ``--auto-setup`` argument requires password-less sudo on each node. If it
+isn't available then the configuration script will print commands to be run on
+each node.
+
+Ring over thunderbolt
+^^^^^^^^^^^^^^^^^^^^^
+
+Setting up a ring backend over thunderbolt only requires changing the
+``--backend`` from ``jaccl`` to ``ring``.
+
+The steps are very similar with the main difference being that instead of
+verifying that the nodes are fully connected, the script attempts to identify a
+ring topology (or multiple rings).
+
+Ring over Ethernet
+^^^^^^^^^^^^^^^^^^
+
+Configuring the ring backend over ethernet doesn't require setting up network
+interface and as such it simply extracts the ``en0`` IP from each node and
+writes the hostfile.
+
+Debugging cable connections
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``mlx.distributed_config`` can help you debug the connectivity of your nodes
+over thunderbolt by exporting a graph of the connections.
+
+Running
+
+.. code-block::
+
+   mlx.distributed_config --verbose \
+        --hosts host1,host2,host3,host4 \
+        --over thunderbolt --dot
+
+will export a `GraphViz <https://graphviz.org>`_ representation of the
+connections between the nodes which makes it very easy to figure out which
+cable is not connected correctly.
+
+See :ref:`the JACCL section <jaccl_section>` for an example.
+
+
+``mlx.launch``
+--------------

 The minimal usage example of ``mlx.launch`` is simply

@@ -33,6 +126,10 @@ the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
 It also takes care of forwarding the output of each remote process to stdout
 and stderr respectively.

+Importantly, it also broadcasts stdin to each process which enables interactive
+programs to work in distributed mode as well as debugging using the interactive
+debugger.
+
 Providing Hosts
 ^^^^^^^^^^^^^^^^

@@ -63,10 +160,62 @@ host and on the same path. A good checklist to debug errors is the following:
  ``mlx.launch --print-python`` to see what that path is.
 * the script you want to run is available on all hosts at the same path

+If you are launching from a node with a completely different setup than the
+nodes that the program will run on, you can specify ``--no-verify-script`` so
+that ``mlx.launch`` does not attempt to verify that the executable and script
+exist locally before launching the distributed job.
+
+.. _ring_specifics:
+
+Ring Specifics
+^^^^^^^^^^^^^^
+
+The :ref:`ring <ring_section>` backend, which is also the default
+backend, can be explicitly selected with the argument ``--backend ring``. The
+ring backend has some specific requirements and arguments that are different to
+other backends:
+
+* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
+  ssh to a hostname that does not correspond to the IP we want to bind to we
+  have to provide a hostfile.
+* ``--starting-port`` defines the port to bind to on the remote hosts.
+  Specifically rank 0 for the first IP will use this port and each subsequent
+  IP or rank will add 1 to this port.
+* ``--connections-per-ip`` allows us to increase the number of connections
+  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
+  ``mpirun``.
+
+.. _jaccl_specifics:
+
+JACCL Specifics
+^^^^^^^^^^^^^^^^
+
+The :ref:`JACCL <jaccl_section>` backend can be selected with the argument
+``--backend jaccl``. A hostfile is necessary to launch with this backend
+because it needs to contain the RDMA devices connecting each node to each other
+node.
+
+NCCL Specifics
+^^^^^^^^^^^^^^
+
+The :ref:`NCCL <nccl_section>` backend is the default backend for CUDA
+environments. When launching from a Mac to a Linux machine with CUDA then the
+backend should be selected using ``--backend nccl``.
+
+The ``--repeat-hosts, -n`` argument should be used to launch multi-node and
+multi-gpu jobs. For instance
+
+.. code-block::
+
+   mlx.launch --backend nccl --hosts linux-1,linux-2 -n 8 --no-verify-script -- ./my-job.sh
+
+will attempt to launch 16 processes, 8 on each node that will all run
+``my-job.sh``.
+
 .. _mpi_specifics:

 MPI Specifics
-------------
+^^^^^^^^^^^^^

 One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
 ``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,
@@ -83,23 +232,3 @@ to choose a specific interface for the byte-transfer-layer of MPI we can call
 .. code:: shell

    mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py
-
-
-.. _ring_specifics:
-
-Ring Specifics
--------------
-
-The ring backend, which is also the default backend, can be explicitly selected
-with the argument ``--backend ring``. The ring backend has some specific
-requirements and arguments that are different to MPI:
-
-* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
-  ssh to a hostname that does not correspond to the IP we want to bind to we
-  have to provide a hostfile.
-* ``--starting-port`` defines the port to bind to on the remote hosts.
-  Specifically rank 0 for the first IP will use this port and each subsequent
-  IP or rank will add 1 to this port.
-* ``--connections-per-ip`` allows us to increase the number of connections
-  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
-  ``mpirun``.
--- a/python/mlx/_distributed_utils/config.py
+++ b/python/mlx/_distributed_utils/config.py
@@ -106,6 +106,8 @@ class IPConfigurator:
            for src_port, p in enumerate(h.ports):
                if not p.connected_to:
                    continue
+                if p.connected_to not in uuid_reverse_index:
+                    continue
                if (src_node, src_port) in assigned:
                    continue

@@ -241,7 +243,7 @@ def make_connectivity_matrix(tb_hosts, uuid_reverse_index):
    for i, h in enumerate(tb_hosts):
        c = [0] * len(tb_hosts)
        for p in h.ports:
-            if p.connected_to is not None:
+            if p.connected_to in uuid_reverse_index:
                j, _ = uuid_reverse_index[p.connected_to]
                c[j] += 1
        connectivity.append(c)
--- a/python/mlx/_distributed_utils/launch.py
+++ b/python/mlx/_distributed_utils/launch.py
@@ -49,7 +49,7 @@ class RemoteProcess(CommandProcess):
        is_local = host == "127.0.0.1"
        cmd = RemoteProcess.make_launch_script(rank, cwd, files, env, command)
        if not is_local:
-            cmd = f"ssh {host} {shlex.quote(cmd)}"
+            cmd = f"ssh -tt -o LogLevel=QUIET {host} {shlex.quote(cmd)}"

        self._host = host
        self._pidfile = None
@@ -57,6 +57,7 @@ class RemoteProcess(CommandProcess):
        self._process = Popen(
            cmd,
            shell=True,
+            executable="/bin/bash",
            stdin=PIPE,
            stdout=PIPE,
            stderr=PIPE,
@@ -91,7 +92,14 @@ class RemoteProcess(CommandProcess):
        cmd = RemoteProcess.make_kill_script(self._pidfile)
        if not self._is_local:
            cmd = f"ssh {self._host} {shlex.quote(cmd)}"
-        c = run(cmd, check=True, shell=True, capture_output=True, text=True)
+        c = run(
+            cmd,
+            check=True,
+            shell=True,
+            executable="/bin/bash",
+            capture_output=True,
+            text=True,
+        )

        self._killed = c.stdout.strip() == "1"

@@ -99,10 +107,13 @@ class RemoteProcess(CommandProcess):
    def make_launch_script(rank, cwd, files, env, command):
        script = ""

+        # Disable echo
+        script = "stty -echo; "
+
        # Write the PID to a file so we can kill the process if needed
        script += "pidfile=$(mktemp); "
        script += "echo $$ > $pidfile; "
-        script += "echo $pidfile; "
+        script += 'printf "%s\\n" $pidfile; '

        # Change the working directory if one was requested. Otherwise attempt to
        # change to the current one but don't fail if it wasn't possible.
--- a/python/src/distributed.cpp
+++ b/python/src/distributed.cpp
@@ -67,7 +67,7 @@ void init_distributed(nb::module_& parent_module) {

      Args:
        backend (str, optional): The name of the backend to check for availability.
-          It takes the same values as ``init()``. Default: ``any``.
+          It takes the same values as :func:`init()`. Default: ``"any"``.

      Returns:
        bool: Whether the distributed backend is available.
Author	SHA1	Message	Date
Angelos Katharopoulos	d2bc340df4	Update the mlx.launch and mlx.distributed_config docs	2025-12-12 17:03:38 -08:00
Angelos Katharopoulos	fabc947df4	Possible doc fix	2025-12-12 15:34:59 -08:00
Angelos Katharopoulos	5523087cfb	Finish the distributed docs	2025-12-12 14:16:16 -08:00
Angelos Katharopoulos	2f939acefa	Progress with the docs	2025-12-12 04:36:46 -08:00
Angelos Katharopoulos	3b416a2e36	Comments	2025-12-12 01:21:43 -08:00
Angelos Katharopoulos	753c6a4d0f	Disable echo for interactive distributed	2025-12-11 17:47:42 -08:00
Angelos Katharopoulos	d3a754c8aa	Fix config when more cables are connected	2025-12-10 02:14:29 -08:00
Angelos Katharopoulos	595a4ad206	Improve interactivity of mlx.launch	2025-12-10 00:54:27 -08:00
Angelos Katharopoulos	a4dc1fac6c	Set the shell to bash explicitly	2025-12-09 15:17:09 -08:00