diff --git a/docs/src/_static/distributed/m3-ultra-mesh-broken.png b/docs/src/_static/distributed/m3-ultra-mesh-broken.png new file mode 100644 index 000000000..108ff58da Binary files /dev/null and b/docs/src/_static/distributed/m3-ultra-mesh-broken.png differ diff --git a/docs/src/_static/distributed/m3-ultra-mesh.png b/docs/src/_static/distributed/m3-ultra-mesh.png new file mode 100644 index 000000000..e049c024c Binary files /dev/null and b/docs/src/_static/distributed/m3-ultra-mesh.png differ diff --git a/docs/src/usage/distributed.rst b/docs/src/usage/distributed.rst index 0b83709e2..73c40f8b9 100644 --- a/docs/src/usage/distributed.rst +++ b/docs/src/usage/distributed.rst @@ -7,22 +7,29 @@ Distributed Communication MLX supports distributed communication operations that allow the computational cost of training or inference to be shared across many physical machines. At the -moment we support three different communication backends: +moment we support several different communication backends introduced below. + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Backend + - Description + * - :ref:`MPI ` + - A full featured and mature distributed communications library. + * - :ref:`RING ` + - Ring all reduce and all gather over TCP sockets. Always available and + usually faster than MPI. + * - :ref:`JACCL ` + - Low latency communication with RDMA over thunderbolt. Necessary for + things like tensor parallelism. + * - :ref:`NCCL ` + - The backend of choice for CUDA environments. -* `MPI `_ a - full-featured and mature distributed communications library -* A **ring** backend of our own that uses native TCP sockets. It should be - faster for thunderbolt connections, but it also works over Ethernet. -* `nccl `_, for use in CUDA environments. The list of all currently supported operations and their documentation can be seen in the :ref:`API docs`. -.. note:: - Some operations may not be supported or not as fast as they should be. - We are adding more and tuning the ones we have as we are figuring out the - best way to do distributed computing on Macs using MLX. - Getting Started --------------- @@ -85,7 +92,7 @@ Selecting Backend ^^^^^^^^^^^^^^^^^ You can select the backend you want to use when calling :func:`init` by passing -one of ``{'any', 'ring', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all +one of ``{'any', 'ring', 'jaccl', 'mpi', 'nccl'}``. When passing ``any``, MLX will try all available backends. If they all fail then a singleton group is created. .. note:: @@ -192,16 +199,247 @@ almost identical to the example above: loss = step(model, x, y) mx.eval(loss, model.parameters()) +.. _ring_section: + +Getting Started with Ring +------------------------- + +The ring backend does not depend on any third party library so it is always +available. It uses TCP sockets so the nodes need to be reachable via a network. +As the name suggests the nodes are connected in a ring which means that rank 1 +can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3 +and so on and so forth. As a result :func:`send` and :func:`recv` with +arbitrary sender and receiver is not supported in the ring backend. + +Defining a Ring +^^^^^^^^^^^^^^^ + +The easiest way to define and use a ring is via a JSON hostfile and the +``mlx.launch`` :doc:`helper script `. For each node one +defines a hostname to ssh into to run commands on this node and one or more IPs +that this node will listen to for connections. + +For example the hostfile below defines a 4 node ring. ``hostname1`` will be +rank 0, ``hostname2`` rank 1 etc. + +.. code:: json + + [ + {"ssh": "hostname1", "ips": ["123.123.123.1"]}, + {"ssh": "hostname2", "ips": ["123.123.123.2"]}, + {"ssh": "hostname3", "ips": ["123.123.123.3"]}, + {"ssh": "hostname4", "ips": ["123.123.123.4"]} + ] + +Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each +node, run the script which will listen for connections in each of the provided +IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a +connection from ``123.123.123.4`` and so on and so forth. + +Thunderbolt Ring +^^^^^^^^^^^^^^^^ + +Although the ring backend can have benefits over MPI even for Ethernet, its +main purpose is to use Thunderbolt rings for higher bandwidth communication. +Setting up such thunderbolt rings can be done manually, but is a relatively +tedious process. To simplify this, we provide the utility ``mlx.distributed_config``. + +To use ``mlx.distributed_config`` your computers need to be accessible by ssh via +Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the +utility as follows: + +.. code:: shell + + mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --backend ring + +By default the script will attempt to discover the thunderbolt ring and provide +you with the commands to configure each node as well as the ``hostfile.json`` +to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes +then ``--auto-setup`` can be used to configure them automatically. + +If you want to go through the process manually, the steps are as follows: + +* Disable the thunderbolt bridge interface +* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces + corresponding to that cable in nodes ``i`` and ``i + 1``. +* Set up a unique subnetwork connecting the two nodes for the corresponding + interfaces. For instance if the cable corresponds to ``en2`` on node ``i`` + and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and + ``192.168.0.2`` respectively to the two nodes. For more details you can see + the commands prepared by the utility script. + +.. _jaccl_section: + +Getting Started with RDMA over Thunderbolt +------------------------------------------ + +Starting from version 26.2 RDMA over thunderbolt is available in MacOS and +enables low-latency communication between Macs with thunderbolt 5. MLX provides +the JACCL backend that uses this functionality to achieve communication latency +an order of magnitude lower than the ring backend. + +.. note:: + + The name JACCL (pronounced Jackal) stands for *Jack and Angelos' Collective + Communication Library* and it is an obvious pun to Nvidia's NCCL but also + tribute to *Jack Beasley* who led the development of RDMA over Thunderbolt + at Apple. + +Enabling RDMA +^^^^^^^^^^^^^ + +Until the feature matures, enabling RDMA over thunderbolt is slightly more +involved and **cannot** be done remotely even with sudo. In fact it has to be +done in macOS recovery: + +1. `Start your computer in recovery `_. +2. Open the Terminal by going to Utilities -> Terminal. +3. Run ``rdma_ctl enable``. +4. Reboot. + +To verify that you have successfully enabled Thunderbolt RDMA you can run +``ibv_devices`` which should produce something like the following for an M3 Ultra. + +.. code-block:: bash + + ~ % ibv_devices + device node GUID + ------ ---------------- + rdma_en2 8096a9d9edbaac05 + rdma_en3 8196a9d9edbaac05 + rdma_en5 8396a9d9edbaac05 + rdma_en4 8296a9d9edbaac05 + rdma_en6 8496a9d9edbaac05 + rdma_en7 8596a9d9edbaac05 + +Defining a Mesh +^^^^^^^^^^^^^^^ + +The JACCL backend supports only fully connected topologies. Namely, there needs +to be a thunderbolt cable connecting all pairs of Macs directly. For example in +the following topology visualizations the left one is valid because there is a +connection from any node to any other node, while for the one on the right M3 +Ultra 1 is not connected to M3 Ultra 2. + +.. raw:: html + +
+
+ M3 Ultra thunderbolt mesh +

Fully connected mesh of four M3 Ultra.

+
+
+ M3 Ultra broken thunderbolt mesh +

Not a valid mesh (M3 Ultra 1 is not connected to M3 Ultra 2).

+
+
+ +Similar to the ring backend, the easiest way to use JACCL with MLX is to write +a JSON hostfile that will be used by ``mlx.launch``. The hostfile needs to contain + +- Hostnames to use for launching scripts via ssh +- An IP for rank 0 that is reachable by all nodes +- A list of rdma devices that connect each node to each other node + +The following JSON defines the valid 4-node mesh from the image above. + +.. code-block:: json + + [ + { + "ssh": "m3-ultra-1", + "ips": ["123.123.123.1"], + "rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"] + }, + { + "ssh": "m3-ultra-2", + "ips": [], + "rdma": ["rdma_en5", null, "rdma_en3", "rdma_en4"] + }, + { + "ssh": "m3-ultra-3", + "ips": [], + "rdma": ["rdma_en4", "rdma_en3", null, "rdma_en5"] + }, + { + "ssh": "m3-ultra-4", + "ips": [], + "rdma": ["rdma_en3", "rdma_en4", "rdma_en5", null] + } + ] + +Even though TCP/IP is not used when communicating with Thunderbolt RDMA, +disabling the thunderbolt bridge is still required as well as setting up +isolated local networks for each thunderbolt connection. + +All of the above can be done instead via ``mlx.distributed_config``. The helper +script will + +- ssh into each node +- extract the thunderbolt connectivity +- check for a valid mesh +- provide the commands to configure each node (or run them if sudo is available) +- generate the hostfile to be used with ``mlx.launch`` + +Putting it All Together +^^^^^^^^^^^^^^^^^^^^^^ + +For example to launch a distributed MLX script that uses JACCL is fairly simple +if the nodes are reachable via ssh and have password-less sudo. + +First, connect all the thunderbolt cables. Then we can verify the connections +by using the ``mlx.distributed_config`` script to visualize them. + +.. code-block:: bash + + mlx.distributed_config --verbose \ + --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \ + --over thunderbolt --dot | dot -Tpng | open -f -a Preview + +After making sure that everything looks right we can auto-configure the nodes +and save the hostfile to ``m3-ultra-jaccl.json`` by running: + +.. code-block:: bash + + mlx.distributed_config --verbose \ + --hosts m3-ultra-1,m3-ultra-2,m3-ultra-3,m3-ultra-4 \ + --over thunderbolt --backend jaccl \ + --auto-setup --output m3-ultra-jaccl.json + +And now we are ready to run a distributed MLX script such as distributed inference +of a gigantic model using MLX-LM. + +.. code-block:: bash + + mlx.launch --verbose --backend jaccl --hostfile m3-ultra-jaccl.json \ + --env MLX_METAL_FAST_SYNCH=1 -- \ # <--- important + /path/to/remote/python -m mlx_lm chat --model mlx-community/DeepSeek-V3.2-8bit --shard + +.. note:: + + Defining the environment variable ``MLX_METAL_FAST_SYNCH=1`` enables a + different, faster way of synchronizing between the GPU and the CPU. It is + not specific to the JACCL backend and can be used in all cases where the CPU + and GPU need to collaborate for some computation and is pretty critical for + low-latency communication since the communication is done by the CPU. + +.. _nccl_section: + +Getting Started with NCCL +------------------------- + +.. _mpi_section: Getting Started with MPI ------------------------ -MLX already comes with the ability to "talk" to MPI if it is installed on the -machine. Launching distributed MLX programs that use MPI can be done with -``mpirun`` as expected. However, in the following examples we will be using -``mlx.launch --backend mpi`` which takes care of some nuisances such as setting -absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` shared -library. +MLX already comes with the ability to "talk" to `MPI +`_ if it is installed +on the machine. Launching distributed MLX programs that use MPI can be done +with ``mpirun`` as expected. However, in the following examples we will be +using ``mlx.launch --backend mpi`` which takes care of some nuisances such as +setting absolute paths for the ``mpirun`` executable and the ``libmpi.dyld`` +shared library. The simplest possible usage is the following which, assuming the minimal example in the beginning of this page, should result in: @@ -269,78 +507,9 @@ Force MPI to use the most performant network interface by setting ``--mca btl_tcp_if_include `` where ```` should be the interface you want to use. -Getting Started with Ring +Distributed Without ``mlx.launch`` +---------------------------------- + + +Using the helper scripts ------------------------- - -The ring backend does not depend on any third party library so it is always -available. It uses TCP sockets so the nodes need to be reachable via a network. -As the name suggests the nodes are connected in a ring which means that rank 1 -can only communicate with rank 0 and rank 2, rank 2 only with rank 1 and rank 3 -and so on and so forth. As a result :func:`send` and :func:`recv` with -arbitrary sender and receiver is not supported in the ring backend. - -Defining a Ring -^^^^^^^^^^^^^^^ - -The easiest way to define and use a ring is via a JSON hostfile and the -``mlx.launch`` :doc:`helper script `. For each node one -defines a hostname to ssh into to run commands on this node and one or more IPs -that this node will listen to for connections. - -For example the hostfile below defines a 4 node ring. ``hostname1`` will be -rank 0, ``hostname2`` rank 1 etc. - -.. code:: json - - [ - {"ssh": "hostname1", "ips": ["123.123.123.1"]}, - {"ssh": "hostname2", "ips": ["123.123.123.2"]}, - {"ssh": "hostname3", "ips": ["123.123.123.3"]}, - {"ssh": "hostname4", "ips": ["123.123.123.4"]} - ] - -Running ``mlx.launch --hostfile ring-4.json my_script.py`` will ssh into each -node, run the script which will listen for connections in each of the provided -IPs. Specifically, ``hostname1`` will connect to ``123.123.123.2`` and accept a -connection from ``123.123.123.4`` and so on and so forth. - -Thunderbolt Ring -^^^^^^^^^^^^^^^^ - -Although the ring backend can have benefits over MPI even for Ethernet, its -main purpose is to use Thunderbolt rings for higher bandwidth communication. -Setting up such thunderbolt rings can be done manually, but is a relatively -tedious process. To simplify this, we provide the utility ``mlx.distributed_config``. - -To use ``mlx.distributed_config`` your computers need to be accessible by ssh via -Ethernet or Wi-Fi. Subsequently, connect them via thunderbolt cables and then call the -utility as follows: - -.. code:: shell - - mlx.distributed_config --verbose --hosts host1,host2,host3,host4 - -By default the script will attempt to discover the thunderbolt ring and provide -you with the commands to configure each node as well as the ``hostfile.json`` -to use with ``mlx.launch``. If password-less ``sudo`` is available on the nodes -then ``--auto-setup`` can be used to configure them automatically. - -To validate your connection without configuring anything -``mlx.distributed_config`` can also plot the ring using DOT format. - -.. code:: shell - - mlx.distributed_config --verbose --hosts host1,host2,host3,host4 --dot >ring.dot - dot -Tpng ring.dot >ring.png - open ring.png - -If you want to go through the process manually, the steps are as follows: - -* Disable the thunderbolt bridge interface -* For the cable connecting rank ``i`` to rank ``i + 1`` find the interfaces - corresponding to that cable in nodes ``i`` and ``i + 1``. -* Set up a unique subnetwork connecting the two nodes for the corresponding - interfaces. For instance if the cable corresponds to ``en2`` on node ``i`` - and ``en2`` also on node ``i + 1`` then we may assign IPs ``192.168.0.1`` and - ``192.168.0.2`` respectively to the two nodes. For more details you can see - the commands prepared by the utility script.