mlx/docs/build/html/_sources/usage/launching_distributed.rst

:orphan:

.. _usage_launch_distributed:

Launching Distributed Programs
==============================

.. currentmodule:: mlx.core.distributed

Installing the MLX python package provides a helper script ``mlx.launch`` that
can be used to run python scripts distributed on several nodes. It allows
launching using either the MPI backend or the ring backend. See the
:doc:`distributed docs <distributed>` for the different backends.

Usage
-----

The minimal usage example of ``mlx.launch`` is simply

.. code:: shell

    mlx.launch --hosts ip1,ip2 my_script.py

or for testing on localhost

.. code:: shell

    mlx.launch -n 2 my_script.py

The ``mlx.launch`` command connects to the provided host and launches the input
script on each host. It monitors each of the launched processes and terminates
the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
It also takes care of forwarding the output of each remote process to stdout
and stderr respectively.

Providing Hosts
^^^^^^^^^^^^^^^^

Hosts can be provided as command line arguments, like above, but the way that
allows to fully define a list of hosts is via a JSON hostfile. The hostfile has
a very simple schema. It is simply a list of objects that define each host via
a hostname to ssh to and a list of IPs to utilize for the communication.

.. code:: json

    [
        {"ssh": "hostname1", "ips": ["123.123.1.1", "123.123.2.1"]},
        {"ssh": "hostname2", "ips": ["123.123.1.2", "123.123.2.2"]}
    ]

You can use ``mlx.distributed_config --over ethernet`` to create a hostfile
with IPs corresponding to the ``en0`` interface.

Setting up Remote Hosts
^^^^^^^^^^^^^^^^^^^^^^^^

In order to be able to launch the script on each host we need to be able to
connect via ssh. Moreover the input script and python binary need to be on each
host and on the same path. A good checklist to debug errors is the following:

* ``ssh hostname`` works without asking for password or host confirmation
* the python binary is available on all hosts at the same path. You can use
  ``mlx.launch --print-python`` to see what that path is.
* the script you want to run is available on all hosts at the same path

.. _mpi_specifics:

MPI Specifics
-------------

One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,

* The IPs in the hostfile are ignored
* The ssh connectivity requirement is stronger as every node needs to be able
  to connect to every other node
* ``mpirun`` needs to be available on every node at the same path

Finally, one can pass arguments to ``mpirun`` using ``--mpi-arg``. For instance
to choose a specific interface for the byte-transfer-layer of MPI we can call
``mlx.launch`` as follows:

.. code:: shell

    mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py


.. _ring_specifics:

Ring Specifics
--------------

The ring backend, which is also the default backend, can be explicitly selected
with the argument ``--backend ring``. The ring backend has some specific
requirements and arguments that are different to MPI:

* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
  ssh to a hostname that does not correspond to the IP we want to bind to we
  have to provide a hostfile.
* ``--starting-port`` defines the port to bind to on the remote hosts.
  Specifically rank 0 for the first IP will use this port and each subsequent
  IP or rank will add 1 to this port.
* ``--connections-per-ip`` allows us to increase the number of connections
  between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
  ``mpirun``.
rebase 2025-03-06 05:30:09 +08:00			`:orphan:`

			`.. _usage_launch_distributed:`

			`Launching Distributed Programs`
			`==============================`

			`.. currentmodule:: mlx.core.distributed`

			Installing the MLX python package provides a helper script ``mlx.launch`` that
			`can be used to run python scripts distributed on several nodes. It allows`
			`launching using either the MPI backend or the ring backend. See the`
			:doc:`distributed docs <distributed>` for the different backends.

			`Usage`
			`-----`

			The minimal usage example of ``mlx.launch`` is simply

			`.. code:: shell`

			`mlx.launch --hosts ip1,ip2 my_script.py`

			`or for testing on localhost`

			`.. code:: shell`

			`mlx.launch -n 2 my_script.py`

			The ``mlx.launch`` command connects to the provided host and launches the input
			`script on each host. It monitors each of the launched processes and terminates`
			the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated.
			`It also takes care of forwarding the output of each remote process to stdout`
			`and stderr respectively.`

			`Providing Hosts`
			`^^^^^^^^^^^^^^^^`

			`Hosts can be provided as command line arguments, like above, but the way that`
			`allows to fully define a list of hosts is via a JSON hostfile. The hostfile has`
			`a very simple schema. It is simply a list of objects that define each host via`
			`a hostname to ssh to and a list of IPs to utilize for the communication.`

			`.. code:: json`

			`[`
			`{"ssh": "hostname1", "ips": ["123.123.1.1", "123.123.2.1"]},`
			`{"ssh": "hostname2", "ips": ["123.123.1.2", "123.123.2.2"]}`
			`]`

			You can use ``mlx.distributed_config --over ethernet`` to create a hostfile
			with IPs corresponding to the ``en0`` interface.

			`Setting up Remote Hosts`
			`^^^^^^^^^^^^^^^^^^^^^^^^`

			`In order to be able to launch the script on each host we need to be able to`
			`connect via ssh. Moreover the input script and python binary need to be on each`
			`host and on the same path. A good checklist to debug errors is the following:`

			* ``ssh hostname`` works without asking for password or host confirmation
			`* the python binary is available on all hosts at the same path. You can use`
			``mlx.launch --print-python`` to see what that path is.
			`* the script you want to run is available on all hosts at the same path`

			`.. _mpi_specifics:`

			`MPI Specifics`
			`-------------`

			One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case,
			``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover,

			`* The IPs in the hostfile are ignored`
			`* The ssh connectivity requirement is stronger as every node needs to be able`
			`to connect to every other node`
			* ``mpirun`` needs to be available on every node at the same path

			Finally, one can pass arguments to ``mpirun`` using ``--mpi-arg``. For instance
			`to choose a specific interface for the byte-transfer-layer of MPI we can call`
			``mlx.launch`` as follows:

			`.. code:: shell`

			`mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py`


			`.. _ring_specifics:`

			`Ring Specifics`
			`--------------`

			`The ring backend, which is also the default backend, can be explicitly selected`
			with the argument ``--backend ring``. The ring backend has some specific
			`requirements and arguments that are different to MPI:`

			* The argument ``--hosts`` only accepts IPs and not hostnames. If we need to
			`ssh to a hostname that does not correspond to the IP we want to bind to we`
			`have to provide a hostfile.`
			* ``--starting-port`` defines the port to bind to on the remote hosts.
			`Specifically rank 0 for the first IP will use this port and each subsequent`
			`IP or rank will add 1 to this port.`
			* ``--connections-per-ip`` allows us to increase the number of connections
			between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for
			``mpirun``.