Commit Graph

15 Commits

Author SHA1 Message Date
Anastasiia Filippova
4ee0d0bb55 removed nproc-per-node 2025-08-20 15:49:32 +02:00
Anastasiia Filippova
984cefb14d CUDA_VISIBLE_DEVICES to local rank 2025-08-09 01:43:14 +02:00
Anastasiia Filippova
dadf8d9c93 repeat host -> proc per node 2025-08-07 15:09:46 +02:00
Anastasiia Filippova
062aa80b84 minor changer to mlx.launch 2025-08-07 13:20:55 +02:00
Anastasiia Filippova
f540b1d612 nccl backend 2025-08-07 13:11:56 +02:00
Angelos Katharopoulos
a3a632d567 Fix the launcher when ran locally (#2147) 2025-05-01 12:56:09 -07:00
Awni Hannun
68d1b3256b nit: fix exception handling (#2066) 2025-04-11 14:12:08 -07:00
Angelos Katharopoulos
4eef8102c9 Distributed layers (#1270) 2025-03-21 13:52:17 -07:00
Angelos Katharopoulos
0792ff02ff Only fail when 10 consecutive socket errors occur (#1928) 2025-03-05 13:16:19 -08:00
Angelos Katharopoulos
a0737273d3 Allow debugging in distributed mode (#1920) 2025-03-04 13:01:10 -08:00
Angelos Katharopoulos
5d68082881 Ring docs (#1829) 2025-02-28 11:34:21 -08:00
Angelos Katharopoulos
607181644f Add mlx.distributed_config script (#1902) 2025-02-28 11:16:39 -08:00
Awni Hannun
83a0340fa7 allow command (#1836) 2025-02-06 10:32:24 -08:00
Awni Hannun
ec7c7def40 no line buffer for mpi jobs (#1825) 2025-02-03 12:02:15 -08:00
Angelos Katharopoulos
ded914f442 Small distributed launch helper (#1810) 2025-01-29 17:55:04 -08:00