At a glance: what you need before starting Axolotl multi-node fine-tuning
At a Glance: Axolotl supports three ways to train multi-node: Accelerate, Ray Train, and torchrun. Prereqs: a multi-node cluster with matching Axolotl commits, identical model config files on every node, and TCP reachability from the main machine to the workers on the rendezvous port.
Axolotl supports three paths for multi-node fine-tuning: Accelerate (with FSDP or FSDP2), Ray Train, and torchrun. The official docs state it plainly: "How to use Axolotl on multiple machines." Each has a different rendezvous mechanism, a different failure mode taxonomy, and a different operating fit. For InfiniBand fabrics specifically, the docs explicitly recommend torchrun to utilize the full IB bandwidth — Accelerate's abstraction layer can route collective traffic through the wrong NIC if NCCL environment variables are not set manually.
The three non-negotiable preconditions before any launcher fires:
- Every node runs the same Axolotl commit.
- Every node has an identical copy of the model config file.
- The head node (rank 0) is TCP-reachable from every worker node on the rendezvous port.
Miss any of these and the job either fails silently, hangs at rendezvous, or produces nondeterministic results that waste cluster budget.
Prerequisites, cluster setup, and version parity across nodes
As the Axolotl multi-node documentation states: "Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility. You will also need to have the same configuration file for your model on each machine." The same docs also make the reachability requirement explicit: "Make sure the main machine is reachable by other machines."
Install Axolotl from source on every node, pinning to the same Git SHA:
$ git clone https://github.com/axolotl-ai-cloud/axolotl.git
$ cd axolotl
$ git checkout <SHARED_COMMIT_SHA> # pin to the same SHA on all nodes
$ pip install -e ".[flash-attn,fsdp]"
After installation, distribute the model config. If your cluster has a shared filesystem (NFS, Lustre, GPFS), place the config there and reference it by the shared path on all nodes. Without a shared filesystem, use rsync or a provisioning tool to copy the config file to an identical absolute path on each node before launch.
Production Note: Version parity is not optional. Two nodes running different Axolotl commits can have subtly incompatible serialization formats, different FSDP wrapping logic, or mismatched tokenizer behavior — any of which produces failures that look like network or NCCL issues during debugging. Pin the SHA in your provisioning script, not in a README. Config parity is equally strict: a single differing field (e.g.,
per_device_train_batch_sizeorgradient_accumulation_steps) between nodes will either crash the run or silently corrupt gradient accumulation. Verify withmd5sum your_config.yamlon all nodes before launch. The rendezvous port (default 29500) must also be open between nodes — confirm this with firewall rules andncchecks before submitting a job.
What must match on every node
Every node in the cluster must carry identical copies of:
- Axolotl commit — use
git rev-parse HEADon each node and compare - Model configuration file — same absolute path, identical content
- Python and PyTorch versions — version skew causes silent ABI mismatches
- CUDA and driver versions — NCCL is sensitive to driver-level differences
- Dataset path or remote reference — preprocessing must produce the same token sequences
Production Note: Do not mix Axolotl commits across nodes, even for patch releases. The docs are explicit: "we suggest using the same commit to ensure compatibility." If you must upgrade mid-campaign, rebuild all nodes simultaneously. Config drift is the second most common cause of multi-node launch failure after network misconfiguration — run
difformd5sumon the config across nodes as a preflight step.
Network and access checks before launch
Verify reachability before any launcher fires. A stalled job that hangs at rendezvous for 10 minutes is almost always a TCP connectivity problem, not a software bug.
# From each worker node — verify head node is reachable on the rendezvous port
$ nc -zv <HEAD_NODE_IP> 29500
# Confirm all nodes can reach each other (replace with actual node IPs)
$ for node in 10.0.0.1 10.0.0.2 10.0.0.3; do
ssh $node "hostname && nvidia-smi -L | wc -l"
done
# Check which NIC is active on each node — needed for NCCL_SOCKET_IFNAME
$ ip link show | grep -E '^[0-9]+:' | awk '{print $2}'
# Quick IB device check on each node
$ ibstat | grep -E '(CA |State)'
As the PyTorch multinode training tutorial notes: "Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0)." Identify the correct NIC names during this preflight phase — you will need them for the NCCL environment variable block.
Run Axolotl with Accelerate and FSDP2 on multiple nodes
The Accelerate path works by writing a static rendezvous configuration to ~/.cache/huggingface/accelerate/default_config.yaml (or a path you specify) on each node, then calling axolotl train normally. Accelerate reads the config, discovers the cluster topology, and coordinates launches across machines. The FSDP2 backend handles parameter sharding across GPU ranks.
Accelerate rendezvous values that must be exact
Write a per-node Accelerate config. The rendezvous fields must be exact — a single wrong value prevents any rank from joining the group.
On node 0 (rank 0 / head node):
# accelerate_config_node0.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
machine_rank: 0 # unique integer per node; 0 = head
main_process_ip: "10.0.0.1" # head node IP, reachable from all workers
main_process_port: 29500 # must be open in firewall on head node
num_machines: 2 # total node count
num_processes: 16 # total GPU count across all nodes (e.g., 2 nodes × 8 GPUs)
On node 1 (rank 1 / worker node): copy the file and change only machine_rank: 1. All other fields remain identical.
# accelerate_config_node1.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
machine_rank: 1
main_process_ip: "10.0.0.1" # still points to head node — do not change
main_process_port: 29500
num_machines: 2
num_processes: 16
total_processes (num_processes) must equal num_machines × GPUs_per_node. Getting this wrong is the most common rendezvous mistake: Accelerate waits indefinitely for a rank count that will never be satisfied.
FSDP2 config for memory-heavy fine-tunes
The FSDP2 settings live inside the Axolotl model config, not the Accelerate config. The Axolotl multi-node documentation provides a reference configuration that you must treat as the starting point for memory-heavy fine-tunes:
# model_config.yaml — the FSDP2 block (same file on every node)
fsdp_version: 2
fsdp_config:
offload_params: true # offload optimizer states to CPU; saves HBM at cost of throughput
state_dict_type: FULL_STATE_DICT # gather full weights on rank 0 for checkpointing
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: LlamaDecoderLayer # match to your model architecture
reshard_after_forward: true # release parameter shards after forward pass; reduces peak HBM
A few operational notes on these fields:
offload_params: truetrades throughput for memory capacity.state_dict_type: FULL_STATE_DICTgathers full weights on rank 0 before saving. This requires the head node to have enough CPU RAM to hold the full model.LOCAL_STATE_DICTsaves shards separately and requires a merge step.transformer_layer_cls_to_wrapmust match the model's actual decoder class name. For Mistral and Llama-family models the value isLlamaDecoderLayer; for Mixtral useMistralDecoderLayer; for Falcon useFalconDecoderLayer. A mismatch causes FSDP to wrap at the wrong granularity, degrading efficiency without error.reshard_after_forward: trueis the FSDP2-native behavior (equivalent toFULL_SHARDin FSDP1). Set it tofalseonly if you are debugging and want parameters to remain gathered — never in production.
Run Axolotl with torchrun for InfiniBand bandwidth
The Axolotl docs are direct: "If you are using Infiniband, we recommend torchrun to utilize the full bandwidth." The reason is NCCL's NIC selection logic: without explicit environment variable steering, NCCL may fall back to Ethernet for out-of-band coordination even when IB is present, cutting allreduce bandwidth and making the job appear slow.
Set these environment variables on every node before the launcher fires:
# NCCL InfiniBand settings — export on all nodes before torchrun
$ export NCCL_IB_DISABLE=0 # enable IB transport (disabled by default on some builds)
$ export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" # restrict rendezvous to these NIC prefixes
$ export NCCL_BUFFSIZE=2097152 # 2 MB transfer buffer; matches Axolotl docs recommendation
$ export NCCL_DEBUG=INFO # enable verbose logs for the first run
Replace the NCCL_SOCKET_IFNAME value with the actual interface prefix on your cluster. Run ip link show on each node to confirm the NIC names before you fix this value.
When torchrun beats accelerate on InfiniBand
The Accelerate launcher wraps torchrun internally but introduces an additional configuration layer that can interfere with NCCL's NIC selection. On IB fabrics, you want NCCL to own the transport decision directly — torchrun gives you that control because you set NCCL variables in the shell environment before the process group initializes, and they take effect without additional abstraction.
Pro Tip: On InfiniBand clusters, use torchrun with explicit NCCL environment variables rather than Accelerate's launcher. If NCCL silently falls back to Ethernet because
NCCL_IB_DISABLEdefaults to1orNCCL_SOCKET_IFNAMEselects the management NIC, your allreduce throughput collapses and the job appears slow without emitting an obvious error. Torchrun also makes NCCL log parsing easier because launcher output and training output share the same process tree — you can pipe both to a single log file and correlate timestamps.
New Axolotl CLI launcher versus legacy direct torchrun
The Axolotl docs label the new CLI as "Option 1: New Axolotl CLI with launcher args (Recommended)" and the direct torchrun path as "Option 2: Direct torchrun (Legacy)". Prefer the new CLI for new deployments.
Option 1 — New Axolotl CLI (Recommended). Run on each node, adjusting --node-rank per machine:
# Run on head node (node rank 0)
$ NCCL_IB_DISABLE=0 \
NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
NCCL_BUFFSIZE=2097152 \
axolotl train model_config.yaml \
--launcher torchrun \
--num-nodes 2 \
--node-rank 0 \
--master-addr 10.0.0.1 \
--master-port 29500
# Run on worker node (node rank 1) — same config, only --node-rank changes
$ NCCL_IB_DISABLE=0 \
NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
NCCL_BUFFSIZE=2097152 \
axolotl train model_config.yaml \
--launcher torchrun \
--num-nodes 2 \
--node-rank 1 \
--master-addr 10.0.0.1 \
--master-port 29500
Option 2 — Legacy direct torchrun. Preserve this for older pipelines or debugging comparisons:
# Legacy: direct torchrun on head node
$ NCCL_IB_DISABLE=0 \
NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
NCCL_BUFFSIZE=2097152 \
torchrun \
--nnodes 2 \
--node-rank 0 \
--master-addr 10.0.0.1 \
--master-port 29500 \
--nproc-per-node 8 \
-m axolotl.cli.train model_config.yaml
# Legacy: direct torchrun on worker node
$ NCCL_IB_DISABLE=0 \
NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
NCCL_BUFFSIZE=2097152 \
torchrun \
--nnodes 2 \
--node-rank 1 \
--master-addr 10.0.0.1 \
--master-port 29500 \
--nproc-per-node 8 \
-m axolotl.cli.train model_config.yaml
Both options require identical NCCL variables and identical model configs on all nodes. The new CLI encapsulates argument handling and is the path Axolotl will maintain going forward.
How to verify the job is actually distributed
A common silent failure: the job appears to run, loss decreases, but only node 0's GPUs are active. This happens when a misconfigured launcher defaults to local-only execution. Confirm distributed execution before trusting any results.
# Set verbose NCCL logging before launch — captures rendezvous and transport init
$ export NCCL_DEBUG=INFO
$ export NCCL_DEBUG_SUBSYS=ALL # optional; captures per-subsystem detail
# On each node, watch GPU utilization in real time
$ watch -n 2 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv
# Check process groups on each node after launch
$ ps aux | grep axolotl
As the PyTorch multinode training tutorial recommends: "Set env variable NCCL_DEBUG to INFO ... to print verbose logs that can help diagnose the issue." With NCCL_DEBUG=INFO, the first 100 lines of each node's log contain transport negotiation output — look for IB references to confirm InfiniBand is in use.
Signals that rendezvous succeeded
# Enable verbose NCCL before training begins
$ export NCCL_DEBUG=INFO
# Grep for key startup tokens in the log
$ axolotl train model_config.yaml ... 2>&1 | tee run.log
$ grep -E "(NCCL INFO|NCCLSocket|Using devices|Rank [0-9]+ initialized)" run.log
Healthy startup produces log lines containing NCCL INFO Net : Using IB device (for InfiniBand) or NCCL INFO Net : Using Socket (fallback to TCP). You should see initialization messages from every rank — the total count must match num_processes from your Accelerate config or --nnodes × --nproc-per-node from torchrun. Absence of timeout or Connection refused messages confirms the rendezvous port is open and reachable.
Signals that all ranks joined the same run
Once training begins, the loss should decrease smoothly from step 1. Erratic loss values or loss that matches a single-node run (identical step speed, identical throughput) are indicators that worker nodes are idle.
Production Note: If one node sits idle after launch, re-check: (1)
machine_rankis unique and correct on the idle node, (2) the idle node can TCP-connect tomain_process_ip:main_process_port, (3) the idle node ran the launcher command with the correct--node-rankvalue, (4) no firewall rule drops traffic on the rendezvous port between the nodes. Use both the launcher's stderr output and NCCL's INFO logs together — the launcher logs show process startup, and NCCL logs show whether the distributed backend actually initialized with the expected rank count.
Common failure modes and how to debug them fast
Most multi-node Axolotl failures fall into three root causes: NCCL transport misconfiguration, rendezvous parameter errors, and version or config drift between nodes. Debug in that order — transport first, rendezvous second, parity third — before touching training hyperparameters.
Watch Out: The three failure classes produce symptoms that look similar at the surface (job hangs, job crashes before step 1, or job runs but all activity is on one node). Do not start adjusting
per_device_train_batch_sizeorgradient_accumulation_stepsbefore ruling out network and config causes. EnableNCCL_DEBUG=INFO, check rendezvous values, and runmd5sumon the config across nodes before treating the problem as a training issue.
NCCL hangs, bandwidth drops, and NIC mismatches
NCCL hangs and bandwidth drops almost always trace to one of three conditions: IB transport is disabled, NCCL selected the wrong NIC, or the buffer size is mismatched to the cluster fabric.
Watch Out: If
NCCL_IB_DISABLEis not explicitly set to0, NCCL may default to TCP/IP transport and ignore InfiniBand entirely. Axolotl's docs specifyexport NCCL_IB_DISABLE=0as a required setting for IB runs. Similarly, ifNCCL_SOCKET_IFNAMEis unset or points to the management NIC (e.g.,eno1instead ofib0or the bonded data NIC), NCCL's out-of-band traffic uses the wrong interface, causing bandwidth to collapse without a clear error message. SetNCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"as documented — adjust the prefix list to match your cluster's actual NIC naming scheme. If throughput is poor but the job is not hanging, verifyNCCL_BUFFSIZE=2097152is set on all nodes; a mismatch in buffer sizes between sender and receiver degrades collective throughput.
Diagnostic sequence for hangs:
# Step 1: confirm IB devices are visible
$ ibstat | grep -E "(CA |State|Rate)"
# Step 2: run a point-to-point IB bandwidth test between two nodes
$ ib_send_bw -d mlx5_0 -i 1 -x 0 # on receiver
$ ib_send_bw -d mlx5_0 -i 1 -x 0 <RECEIVER_IP> # on sender
# Step 3: check NCCL selected IB in the run log
$ grep "NCCL INFO Net" run.log
Rendezvous failures and machine-rank mistakes
A rendezvous failure produces a timeout or Connection refused error on the worker nodes. The head node either never starts listening or the port is blocked.
# Re-check rendezvous parameters before relaunch
# Verify head node is listening
$ ss -tlnp | grep 29500 # run on head node after launcher starts
# Verify port is reachable from each worker
$ nc -zv <HEAD_NODE_IP> 29500 # run on each worker node
# Verify machine_rank values in your Accelerate configs
$ grep machine_rank ~/.cache/huggingface/accelerate/default_config.yaml
# Verify num_machines and num_processes match the actual cluster size
$ grep -E "(num_machines|num_processes)" ~/.cache/huggingface/accelerate/default_config.yaml
The documented critical fields — machine_rank, main_process_ip, main_process_port, num_machines, and total_processes (as num_processes) — must all be exact. A duplicate machine_rank across two nodes causes one of them to time out waiting for a rank assignment that will never arrive. total_processes less than the actual GPU count causes the run to launch with fewer workers than expected, silently training on a subset of the cluster.
Config drift, commit drift, and dataset drift
Config and commit drift are insidious because they often produce errors that look like NCCL or memory errors rather than configuration errors.
Pro Tip: Treat your Axolotl commit SHA and model config file as immutable artifacts for the duration of a training run. Store them in a versioned location (Git tag, S3 path with SHA prefix) and provision all nodes from the same source at job start. The Axolotl docs state explicitly: "we suggest using the same commit to ensure compatibility" and "You will also need to have the same configuration file for your model on each machine." For reproducibility, also pin dataset preprocessing: if one node preprocesses tokens differently due to a different tokenizer cache state, the gradient contributions from that node are silently corrupted. Run a dataset hash check as part of your preflight script.
Production checklist for serious multi-node runs
Before committing GPU-hours to a long fine-tuning job, lock down every variable that can cause a mid-run failure.
Production Note: Multi-node job failures that occur late in training are far more expensive than failures at step 0. Invest time in preflight validation on a short smoke-run (
max_steps: 2) before submitting the full job. Confirm loss decreases on all nodes, GPU utilization is nonzero on all ranks, and checkpoint write completes cleanly. Only then submit the full job with the production step count.
# Production preflight checklist — run before full job submission
# 1. Version parity
$ for node in node0 node1; do
ssh $node "cd /path/to/axolotl && git rev-parse HEAD"
done
# 2. Config parity
$ for node in node0 node1; do
ssh $node "md5sum /path/to/model_config.yaml"
done
# 3. Port reachability
$ nc -zv <HEAD_NODE_IP> 29500
# 4. IB device check on all nodes
$ for node in node0 node1; do
ssh $node "ibstat | grep State"
done
# 5. Smoke run with verbose NCCL
$ NCCL_DEBUG=INFO axolotl train model_config.yaml \
--launcher torchrun \
--num-nodes 2 \
--node-rank 0 \
--master-addr <HEAD_NODE_IP> \
--master-port 29500 \
-- --max_steps 2
Launcher choice, scheduler fit, and supportability
Choose the launcher based on your cluster's scheduler and operational model, not on framework preference.
Production Note: Accelerate with static rendezvous fits small static clusters where you SSH into each node manually or use a simple scheduler like SLURM's
srun. Ray Train, which Axolotl also supports as an alternative to Accelerate, centralizes orchestration and fits clusters already running a Ray cluster — it handles node failure and restart more gracefully than torchrun. Torchrun is the right choice for InfiniBand fabrics (per the Axolotl docs) and for teams who want direct control over NCCL's environment without abstraction. It is also the easiest path to debug because the process hierarchy is flat: one torchrun per node, all ranks visible inps.
What to log for future debugging
# Capture the exact launcher invocation
$ history -w /path/to/logs/launcher_history_$(date +%Y%m%d_%H%M%S).txt
# Save NCCL environment variables exactly as set
$ env | grep NCCL > /path/to/logs/nccl_env_$(date +%Y%m%d_%H%M%S).txt
# Save Axolotl commit and config
$ git -C /path/to/axolotl rev-parse HEAD > /path/to/logs/axolotl_commit.txt
$ cp /path/to/model_config.yaml /path/to/logs/model_config_snapshot.yaml
# Capture accelerate config if used
$ cp ~/.cache/huggingface/accelerate/default_config.yaml \
/path/to/logs/accelerate_config_node$(hostname).yaml
# On job completion or failure, save the full NCCL debug log
$ cp run.log /path/to/logs/nccl_debug_$(date +%Y%m%d_%H%M%S).log
Preserve the commit SHA, config snapshot, exact NCCL environment, and the launcher command line verbatim. These four artifacts reproduce the failure without having to reconstruct the environment from memory.
FAQ
How do I run Axolotl on multiple machines?
Install the same Axolotl commit on every machine, place an identical model config on each, then launch with one of three methods: Accelerate (with FSDP/FSDP2 config), Ray Train, or torchrun. The Axolotl multi-node docs describe all three. For each, node 0 acts as the coordinator; all other nodes reference its IP and rendezvous port.
Should I use Accelerate or torchrun for Axolotl multi-node training?
It depends on your network fabric. On Ethernet clusters, Accelerate with FSDP2 is a reasonable default — it handles rendezvous automatically and integrates cleanly with HuggingFace tooling. On InfiniBand, follow the Axolotl documentation's explicit recommendation and use torchrun with NCCL_IB_DISABLE=0 and the other IB environment variables. Accelerate can work on IB if you set the NCCL variables correctly, but the torchrun path is simpler to debug and gives NCCL direct control over transport selection.
Pro Tip: On InfiniBand fabrics, start with torchrun. On Ethernet or mixed fabrics without IB, Accelerate's static rendezvous is easier to configure for teams without deep NCCL expertise. Ray Train is the right choice if your cluster already has a Ray runtime and you need fault tolerance or dynamic scaling.
How do I configure NCCL for InfiniBand in multi-node training?
Set three variables on all nodes before the launcher starts: NCCL_IB_DISABLE=0 (enables IB transport), NCCL_SOCKET_IFNAME set to your cluster's data NIC prefix (e.g., "eth0,en,eth,em,bond"), and NCCL_BUFFSIZE=2097152. Add NCCL_DEBUG=INFO during initial setup to confirm IB is selected in the startup logs.
Do all nodes need the same Axolotl version and config?
Yes, without exception. The Axolotl documentation states: "Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility. You will also need to have the same configuration file for your model on each machine." Different commits can produce incompatible code paths; different configs produce nondeterministic distributed behavior or immediate launch failures.
Sources & References
- Axolotl Multi-Node Documentation — Official runbook for Accelerate, Ray Train, and torchrun multi-node paths, including FSDP2 config and NCCL InfiniBand variables
- Axolotl Ray Integration Documentation — Details on using Ray Train as an alternative launcher for Axolotl
- PyTorch Multinode DDP Tutorial — Source for NCCL_DEBUG=INFO guidance and NIC selection recommendations for distributed backends
Keywords: Axolotl, FSDP2, torchrun, Accelerate, Ray Train, NCCL, InfiniBand, NCCL_IB_DISABLE=0, NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE=2097152, LlamaDecoderLayer, FULL_STATE_DICT, TRANSFORMER_BASED_WRAP, reshard_after_forward, machine_rank


