Skip to content
AxiomLogicaSearch
AI & ML

How to run multi-node fine-tuning with Axolotl using FSDP2 or torchrun over InfiniBand

Axolotl’s multi-node path works either through Accelerate/FSDP2 config or torchrun rendezvous, and for InfiniBand the docs explicitly recommend torchrun with NCCL_IB_DISABLE=0 and tuned NCCL_SOCKET_IFNAME/NCCL_BUFFSIZE settings — but every node must share the same Axolotl commit and config, and the launcher choice changes how you debug NCCL and rendezvous failures.

How to run multi-node fine-tuning with Axolotl using FSDP2 or torchrun over InfiniBand
How to run multi-node fine-tuning with Axolotl using FSDP2 or torchrun over InfiniBand

At a glance: what you need before starting Axolotl multi-node fine-tuning

At a Glance: Axolotl supports three ways to train multi-node: Accelerate, Ray Train, and torchrun. Prereqs: a multi-node cluster with matching Axolotl commits, identical model config files on every node, and TCP reachability from the main machine to the workers on the rendezvous port.

Axolotl supports three paths for multi-node fine-tuning: Accelerate (with FSDP or FSDP2), Ray Train, and torchrun. The official docs state it plainly: "How to use Axolotl on multiple machines." Each has a different rendezvous mechanism, a different failure mode taxonomy, and a different operating fit. For InfiniBand fabrics specifically, the docs explicitly recommend torchrun to utilize the full IB bandwidth — Accelerate's abstraction layer can route collective traffic through the wrong NIC if NCCL environment variables are not set manually.

The three non-negotiable preconditions before any launcher fires:

  1. Every node runs the same Axolotl commit.
  2. Every node has an identical copy of the model config file.
  3. The head node (rank 0) is TCP-reachable from every worker node on the rendezvous port.

Miss any of these and the job either fails silently, hangs at rendezvous, or produces nondeterministic results that waste cluster budget.


Prerequisites, cluster setup, and version parity across nodes

As the Axolotl multi-node documentation states: "Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility. You will also need to have the same configuration file for your model on each machine." The same docs also make the reachability requirement explicit: "Make sure the main machine is reachable by other machines."

Install Axolotl from source on every node, pinning to the same Git SHA:

$ git clone https://github.com/axolotl-ai-cloud/axolotl.git
$ cd axolotl
$ git checkout <SHARED_COMMIT_SHA>   # pin to the same SHA on all nodes
$ pip install -e ".[flash-attn,fsdp]"

After installation, distribute the model config. If your cluster has a shared filesystem (NFS, Lustre, GPFS), place the config there and reference it by the shared path on all nodes. Without a shared filesystem, use rsync or a provisioning tool to copy the config file to an identical absolute path on each node before launch.

Production Note: Version parity is not optional. Two nodes running different Axolotl commits can have subtly incompatible serialization formats, different FSDP wrapping logic, or mismatched tokenizer behavior — any of which produces failures that look like network or NCCL issues during debugging. Pin the SHA in your provisioning script, not in a README. Config parity is equally strict: a single differing field (e.g., per_device_train_batch_size or gradient_accumulation_steps) between nodes will either crash the run or silently corrupt gradient accumulation. Verify with md5sum your_config.yaml on all nodes before launch. The rendezvous port (default 29500) must also be open between nodes — confirm this with firewall rules and nc checks before submitting a job.

What must match on every node

Every node in the cluster must carry identical copies of:

  • Axolotl commit — use git rev-parse HEAD on each node and compare
  • Model configuration file — same absolute path, identical content
  • Python and PyTorch versions — version skew causes silent ABI mismatches
  • CUDA and driver versions — NCCL is sensitive to driver-level differences
  • Dataset path or remote reference — preprocessing must produce the same token sequences

Production Note: Do not mix Axolotl commits across nodes, even for patch releases. The docs are explicit: "we suggest using the same commit to ensure compatibility." If you must upgrade mid-campaign, rebuild all nodes simultaneously. Config drift is the second most common cause of multi-node launch failure after network misconfiguration — run diff or md5sum on the config across nodes as a preflight step.

Network and access checks before launch

Verify reachability before any launcher fires. A stalled job that hangs at rendezvous for 10 minutes is almost always a TCP connectivity problem, not a software bug.

# From each worker node — verify head node is reachable on the rendezvous port
$ nc -zv <HEAD_NODE_IP> 29500

# Confirm all nodes can reach each other (replace with actual node IPs)
$ for node in 10.0.0.1 10.0.0.2 10.0.0.3; do
    ssh $node "hostname && nvidia-smi -L | wc -l"
  done

# Check which NIC is active on each node — needed for NCCL_SOCKET_IFNAME
$ ip link show | grep -E '^[0-9]+:' | awk '{print $2}'

# Quick IB device check on each node
$ ibstat | grep -E '(CA |State)'

As the PyTorch multinode training tutorial notes: "Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0)." Identify the correct NIC names during this preflight phase — you will need them for the NCCL environment variable block.


Run Axolotl with Accelerate and FSDP2 on multiple nodes

The Accelerate path works by writing a static rendezvous configuration to ~/.cache/huggingface/accelerate/default_config.yaml (or a path you specify) on each node, then calling axolotl train normally. Accelerate reads the config, discovers the cluster topology, and coordinates launches across machines. The FSDP2 backend handles parameter sharding across GPU ranks.

Accelerate rendezvous values that must be exact

Write a per-node Accelerate config. The rendezvous fields must be exact — a single wrong value prevents any rank from joining the group.

On node 0 (rank 0 / head node):

# accelerate_config_node0.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
machine_rank: 0                      # unique integer per node; 0 = head
main_process_ip: "10.0.0.1"         # head node IP, reachable from all workers
main_process_port: 29500             # must be open in firewall on head node
num_machines: 2                      # total node count
num_processes: 16                    # total GPU count across all nodes (e.g., 2 nodes × 8 GPUs)

On node 1 (rank 1 / worker node): copy the file and change only machine_rank: 1. All other fields remain identical.

# accelerate_config_node1.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
machine_rank: 1
main_process_ip: "10.0.0.1"         # still points to head node — do not change
main_process_port: 29500
num_machines: 2
num_processes: 16

total_processes (num_processes) must equal num_machines × GPUs_per_node. Getting this wrong is the most common rendezvous mistake: Accelerate waits indefinitely for a rank count that will never be satisfied.

FSDP2 config for memory-heavy fine-tunes

The FSDP2 settings live inside the Axolotl model config, not the Accelerate config. The Axolotl multi-node documentation provides a reference configuration that you must treat as the starting point for memory-heavy fine-tunes:

# model_config.yaml — the FSDP2 block (same file on every node)
fsdp_version: 2
fsdp_config:
  offload_params: true               # offload optimizer states to CPU; saves HBM at cost of throughput
  state_dict_type: FULL_STATE_DICT   # gather full weights on rank 0 for checkpointing
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: LlamaDecoderLayer   # match to your model architecture
  reshard_after_forward: true        # release parameter shards after forward pass; reduces peak HBM

A few operational notes on these fields:

  • offload_params: true trades throughput for memory capacity.
  • state_dict_type: FULL_STATE_DICT gathers full weights on rank 0 before saving. This requires the head node to have enough CPU RAM to hold the full model. LOCAL_STATE_DICT saves shards separately and requires a merge step.
  • transformer_layer_cls_to_wrap must match the model's actual decoder class name. For Mistral and Llama-family models the value is LlamaDecoderLayer; for Mixtral use MistralDecoderLayer; for Falcon use FalconDecoderLayer. A mismatch causes FSDP to wrap at the wrong granularity, degrading efficiency without error.
  • reshard_after_forward: true is the FSDP2-native behavior (equivalent to FULL_SHARD in FSDP1). Set it to false only if you are debugging and want parameters to remain gathered — never in production.

Run Axolotl with torchrun for InfiniBand bandwidth

The Axolotl docs are direct: "If you are using Infiniband, we recommend torchrun to utilize the full bandwidth." The reason is NCCL's NIC selection logic: without explicit environment variable steering, NCCL may fall back to Ethernet for out-of-band coordination even when IB is present, cutting allreduce bandwidth and making the job appear slow.

Set these environment variables on every node before the launcher fires:

# NCCL InfiniBand settings — export on all nodes before torchrun
$ export NCCL_IB_DISABLE=0                            # enable IB transport (disabled by default on some builds)
$ export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"    # restrict rendezvous to these NIC prefixes
$ export NCCL_BUFFSIZE=2097152                         # 2 MB transfer buffer; matches Axolotl docs recommendation
$ export NCCL_DEBUG=INFO                               # enable verbose logs for the first run

Replace the NCCL_SOCKET_IFNAME value with the actual interface prefix on your cluster. Run ip link show on each node to confirm the NIC names before you fix this value.

When torchrun beats accelerate on InfiniBand

The Accelerate launcher wraps torchrun internally but introduces an additional configuration layer that can interfere with NCCL's NIC selection. On IB fabrics, you want NCCL to own the transport decision directly — torchrun gives you that control because you set NCCL variables in the shell environment before the process group initializes, and they take effect without additional abstraction.

Pro Tip: On InfiniBand clusters, use torchrun with explicit NCCL environment variables rather than Accelerate's launcher. If NCCL silently falls back to Ethernet because NCCL_IB_DISABLE defaults to 1 or NCCL_SOCKET_IFNAME selects the management NIC, your allreduce throughput collapses and the job appears slow without emitting an obvious error. Torchrun also makes NCCL log parsing easier because launcher output and training output share the same process tree — you can pipe both to a single log file and correlate timestamps.

New Axolotl CLI launcher versus legacy direct torchrun

The Axolotl docs label the new CLI as "Option 1: New Axolotl CLI with launcher args (Recommended)" and the direct torchrun path as "Option 2: Direct torchrun (Legacy)". Prefer the new CLI for new deployments.

Option 1 — New Axolotl CLI (Recommended). Run on each node, adjusting --node-rank per machine:

# Run on head node (node rank 0)
$ NCCL_IB_DISABLE=0 \
  NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
  NCCL_BUFFSIZE=2097152 \
  axolotl train model_config.yaml \
    --launcher torchrun \
    --num-nodes 2 \
    --node-rank 0 \
    --master-addr 10.0.0.1 \
    --master-port 29500

# Run on worker node (node rank 1) — same config, only --node-rank changes
$ NCCL_IB_DISABLE=0 \
  NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
  NCCL_BUFFSIZE=2097152 \
  axolotl train model_config.yaml \
    --launcher torchrun \
    --num-nodes 2 \
    --node-rank 1 \
    --master-addr 10.0.0.1 \
    --master-port 29500

Option 2 — Legacy direct torchrun. Preserve this for older pipelines or debugging comparisons:

# Legacy: direct torchrun on head node
$ NCCL_IB_DISABLE=0 \
  NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
  NCCL_BUFFSIZE=2097152 \
  torchrun \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr 10.0.0.1 \
    --master-port 29500 \
    --nproc-per-node 8 \
    -m axolotl.cli.train model_config.yaml

# Legacy: direct torchrun on worker node
$ NCCL_IB_DISABLE=0 \
  NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" \
  NCCL_BUFFSIZE=2097152 \
  torchrun \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr 10.0.0.1 \
    --master-port 29500 \
    --nproc-per-node 8 \
    -m axolotl.cli.train model_config.yaml

Both options require identical NCCL variables and identical model configs on all nodes. The new CLI encapsulates argument handling and is the path Axolotl will maintain going forward.


How to verify the job is actually distributed

A common silent failure: the job appears to run, loss decreases, but only node 0's GPUs are active. This happens when a misconfigured launcher defaults to local-only execution. Confirm distributed execution before trusting any results.

# Set verbose NCCL logging before launch — captures rendezvous and transport init
$ export NCCL_DEBUG=INFO
$ export NCCL_DEBUG_SUBSYS=ALL    # optional; captures per-subsystem detail

# On each node, watch GPU utilization in real time
$ watch -n 2 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv

# Check process groups on each node after launch
$ ps aux | grep axolotl

As the PyTorch multinode training tutorial recommends: "Set env variable NCCL_DEBUG to INFO ... to print verbose logs that can help diagnose the issue." With NCCL_DEBUG=INFO, the first 100 lines of each node's log contain transport negotiation output — look for IB references to confirm InfiniBand is in use.

Signals that rendezvous succeeded

# Enable verbose NCCL before training begins
$ export NCCL_DEBUG=INFO

# Grep for key startup tokens in the log
$ axolotl train model_config.yaml ... 2>&1 | tee run.log
$ grep -E "(NCCL INFO|NCCLSocket|Using devices|Rank [0-9]+ initialized)" run.log

Healthy startup produces log lines containing NCCL INFO Net : Using IB device (for InfiniBand) or NCCL INFO Net : Using Socket (fallback to TCP). You should see initialization messages from every rank — the total count must match num_processes from your Accelerate config or --nnodes × --nproc-per-node from torchrun. Absence of timeout or Connection refused messages confirms the rendezvous port is open and reachable.

Signals that all ranks joined the same run

Once training begins, the loss should decrease smoothly from step 1. Erratic loss values or loss that matches a single-node run (identical step speed, identical throughput) are indicators that worker nodes are idle.

Production Note: If one node sits idle after launch, re-check: (1) machine_rank is unique and correct on the idle node, (2) the idle node can TCP-connect to main_process_ip:main_process_port, (3) the idle node ran the launcher command with the correct --node-rank value, (4) no firewall rule drops traffic on the rendezvous port between the nodes. Use both the launcher's stderr output and NCCL's INFO logs together — the launcher logs show process startup, and NCCL logs show whether the distributed backend actually initialized with the expected rank count.


Common failure modes and how to debug them fast

Most multi-node Axolotl failures fall into three root causes: NCCL transport misconfiguration, rendezvous parameter errors, and version or config drift between nodes. Debug in that order — transport first, rendezvous second, parity third — before touching training hyperparameters.

Watch Out: The three failure classes produce symptoms that look similar at the surface (job hangs, job crashes before step 1, or job runs but all activity is on one node). Do not start adjusting per_device_train_batch_size or gradient_accumulation_steps before ruling out network and config causes. Enable NCCL_DEBUG=INFO, check rendezvous values, and run md5sum on the config across nodes before treating the problem as a training issue.

NCCL hangs, bandwidth drops, and NIC mismatches

NCCL hangs and bandwidth drops almost always trace to one of three conditions: IB transport is disabled, NCCL selected the wrong NIC, or the buffer size is mismatched to the cluster fabric.

Watch Out: If NCCL_IB_DISABLE is not explicitly set to 0, NCCL may default to TCP/IP transport and ignore InfiniBand entirely. Axolotl's docs specify export NCCL_IB_DISABLE=0 as a required setting for IB runs. Similarly, if NCCL_SOCKET_IFNAME is unset or points to the management NIC (e.g., eno1 instead of ib0 or the bonded data NIC), NCCL's out-of-band traffic uses the wrong interface, causing bandwidth to collapse without a clear error message. Set NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" as documented — adjust the prefix list to match your cluster's actual NIC naming scheme. If throughput is poor but the job is not hanging, verify NCCL_BUFFSIZE=2097152 is set on all nodes; a mismatch in buffer sizes between sender and receiver degrades collective throughput.

Diagnostic sequence for hangs:

# Step 1: confirm IB devices are visible
$ ibstat | grep -E "(CA |State|Rate)"

# Step 2: run a point-to-point IB bandwidth test between two nodes
$ ib_send_bw -d mlx5_0 -i 1 -x 0   # on receiver
$ ib_send_bw -d mlx5_0 -i 1 -x 0 <RECEIVER_IP>  # on sender

# Step 3: check NCCL selected IB in the run log
$ grep "NCCL INFO Net" run.log

Rendezvous failures and machine-rank mistakes

A rendezvous failure produces a timeout or Connection refused error on the worker nodes. The head node either never starts listening or the port is blocked.

# Re-check rendezvous parameters before relaunch
# Verify head node is listening
$ ss -tlnp | grep 29500   # run on head node after launcher starts

# Verify port is reachable from each worker
$ nc -zv <HEAD_NODE_IP> 29500   # run on each worker node

# Verify machine_rank values in your Accelerate configs
$ grep machine_rank ~/.cache/huggingface/accelerate/default_config.yaml

# Verify num_machines and num_processes match the actual cluster size
$ grep -E "(num_machines|num_processes)" ~/.cache/huggingface/accelerate/default_config.yaml

The documented critical fields — machine_rank, main_process_ip, main_process_port, num_machines, and total_processes (as num_processes) — must all be exact. A duplicate machine_rank across two nodes causes one of them to time out waiting for a rank assignment that will never arrive. total_processes less than the actual GPU count causes the run to launch with fewer workers than expected, silently training on a subset of the cluster.

Config drift, commit drift, and dataset drift

Config and commit drift are insidious because they often produce errors that look like NCCL or memory errors rather than configuration errors.

Pro Tip: Treat your Axolotl commit SHA and model config file as immutable artifacts for the duration of a training run. Store them in a versioned location (Git tag, S3 path with SHA prefix) and provision all nodes from the same source at job start. The Axolotl docs state explicitly: "we suggest using the same commit to ensure compatibility" and "You will also need to have the same configuration file for your model on each machine." For reproducibility, also pin dataset preprocessing: if one node preprocesses tokens differently due to a different tokenizer cache state, the gradient contributions from that node are silently corrupted. Run a dataset hash check as part of your preflight script.


Production checklist for serious multi-node runs

Before committing GPU-hours to a long fine-tuning job, lock down every variable that can cause a mid-run failure.

Production Note: Multi-node job failures that occur late in training are far more expensive than failures at step 0. Invest time in preflight validation on a short smoke-run (max_steps: 2) before submitting the full job. Confirm loss decreases on all nodes, GPU utilization is nonzero on all ranks, and checkpoint write completes cleanly. Only then submit the full job with the production step count.

# Production preflight checklist — run before full job submission

# 1. Version parity
$ for node in node0 node1; do
    ssh $node "cd /path/to/axolotl && git rev-parse HEAD"
  done

# 2. Config parity
$ for node in node0 node1; do
    ssh $node "md5sum /path/to/model_config.yaml"
  done

# 3. Port reachability
$ nc -zv <HEAD_NODE_IP> 29500

# 4. IB device check on all nodes
$ for node in node0 node1; do
    ssh $node "ibstat | grep State"
  done

# 5. Smoke run with verbose NCCL
$ NCCL_DEBUG=INFO axolotl train model_config.yaml \
    --launcher torchrun \
    --num-nodes 2 \
    --node-rank 0 \
    --master-addr <HEAD_NODE_IP> \
    --master-port 29500 \
    -- --max_steps 2

Launcher choice, scheduler fit, and supportability

Choose the launcher based on your cluster's scheduler and operational model, not on framework preference.

Production Note: Accelerate with static rendezvous fits small static clusters where you SSH into each node manually or use a simple scheduler like SLURM's srun. Ray Train, which Axolotl also supports as an alternative to Accelerate, centralizes orchestration and fits clusters already running a Ray cluster — it handles node failure and restart more gracefully than torchrun. Torchrun is the right choice for InfiniBand fabrics (per the Axolotl docs) and for teams who want direct control over NCCL's environment without abstraction. It is also the easiest path to debug because the process hierarchy is flat: one torchrun per node, all ranks visible in ps.

What to log for future debugging

# Capture the exact launcher invocation
$ history -w /path/to/logs/launcher_history_$(date +%Y%m%d_%H%M%S).txt

# Save NCCL environment variables exactly as set
$ env | grep NCCL > /path/to/logs/nccl_env_$(date +%Y%m%d_%H%M%S).txt

# Save Axolotl commit and config
$ git -C /path/to/axolotl rev-parse HEAD > /path/to/logs/axolotl_commit.txt
$ cp /path/to/model_config.yaml /path/to/logs/model_config_snapshot.yaml

# Capture accelerate config if used
$ cp ~/.cache/huggingface/accelerate/default_config.yaml \
     /path/to/logs/accelerate_config_node$(hostname).yaml

# On job completion or failure, save the full NCCL debug log
$ cp run.log /path/to/logs/nccl_debug_$(date +%Y%m%d_%H%M%S).log

Preserve the commit SHA, config snapshot, exact NCCL environment, and the launcher command line verbatim. These four artifacts reproduce the failure without having to reconstruct the environment from memory.


FAQ

How do I run Axolotl on multiple machines?

Install the same Axolotl commit on every machine, place an identical model config on each, then launch with one of three methods: Accelerate (with FSDP/FSDP2 config), Ray Train, or torchrun. The Axolotl multi-node docs describe all three. For each, node 0 acts as the coordinator; all other nodes reference its IP and rendezvous port.

Should I use Accelerate or torchrun for Axolotl multi-node training?

It depends on your network fabric. On Ethernet clusters, Accelerate with FSDP2 is a reasonable default — it handles rendezvous automatically and integrates cleanly with HuggingFace tooling. On InfiniBand, follow the Axolotl documentation's explicit recommendation and use torchrun with NCCL_IB_DISABLE=0 and the other IB environment variables. Accelerate can work on IB if you set the NCCL variables correctly, but the torchrun path is simpler to debug and gives NCCL direct control over transport selection.

Pro Tip: On InfiniBand fabrics, start with torchrun. On Ethernet or mixed fabrics without IB, Accelerate's static rendezvous is easier to configure for teams without deep NCCL expertise. Ray Train is the right choice if your cluster already has a Ray runtime and you need fault tolerance or dynamic scaling.

How do I configure NCCL for InfiniBand in multi-node training?

Set three variables on all nodes before the launcher starts: NCCL_IB_DISABLE=0 (enables IB transport), NCCL_SOCKET_IFNAME set to your cluster's data NIC prefix (e.g., "eth0,en,eth,em,bond"), and NCCL_BUFFSIZE=2097152. Add NCCL_DEBUG=INFO during initial setup to confirm IB is selected in the startup logs.

Do all nodes need the same Axolotl version and config?

Yes, without exception. The Axolotl documentation states: "Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility. You will also need to have the same configuration file for your model on each machine." Different commits can produce incompatible code paths; different configs produce nondeterministic distributed behavior or immediate launch failures.


Sources & References


Keywords: Axolotl, FSDP2, torchrun, Accelerate, Ray Train, NCCL, InfiniBand, NCCL_IB_DISABLE=0, NCCL_SOCKET_IFNAME, NCCL_BUFFSIZE=2097152, LlamaDecoderLayer, FULL_STATE_DICT, TRANSFORMER_BASED_WRAP, reshard_after_forward, machine_rank

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit