Skip to content
AxiomLogicaSearch
AI & ML

How Megatron-LM handles tensor, pipeline, and sequence parallelism for large transformer training

Megatron-LM’s design composes tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, and context/sequence parallelism inside Megatron Core so large transformers can be partitioned across GPUs without changing the model’s mathematical behavior — but the trade-off is added communication, scheduling complexity, and a need to balance activation recomputation against throughput.

How Megatron-LM handles tensor, pipeline, and sequence parallelism for large transformer training
How Megatron-LM handles tensor, pipeline, and sequence parallelism for large transformer training

What Megatron-LM is solving at cluster scale

Training large transformers requires splitting model state, optimizer state, gradients, and activations across multiple GPUs so that memory and communication remain manageable. Megatron-LM exists to solve that distributed-training problem by composing multiple parallelism strategies so that no single GPU needs to hold the full model, the full optimizer state, or the full activation buffer.

The repository exposes two distinct components. Megatron-LM is the reference example: "Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts." Megatron Core is the modular substrate underneath it — a GPU-optimized library of composable building blocks that any training framework can consume. The repo makes this split explicit: "Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts."

The answer to whether Megatron-LM can use tensor parallelism and pipeline parallelism simultaneously is yes — and it is designed to. Megatron Core composes TP, PP, DP, expert parallelism (EP), and context parallelism (CP) as orthogonal process-group dimensions, meaning you can dial each one independently and the intersection forms the actual device mesh.

Bottom Line: Megatron-LM's value is not any single parallelism strategy — it is the ability to compose tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, and sequence/context parallelism simultaneously inside one training job, with mathematical equivalence to single-GPU training preserved at every point. The cost is added communication overhead, scheduling complexity, and a steep configuration surface that punishes incorrect partitioning choices.


How Megatron Core layers TP, PP, DP, EP, and CP

Megatron Core treats each parallelism axis as a separate process-group dimension. When you launch a job with TP, PP, and DP values chosen for your cluster, the global process count is the product of those dimensions, and NCCL initializes separate communicators for each axis. Collectives within the TP group use the fastest available intra-node links where the topology allows it. Collectives within the PP group use point-to-point sends and receives between pipeline stages. DP gradient reductions follow the data-parallel topology and can be bucketed and overlapped with backward computation.

The official Megatron Core parallelism documentation adds EP for mixture-of-experts token routing and CP for long-context attention partitioning as first-class axes in the same composable mesh.

flowchart LR
    subgraph DP_Replica_0["DP Replica 0"]
        subgraph PP_Stage_0["PP Stage 0 (Layers 0–11)"]
            TP0["GPU 0\nTP rank 0"]
            TP1["GPU 1\nTP rank 1"]
            TP2["GPU 2\nTP rank 2"]
            TP3["GPU 3\nTP rank 3"]
            TP0 <-->|"all-reduce / reduce-scatter\n(NVLink)"| TP1
            TP1 <-->|NVLink| TP2
            TP2 <-->|NVLink| TP3
        end
        subgraph PP_Stage_1["PP Stage 1 (Layers 12–23)"]
            TP4["GPU 4\nTP rank 0"]
            TP5["GPU 5\nTP rank 1"]
            TP6["GPU 6\nTP rank 2"]
            TP7["GPU 7\nTP rank 3"]
            TP4 <-->|NVLink| TP5
            TP5 <-->|NVLink| TP6
            TP6 <-->|NVLink| TP7
        end
        PP_Stage_0 -->|"activation send\n(InfiniBand)"| PP_Stage_1
    end
    subgraph DP_Replica_1["DP Replica 1"]
        PP_Stage_2["PP Stage 0 (mirror)"]
        PP_Stage_3["PP Stage 1 (mirror)"]
    end
    DP_Replica_0 <-->|"gradient all-reduce\n(InfiniBand)"| DP_Replica_1

The diagram above represents TP=4, PP=2, DP=2 — 16 GPUs total. Sequence parallelism and context parallelism layer on top of the TP dimension without requiring additional process-group creation: SP reuses the TP communicator to shard non-TP activations, and CP adds a separate ring-attention communicator for the attention computation across long sequences.

Parallelism domains and what each one splits

Each parallelism axis splits a different tensor dimension:

Tensor parallelism (TP) splits individual weight matrices and the corresponding activation tensors within a single transformer layer across GPUs. A matrix multiply that would normally produce a full output on one GPU instead produces a partial result on each TP rank; a collective then reconstructs the full result. TP keeps all TP-rank GPUs busy on every forward step, which is why it is best contained within one NVLink domain.

Pipeline parallelism (PP) assigns complete transformer layers to GPU stages. Stage 0 holds layers 0–L/P, stage 1 holds layers L/P–2L/P, and so on. Microbatches flow through the stages sequentially, with activation tensors transmitted between stages as ordinary NCCL point-to-point sends.

Data parallelism (DP) replicates the model across PP × TP device groups and synchronizes gradients after each backward pass. At large scale, DP becomes the axis where the distributed optimizer shards optimizer states.

Expert parallelism (EP) partitions MoE expert weights across EP ranks, with NCCL all-to-all collectives routing tokens to the correct experts.

Context parallelism (CP) splits the sequence dimension specifically for the attention computation, enabling ring attention across long sequences where each CP rank holds a contiguous sequence chunk.

The Megatron Bridge parallelism guide states: "Enable sequence parallelism when using tensor parallelism to reduce activation memory. Use context parallelism for long sequence training." The Hugging Face Megatron-LM integration guide adds, "As all-reduce = reduce-scatter + all-gather, this saves a ton of activation memory at no added communication cost."

Pro Tip: Keep TP degree ≤ the number of GPUs within a single NVLink domain (typically 8 on an H100 SXM node). Crossing NVLink boundaries with TP collectives increases latency by an order of magnitude and can erase any memory benefit the sharding provides.

Why Megatron Core is composable instead of monolithic

Megatron Core exposes transformer building blocks — attention, MLP, normalization, embedding — as individually instantiable modules that accept a TransformerConfig carrying the parallelism parameters. A custom training framework imports these modules directly without inheriting the full Megatron-LM training loop.

As the official repository states: "Megatron Core is a composable library with GPU-optimized building blocks for custom training frameworks." This is architecturally distinct from DeepSpeed or FSDP, which wrap an existing model rather than providing model building blocks.

Production Note: Megatron Core's composability is a double-edged property. It gives infrastructure teams explicit control over every parallelism decision, but it means you own the integration: data loading, checkpointing, logging, and learning-rate scheduling are not included. Teams migrating from an opinionated training script should budget significant engineering time for the integration layer before claiming Megatron Core's performance at scale.


Tensor parallelism inside a transformer block

Tensor parallelism preserves the transformer's mathematical behavior by splitting weight matrices in a way that makes partial results computable on each rank and then reconstructable via a single collective. In general linear algebra, a matrix multiply $Y = XW$ can be decomposed as a column-partitioned or row-partitioned operation without changing the final result.

For an MLP with weight matrices (W_1 \in \mathbb{R}^{d \times 4d}) and (W_2 \in \mathbb{R}^{4d \times d}) and TP degree $t$, the decomposition is:

$(Y = \text{GeLU}(X W_1) W_2 = \left[\text{GeLU}(X W_1^{(1)}) \;\Big|\; \cdots \;\Big|\; \text{GeLU}(X W_1^{(t)})\right] \begin{bmatrix} W_2^{(1)} \ \vdots \ W_2^{(t)} \end{bmatrix})$

Each GPU $i$ holds (W_1^{(i)} \in \mathbb{R}^{d \times 4d/t}) (column partition) and (W_2^{(i)} \in \mathbb{R}^{4d/t \times d}) (row partition). GPU $i$ computes its partial output (Y^{(i)} = \text{GeLU}(X W_1^{(i)}) W_2^{(i)} \in \mathbb{R}^{s \times d}), and an all-reduce across the TP group sums the $t$ partial outputs to recover $Y$. The full transformer forward pass completes with mathematical equivalence to the single-GPU version.

The Megatron Core tensor-parallel layers implement this via reduce_scatter_to_sequence_parallel_region and related collectives in megatron/core/tensor_parallel/layers.py.

Column-wise and row-wise sharding in attention and MLP

Column-wise sharding partitions a weight matrix along its output (column) dimension: each rank computes a subset of output features independently and holds the full input. Row-wise sharding partitions along the input (row) dimension: each rank holds a subset of input features and accumulates partial dot products that must be summed. Chaining column-wise → row-wise within a single layer means the input broadcast happens once at the column layer, the partial results flow through activation, and the all-reduce happens once after the row layer — two collectives per MLP block regardless of $t$.

For multi-head attention, the QKV projection is column-partitioned (each rank holds a subset of heads), and the output projection is row-partitioned. Attention computation per head is local to each rank. The Hugging Face Megatron-LM integration guide describes the sequence-parallel variant: "To put it simply, it shards the outputs of each transformer layer along sequence dimension."

Pro Tip: With sequence parallelism enabled, the all-reduce after the row-wise linear is replaced by a reduce-scatter (producing sequence-sharded output) and the all-reduce before the column-wise linear is replaced by an all-gather. The Hugging Face guide notes: "As all-reduce = reduce-scatter + all-gather, this saves a ton of activation memory at no added communication cost." The key insight is that the sequence-sharded activations between transformer sub-layers need not be reassembled until a collective is already required.

Communication cost: all-reduce, reduce-scatter, and overlap

Every TP forward pass issues two collective operations per transformer layer (or four in the SP variant — reduce-scatter and all-gather in each direction). NCCL provides the underlying implementations: "NCCL provides efficient implementations of all-reduce, all-gather, reduce-scatter, broadcast, and reduce" and selects algorithms based on hardware topology and message size automatically.

When TP is confined to a single NVLink node, NVSwitch bandwidth (up to 900 GB/s bidirectional on H100 SXM) absorbs these collectives cheaply. Crossing InfiniBand with TP is where latency compounds: each layer now waits on a cross-node fabric round trip per collective, and at large $t$ the per-layer communication can exceed the compute time for moderate hidden dimensions.

Watch Out: Increasing TP degree beyond your NVLink domain boundary does not just add communication — it changes the communication latency from nanoseconds (NVSwitch) to microseconds (InfiniBand). At TP=16 across two nodes, each transformer layer's forward pass issues collectives that cross InfiniBand four times (with SP). Profile with NCCL_DEBUG=INFO before assuming TP=16 outperforms TP=8 with more DP replicas.


Pipeline parallelism and stage scheduling

Pipeline parallelism partitions the transformer's layers across PP stage GPUs so that no single GPU holds the full model depth. In Megatron-LM, pipeline parallelism partitions model layers into stages across GPUs and uses microbatches to reduce pipeline bubbles. Stage 0 executes layers 0 through $L/P - 1$, sends its output activation tensor to stage 1, which executes the next layer block, and so on. The activation transfer between stages is a peer-to-peer send/receive over the cluster fabric, which is significantly cheaper than a collective but introduces a sequential dependency that creates pipeline bubbles.

Bottom Line: Pipeline parallelism is the correct axis when the model does not fit on one TP group even at the maximum feasible TP degree. The Megatron Bridge parallelism guide is direct: "Consider pipeline parallelism for very large models that don't fit on a single GPU." The bubble overhead is real but manageable if you schedule enough microbatches relative to pipeline depth.

Microbatching, bubbles, and schedule choices

A pipeline bubble is idle time on a stage GPU while it waits for its upstream stage to produce activations. For a simple 1F1B (one-forward-one-backward) schedule with $P$ stages, the bubble fraction is approximately $(P-1)/(m + P - 1)$ where $m$ is the number of microbatches per global batch. With $P = 8$ stages and $m = 8$ microbatches, the bubble fraction is (\approx 47\%). Increasing to $m = 64$ drops it to (\approx 10\%).

Megatron-LM supports interleaved pipeline schedules (virtual stages) that split each physical stage into $v$ virtual stages, reducing the bubble fraction to approximately $(P-1)/(mv + P - 1)$. Interleaving trades reduced bubble time for increased activation memory because more microbatches must be live simultaneously.

Watch Out: Choosing $m$ too small to hit the memory budget while assuming the bubble is acceptable is a common misconfiguration. At PP=8 with $m=4$, you are running at roughly 64% pipeline utilization before accounting for any other overhead. The right trade-off is a function of your global batch size requirement — if your batch size forces $m < P$, reconsider the PP depth.

When pipeline depth helps and when it hurts

Pipeline depth helps precisely when the model exceeds single-node memory even after applying maximum TP and optimizer sharding. A 530B parameter model at BF16 occupies roughly 1 TB of weight memory alone; no single H100 pod can hold it at any TP setting, so PP is unavoidable.

Pipeline depth hurts when: - Stages are imbalanced (embedding layers are lightweight; later layers with heavy attention at long sequences are heavier), creating a bottleneck stage that stalls all others. - The number of microbatches is constrained by the global batch size, driving bubble overhead above 30%. - The activation tensors crossing stage boundaries are large (long sequences, large hidden dimensions), saturating the InfiniBand link that also carries DP gradient traffic.

Production Note: Virtual pipeline stages (interleaving) are Megatron-LM's primary tool for attacking bubble overhead without increasing global batch size. Setting --num-layers-per-virtual-pipeline-stage reduces the effective bubble but increases peak activation memory per pipeline rank. At scale on NVIDIA H100 clusters, the combination PP=8 with $v=2$ virtual stages and $m=16$ microbatches is a common starting configuration, but the right answer depends on model depth, sequence length, and network topology.


Sequence parallelism for activation memory reduction

Sequence parallelism addresses an activation-memory problem that tensor parallelism alone does not solve. TP shards weight matrices but, in the regions between sub-layers (after layer norm, before the next column-wise linear), the full-sequence activation tensors still exist unsharded on each TP rank. This is the memory pressure that SP is designed to reduce.

The Megatron Bridge guide states: "Sequence Parallelism distributes the sequence dimension across multiple GPUs, reducing activation memory." The arXiv paper 2405.07719 formally defines this as dividing the sequence dimension of input tensors across multiple computational devices.

How SP complements tensor parallelism

Sequence parallelism is not a standalone strategy — it is an augmentation of the TP process group. The same TP communicator handles both the weight-parallel collectives (reduce-scatter/all-gather for the linear layers) and the sequence-dimension transitions between sub-layers. There is no new process group, no new communication topology.

When SP is active, the activation between sub-layers is sequence-sharded: each TP rank holds contiguous sequence positions. Before a column-wise linear (which needs the full sequence to compute its partial output features), Megatron Core issues an all-gather on the TP group to reconstitute the full sequence. After the row-wise linear, it issues a reduce-scatter to both sum partial outputs and immediately re-shard along the sequence axis, leaving the next sub-layer with sequence-sharded activations again.

As the Hugging Face guide confirms: "Enable sequence parallelism when using tensor parallelism to reduce activation memory."

Pro Tip: SP's memory benefit scales linearly with TP degree: at TP=8, non-layer-norm activations are 8× smaller per rank. The throughput cost is near-zero because reduce-scatter and all-gather together complete the same total data movement as a standalone all-reduce. Enabling SP whenever TP ≥ 4 is almost always the right call unless you have a specific reason to keep activations unsharded.

Sequence parallelism versus context parallelism

SP and CP are frequently confused because both operate on the sequence dimension, but they solve fundamentally different problems at different sequence-length regimes.

Dimension Sequence Parallelism (SP) Context Parallelism (CP)
What it shards Activations between sub-layers Attention computation across sequence
Primary benefit Activation memory reduction Enables longer sequences per GPU
Works with TP process group (no new communicator) Separate CP ring communicator
Collective pattern all-gather + reduce-scatter per linear Ring attention send/recv
Recommended when Always with TP ≥ 4 Sequence length > NVLink-node capacity
NVIDIA guidance Activation memory reduction Long sequence training

The Megatron Core developer docs and the Megatron Bridge guide are explicit about this split: "Use context parallelism for long sequence training." CP partitions the KV cache and attention scores across CP ranks using a ring pattern, which is architecturally distinct from SP's activation-sharding role. The arXiv paper 2405.07719 also notes that different SP approaches carry different communication costs and robustness characteristics, which is why NVIDIA chose a separate CP axis rather than extending SP to cover long-context attention.


Optimizer sharding, recomputation, and memory economics

A 70B parameter model in BF16 occupies ~140 GB of weight memory. With mixed-precision Adam, optimizer states (FP32 master weights, first moment, second moment) add another ~840 GB. Activations for a 4,096-token sequence at 80 layers without recomputation add tens of GB more. No combination of TP and PP alone eliminates this; optimizer sharding and activation recomputation are required as complementary tools.

Strategy Memory saved Compute cost Communication added
TP=8 (weights only) Weights / 8 per rank None (local) 2 collectives per layer
SP enabled Non-TP activations / 8 per rank None Absorbed into TP collectives
PP=4 Weights / 4 (across stages) Bubble overhead P2P activation transfers
Distributed optimizer (DP sharding) Optimizer states / DP degree Gather at update Scatter/gather at step
Full activation recomputation All activations (keep only inputs) ~33% extra FLOPs None
Selective recomputation Expensive activations only ~5–10% extra FLOPs None

Megatron Core's distributed data parallel implementation (megatron/core/distributed/distributed_data_parallel.py) shards optimizer states across the DP group, mirroring ZeRO-1/ZeRO-2 semantics. This is separate from and composable with TP and PP.

Activation recomputation versus throughput

Activation recomputation (checkpoint recomputation) discards intermediate activations during the forward pass and recomputes them from layer inputs during the backward pass. As confirmed by Megatron-LM benchmarking references: "To reduce GPU memory usage when training a large model, we support various forms of activation checkpointing and recomputation."

Full recomputation adds roughly one additional forward pass of compute per training step — approximately 33% overhead in a compute-balanced regime. Selective recomputation targets only the activations with the highest memory-to-recompute-cost ratio (typically attention softmax intermediates and dropout masks), reducing overhead to 5–10% at the cost of moderate activation memory.

Watch Out: Full recomputation at TP=8 + PP=8 becomes counterproductive when compute utilization is already above 80 MFU (model FLOP utilization). You are paying 33% compute overhead to recover memory that could instead be freed by tuning SP degree or reducing microbatch size. Profile FLOP utilization with and without recomputation before committing to it as a default.

Where distributed optimizer sharding fits in

Megatron Core's distributed optimizer shards FP32 optimizer states across the DP group, so each DP rank holds optimizer states only for the parameters it owns within the DP partition. This is the primary tool for attacking the optimizer-state memory floor that neither TP nor PP addresses.

The interaction with TP and PP is indirect but important: TP shards weight tensors, so each TP rank holds a different weight slice; the DP group then shards optimizer states over copies of those slices. Increasing TP degree reduces weight memory per rank AND reduces optimizer-state memory per rank in the distributed optimizer, making TP × distributed optimizer the highest-leverage combination for parameter memory.

Production Note: Distributed optimizer state sharding interacts with gradient bucket sizing and communication overlap scheduling. Setting bucket sizes too large delays optimizer updates; too small increases collective overhead. The default Megatron Core configuration balances these, but multi-node deployments with heterogeneous network fabric (e.g., mixed NVLink and HDR InfiniBand) may require explicit bucket tuning to maintain compute/communication overlap during optimizer steps.


Practical engineering constraints from the official repo

The official NVIDIA/Megatron-LM repository identifies two installation paths and one migration tool that practitioners need to know before touching a cluster.

Bottom Line: PyPI installation is faster for evaluation; source builds are required for custom CUDA extensions and Transformer Engine integration. Megatron Bridge is the official checkpoint conversion path when moving models between Hugging Face and Megatron format — mismatched parallelism layouts during conversion silently produce incorrect weights.

PyPI install versus source build

# PyPI install — fastest path, no custom kernel compilation
$ pip install megatron-core

# Source build — required for Transformer Engine, custom fused kernels
$ git clone https://github.com/NVIDIA/Megatron-LM.git
$ cd Megatron-LM

# Limit parallel compilation jobs to avoid OOM during build
$ MAX_JOBS=4 pip install -e .

The repo is explicit: "Building from source can be memory-intensive." On a machine with limited CPU RAM (under 64 GB), compilation of CUDA extensions with the default job count can exhaust system memory. Setting MAX_JOBS=4 or lower prevents OOM kills during pip install. The PyPI package (megatron-core) provides the core library without the reference training scripts and does not compile CUDA extensions by default, making it suitable for framework integration work but insufficient for production training runs that require fused kernels.

Checkpoint conversion with Megatron Bridge

Megatron Bridge is NVIDIA's bidirectional conversion tooling for checkpoints between Hugging Face format and Megatron's sharded format.

ConfigBlock: The following configuration example shows the required parallelism layout for conversion.

# Example Megatron Bridge conversion config (megatron_bridge_config.yaml)
# Direction: Hugging Face → Megatron Core sharded checkpoint
source_format: huggingface
target_format: megatron_core
model_type: llama          # architecture identifier
tensor_parallel_size: 8    # must match target training config
pipeline_parallel_size: 4  # must match target training config
sequence_parallel: true    # must match target SP setting
bf16: true                 # precision must be consistent

The critical constraint is that parallelism parameters in the conversion config must exactly match the target training configuration. A mismatch between tensor_parallel_size in the Bridge config and the value used at training launch does not raise an error during conversion — it produces weight tensors with wrong shard boundaries that cause silent numerical errors during training. The Bridge docs also carry configuration guidance for sequence_parallel and context parallelism settings, reinforcing that conversion is layout-aware, not just weight-format-aware.


Where Megatron-LM fits against FSDP and DeepSpeed

Dimension Megatron-LM / Megatron Core PyTorch FSDP2 DeepSpeed ZeRO
Parallelism model TP + PP + DP + EP + CP composable Pure sharding (ZeRO-3 style) ZeRO-1/2/3 + optional TP/PP via Megatron-DeepSpeed
Memory strategy Shard weights by axis; optimizer sharding per DP Shard all: weights, grads, optimizer Shard weights/grads/optimizer; optional offload
Communication pattern Multi-collective (TP), P2P (PP), AR (DP) All-gather + reduce-scatter per param All-gather + reduce-scatter (ZeRO-3), AR (ZeRO-1/2)
Framework coupling Megatron Core integrates; Megatron-LM is opinionated PyTorch-native, low integration cost Wraps existing models; engine API
Best regime Large pre-training, MoE, long context Fine-tuning, moderate scale Fine-tuning to mid-scale pre-training

The arXiv study 2311.03687 directly compares DeepSpeed and Megatron-LM training, evaluating ZeRO-2, ZeRO-3, offloading, activation recomputation, and FlashAttention across training configurations. The Megatron-DeepSpeed fork (github.com/deepspeedai/Megatron-DeepSpeed) combines both stacks: "DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others." This is a distinct codebase from NVIDIA's Megatron-LM.

Best fit by workload, cluster, and team maturity

  • Choose Megatron Core when you are running pre-training at scale (>100B parameters), need explicit TP/PP/CP composition, run on an NVIDIA NVLink cluster, and your team can invest in the integration layer. The Megatron Core documentation confirms its design target is custom training frameworks requiring GPU-optimized building blocks.

  • Choose FSDP2 when you are fine-tuning an existing Hugging Face model at up to ~70B parameters, your team is PyTorch-native, and you want the lowest integration cost. FSDP2's all-gather/reduce-scatter pattern is simpler to reason about and debug than Megatron's multi-axis collective graph.

  • Choose DeepSpeed ZeRO when memory is the binding constraint and you need ZeRO-3 CPU offload to train a model that does not fit on GPU memory even with sharding. DeepSpeed's offload path has no Megatron Core equivalent.

  • Choose Megatron-DeepSpeed when you need MoE with curriculum learning or want 3D parallelism without building the Megatron Core integration from scratch.

  • Choose Megatron-LM (the reference scripts) when you want NVIDIA's exact pre-training configuration for GPT-style models and are willing to stay inside the script's assumptions.


Failure modes and scaling gotchas

Watch Out: Mathematical correctness and operational correctness are independent properties of a Megatron configuration. A configuration with correct TP/PP/DP sizes, valid layer partitioning, and matching SP flags can still run at 30% of theoretical MFU if NCCL collectives are not overlapped with compute, InfiniBand is contended, or microbatch count is too low for the pipeline depth. Preserving model math does not guarantee throughput.

NCCL collective imbalance is the most common failure mode at scale. At TP=8 + PP=8 across 512 GPUs, NCCL manages multiple distinct communicators: the TP all-reduce/reduce-scatter group, the PP point-to-point group, and the DP gradient-sync group. If any of these contend on the same InfiniBand rail without proper affinity pinning, step time inflates non-deterministically.

Diagnosing communication hot spots

Pro Tip: Set NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to log every collective operation with size and algorithm selection. Pipe the output to a per-rank log file (NCCL_DEBUG_FILE=/tmp/nccl_rank_%d.log) and grep for RING versus TREE algorithm selection — NCCL choosing TREE for large TP collectives over InfiniBand indicates it has detected sub-optimal topology and is adapting, which is a signal to check NCCL_IB_HCA binding and NCCL_SOCKET_IFNAME for correct interface selection.

On multi-node deployments over InfiniBand, verify that each GPU maps to the InfiniBand HCA on the same PCIe root complex. Mismatched GPU-to-HCA affinity can halve effective bandwidth for TP collectives that must cross the InfiniBand fabric. Use nvidia-smi topo -m to confirm GPU–NIC topology before launching any multi-node Megatron job.

NCCL's algorithm selection is topology-aware but not cluster-topology-aware: it cannot know that your InfiniBand fabric has a particular congestion bottleneck. At very large scales (1,000+ GPUs), collective hot spots typically appear only in production runs, not in single-node or small-scale tests. Monitoring step time variance across training steps is a more reliable indicator than any single profiling run.

When model math is unchanged but performance collapses

Three configuration mistakes consistently produce mathematically valid but operationally poor runs:

Stage imbalance in PP: Assigning the embedding layer and first two transformer layers to stage 0 while assigning 20 heavy layers to stage 1 creates a bottleneck stage. The pipeline stalls on stage 1's compute, and all other stages idle. Megatron-LM's layer assignment is uniform by default; non-uniform layer counts require explicit configuration.

TP width exceeding NVLink domain: At TP=16 on a cluster where each node has 8 GPUs, TP collectives cross InfiniBand. Collective latency for a 512 MB all-gather at InfiniBand speeds is measurable in milliseconds — long enough to stall the compute stream on a layer with moderate FLOPs. The model math remains correct; throughput collapses.

Over-recomputation at high MFU: Enabling full activation recomputation when the cluster is compute-bound rather than memory-bound adds ~33% backward-pass compute with no memory headroom gain (since memory was not the constraint). Selective recomputation targeting only attention softmax intermediates is the correct default for a GPU running above 50% MFU.

Watch Out: Always profile MFU (model FLOP utilization) before and after enabling recomputation, SP, or changing TP/PP configuration. A configuration change that improves memory headroom but drops MFU from 45% to 30% is a net loss for training cost per token.


FAQ on Megatron-LM parallelism

What is tensor parallelism in Megatron-LM?

Tensor parallelism shards individual weight matrices (QKV projections, MLP weights) across a group of GPUs within a single layer. Each GPU computes a partial result; a collective (all-reduce or reduce-scatter) assembles the full result. The transformer's mathematical output is identical to single-GPU computation.

How does pipeline parallelism work in Megatron-LM?

Pipeline parallelism assigns contiguous groups of transformer layers to distinct GPU stages. Microbatches stream through stages sequentially. Stage-to-stage activation tensors transfer via NCCL point-to-point sends. Bubble overhead is managed by increasing microbatch count or using interleaved (virtual) pipeline schedules.

What is sequence parallelism in Megatron-LM?

Sequence parallelism shards activation tensors along the sequence dimension between transformer sub-layers, using the same process group as tensor parallelism. It is enabled alongside TP to reduce activation memory. It does not partition the attention computation — that is context parallelism's role.

How does Megatron-LM differ from DeepSpeed?

Megatron-LM (and Megatron Core) use model parallelism — TP splits weight tensors, PP splits layers. DeepSpeed ZeRO uses pure weight/gradient/optimizer-state sharding across all data-parallel ranks without changing the forward-pass computation graph. Megatron Core is composable building blocks; DeepSpeed wraps existing models. They can be combined in the Megatron-DeepSpeed fork.

Can Megatron-LM use tensor and pipeline parallelism together?

Yes. Megatron Core treats TP and PP as orthogonal process-group dimensions. A job with TP, PP, and DP values chosen for the cluster uses separate NCCL communicators for each axis. SP and CP can be layered on top without additional process-group changes.

Bottom Line: Sequence parallelism does not replace tensor parallelism — it is an activation-memory augmentation that runs within the TP process group. Context parallelism is the separate axis for long-sequence attention scaling. Megatron Core's composability lets you combine all five axes (TP, PP, DP, EP, CP) simultaneously, but correct throughput requires balancing all five against your hardware fabric and batch-size constraints.


Sources and references

Production Note: The NVIDIA/Megatron-LM GitHub repository and NVIDIA's developer documentation are the authoritative sources for Megatron Core's current API and parallelism semantics. ArXiv papers cover the algorithmic foundations and comparative evaluations. Megatron Bridge documentation is the canonical reference for checkpoint conversion. All URLs below were current as of May 2026.


Keywords: Megatron-LM, Megatron Core, Megatron Bridge, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, expert parallelism, FSDP, DeepSpeed, NCCL, InfiniBand, NVIDIA H100, Transformer Engine, BF16

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit