AI & ML

FSDP vs DeepSpeed in Accelerate: how to choose sharding, offload, and checkpointing settings

Q: What is the difference between FSDP and DeepSpeed in Accelerate?

Both integrate into [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/) and both support full parameter sharding. The differences are in config surface (FSDP uses plugin args only; DeepSpeed also requires a JSON config), offload semantics (FSDP is all-or-nothing; DeepSpeed splits param and optimizer offload), checkpoint format (FSDP prefers `SHARDED_STATE_DICT`; DeepSpeed produces partitioned shards requiring `zero_to_fp32.py` for consolidation), and NVMe support (DeepSpeed only). | Question | FSDP answer | DeepSpeed ZeRO-3 answer | |---|---|---| | Supports offloading? | Yes, all-or-nothing to CPU | Yes, params and optimizer independently, CPU or NVMe | | How to save a 70B checkpoint? | `SHARDED_STATE_DICT` (recommended) | Partitioned shards + `zero_to_fp32.py`, or `zero3_save_16bit_model` | | Better for operational simplicity? | Yes | No — requires JSON config + conversion tooling | | Better for memory-constrained nodes? | No — binary offload limits flexibility | Yes — granular offload, NVMe support | ---

Accelerate maps FSDP FULL_SHARD to DeepSpeed ZeRO stage 3, but the two stacks diverge on offload and checkpointing: FSDP is all-or-nothing for offload, while DeepSpeed can split parameter and optimizer offload and even target NVMe — but FSDP can checkpoint sharded state directly, whereas ZeRO-3 often needs a consolidation or post-conversion step, which changes the operational cost of saving 70B fine-tunes.

By AxiomLogica Editorial

May 17, 202620 min read

Reviewed by Editorial

FSDP vs DeepSpeed in Accelerate: how to choose sharding, offload, and checkpointing settings

How we compared FSDP and DeepSpeed in Accelerate

What is the difference between FSDP and DeepSpeed in Accelerate? The Hugging Face Accelerate documentation provides a direct concept guide comparing the two stacks on six dimensions: sharding strategy, offload behavior, model loading, checkpointing, prefetching, and parameter synchronization. That guide is the primary source for this comparison, scoped to single-node, multi-GPU usage — which is the only regime the official guide covers directly.

The central mapping the guide establishes: "FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3". That equivalence is useful but incomplete — the two stacks diverge on every other dimension in the table below.

ComparisonTable

Dimension	PyTorch FSDP	DeepSpeed ZeRO-3
Sharding stage equivalent	`FULL_SHARD`	ZeRO stage 3
CPU offload granularity	All-or-nothing	Parameters and optimizer independently
NVMe offload	Not supported	Supported (ZeRO-Infinity family)
Recommended checkpoint format	`SHARDED_STATE_DICT`	Partitioned shards + `zero_to_fp32.py`
Config surface	Plugin args + `accelerate launch`	Plugin args + DeepSpeed JSON config file
RAM-efficient model loading	`fsdp_cpu_ram_efficient_loading` (Transformers models only)	`zero3_init_context_manager`

Bottom Line: For 70B fine-tunes on a single node, FSDP is the lower-friction path when your priority is sharded checkpointing and simpler config. DeepSpeed ZeRO-3 earns its place when you need to separate parameter and optimizer offload, target NVMe, or squeeze models that exceed CPU RAM on a consolidation step. The two stacks are not interchangeable behind a flag flip — offload semantics, checkpoint formats, and config surfaces differ materially.

At-a-glance choice for 70B fine-tunes

PyTorch FSDP shards parameters, gradients, and optimizer states across data-parallel workers to reduce per-GPU memory at the cost of all-reduce communication on forward and backward passes. DeepSpeed ZeRO-3 applies the same partitioning logic but adds a richer offload stack and a separate JSON-driven config layer.

For a Llama 3.1 70B fine-tune, the practical choice hinges on two questions: how you want to checkpoint, and whether you need granular CPU/NVMe offload.

ComparisonTable

Criterion	Prefer FSDP	Prefer DeepSpeed ZeRO-3
Checkpoint simplicity	✅ Sharded state dicts, no conversion step	❌ Requires `zero_to_fp32.py` for consolidated weights
Offload flexibility	❌ CPU offload is all-or-nothing	✅ Separate param and optimizer offload
NVMe offload	❌ Not available	✅ Supported
Config surface	✅ Plugin args only	❌ Plugin args + DeepSpeed JSON
Transformers integration	✅ `fsdp_cpu_ram_efficient_loading` available	✅ `zero3_init_context_manager` available
Checkpoint reload speed (70B)	✅ Distributed load across ranks	⚠️ Single-rank consolidation can be slow

DecisionMatrix: - Choose FSDP when GPU memory is sufficient with full CPU offload (or no offload), you want sharded checkpoints without a post-conversion step, and operational simplicity matters. - Choose DeepSpeed ZeRO-3 when parameters and optimizer states need independent offload targets, when NVMe offload is the only way to fit the model, or when you need a consolidated 16-bit weight file for inference portability. - No clear winner when both fit in GPU memory with ZeRO-3/FULL_SHARD sharding and you have no strong checkpoint or offload requirement — benchmark both for your specific batch size and sequence length.

When the default answer is FSDP

At a Glance: Time: minutes to configure · Prereqs: Transformers-based model, Accelerate FSDP plugin · Hardware: single-node multi-GPU with enough VRAM for FULL_SHARD · Cost: lowest operational overhead when checkpointing is sharded

FSDP is the right default for teams that want fewer moving parts and predictable checkpoint behavior. The Accelerate FSDP usage guide states: "The new recommended way of checkpointing when using FSDP models is to use SHARDED_STATE_DICT as StateDictType." Sharded state dicts distribute the save and load work across all ranks — no single rank needs to hold the full 70B parameter set in memory during a checkpoint operation.

FSDP does support CPU offload, but only as a binary switch: either all parameters and optimizer states move to CPU, or none do. You cannot offload just the optimizer and keep parameters on GPU. For most 70B fine-tune setups where GPU memory is tight but not catastrophically so, that limitation rarely triggers — a 8×A100 80GB node can run Llama 3.1 70B in bf16 with FULL_SHARD without any CPU offload. If it cannot, FSDP will move everything to CPU, which adds substantial PCIe transfer overhead on every step.

Pro Tip: Set fsdp_cpu_ram_efficient_loading: true in your Accelerate FSDP plugin config when loading a pretrained 70B checkpoint. This flag causes only the first process to load the checkpoint from disk while all other processes initialize with empty weights — Accelerate then shards and scatters. Without it, every rank loads the full checkpoint simultaneously, potentially exhausting host RAM. Note this option is only applicable for Transformers models.

When DeepSpeed earns its complexity budget

DeepSpeed ZeRO-3 is the right choice when you need to control offload at a finer granularity than FSDP allows. As the DeepSpeed ZeRO-Offload tutorial defines it: "ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU." ZeRO-3 Offload extends this to parameters as well, and ZeRO-Infinity adds NVMe as a third tier.

ComparisonTable

Offload tier	DeepSpeed capability	FSDP equivalent
Optimizer only → CPU	Supported with `offload_optimizer`	Not possible without also offloading params
Parameters only → CPU	Supported with `offload_param`	Not possible as a separate choice
Both → CPU	Supported	Supported, but only as a single binary switch
CPU → NVMe	Supported through ZeRO-Infinity	Not supported
Fine-grained control	Independent toggles per component	All-or-nothing

The practical upside: you can keep parameters on GPU (fast access during forward/backward) while offloading only optimizer states to CPU. For 70B fine-tunes with AdamW, optimizer states alone account for roughly 4 bytes × 2 states × 70B parameters = ~560 GB in fp32 — far exceeding GPU memory on any single node. Offloading optimizer states to CPU while keeping sharded parameters on GPU is a viable middle-ground that FSDP simply cannot express.

Watch Out: DeepSpeed's NVMe offload via ZeRO-Infinity requires careful configuration of the offload_param and offload_optimizer keys in the DeepSpeed JSON config, including nvme_path. NVMe bandwidth directly caps your training throughput when the optimizer step triggers NVMe reads/writes. Consumer NVMe (PCIe 3.0 ×4) at ~3 GB/s will noticeably throttle step time for 70B models. Enterprise NVMe in RAID or PCIe 5.0 configurations mitigate this but add infrastructure cost. Measure actual step time before committing to NVMe offload as a production configuration.

Sharding behavior and memory trade-offs

Both stacks partition model state across ranks to reduce per-GPU memory. The mapping is conceptually clean at the top level but diverges in semantics below the sharding stage number.

DecisionMatrix:

FSDP sharding strategy	Memory profile	DeepSpeed equivalent	Fit scenario
`NO_SHARD`	Full model replica per GPU	ZeRO stage 0	Small models or baselines only
`SHARD_GRAD_OP`	Shard gradients and optimizer states	ZeRO stage 2	Fits in GPU memory with lower sharding overhead
`FULL_SHARD`	Shard params, grads, and optimizer states	ZeRO stage 3	70B single-node fine-tunes and other memory-bound runs
`HYBRID_SHARD`	Full shard within node, replica across nodes	No direct equivalent	Multi-node topologies outside this guide

For 70B fine-tunes on a single node, FULL_SHARD / ZeRO stage 3 is the only viable sharding mode — the others retain too much state per GPU. HYBRID_SHARD is FSDP-only and targets multi-node topologies outside the scope of the Accelerate single-node comparison.

Decision matrix for sharding stage selection: - Choose FULL_SHARD / ZeRO-3 when the model does not fit with lower sharding stages on your GPU count. - Choose SHARD_GRAD_OP / ZeRO-2 when the model fits in GPU memory but optimizer states are the bottleneck — lower communication overhead than full sharding. - Choose NO_SHARD / ZeRO-0 only for baselines or small models; irrelevant for 70B.

FULL_SHARD and ZeRO stage 3 are similar, not identical

The Accelerate concept guide establishes: "In particular, FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3." Both shard parameters, gradients, and optimizer states across all data-parallel ranks. During the forward pass, each rank all-gathers the parameters it needs, computes, then discards the gathered tensors. Gradients are reduce-scattered after the backward pass.

ComparisonTable

Behavior	FSDP FULL_SHARD	DeepSpeed ZeRO stage 3
Parameter prefetching	Configurable (`limit_all_gathers`)	Configurable (`stage3_prefetch_bucket_size`)
Parameter synchronization	NCCL all-gather per FSDP unit	NCCL all-gather per layer partition
Offload control	Binary (on/off)	Per-component (params vs. optimizer)
Config location	Accelerate plugin args	Accelerate plugin args + DeepSpeed JSON

The prefetching and synchronization differences mean that even at equivalent sharding stages, raw training throughput can differ for the same model and batch size. Neither stack is universally faster — this is workload-dependent.

Offload semantics that do not translate cleanly

FSDP treats CPU offload as a single binary flag. Enabling it moves all parameters and optimizer states to CPU; disabling it keeps everything on GPU. There is no middle configuration.

DeepSpeed separates the two. The offload_param and offload_optimizer keys in the ZeRO config JSON are independent. You can offload optimizer states to CPU while parameters remain GPU-resident, or offload both. NVMe is available as a target for either component:

Offload scenario	FSDP	DeepSpeed ZeRO-3
No offload	✅	✅
Optimizer states → CPU	❌ (must offload params too)	✅
Params → CPU, optimizer → GPU	❌	✅
Both → CPU	✅	✅
Either → NVMe	❌	✅ (ZeRO-Infinity)

Watch Out: If you migrate a working FSDP config with CPU offload enabled to DeepSpeed, do not assume the memory profile will be identical. DeepSpeed's granular offload means you may need to explicitly configure both offload_param and offload_optimizer to reproduce the same memory reduction. Leaving one unset defaults to GPU-resident, which changes the effective memory footprint.

Checkpointing for large-model training runs

Checkpointing a 70B fine-tune under distributed training is an operationally significant decision — save time, reload time, storage layout, and inference portability all vary by approach. The Accelerate concept guide notes that for DeepSpeed ZeRO-3: "one could pass a --zero3_save_16bit_model true, which conveniently consolidates the model to a single rank and saves." FSDP's recommended path is SHARDED_STATE_DICT, which saves distributed shards without consolidation.

DecisionMatrix:

Checkpoint approach	Save cost	Reload cost	Inference-ready?	Stack
FSDP `SHARDED_STATE_DICT`	Low (distributed)	Low (distributed)	No — needs full-state conversion	FSDP
FSDP `FULL_STATE_DICT`	High (single-rank gather)	Medium	Yes	FSDP
DeepSpeed ZeRO-3 partitioned shards	Low (distributed)	Low (distributed)	No — needs `zero_to_fp32.py`	DeepSpeed
DeepSpeed `zero3_save_16bit_model`	High (single-rank consolidation)	Fast (standard load)	Yes	DeepSpeed

Decision matrix for checkpoint strategy: - Use SHARDED_STATE_DICT when you want fast mid-training saves and can tolerate a conversion step before inference. - Use FSDP FULL_STATE_DICT when you need an immediately loadable checkpoint and have sufficient CPU RAM for a single-rank gather. - Use DeepSpeed partitioned shards + zero_to_fp32.py for routine mid-training saves when using DeepSpeed. - Use zero3_save_16bit_model for end-of-training saves when inference portability is required immediately.

Why sharded state dicts are usually the faster path

For 70B fine-tunes, saving a full consolidated checkpoint requires one rank to assemble the entire model in CPU RAM — for a bf16 70B model, that is approximately 140 GB of host RAM on a single process. Few training nodes have that headroom available mid-training without OOM risk, especially if optimizer states are also in CPU RAM.

Sharded state dicts distribute the save across all ranks. Each rank writes its own shard, and the total wall-clock time for the save scales with shard size rather than total model size. As the Accelerate FSDP usage guide states: "The new recommended way of checkpointing when using FSDP models is to use SHARDED_STATE_DICT as StateDictType."

ComparisonTable

Metric	SHARDED_STATE_DICT	FULL_STATE_DICT
Per-rank RAM during save	~(model size / num_gpus)	Full model size on rank 0
Wall-clock save time	Fast (parallel writes)	Slow (sequential gather then write)
Storage layout	One shard file per rank	Single weight file
Resume from same GPU count	Direct	Direct
Resume from different GPU count	Requires resharding	Direct

When to use zero_to_fp32.py after ZeRO-3

DeepSpeed checkpoint directories contain partitioned weight and optimizer state files plus a generated zero_to_fp32.py script. As the DeepSpeed ZeRO tutorial documents: "The zero_to_fp32.py script gets created automatically when you save a checkpoint." Running it consolidates all partitioned shards into a single fp32 weight file suitable for inference or model conversion.

The operational cost is explicit in the same source: "Note: currently this script uses 2x memory (general RAM) of the size of the final checkpoint." For a 70B model in bf16 (~140 GB), the conversion requires approximately 280 GB of host RAM. On most single-node training machines, this either requires a dedicated post-training conversion job or a machine with large host RAM provisioned specifically for this step.

ComparisonTable

Scenario	Recommended approach	RAM behavior
Mid-training checkpoint (resume only)	Save partitioned shards, skip `zero_to_fp32.py`	Lowest RAM footprint
End-of-training (need inference weights)	Run `zero_to_fp32.py` on a high-RAM instance	About 2x final checkpoint size
End-of-training, RAM-constrained	Use `zero3_save_16bit_model` during training instead	Single-rank gather at save time
Need fp32 weights from a partitioned checkpoint	`zero_to_fp32.py` is the only path	Requires full consolidation RAM

Pro Tip: If you anticipate needing consolidated weights at training end, set "zero3_save_16bit_model": true in your DeepSpeed config from the start rather than relying on zero_to_fp32.py post-hoc. The in-training consolidation path still requires single-rank gather overhead, but it avoids provisioning a separate high-RAM conversion machine after training completes.

Accelerate launch and config patterns that actually map cleanly

Both stacks launch via accelerate launch, but the config surfaces diverge significantly. FSDP configuration lives entirely in the Accelerate plugin args (via accelerate config YAML or inline --fsdp_* flags). DeepSpeed requires both Accelerate plugin args and a separate DeepSpeed JSON config file — any ZeRO optimization, offload, or checkpoint flag that DeepSpeed manages natively belongs in the JSON, not the Accelerate layer.

DecisionMatrix:

Config layer	FSDP	DeepSpeed ZeRO-3
Sharding strategy	`fsdp_sharding_strategy: FULL_SHARD`	`zero_stage: 3` in Accelerate plugin
CPU offload	`fsdp_offload_params: true`	`offload_optimizer` / `offload_param` in DS JSON
NVMe offload	Not applicable	`nvme_path` in DS JSON `offload_*` blocks
Checkpoint type	`fsdp_state_dict_type: SHARDED_STATE_DICT`	`zero3_save_16bit_model` in DS JSON or Accelerate arg
RAM-efficient loading	`fsdp_cpu_ram_efficient_loading: true`	`zero3_init_context_manager` in training script

Decision matrix for config management: - Use FSDP when you want a single YAML file to describe the full training configuration with no external JSON dependency. - Use DeepSpeed when your team already manages DeepSpeed JSON configs and has tooling around them, or when you need config options that Accelerate's plugin layer does not expose (e.g., stage3_gather_16bit_weights_on_model_save, fine-grained bucket sizes). - Avoid mixing Accelerate-layer DeepSpeed args with the JSON config for the same setting — precedence behavior is not always intuitive and can silently override intended values.

FSDP plugin settings to check first

The minimum FSDP settings to verify before a 70B fine-tune run:

ComparisonTable

Setting	Key	Recommended value for 70B
Sharding strategy	`fsdp_sharding_strategy`	`FULL_SHARD`
CPU offload	`fsdp_offload_params`	`false` (unless GPU memory is exhausted)
State dict type	`fsdp_state_dict_type`	`SHARDED_STATE_DICT`
RAM-efficient loading	`fsdp_cpu_ram_efficient_loading`	`true` (Transformers models only)
Auto-wrap policy	`fsdp_auto_wrap_policy`	`TRANSFORMER_BASED_WRAP`
Min num params	`fsdp_min_num_params`	Match to your model's layer size

The Accelerate FSDP usage guide states that when fsdp_cpu_ram_efficient_loading is true: "only the first process loads the pretrained model checkpoint while all other processes have empty weights." This dramatically reduces peak host RAM during model load for large checkpoints. The constraint: this option is only applicable for Transformers models — custom architectures must handle sharded loading manually.

DeepSpeed config file settings to check first

The DeepSpeed JSON config carries the settings that Accelerate's plugin layer cannot express. The DeepSpeed ZeRO tutorial confirms: "To enable ZeRO optimizations for a DeepSpeed model, we simply add the zero_optimization key to the DeepSpeed JSON configuration."

Setting	JSON key path	Purpose
ZeRO stage	`zero_optimization.stage`	Set to `3` for 70B
Optimizer offload target	`zero_optimization.offload_optimizer.device`	`"cpu"` or `"nvme"`
Param offload target	`zero_optimization.offload_param.device`	`"cpu"` or `"nvme"`
NVMe path	`zero_optimization.offload_*.nvme_path`	Local NVMe mount point
Save 16-bit model	`zero_optimization.stage3_gather_16bit_weights_on_model_save`	Enables `zero3_save_16bit_model`
Optimizer type (CPU offload)	`optimizer.type`	`"DeepSpeedCPUAdam"` required for CPU offload

Watch Out: If you enable CPU offload for optimizer states in DeepSpeed, you must also switch to DeepSpeedCPUAdam as the optimizer. Standard Adam or AdamW from PyTorch will not function correctly with CPU-resident optimizer states in DeepSpeed's ZeRO-Offload path. The zero3_save_16bit_model flag (stage3_gather_16bit_weights_on_model_save: true) triggers a single-rank gather at save time — account for the RAM overhead.

Decision matrix for single-node, multi-GPU training

The core operational question for a 70B fine-tune on a single node: how much GPU memory headroom do you have, how often do you checkpoint, and do you need inference-ready weights at the end of training?

Scenario	Recommended stack	Key reason
GPU memory sufficient with `FULL_SHARD`, no offload needed	FSDP	Simpler config, fast sharded checkpoints
Need optimizer-only CPU offload, params on GPU	DeepSpeed ZeRO-3	FSDP cannot express this split
Need NVMe offload to fit the model at all	DeepSpeed ZeRO-3 + ZeRO-Infinity	Only available in DeepSpeed
Frequent mid-training checkpoints (every N steps)	FSDP `SHARDED_STATE_DICT`	Fastest distributed save, no conversion needed
Need inference-ready consolidated weights at end of training	DeepSpeed `zero3_save_16bit_model` or FSDP `FULL_STATE_DICT`	Depends on which stack you're already using
Team unfamiliar with DeepSpeed JSON config	FSDP	Fewer config surfaces, fewer failure modes

Bottom Line: Neither stack dominates for all 70B fine-tune configurations. FSDP wins on operational simplicity and checkpoint ergonomics. DeepSpeed ZeRO-3 wins when offload granularity or NVMe is required. The mapping between the two is real (FULL_SHARD ↔ ZeRO-3) but does not extend to offload or checkpointing — those require stack-specific solutions.

Choose FSDP when operational simplicity matters more than exotic offload

FSDP is the right default when your GPU fleet can run the fine-tune with FULL_SHARD and either no offload or full CPU offload. As the PyTorch FSDP documentation describes: "FSDP shards module parameters, gradients, and optimizer states across data parallel workers to save memory at the cost of communication."

Choose FSDP when: - GPU count × GPU VRAM covers the model with FULL_SHARD (no offload), or full CPU offload is an acceptable fallback - You want sharded checkpoints without a post-conversion workflow - Your training script is built on Transformers and you want fsdp_cpu_ram_efficient_loading - You want a single Accelerate YAML as the sole configuration artifact - The team prioritizes debuggability over maximum memory efficiency

Choose DeepSpeed when separate offload and NVMe are worth the complexity

DeepSpeed ZeRO-3's core advantage is offload granularity. As the DeepSpeed ZeRO-3 Offload documentation states: "ZeRO-3 Offload can partition the parameters across GPUs, and offload them to CPU, supporting model sizes that are much larger than the memory on a single GPU." NVMe offload extends this further.

Choose DeepSpeed ZeRO-3 when: - You need optimizer-only CPU offload while keeping parameters GPU-resident - NVMe offload is the only path to fitting the model on your node - Your team already maintains DeepSpeed JSON configs and has operational familiarity - You need zero3_save_16bit_model for direct inference-weight export at training end - Fine-grained ZeRO bucket sizes and prefetch tuning are worth the config surface

Watch Out: The complexity budget for DeepSpeed includes the JSON config, the DeepSpeedCPUAdam optimizer requirement for CPU offload, the zero_to_fp32.py conversion workflow, and the 2× RAM requirement for post-training weight consolidation. On a 70B model, that last point alone can require 280 GB of host RAM for the conversion step. If your infrastructure does not have that headroom, plan the checkpoint strategy before training starts — not after.

FAQ

Which is better, FSDP or DeepSpeed? Neither is categorically better. FSDP is operationally simpler with cleaner sharded checkpointing. DeepSpeed ZeRO-3 offers more granular offload control and NVMe support. The right choice depends on your memory constraints and checkpoint requirements, not on a universal ranking.

Does FSDP support offloading? Yes, but only as a binary switch. Enabling FSDP CPU offload moves all parameters and optimizer states to CPU — you cannot offload only optimizer states while keeping parameters on GPU. DeepSpeed can split these independently.

How do you save a checkpoint in DeepSpeed ZeRO-3? Two paths: (1) save partitioned shards during training (default) and run zero_to_fp32.py post-training to consolidate — this requires approximately 2× the final checkpoint size in host RAM; (2) set stage3_gather_16bit_weights_on_model_save: true (the zero3_save_16bit_model flag) to consolidate to a single rank at save time, trading save latency for immediate portability.

What is the difference between FSDP and DeepSpeed in Accelerate? Both integrate into Hugging Face Accelerate and both support full parameter sharding. The differences are in config surface (FSDP uses plugin args only; DeepSpeed also requires a JSON config), offload semantics (FSDP is all-or-nothing; DeepSpeed splits param and optimizer offload), checkpoint format (FSDP prefers SHARDED_STATE_DICT; DeepSpeed produces partitioned shards requiring zero_to_fp32.py for consolidation), and NVMe support (DeepSpeed only).

Question	FSDP answer	DeepSpeed ZeRO-3 answer
Supports offloading?	Yes, all-or-nothing to CPU	Yes, params and optimizer independently, CPU or NVMe
How to save a 70B checkpoint?	`SHARDED_STATE_DICT` (recommended)	Partitioned shards + `zero_to_fp32.py`, or `zero3_save_16bit_model`
Better for operational simplicity?	Yes	No — requires JSON config + conversion tooling
Better for memory-constrained nodes?	No — binary offload limits flexibility	Yes — granular offload, NVMe support

Sources and references

Hugging Face Accelerate: FSDP and DeepSpeed concept guide — Primary source for the FSDP ↔ ZeRO stage mapping, offload semantics, and checkpoint comparison
Hugging Face Accelerate: FSDP usage guide — Source for SHARDED_STATE_DICT recommendation, fsdp_cpu_ram_efficient_loading behavior
PyTorch FSDP: fully_shard documentation — Source for FSDP parameter/gradient/optimizer-state sharding description
DeepSpeed ZeRO tutorial — Source for zero_to_fp32.py auto-generation and 2× RAM requirement
DeepSpeed ZeRO-Offload tutorial — Source for optimizer offload definition and CPU offload semantics
DeepSpeed ZeRO-3 Offload blog post — Source for ZeRO-3 parameter partitioning and CPU offload capability
DeepSpeed config JSON reference — Source for zero_optimization key structure, offload config fields, and optimizer requirements

Keywords: Hugging Face Accelerate, PyTorch FSDP, DeepSpeed ZeRO-3, ZeRO-Infinity, zero_to_fp32.py, fsdp_full_shard, fsdp_cpu_ram_efficient_loading, zero3_save_16bit_model, NCCL, SLURM, Ray, NVMe offload, state_dict, sharded checkpointing, Llama 3.1 70B

Was this guide helpful?

Share: X · LinkedIn · Reddit

How we compared FSDP and DeepSpeed in Accelerate

At-a-glance choice for 70B fine-tunes

When the default answer is FSDP

When DeepSpeed earns its complexity budget

Sharding behavior and memory trade-offs

FULL_SHARD and ZeRO stage 3 are similar, not identical

Offload semantics that do not translate cleanly

Checkpointing for large-model training runs

Why sharded state dicts are usually the faster path

When to use zero_to_fp32.py after ZeRO-3

Accelerate launch and config patterns that actually map cleanly

FSDP plugin settings to check first

DeepSpeed config file settings to check first

Decision matrix for single-node, multi-GPU training

Choose FSDP when operational simplicity matters more than exotic offload

Choose DeepSpeed when separate offload and NVMe are worth the complexity

FAQ

Sources and references

The weekly brief.

Related reading

Optimizing LLM Serving Goodput: A Guide to ChunkSize Tuning

Should you adopt FlashAttention-3 now, or stay on FlashAttention-2? A Hopper-era migration decision

DeepSpeed vs Megatron-LM: which stack fits pre-training, fine-tuning, and checkpoint portability?