How we compared FSDP and DeepSpeed in Accelerate
What is the difference between FSDP and DeepSpeed in Accelerate? The Hugging Face Accelerate documentation provides a direct concept guide comparing the two stacks on six dimensions: sharding strategy, offload behavior, model loading, checkpointing, prefetching, and parameter synchronization. That guide is the primary source for this comparison, scoped to single-node, multi-GPU usage — which is the only regime the official guide covers directly.
The central mapping the guide establishes: "FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3". That equivalence is useful but incomplete — the two stacks diverge on every other dimension in the table below.
ComparisonTable
| Dimension | PyTorch FSDP | DeepSpeed ZeRO-3 |
|---|---|---|
| Sharding stage equivalent | FULL_SHARD |
ZeRO stage 3 |
| CPU offload granularity | All-or-nothing | Parameters and optimizer independently |
| NVMe offload | Not supported | Supported (ZeRO-Infinity family) |
| Recommended checkpoint format | SHARDED_STATE_DICT |
Partitioned shards + zero_to_fp32.py |
| Config surface | Plugin args + accelerate launch |
Plugin args + DeepSpeed JSON config file |
| RAM-efficient model loading | fsdp_cpu_ram_efficient_loading (Transformers models only) |
zero3_init_context_manager |
Bottom Line: For 70B fine-tunes on a single node, FSDP is the lower-friction path when your priority is sharded checkpointing and simpler config. DeepSpeed ZeRO-3 earns its place when you need to separate parameter and optimizer offload, target NVMe, or squeeze models that exceed CPU RAM on a consolidation step. The two stacks are not interchangeable behind a flag flip — offload semantics, checkpoint formats, and config surfaces differ materially.
At-a-glance choice for 70B fine-tunes
PyTorch FSDP shards parameters, gradients, and optimizer states across data-parallel workers to reduce per-GPU memory at the cost of all-reduce communication on forward and backward passes. DeepSpeed ZeRO-3 applies the same partitioning logic but adds a richer offload stack and a separate JSON-driven config layer.
For a Llama 3.1 70B fine-tune, the practical choice hinges on two questions: how you want to checkpoint, and whether you need granular CPU/NVMe offload.
ComparisonTable
| Criterion | Prefer FSDP | Prefer DeepSpeed ZeRO-3 |
|---|---|---|
| Checkpoint simplicity | ✅ Sharded state dicts, no conversion step | ❌ Requires zero_to_fp32.py for consolidated weights |
| Offload flexibility | ❌ CPU offload is all-or-nothing | ✅ Separate param and optimizer offload |
| NVMe offload | ❌ Not available | ✅ Supported |
| Config surface | ✅ Plugin args only | ❌ Plugin args + DeepSpeed JSON |
| Transformers integration | ✅ fsdp_cpu_ram_efficient_loading available |
✅ zero3_init_context_manager available |
| Checkpoint reload speed (70B) | ✅ Distributed load across ranks | ⚠️ Single-rank consolidation can be slow |
DecisionMatrix: - Choose FSDP when GPU memory is sufficient with full CPU offload (or no offload), you want sharded checkpoints without a post-conversion step, and operational simplicity matters. - Choose DeepSpeed ZeRO-3 when parameters and optimizer states need independent offload targets, when NVMe offload is the only way to fit the model, or when you need a consolidated 16-bit weight file for inference portability. - No clear winner when both fit in GPU memory with ZeRO-3/FULL_SHARD sharding and you have no strong checkpoint or offload requirement — benchmark both for your specific batch size and sequence length.
When the default answer is FSDP
At a Glance: Time: minutes to configure · Prereqs: Transformers-based model, Accelerate FSDP plugin · Hardware: single-node multi-GPU with enough VRAM for
FULL_SHARD· Cost: lowest operational overhead when checkpointing is sharded
FSDP is the right default for teams that want fewer moving parts and predictable checkpoint behavior. The Accelerate FSDP usage guide states: "The new recommended way of checkpointing when using FSDP models is to use SHARDED_STATE_DICT as StateDictType." Sharded state dicts distribute the save and load work across all ranks — no single rank needs to hold the full 70B parameter set in memory during a checkpoint operation.
FSDP does support CPU offload, but only as a binary switch: either all parameters and optimizer states move to CPU, or none do. You cannot offload just the optimizer and keep parameters on GPU. For most 70B fine-tune setups where GPU memory is tight but not catastrophically so, that limitation rarely triggers — a 8×A100 80GB node can run Llama 3.1 70B in bf16 with FULL_SHARD without any CPU offload. If it cannot, FSDP will move everything to CPU, which adds substantial PCIe transfer overhead on every step.
Pro Tip: Set
fsdp_cpu_ram_efficient_loading: truein your Accelerate FSDP plugin config when loading a pretrained 70B checkpoint. This flag causes only the first process to load the checkpoint from disk while all other processes initialize with empty weights — Accelerate then shards and scatters. Without it, every rank loads the full checkpoint simultaneously, potentially exhausting host RAM. Note this option is only applicable for Transformers models.
When DeepSpeed earns its complexity budget
DeepSpeed ZeRO-3 is the right choice when you need to control offload at a finer granularity than FSDP allows. As the DeepSpeed ZeRO-Offload tutorial defines it: "ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU." ZeRO-3 Offload extends this to parameters as well, and ZeRO-Infinity adds NVMe as a third tier.
ComparisonTable
| Offload tier | DeepSpeed capability | FSDP equivalent |
|---|---|---|
| Optimizer only → CPU | Supported with offload_optimizer |
Not possible without also offloading params |
| Parameters only → CPU | Supported with offload_param |
Not possible as a separate choice |
| Both → CPU | Supported | Supported, but only as a single binary switch |
| CPU → NVMe | Supported through ZeRO-Infinity | Not supported |
| Fine-grained control | Independent toggles per component | All-or-nothing |
The practical upside: you can keep parameters on GPU (fast access during forward/backward) while offloading only optimizer states to CPU. For 70B fine-tunes with AdamW, optimizer states alone account for roughly 4 bytes × 2 states × 70B parameters = ~560 GB in fp32 — far exceeding GPU memory on any single node. Offloading optimizer states to CPU while keeping sharded parameters on GPU is a viable middle-ground that FSDP simply cannot express.
Watch Out: DeepSpeed's NVMe offload via ZeRO-Infinity requires careful configuration of the
offload_paramandoffload_optimizerkeys in the DeepSpeed JSON config, includingnvme_path. NVMe bandwidth directly caps your training throughput when the optimizer step triggers NVMe reads/writes. Consumer NVMe (PCIe 3.0 ×4) at ~3 GB/s will noticeably throttle step time for 70B models. Enterprise NVMe in RAID or PCIe 5.0 configurations mitigate this but add infrastructure cost. Measure actual step time before committing to NVMe offload as a production configuration.
Sharding behavior and memory trade-offs
Both stacks partition model state across ranks to reduce per-GPU memory. The mapping is conceptually clean at the top level but diverges in semantics below the sharding stage number.
DecisionMatrix:
| FSDP sharding strategy | Memory profile | DeepSpeed equivalent | Fit scenario |
|---|---|---|---|
NO_SHARD |
Full model replica per GPU | ZeRO stage 0 | Small models or baselines only |
SHARD_GRAD_OP |
Shard gradients and optimizer states | ZeRO stage 2 | Fits in GPU memory with lower sharding overhead |
FULL_SHARD |
Shard params, grads, and optimizer states | ZeRO stage 3 | 70B single-node fine-tunes and other memory-bound runs |
HYBRID_SHARD |
Full shard within node, replica across nodes | No direct equivalent | Multi-node topologies outside this guide |
For 70B fine-tunes on a single node, FULL_SHARD / ZeRO stage 3 is the only viable sharding mode — the others retain too much state per GPU. HYBRID_SHARD is FSDP-only and targets multi-node topologies outside the scope of the Accelerate single-node comparison.
Decision matrix for sharding stage selection:
- Choose FULL_SHARD / ZeRO-3 when the model does not fit with lower sharding stages on your GPU count.
- Choose SHARD_GRAD_OP / ZeRO-2 when the model fits in GPU memory but optimizer states are the bottleneck — lower communication overhead than full sharding.
- Choose NO_SHARD / ZeRO-0 only for baselines or small models; irrelevant for 70B.
FULL_SHARD and ZeRO stage 3 are similar, not identical
The Accelerate concept guide establishes: "In particular, FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3." Both shard parameters, gradients, and optimizer states across all data-parallel ranks. During the forward pass, each rank all-gathers the parameters it needs, computes, then discards the gathered tensors. Gradients are reduce-scattered after the backward pass.
ComparisonTable
| Behavior | FSDP FULL_SHARD | DeepSpeed ZeRO stage 3 |
|---|---|---|
| Parameter prefetching | Configurable (limit_all_gathers) |
Configurable (stage3_prefetch_bucket_size) |
| Parameter synchronization | NCCL all-gather per FSDP unit | NCCL all-gather per layer partition |
| Offload control | Binary (on/off) | Per-component (params vs. optimizer) |
| Config location | Accelerate plugin args | Accelerate plugin args + DeepSpeed JSON |
The prefetching and synchronization differences mean that even at equivalent sharding stages, raw training throughput can differ for the same model and batch size. Neither stack is universally faster — this is workload-dependent.
Offload semantics that do not translate cleanly
FSDP treats CPU offload as a single binary flag. Enabling it moves all parameters and optimizer states to CPU; disabling it keeps everything on GPU. There is no middle configuration.
DeepSpeed separates the two. The offload_param and offload_optimizer keys in the ZeRO config JSON are independent. You can offload optimizer states to CPU while parameters remain GPU-resident, or offload both. NVMe is available as a target for either component:
| Offload scenario | FSDP | DeepSpeed ZeRO-3 |
|---|---|---|
| No offload | ✅ | ✅ |
| Optimizer states → CPU | ❌ (must offload params too) | ✅ |
| Params → CPU, optimizer → GPU | ❌ | ✅ |
| Both → CPU | ✅ | ✅ |
| Either → NVMe | ❌ | ✅ (ZeRO-Infinity) |
Watch Out: If you migrate a working FSDP config with CPU offload enabled to DeepSpeed, do not assume the memory profile will be identical. DeepSpeed's granular offload means you may need to explicitly configure both
offload_paramandoffload_optimizerto reproduce the same memory reduction. Leaving one unset defaults to GPU-resident, which changes the effective memory footprint.
Checkpointing for large-model training runs
Checkpointing a 70B fine-tune under distributed training is an operationally significant decision — save time, reload time, storage layout, and inference portability all vary by approach. The Accelerate concept guide notes that for DeepSpeed ZeRO-3: "one could pass a --zero3_save_16bit_model true, which conveniently consolidates the model to a single rank and saves." FSDP's recommended path is SHARDED_STATE_DICT, which saves distributed shards without consolidation.
DecisionMatrix:
| Checkpoint approach | Save cost | Reload cost | Inference-ready? | Stack |
|---|---|---|---|---|
FSDP SHARDED_STATE_DICT |
Low (distributed) | Low (distributed) | No — needs full-state conversion | FSDP |
FSDP FULL_STATE_DICT |
High (single-rank gather) | Medium | Yes | FSDP |
| DeepSpeed ZeRO-3 partitioned shards | Low (distributed) | Low (distributed) | No — needs zero_to_fp32.py |
DeepSpeed |
DeepSpeed zero3_save_16bit_model |
High (single-rank consolidation) | Fast (standard load) | Yes | DeepSpeed |
Decision matrix for checkpoint strategy:
- Use SHARDED_STATE_DICT when you want fast mid-training saves and can tolerate a conversion step before inference.
- Use FSDP FULL_STATE_DICT when you need an immediately loadable checkpoint and have sufficient CPU RAM for a single-rank gather.
- Use DeepSpeed partitioned shards + zero_to_fp32.py for routine mid-training saves when using DeepSpeed.
- Use zero3_save_16bit_model for end-of-training saves when inference portability is required immediately.
Why sharded state dicts are usually the faster path
For 70B fine-tunes, saving a full consolidated checkpoint requires one rank to assemble the entire model in CPU RAM — for a bf16 70B model, that is approximately 140 GB of host RAM on a single process. Few training nodes have that headroom available mid-training without OOM risk, especially if optimizer states are also in CPU RAM.
Sharded state dicts distribute the save across all ranks. Each rank writes its own shard, and the total wall-clock time for the save scales with shard size rather than total model size. As the Accelerate FSDP usage guide states: "The new recommended way of checkpointing when using FSDP models is to use SHARDED_STATE_DICT as StateDictType."
ComparisonTable
| Metric | SHARDED_STATE_DICT | FULL_STATE_DICT |
|---|---|---|
| Per-rank RAM during save | ~(model size / num_gpus) | Full model size on rank 0 |
| Wall-clock save time | Fast (parallel writes) | Slow (sequential gather then write) |
| Storage layout | One shard file per rank | Single weight file |
| Resume from same GPU count | Direct | Direct |
| Resume from different GPU count | Requires resharding | Direct |
When to use zero_to_fp32.py after ZeRO-3
DeepSpeed checkpoint directories contain partitioned weight and optimizer state files plus a generated zero_to_fp32.py script. As the DeepSpeed ZeRO tutorial documents: "The zero_to_fp32.py script gets created automatically when you save a checkpoint." Running it consolidates all partitioned shards into a single fp32 weight file suitable for inference or model conversion.
The operational cost is explicit in the same source: "Note: currently this script uses 2x memory (general RAM) of the size of the final checkpoint." For a 70B model in bf16 (~140 GB), the conversion requires approximately 280 GB of host RAM. On most single-node training machines, this either requires a dedicated post-training conversion job or a machine with large host RAM provisioned specifically for this step.
ComparisonTable
| Scenario | Recommended approach | RAM behavior |
|---|---|---|
| Mid-training checkpoint (resume only) | Save partitioned shards, skip zero_to_fp32.py |
Lowest RAM footprint |
| End-of-training (need inference weights) | Run zero_to_fp32.py on a high-RAM instance |
About 2x final checkpoint size |
| End-of-training, RAM-constrained | Use zero3_save_16bit_model during training instead |
Single-rank gather at save time |
| Need fp32 weights from a partitioned checkpoint | zero_to_fp32.py is the only path |
Requires full consolidation RAM |
Pro Tip: If you anticipate needing consolidated weights at training end, set
"zero3_save_16bit_model": truein your DeepSpeed config from the start rather than relying onzero_to_fp32.pypost-hoc. The in-training consolidation path still requires single-rank gather overhead, but it avoids provisioning a separate high-RAM conversion machine after training completes.
Accelerate launch and config patterns that actually map cleanly
Both stacks launch via accelerate launch, but the config surfaces diverge significantly. FSDP configuration lives entirely in the Accelerate plugin args (via accelerate config YAML or inline --fsdp_* flags). DeepSpeed requires both Accelerate plugin args and a separate DeepSpeed JSON config file — any ZeRO optimization, offload, or checkpoint flag that DeepSpeed manages natively belongs in the JSON, not the Accelerate layer.
DecisionMatrix:
| Config layer | FSDP | DeepSpeed ZeRO-3 |
|---|---|---|
| Sharding strategy | fsdp_sharding_strategy: FULL_SHARD |
zero_stage: 3 in Accelerate plugin |
| CPU offload | fsdp_offload_params: true |
offload_optimizer / offload_param in DS JSON |
| NVMe offload | Not applicable | nvme_path in DS JSON offload_* blocks |
| Checkpoint type | fsdp_state_dict_type: SHARDED_STATE_DICT |
zero3_save_16bit_model in DS JSON or Accelerate arg |
| RAM-efficient loading | fsdp_cpu_ram_efficient_loading: true |
zero3_init_context_manager in training script |
Decision matrix for config management:
- Use FSDP when you want a single YAML file to describe the full training configuration with no external JSON dependency.
- Use DeepSpeed when your team already manages DeepSpeed JSON configs and has tooling around them, or when you need config options that Accelerate's plugin layer does not expose (e.g., stage3_gather_16bit_weights_on_model_save, fine-grained bucket sizes).
- Avoid mixing Accelerate-layer DeepSpeed args with the JSON config for the same setting — precedence behavior is not always intuitive and can silently override intended values.
FSDP plugin settings to check first
The minimum FSDP settings to verify before a 70B fine-tune run:
ComparisonTable
| Setting | Key | Recommended value for 70B |
|---|---|---|
| Sharding strategy | fsdp_sharding_strategy |
FULL_SHARD |
| CPU offload | fsdp_offload_params |
false (unless GPU memory is exhausted) |
| State dict type | fsdp_state_dict_type |
SHARDED_STATE_DICT |
| RAM-efficient loading | fsdp_cpu_ram_efficient_loading |
true (Transformers models only) |
| Auto-wrap policy | fsdp_auto_wrap_policy |
TRANSFORMER_BASED_WRAP |
| Min num params | fsdp_min_num_params |
Match to your model's layer size |
The Accelerate FSDP usage guide states that when fsdp_cpu_ram_efficient_loading is true: "only the first process loads the pretrained model checkpoint while all other processes have empty weights." This dramatically reduces peak host RAM during model load for large checkpoints. The constraint: this option is only applicable for Transformers models — custom architectures must handle sharded loading manually.
DeepSpeed config file settings to check first
The DeepSpeed JSON config carries the settings that Accelerate's plugin layer cannot express. The DeepSpeed ZeRO tutorial confirms: "To enable ZeRO optimizations for a DeepSpeed model, we simply add the zero_optimization key to the DeepSpeed JSON configuration."
| Setting | JSON key path | Purpose |
|---|---|---|
| ZeRO stage | zero_optimization.stage |
Set to 3 for 70B |
| Optimizer offload target | zero_optimization.offload_optimizer.device |
"cpu" or "nvme" |
| Param offload target | zero_optimization.offload_param.device |
"cpu" or "nvme" |
| NVMe path | zero_optimization.offload_*.nvme_path |
Local NVMe mount point |
| Save 16-bit model | zero_optimization.stage3_gather_16bit_weights_on_model_save |
Enables zero3_save_16bit_model |
| Optimizer type (CPU offload) | optimizer.type |
"DeepSpeedCPUAdam" required for CPU offload |
Watch Out: If you enable CPU offload for optimizer states in DeepSpeed, you must also switch to
DeepSpeedCPUAdamas the optimizer. Standard Adam or AdamW from PyTorch will not function correctly with CPU-resident optimizer states in DeepSpeed's ZeRO-Offload path. Thezero3_save_16bit_modelflag (stage3_gather_16bit_weights_on_model_save: true) triggers a single-rank gather at save time — account for the RAM overhead.
Decision matrix for single-node, multi-GPU training
The core operational question for a 70B fine-tune on a single node: how much GPU memory headroom do you have, how often do you checkpoint, and do you need inference-ready weights at the end of training?
| Scenario | Recommended stack | Key reason |
|---|---|---|
GPU memory sufficient with FULL_SHARD, no offload needed |
FSDP | Simpler config, fast sharded checkpoints |
| Need optimizer-only CPU offload, params on GPU | DeepSpeed ZeRO-3 | FSDP cannot express this split |
| Need NVMe offload to fit the model at all | DeepSpeed ZeRO-3 + ZeRO-Infinity | Only available in DeepSpeed |
| Frequent mid-training checkpoints (every N steps) | FSDP SHARDED_STATE_DICT |
Fastest distributed save, no conversion needed |
| Need inference-ready consolidated weights at end of training | DeepSpeed zero3_save_16bit_model or FSDP FULL_STATE_DICT |
Depends on which stack you're already using |
| Team unfamiliar with DeepSpeed JSON config | FSDP | Fewer config surfaces, fewer failure modes |
Bottom Line: Neither stack dominates for all 70B fine-tune configurations. FSDP wins on operational simplicity and checkpoint ergonomics. DeepSpeed ZeRO-3 wins when offload granularity or NVMe is required. The mapping between the two is real (
FULL_SHARD↔ ZeRO-3) but does not extend to offload or checkpointing — those require stack-specific solutions.
Choose FSDP when operational simplicity matters more than exotic offload
FSDP is the right default when your GPU fleet can run the fine-tune with FULL_SHARD and either no offload or full CPU offload. As the PyTorch FSDP documentation describes: "FSDP shards module parameters, gradients, and optimizer states across data parallel workers to save memory at the cost of communication."
Choose FSDP when:
- GPU count × GPU VRAM covers the model with FULL_SHARD (no offload), or full CPU offload is an acceptable fallback
- You want sharded checkpoints without a post-conversion workflow
- Your training script is built on Transformers and you want fsdp_cpu_ram_efficient_loading
- You want a single Accelerate YAML as the sole configuration artifact
- The team prioritizes debuggability over maximum memory efficiency
Choose DeepSpeed when separate offload and NVMe are worth the complexity
DeepSpeed ZeRO-3's core advantage is offload granularity. As the DeepSpeed ZeRO-3 Offload documentation states: "ZeRO-3 Offload can partition the parameters across GPUs, and offload them to CPU, supporting model sizes that are much larger than the memory on a single GPU." NVMe offload extends this further.
Choose DeepSpeed ZeRO-3 when:
- You need optimizer-only CPU offload while keeping parameters GPU-resident
- NVMe offload is the only path to fitting the model on your node
- Your team already maintains DeepSpeed JSON configs and has operational familiarity
- You need zero3_save_16bit_model for direct inference-weight export at training end
- Fine-grained ZeRO bucket sizes and prefetch tuning are worth the config surface
Watch Out: The complexity budget for DeepSpeed includes the JSON config, the
DeepSpeedCPUAdamoptimizer requirement for CPU offload, thezero_to_fp32.pyconversion workflow, and the 2× RAM requirement for post-training weight consolidation. On a 70B model, that last point alone can require 280 GB of host RAM for the conversion step. If your infrastructure does not have that headroom, plan the checkpoint strategy before training starts — not after.
FAQ
Which is better, FSDP or DeepSpeed? Neither is categorically better. FSDP is operationally simpler with cleaner sharded checkpointing. DeepSpeed ZeRO-3 offers more granular offload control and NVMe support. The right choice depends on your memory constraints and checkpoint requirements, not on a universal ranking.
Does FSDP support offloading? Yes, but only as a binary switch. Enabling FSDP CPU offload moves all parameters and optimizer states to CPU — you cannot offload only optimizer states while keeping parameters on GPU. DeepSpeed can split these independently.
How do you save a checkpoint in DeepSpeed ZeRO-3?
Two paths: (1) save partitioned shards during training (default) and run zero_to_fp32.py post-training to consolidate — this requires approximately 2× the final checkpoint size in host RAM; (2) set stage3_gather_16bit_weights_on_model_save: true (the zero3_save_16bit_model flag) to consolidate to a single rank at save time, trading save latency for immediate portability.
What is the difference between FSDP and DeepSpeed in Accelerate?
Both integrate into Hugging Face Accelerate and both support full parameter sharding. The differences are in config surface (FSDP uses plugin args only; DeepSpeed also requires a JSON config), offload semantics (FSDP is all-or-nothing; DeepSpeed splits param and optimizer offload), checkpoint format (FSDP prefers SHARDED_STATE_DICT; DeepSpeed produces partitioned shards requiring zero_to_fp32.py for consolidation), and NVMe support (DeepSpeed only).
| Question | FSDP answer | DeepSpeed ZeRO-3 answer |
|---|---|---|
| Supports offloading? | Yes, all-or-nothing to CPU | Yes, params and optimizer independently, CPU or NVMe |
| How to save a 70B checkpoint? | SHARDED_STATE_DICT (recommended) |
Partitioned shards + zero_to_fp32.py, or zero3_save_16bit_model |
| Better for operational simplicity? | Yes | No — requires JSON config + conversion tooling |
| Better for memory-constrained nodes? | No — binary offload limits flexibility | Yes — granular offload, NVMe support |
Sources and references
- Hugging Face Accelerate: FSDP and DeepSpeed concept guide — Primary source for the FSDP ↔ ZeRO stage mapping, offload semantics, and checkpoint comparison
- Hugging Face Accelerate: FSDP usage guide — Source for
SHARDED_STATE_DICTrecommendation,fsdp_cpu_ram_efficient_loadingbehavior - PyTorch FSDP:
fully_sharddocumentation — Source for FSDP parameter/gradient/optimizer-state sharding description - DeepSpeed ZeRO tutorial — Source for
zero_to_fp32.pyauto-generation and 2× RAM requirement - DeepSpeed ZeRO-Offload tutorial — Source for optimizer offload definition and CPU offload semantics
- DeepSpeed ZeRO-3 Offload blog post — Source for ZeRO-3 parameter partitioning and CPU offload capability
- DeepSpeed config JSON reference — Source for
zero_optimizationkey structure, offload config fields, and optimizer requirements
Keywords: Hugging Face Accelerate, PyTorch FSDP, DeepSpeed ZeRO-3, ZeRO-Infinity, zero_to_fp32.py, fsdp_full_shard, fsdp_cpu_ram_efficient_loading, zero3_save_16bit_model, NCCL, SLURM, Ray, NVMe offload, state_dict, sharded checkpointing, Llama 3.1 70B



