Implementing Machine Unlearning for NIST AI 100-2e Compliance

14 min read · Published Apr 25, 2026, 12:02 AM

The regulatory pressure on AI systems to respect data removal requests has moved from theoretical to operational. NIST AI 100-2e frames "Right to be Forgotten" not as a UI toggle but as a traceable, auditable technical guarantee—one that output suppression cannot satisfy. This article provides the exact implementation path: from gradient manipulation primitives to SISA shard architectures, from differential privacy budget accounting to onion-effect diagnostics.


The NIST AI 100-2e Mandate: Beyond Output Suppression

NIST AI 100-2e categorizes AI risks with an explicit requirement for data provenance and traceability. When a data subject invokes their right to erasure, the compliance obligation is not to hide the model's response—it is to eliminate the training signal itself. Output suppression (content filters, guardrails, RLHF-tuned refusals) operates entirely at inference time. It adds negligible latency and zero parameter modification.

This is legally and technically insufficient. As the machine learning research community has clarified, "machine unlearning is now a subarea of machine learning that both develops methods for (1) the targeted removal of the effect of training data from the trained model and (2) the targeted suppression of content in a generative-AI model's outputs" (SpicyIP, 2025). The distinction matters: method (1) satisfies an audit; method (2) fails it.

An auditor examining NIST AI 100-2e compliance will ask for a data lineage artifact proving that a specific training record no longer influences model predictions. A guardrail produces no such artifact. It can also be bypassed via prompt injection, adversarial suffixes, or jailbreak techniques—leaving the latent training data influence fully intact in the weight space. NIST's accountability and traceability governance functions demand more.

Technical Warning: Output suppression is not a compliance substitute for parameter-level unlearning. Deploying guardrails as your primary "Right to be Forgotten" mechanism exposes your organization to audit failure and potential regulatory liability under frameworks referencing NIST AI 100-2e.


Architectural Foundation: Parameter-Level vs. Output-Level Control

The core architectural distinction is where the intervention occurs. Output-level control intercepts activations or logits at inference time. Parameter-level control modifies the weight tensors that encode training data influence—permanently, verifiably, before inference ever runs.

Dimension Output Suppression Parameter-Level Unlearning
Intervention Point Inference-time (post-forward pass) Weight space (pre-deployment)
Computational Overhead Negligible (filter latency < 1ms) High (partial to full retraining cycle)
Compliance Risk High — no audit artifact, bypassable Low — produces verifiable weight deltas
Model Drift Risk None Moderate — requires D_r validation
Data Provenance None Full — tied to training subset D_f
Bypass Vector Prompt injection, adversarial inputs None (weights are modified)

Parameter-level modification carries real risk of model drift; output suppression carries real risk of prompt-injection-based bypass. For NIST AI 100-2e compliance, only parameter-level modification creates an auditable chain from the erasure request to the model artifact.

Mapping Risks to NIST AI RMF

NIST AI RMF's GOVERN, MAP, MEASURE, and MANAGE functions each have direct hooks into machine unlearning requirements:

  • GOVERN 1.1 — Policies must explicitly define the unlearning SLA (e.g., removal within 30 days of request).
  • MAP 2.2 — Training data provenance must be catalogued at the record level, not just dataset level. Shadow AI—models trained on undocumented data pipelines—represent an acute risk here because unlearning requires access to the original forget set (D_f).
  • MEASURE 2.5 — Post-unlearning model performance on the retain set (D_r) must be benchmarked against a baseline to detect regression.
  • MANAGE 2.4 — Unlearning events must produce immutable logs: timestamp, D_f hash, weight checksum before and after, and privacy budget consumed.

The audit requirements follow directly:

  1. Record-level training data inventory with cryptographic hashes
  2. Mapping of training record influence to model layer activations (gradient attribution)
  3. Pre/post weight checksum comparison for each unlearning event
  4. Differential privacy budget ledger per unlearning cycle
  5. Retain-set performance regression report against baseline
  6. Immutable unlearning event log stored outside the model artifact

Implementing Gradient-Based Unlearning via PyTorch

Gradient Ascent unlearning inverts the standard training objective. Standard training minimizes loss on D_f; gradient ascent maximizes it, driving the model parameters away from the decision boundary learned from the forget set. The mechanism is precise: compute the loss on D_f, negate it, and apply optimizer updates. Research shows this approach can facilitate unlearning up to 30% of a dataset without substantial catastrophic forgetting—but that threshold is model-architecture-dependent and must be validated empirically.

"Gradient ascent, which is used in unlearning methods, effectively removes the influence of specific data by adjusting the model parameters in the opposite direction of the gradient" (Medium: Machine Unlearning Tutorial 3, 2026).

The critical implementation detail is constraining the ascent so it does not corrupt D_r. A joint loss formulation—ascending on D_f while descending on a sampled D_r batch—provides that constraint.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from copy import deepcopy

def gradient_ascent_unlearn(
    model: nn.Module,
    forget_loader: DataLoader,      # D_f: the subset to be unlearned
    retain_loader: DataLoader,      # D_r: sampled retain set for stability
    optimizer: torch.optim.Optimizer,
    criterion: nn.Module,
    num_epochs: int = 5,
    retain_weight: float = 0.5,     # balances forget-ascent vs. retain-descent
    device: str = "cuda"
) -> nn.Module:
    """
    Jointly ascends loss on D_f and descends loss on D_r.
    Returns modified model; caller is responsible for checkpointing.
    """
    model = model.to(device)
    baseline = deepcopy(model)  # preserve for onion-effect diagnostics later
    baseline.eval()

    retain_iter = iter(retain_loader)

    for epoch in range(num_epochs):
        model.train()
        for forget_inputs, forget_labels in forget_loader:
            forget_inputs = forget_inputs.to(device)
            forget_labels = forget_labels.to(device)

            # --- Forget loss: ASCENT on D_f ---
            forget_outputs = model(forget_inputs)
            # Negate the loss to push parameters AWAY from D_f
            loss_forget = -criterion(forget_outputs, forget_labels)

            # --- Retain loss: DESCENT on D_r (prevents catastrophic forgetting) ---
            try:
                retain_inputs, retain_labels = next(retain_iter)
            except StopIteration:
                retain_iter = iter(retain_loader)
                retain_inputs, retain_labels = next(retain_iter)

            retain_inputs = retain_inputs.to(device)
            retain_labels = retain_labels.to(device)
            retain_outputs = model(retain_inputs)
            loss_retain = criterion(retain_outputs, retain_labels)

            # Combined loss: balance forget-ascent against retain-descent
            total_loss = loss_forget + retain_weight * loss_retain

            optimizer.zero_grad()
            total_loss.backward()

            # Clip gradients to prevent catastrophic unlearning on D_r
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"Forget Loss: {loss_forget.item():.4f} | "
              f"Retain Loss: {loss_retain.item():.4f}")

    return model, baseline

Pro-Tip: Use PyTorch 2.x torch.compile() on the training loop to reduce gradient computation overhead by ~20% on H100 hardware. Register Autograd hooks on targeted parameter groups to log per-layer gradient magnitudes for the NIST audit artifact.

The SISA Framework for Efficient Removals

Full model retraining from scratch is the gold-standard proof of unlearning, but its cost is O(n) over the full dataset. SISA (Sharded, Isolated, Sliced, and Aggregated) reduces this to approximately O(n/k + s), where k is the number of shards and s is the slice size. Only the shard containing the forget-set record requires retraining—not the entire model.

The architecture distributes training data across k disjoint shards. Each shard trains an isolated sub-model. Inference aggregates predictions across all sub-models. When an erasure request arrives, the team identifies which shard contains D_f, retrains only that sub-model from the relevant slice checkpoint, and re-aggregates.

flowchart TD
    A[Full Training Dataset D] --> B[Shard Partitioner\nDisjoint Split into k shards]
    B --> S1[Shard 1\nSub-model M1]
    B --> S2[Shard 2\nSub-model M2]
    B --> SK[Shard k\nSub-model Mk]

    S1 --> SL1[Sliced Checkpoints\nC1_1, C1_2, ... C1_s]
    S2 --> SL2[Sliced Checkpoints\nC2_1, C2_2, ... C2_s]
    SK --> SLK[Sliced Checkpoints\nCk_1, Ck_2, ... Ck_s]

    SL1 --> AGG[Aggregator\nEnsemble / Voting Layer]
    SL2 --> AGG
    SLK --> AGG

    AGG --> OUT[Model Output]

    DR[Erasure Request\nD_f identified in Shard j] --> RETRAIN[Retrain Sub-model Mj\nfrom last clean slice checkpoint]
    RETRAIN --> AGG

    style DR fill:#ff6b6b,color:#fff
    style RETRAIN fill:#ffa94d,color:#fff

Technical Warning: SISA requires maintaining disjoint shards, which increases memory overhead during initial training and aggregate inference cycles. For large language models, shard boundaries must be established at training time—retrofitting SISA onto an already-trained model is not possible without full retraining.

The NIST traceability requirement maps cleanly to SISA: each shard checkpoint carries a cryptographic hash of its constituent training records. The erasure audit log references the specific shard ID, the slice checkpoint resumed from, and the post-retrain weight checksum.

Managing Differential Privacy (epsilon, delta) Targets

Differential privacy provides the mathematical foundation for quantifying what an adversary can infer about D_f after unlearning. The privacy guarantee is expressed as (ε, δ): ε bounds the worst-case information leakage; δ is the probability that the bound fails.

"Every DP operation consumes a finite privacy budget, measured through ε (epsilon) for worst-case leakage and δ (delta) for failure probability" (Provectus, 2026).

After an unlearning cycle, privacy leakage must be evaluated by comparing the output distributions of the unlearned model M_u and a reference model M_ref trained without D_f:

$$\Pr[M_u(x) \in S] \leq e^{\varepsilon} \cdot \Pr[M_{ref}(x) \in S] + \delta$$

For any measurable output set S and any input x, the distributions must satisfy this bound. In practice, you evaluate this empirically using membership inference attacks: train a shadow model on the same distribution and test whether D_f records can be distinguished from non-training records in M_u's outputs.

import torch
import numpy as np
from scipy.special import rel_entr

def estimate_privacy_leakage(
    unlearned_model: torch.nn.Module,
    reference_model: torch.nn.Module,  # trained WITHOUT D_f
    forget_loader: torch.utils.data.DataLoader,
    device: str = "cuda",
    delta: float = 1e-5
) -> dict:
    """
    Estimates epsilon by comparing output distributions of M_u vs M_ref.
    Uses KL divergence as a proxy for privacy leakage.
    """
    unlearned_model.eval()
    reference_model.eval()
    kl_divergences = []

    with torch.no_grad():
        for inputs, _ in forget_loader:
            inputs = inputs.to(device)

            # Softmax probabilities from each model
            p_unlearned = torch.softmax(unlearned_model(inputs), dim=-1).cpu().numpy()
            p_reference = torch.softmax(reference_model(inputs), dim=-1).cpu().numpy()

            # KL divergence: D_KL(P_unlearned || P_reference) per sample
            # Lower divergence = unlearned model approximates reference on D_f
            for p, q in zip(p_unlearned, p_reference):
                # Clip to avoid log(0); add small epsilon for numerical stability
                p = np.clip(p, 1e-10, 1.0)
                q = np.clip(q, 1e-10, 1.0)
                kl = np.sum(rel_entr(p, q))
                kl_divergences.append(kl)

    mean_kl = np.mean(kl_divergences)
    max_kl = np.max(kl_divergences)

    # Approximate epsilon: max KL divergence bounds worst-case leakage
    epsilon_approx = max_kl

    return {
        "epsilon_approx": epsilon_approx,
        "delta": delta,
        "mean_kl_divergence": mean_kl,
        "max_kl_divergence": max_kl,
        "compliant": epsilon_approx < 1.0  # adjust threshold per policy
    }

Pro-Tip: Floating-point arithmetic introduces non-zero leakage even in nominally DP-compliant models. Use deterministic CUDA operations (torch.use_deterministic_algorithms(True)) and audit gradient accumulation precision to prevent silent privacy budget overruns.


The onion effect is the primary threat to retain-set integrity during unlearning. As Carlini et al. demonstrated, "unlearning specific data points can inadvertently expose information about others in the retained set due to the 'onion effect' of privacy" (Carlini et al., 2022; referenced in ResearchGate, 2026). The mechanism: removing a data point changes the model's decision boundaries, potentially moving other D_r records closer to the boundary and making them more vulnerable to membership inference attacks.

Aggressive gradient ascent on D_f does not stay contained. Depending on training data correlation structure, parameter updates propagate through shared weight subspaces, degrading performance on semantically adjacent D_r records. The diagnostic requirement is mandatory: after every unlearning cycle, validate model performance on a held-out D_r benchmark against a frozen baseline.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def diagnose_retain_set_integrity(
    unlearned_model: nn.Module,
    baseline_model: nn.Module,       # snapshot before unlearning (deepcopy)
    retain_loader: DataLoader,
    device: str = "cuda",
    accuracy_tolerance: float = 0.02  # max 2% degradation allowed
) -> dict:
    """
    Compares unlearned model vs. baseline on D_r.
    Flags integrity violation if accuracy drops beyond tolerance threshold.
    """
    def compute_accuracy(model, loader):
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for inputs, labels in loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                preds = outputs.argmax(dim=1)
                correct += (preds == labels).sum().item()
                total += labels.size(0)
        return correct / total if total > 0 else 0.0

    baseline_acc = compute_accuracy(baseline_model, retain_loader)
    unlearned_acc = compute_accuracy(unlearned_model, retain_loader)
    degradation = baseline_acc - unlearned_acc

    integrity_violation = degradation > accuracy_tolerance

    report = {
        "baseline_accuracy": baseline_acc,
        "unlearned_accuracy": unlearned_acc,
        "degradation": degradation,
        "tolerance": accuracy_tolerance,
        "integrity_violation": integrity_violation,
        "recommendation": (
            "HALT: Re-run with reduced ascent learning rate or fewer epochs."
            if integrity_violation else
            "PASS: Retain-set integrity within acceptable bounds."
        )
    }

    print(f"Baseline Acc: {baseline_acc:.4f} | Unlearned Acc: {unlearned_acc:.4f} | "
          f"Degradation: {degradation:.4f} | Violation: {integrity_violation}")

    return report

Run this diagnostic after each gradient ascent epoch, not just at cycle completion. Early detection prevents compounding degradation across iterative unlearning events.

Infrastructure Requirements for High-Performance Re-training

SISA shard retraining and gradient ascent cycles on large models require dedicated compute. Full retraining of a multi-billion parameter model on CPU or consumer GPU hardware is operationally infeasible within any reasonable compliance SLA.

Minimum hardware specification for production unlearning pipelines:

  • GPU: NVIDIA H100 80GB (preferred) or A100 80GB arrays, minimum 4× per unlearning job
  • Interconnect: NVLink 4.0 for intra-node; InfiniBand HDR (200Gb/s) for multi-node SISA shard distribution — cluster bandwidth is the primary bottleneck for distributed shard updates
  • Storage: NVMe SSD arrays with ≥10GB/s sequential read for checkpoint loading; D_f and D_r subsets must be retrievable within seconds, not minutes
  • Memory: SISA shard sub-models require proportional GPU VRAM; factor in 2× overhead for gradient accumulation buffers

Compliance pipeline operationalization checklist:

  • [ ] Training data inventory database with record-level SHA-256 hashes
  • [ ] Shard-to-record mapping table, immutable, cryptographically signed
  • [ ] Automated D_f identification from erasure request to shard lookup
  • [ ] Gradient ascent job scheduler with priority queue for SLA management
  • [ ] Post-unlearning checkpoint storage with before/after weight diff artifacts
  • [ ] Retain-set integrity diagnostic integrated as a pipeline gate (block deployment on violation)
  • [ ] Privacy leakage estimation job triggered automatically post-unlearning
  • [ ] Immutable event log export (JSON-L format) to SIEM or compliance data lake

Operationalizing Compliance: A CISO's Checklist

Technical implementation without operational governance fails NIST AI 100-2e in a different way. The framework's GOVERN and MANAGE functions require that unlearning processes be repeatable, auditable, and organizationally accountable—not just technically functional.

NIST AI 100-2e compliance integration checklist:

Governance Layer - [ ] Define unlearning SLA in data processing agreements (target: ≤30 days from verified erasure request) - [ ] Assign explicit ownership: who authorizes a D_f removal event, who executes it, who validates it - [ ] Establish model versioning policy: every unlearning event produces a new versioned artifact

Technical Audit Layer - [ ] Automate NIST audit artifact generation: D_f hash, pre-unlearn weight checksum, post-unlearn weight checksum, epsilon/delta report, retain-set integrity report - [ ] Route all artifacts to an append-only audit log (immutable storage: WORM-compliant S3 bucket or equivalent) - [ ] Integrate audit log schema into SIEM for automated anomaly detection on unlearning frequency and privacy budget consumption

Validation Layer - [ ] Mandatory membership inference attack test on D_f post-unlearning (confirm records are no longer distinguishable as training members) - [ ] Retain-set accuracy regression report against versioned baseline (gate on ≤2% degradation) - [ ] Privacy budget ledger: track cumulative ε consumed across all unlearning events per model version

Reporting Layer - [ ] Automated compliance report generation per unlearning event, consumable by legal and privacy teams without raw ML tooling - [ ] Escalation path for integrity violations: onion-effect triggers halt, human review, and potential full shard retrain


Future-Proofing Your Machine Learning Lifecycle

The trajectory of AI accountability regulation in 2026 is unambiguous: voluntary frameworks like NIST AI 100-2e are the floor, not the ceiling. The EU AI Act's conformity assessment requirements for high-risk AI systems and emerging state-level privacy legislation in the US are already referencing technical standards that assume parameter-level unlearning capability.

The teams building this infrastructure now will have a compounding advantage. Each unlearning event logged, each privacy budget accounted for, each retain-set diagnostic validated, builds the evidentiary record that regulators will eventually require by default.

Automated machine unlearning pipelines—where erasure requests trigger a fully automated D_f identification, shard retrain, DP audit, and artifact export without human intervention—are the target architecture. Teams should be building toward sub-24-hour unlearning SLAs with zero-touch compliance artifact generation. The alternative is manual, error-prone responses to data subject requests that scale poorly against growing model sizes and increasingly granular privacy mandates.

Three near-term developments will reshape the implementation calculus: (1) hardware-native DP noise injection in next-generation accelerators, reducing the computational overhead of privacy-preserving unlearning; (2) standardized unlearning audit schemas that compliance tools can consume directly; and (3) federated unlearning protocols that extend parameter-level removal guarantees to models trained across distributed data silos.

The investment in parameter-level unlearning infrastructure today is not just a compliance cost—it is the foundation of a trustworthy, auditable ML lifecycle that survives regulatory evolution without requiring architectural rebuilds.


Keywords: SISA (Sharded, Isolated, Sliced, and Aggregated), Gradient Ascent, Differential Privacy (epsilon-delta), Catastrophic Forgetting, NIST AI Risk Management Framework, Model Retraining Cycle, PyTorch Gradient Manipulation, Catastrophic Unlearning, Right to be Forgotten, Machine Learning Auditability