Scaling Neural Operators for Industry-Scale 3D Surrogate Modeling

16 min read · Published Apr 22, 2026, 12:05 AM

Static mesh-based solvers are hitting a wall. Computational Fluid Dynamics workflows that once required days of HPC cluster time are now under pressure to deliver real-time feedback for digital twin environments, autonomous systems, and GNSS-denied navigation. Neural operators—learned mappings between function spaces—bridge that gap, but their industrial deployment requires solving three simultaneous problems: heterogeneous sensor fusion, catastrophic forgetting during continual training, and sub-millisecond inference at the edge. This article maps the full stack, from DINOv2 feature extraction through NVIDIA Triton orchestration.


Architecting High-Efficiency 3D Surrogate Modeling

Traditional CFD pipelines discretize physical domains into static meshes, then solve governing PDEs iteratively. The computational cost scales poorly with domain resolution, and the solver must restart completely when boundary conditions change. Neural operators—specifically Fourier Neural Operators (FNOs) and their 3D extensions—learn resolution-invariant mappings that amortize the simulation cost across a training distribution. Once trained, a neural operator evaluates in milliseconds against solver runtimes measured in hours.

The transition from mesh-based solvers to neural operators introduces a new bottleneck: sensor input heterogeneity. Industrial deployments ingest LiDAR point clouds, thermal imagery, visible-spectrum video, and inertial measurements simultaneously. Each modality has distinct spatial resolution, noise characteristics, and temporal sampling rates. A shared backbone must project all of these into a common feature space without discarding modality-specific signal.

SCC-Loc demonstrates one production-grade answer to this problem. The framework achieves a 7.6-fold accuracy improvement compared to traditional localization baselines under a strict 5-meter error threshold, with a mean localization error of 9.37 meters in GNSS-denied test environments. The architectural requirement is a shared DINOv2 backbone for cross-modal feature extraction—a constraint that shapes every design decision downstream.

The data flow from raw sensor ingestion to SCC-Loc localization output follows this pipeline:

flowchart TD
    A[Raw Sensor Inputs\nLiDAR · Thermal · Visible · IMU] --> B[Modality-Specific Preprocessors\nVoxelization · Normalization · Temporal Alignment]
    B --> C[Shared DINOv2 ViT Backbone\nPatch Embedding + Transformer Blocks]
    C --> D[Global CLS Token\nCross-Modal Alignment Space]
    C --> E[Intermediate Spatial Feature Maps\nPatch-Level Tokens per Modality]
    D --> F[Semantic-Guided Viewport Alignment\nSGVA Module]
    E --> G[Cascaded Spatial-Adaptive\nTexture-Structure Filtering]
    F --> H[Consensus-Driven Reliability Weighting]
    G --> H
    H --> I[3D Surrogate Model\nNeural Operator]
    I --> J[Localization Output\nPose Estimate + Uncertainty]
    J --> K[Digital Twin State Update]

    style A fill:#1e3a5f,color:#fff
    style I fill:#1e3a5f,color:#fff
    style K fill:#2d6a4f,color:#fff
    style H fill:#6b3a2a,color:#fff

The CLS token from DINOv2 carries semantic global context; the patch-level tokens carry spatial structure. Both are required—global tokens drive cross-modal alignment, while patch tokens feed the spatial-adaptive filtering stage that ultimately constrains inference latency.


The SCC-Loc Framework: Bridging the Modality Gap

Fusing thermal and visible-spectrum imagery is not a registration problem—it is a semantic alignment problem. Thermal sensors respond to emissive surfaces; visible cameras respond to reflective ones. Identical geographic features produce structurally different representations across modalities. Naive feature concatenation amplifies this divergence rather than resolving it.

SCC-Loc addresses this through its Semantic-Guided Viewport Alignment (SGVA) module. As stated directly in the paper: "We design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations." SGVA treats viewport selection as a learnable optimization, iteratively refining the spatial crop of satellite imagery to maximize semantic agreement with the ground-level query image. This eliminates a class of feature ambiguity that plagued prior retrieval-based localization systems operating across large-scale geo-referenced databases.

The reliability weighting mechanism governing how SCC-Loc arbitrates between competing modality signals uses a consensus-driven formulation. For a set of candidate location hypotheses $\mathcal{H} = {h_1, h_2, \ldots, h_N}$ and their corresponding similarity scores ${s_1, s_2, \ldots, s_N}$ from $M$ modalities, the reliability weight $w_m$ for modality $m$ is:

$$w_m = \frac{\exp\left(\alpha \cdot \text{Corr}(\mathbf{s}m, \bar{\mathbf{s}})\right)}{\sum$$}^{M} \exp\left(\alpha \cdot \text{Corr}(\mathbf{s}_k, \bar{\mathbf{s}})\right)

where $\bar{\mathbf{s}} = \frac{1}{M}\sum_{k=1}^{M} \mathbf{s}_k$ is the consensus score vector, $\text{Corr}(\cdot, \cdot)$ denotes Pearson correlation, and $\alpha$ is a temperature hyperparameter controlling weight sharpness. Modalities whose ranking distributions agree with the consensus receive higher weight; outlier modalities—typically those degraded by sensor noise or occlusion—are attenuated automatically. This is not learned gating; it is an inference-time adaptive mechanism that requires no additional parameters.

Technical Warning: The consensus weight calculation assumes at least three active modalities to produce a stable $\bar{\mathbf{s}}$. In sensor dropout scenarios (e.g., thermal camera failure), the framework must detect single-modality operation and bypass the weighting step, falling back to raw similarity scores. Build explicit sensor health monitoring into your ingestion pipeline.

The framework relies on global CLS tokens from pre-trained DINOv2 models to bridge the thermal-visible modality gap. These tokens encode high-level semantic categories (buildings, vegetation, road surfaces) that are invariant to the specific imaging modality, making them ideal anchors for cross-modal retrieval.


DINOv2 Backbones for Spatial-Adaptive Extraction

DINOv2's ViT-L/14 backbone, trained via self-supervised distillation on a curated 142M-image dataset, produces patch tokens with strong spatial correspondence properties—without any task-specific fine-tuning. For neural operator surrogate modeling, the relevant features live in intermediate transformer blocks, not only the final layer. Deeper layers encode semantic abstraction; shallower layers encode texture and local structure. Spatial-adaptive filtering in SCC-Loc exploits this hierarchy.

Compute overhead scales linearly with ViT depth. For production deployments, register forward hooks at specific block indices rather than extracting all 24 layers of ViT-L.

import torch
import torch.nn as nn
from torchvision import transforms
from PIL import Image

# Load pre-trained DINOv2 ViT-L/14 from torch.hub
# Requires: pip install timm, torch>=2.0
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
model.eval().cuda()

# Define which transformer block indices to tap for spatial features.
# Blocks 12 and 20 (of 24) capture mid-level texture and high-level semantics respectively.
TARGET_BLOCKS = [12, 20]
intermediate_features: dict[int, torch.Tensor] = {}

def make_hook(block_idx: int):
    """
    Returns a forward hook that stores the patch token output
    (excluding CLS token at position 0) for the specified block.
    """
    def hook(module: nn.Module, input: tuple, output: torch.Tensor) -> None:
        # output shape: [B, N+1, D] where N = num_patches, D = embed_dim
        # Slice off CLS token; retain spatial patch tokens only
        intermediate_features[block_idx] = output[:, 1:, :]
    return hook

# Register hooks on the chosen transformer blocks
hook_handles = []
for block_idx in TARGET_BLOCKS:
    handle = model.blocks[block_idx].register_forward_hook(make_hook(block_idx))
    hook_handles.append(handle)

# Preprocessing: DINOv2 expects 224px or 518px inputs normalized to ImageNet stats
preprocess = transforms.Compose([
    transforms.Resize(518),
    transforms.CenterCrop(518),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load a sample image (replace with your sensor frame)
img = Image.open("sensor_frame.png").convert("RGB")
x = preprocess(img).unsqueeze(0).cuda()  # [1, 3, 518, 518]

with torch.no_grad():
    _ = model(x)  # Forward pass populates intermediate_features via hooks

# Retrieve and reshape patch tokens into spatial grids
# For 518px input with patch size 14: num_patches = (518/14)^2 = 1369
patch_grid_size = 518 // 14  # 37

for block_idx, tokens in intermediate_features.items():
    B, N, D = tokens.shape
    # Reshape flat patch sequence back to 2D spatial map for convolutional processing
    spatial_map = tokens.reshape(B, patch_grid_size, patch_grid_size, D)
    spatial_map = spatial_map.permute(0, 3, 1, 2)  # [B, D, H, W]
    print(f"Block {block_idx} spatial feature map: {spatial_map.shape}")

# Always remove hooks after use to prevent memory leaks in long-running services
for handle in hook_handles:
    handle.remove()

The spatial map at block 12 feeds directly into SCC-Loc's Cascaded Spatial-Adaptive Texture-Structure Filtering stage. The block 20 output drives the SGVA module's semantic correlation. By tapping two blocks rather than the final output alone, the surrogate model receives both structural edge information and abstract semantic context without doubling backbone inference cost.


Building a Unified Pipeline without Catastrophic Forgetting

A digital twin that ingests data from multiple operational domains—urban navigation, subsurface infrastructure, wind farm monitoring—cannot afford to retrain from scratch when adding a new domain. Standard fine-tuning overwrites previously learned weight configurations. Elastic Weight Consolidation (EWC) constrains updates to parameters that are not critical to prior task performance, measured by the diagonal of the Fisher Information Matrix (FIM).

EWC adds a regularization term to the loss:

$$\mathcal{L}\text{EWC} = \mathcal{L}\text{task} + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^*)^2$$

where $F_i$ is the diagonal Fisher estimate for parameter $i$, $\theta_i^*$ is the optimal parameter from the previous task, and $\lambda$ controls regularization strength.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from copy import deepcopy

class EWCTrainer:
    """
    Implements Elastic Weight Consolidation for continual learning
    across heterogeneous surrogate modeling domains.
    """

    def __init__(self, model: nn.Module, ewc_lambda: float = 5000.0):
        self.model = model
        self.ewc_lambda = ewc_lambda
        # Stores (optimal_params, fisher_diagonal) per completed task
        self.task_memory: list[dict[str, torch.Tensor]] = []

    def compute_fisher(self, dataloader: DataLoader, num_samples: int = 512) -> dict[str, torch.Tensor]:
        """
        Estimate diagonal Fisher Information Matrix via empirical approximation.
        Uses squared gradients of log-likelihood averaged over data samples.
        """
        fisher: dict[str, torch.Tensor] = {
            name: torch.zeros_like(param)
            for name, param in self.model.named_parameters()
            if param.requires_grad
        }
        self.model.eval()
        samples_processed = 0

        for batch in dataloader:
            if samples_processed >= num_samples:
                break
            inputs, targets = batch[0].cuda(), batch[1].cuda()
            self.model.zero_grad()
            outputs = self.model(inputs)
            # Use log-softmax + NLL as proxy for log-likelihood
            log_probs = F.log_softmax(outputs, dim=-1)
            loss = F.nll_loss(log_probs, targets)
            loss.backward()

            for name, param in self.model.named_parameters():
                if param.requires_grad and param.grad is not None:
                    # Accumulate squared gradients as Fisher diagonal estimate
                    fisher[name] += param.grad.detach().pow(2)

            samples_processed += inputs.size(0)

        # Normalize by number of processed samples
        for name in fisher:
            fisher[name] /= samples_processed

        return fisher

    def consolidate(self, dataloader: DataLoader) -> None:
        """
        Call after completing training on a task to record
        optimal parameters and their Fisher importance scores.
        """
        fisher = self.compute_fisher(dataloader)
        # Deep-copy current parameters as the reference point for this task
        optimal_params = {
            name: param.detach().clone()
            for name, param in self.model.named_parameters()
            if param.requires_grad
        }
        self.task_memory.append({"fisher": fisher, "params": optimal_params})

    def ewc_penalty(self) -> torch.Tensor:
        """
        Compute cumulative EWC regularization loss across all consolidated tasks.
        Returns zero tensor on GPU if no tasks have been consolidated yet.
        """
        if not self.task_memory:
            return torch.tensor(0.0, device=next(self.model.parameters()).device)

        penalty = torch.tensor(0.0, device=next(self.model.parameters()).device)
        for memory in self.task_memory:
            for name, param in self.model.named_parameters():
                if param.requires_grad and name in memory["fisher"]:
                    fisher_w = memory["fisher"][name]
                    ref_param = memory["params"][name]
                    # Penalize deviation from previous optimal, weighted by Fisher importance
                    penalty += (fisher_w * (param - ref_param).pow(2)).sum()

        return (self.ewc_lambda / 2.0) * penalty

    def train_step(
        self,
        task_loss: torch.Tensor,
        optimizer: torch.optim.Optimizer
    ) -> torch.Tensor:
        """
        Single training step combining task loss with EWC penalty.
        """
        total_loss = task_loss + self.ewc_penalty()
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        return total_loss

Pro-Tip: Fisher Information scales poorly with model width. For DINOv2 ViT-L (307M parameters), computing the full diagonal FIM requires storing one float32 value per parameter per task—approximately 1.2 GB per consolidated domain. Limit EWC to the final attention layers and task-specific heads; freeze the backbone entirely between domains to avoid FIM storage explosion.


Managing Heterogeneous Sensor Data at Scale

Coordinate system misalignment between LiDAR, thermal, and visible sensors is the most common silent failure mode in digital twin deployments. Each sensor has its own extrinsic calibration matrix relative to a vehicle or infrastructure reference frame. Temporal skew compounds spatial misalignment when sensors operate at different frequencies (e.g., 10 Hz LiDAR, 30 Hz camera, 400 Hz IMU).

The configuration below defines a cross-modal stream registration schema compatible with ROS2 and custom ingestion pipelines:

# cross_modal_registration.yaml
# Extrinsic calibration for multi-sensor digital twin ingestion
# All transforms expressed as 4x4 SE(3) matrices relative to LiDAR frame (base frame)

sensor_rig:
  base_frame: "lidar_0"
  timestamp_tolerance_ms: 5.0  # Maximum inter-sensor temporal skew before frame drop
  interpolation_strategy: "linear_slerp"  # SLERP for rotation, linear for translation

sensors:
  - id: "thermal_0"
    type: "thermal_camera"
    intrinsics:
      fx: 512.4
      fy: 512.4
      cx: 320.0
      cy: 240.0
      distortion_model: "plumb_bob"
      distortion_coeffs: [-0.012, 0.004, 0.0, 0.0, 0.0]
    extrinsics_to_base:
      # Row-major 4x4 SE(3) transform: thermal_0 -> lidar_0
      matrix: [
         0.9998,  0.0012, -0.0201,  0.132,
        -0.0011,  0.9999,  0.0087, -0.051,
         0.0201, -0.0087,  0.9998,  0.089,
         0.0000,  0.0000,  0.0000,  1.000
      ]
    frequency_hz: 30

  - id: "rgb_0"
    type: "visible_camera"
    intrinsics:
      fx: 1024.0
      fy: 1024.0
      cx: 640.0
      cy: 480.0
      distortion_model: "plumb_bob"
      distortion_coeffs: [-0.031, 0.018, 0.0, 0.0, 0.0]
    extrinsics_to_base:
      matrix: [
         0.9996,  0.0003, -0.0283,  0.208,
        -0.0002,  0.9999,  0.0091, -0.038,
         0.0283, -0.0091,  0.9995,  0.074,
         0.0000,  0.0000,  0.0000,  1.000
      ]
    frequency_hz: 30

  - id: "imu_0"
    type: "imu"
    extrinsics_to_base:
      matrix: [
         1.000,  0.000,  0.000,  0.011,
         0.000,  1.000,  0.000,  0.003,
         0.000,  0.000,  1.000, -0.021,
         0.000,  0.000,  0.000,  1.000
      ]
    frequency_hz: 400

pipeline:
  projection_target: "lidar_0"  # All modalities projected into LiDAR frame before feature extraction
  max_buffer_size: 64           # Frames buffered per sensor before backpressure applied
  drop_policy: "oldest_first"   # Under backpressure, drop oldest unprocessed frames

High-precision extrinsic calibration is non-negotiable. A 1° rotational error in thermal-to-LiDAR calibration produces a 17 cm positional error at 10 meters range—enough to push SCC-Loc's mean localization error above its 9.37 m benchmark.


Deployment and Orchestration on NVIDIA Triton

Packaging a DINOv2-based neural operator for concurrent CFD surrogate inference requires explicit configuration of NVIDIA Triton's execution backend. CUDA 12.x enables FP8 quantization natively, which reduces ViT-L memory bandwidth by 4x compared to FP32 while maintaining acceptable feature fidelity for localization tasks.

# model_config.pbtxt
# Triton Inference Server configuration for DINOv2-based neural operator
# Tested on: Triton 2.45+, CUDA 12.3, TensorRT 9.x

name: "dinov2_neural_operator"
backend: "tensorrt"
max_batch_size: 16

# Input: batched RGB sensor frames at 518x518 resolution
input [
  {
    name: "sensor_input"
    data_type: TYPE_FP16
    dims: [ 3, 518, 518 ]
  }
]

# Outputs: CLS token (global) and two spatial feature maps (block 12, block 20)
output [
  {
    name: "cls_token"
    data_type: TYPE_FP16
    dims: [ 1024 ]
  },
  {
    name: "spatial_feat_block12"
    data_type: TYPE_FP16
    dims: [ 1024, 37, 37 ]
  },
  {
    name: "spatial_feat_block20"
    data_type: TYPE_FP16
    dims: [ 1024, 37, 37 ]
  }
]

instance_group [
  {
    count: 2           # Two model instances for parallel request handling
    kind: KIND_GPU
    gpus: [ 0, 1 ]    # Pin to GPUs 0 and 1 on NVLink-connected nodes
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16 ]
  max_queue_delay_microseconds: 2000   # 2ms max batching wait; tune per latency SLA
  preserve_ordering: false             # Allow out-of-order completion for throughput
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator [
      {
        name: "tensorrt"
        parameters {
          key: "precision_mode"
          value: "FP16"              # FP8 requires explicit calibration dataset
        }
        parameters {
          key: "max_workspace_size_bytes"
          value: "4294967296"        # 4 GB TensorRT workspace per instance
        }
        parameters {
          key: "trt_engine_cache_enable"
          value: "true"             # Cache compiled TRT engines to disk
        }
      }
    ]
  }
}

Technical Warning: preserve_ordering: false breaks request correlation if your downstream pipeline assumes response order matches submission order. CFD workflows with stateful simulation contexts must either use sequence batching or implement client-side request tracking via Triton's correlation IDs.


Optimizing Multi-GPU Training for Neural Operators

NCCL collective communication is the dominant bottleneck during gradient synchronization in large-scale neural operator training—not compute. Three hardware optimization parameters address this directly:

  1. NCCL_P2P_DISABLE=0 — Enables direct GPU-to-GPU peer-to-peer transfers over NVLink or PCIe without routing through host memory. Disabling this (value 1) forces all-reduce operations through system RAM, increasing gradient synchronization latency by 3–8x on NVLink clusters. This must be set explicitly; some container environments default to disabled.

  2. NCCL_IB_HCA=mlx5_0 — Specifies the InfiniBand Host Channel Adapter for RDMA-capable collective operations. Multi-rail clusters with multiple HCAs (e.g., mlx5_0,mlx5_1) require comma-separated enumeration. Binding to the wrong HCA routes traffic through a lower-bandwidth interface, capping effective inter-node bandwidth at a fraction of advertised network capacity.

  3. NCCL_IB_GID_INDEX=3 — Selects the RoCE v2 GID table index for InfiniBand communication. GID index 3 corresponds to IPv4-routable RoCEv2 addresses in standard Mellanox/ConnectX-6 configurations. Incorrect GID selection causes silent fallback to slower IB UD transport, which is detectable only through NCCL_DEBUG=INFO logging showing WARN entries about transport negotiation failures.


Real-Time Localization in GNSS-Denied Environments

The 7.6x latency reduction SCC-Loc achieves over baseline methods is a product of architectural specificity, not general optimization. The baseline methods it outperforms rely on dense descriptor matching across full-resolution imagery—an $O(N^2)$ matching problem over candidate patches. SCC-Loc's SGVA module reduces the search space by predicting semantically valid viewport crops before entering the matching stage, converting the problem from exhaustive search to directed retrieval.

The practical consequence for edge deployment: a system that previously required 760ms per localization query now operates at approximately 100ms. For autonomous navigation in GNSS-denied environments—tunnels, urban canyons, underground facilities—this is the difference between a system that controls at 1 Hz and one that supports 10 Hz control loops.

The trade-off is resolution. SGVA's viewport cropping operates at a fixed crop scale relative to the satellite tile. Coarser crops process faster but lose sub-building discriminability. The paper's 9.37 m mean error operates at a crop resolution that balances these demands for urban outdoor environments. Indoor digital twin environments with tighter feature density require smaller crop windows, which increases the number of candidate hypotheses the system must evaluate and degrades throughput.

Practitioners must characterize this curve empirically for their operational domain. The relationship between crop resolution $r$ and inference throughput $T$ follows:

$$T(r) \propto \frac{1}{(r/r_\text{base})^2 \cdot N_\text{hyp}(r)}$$

where $N_\text{hyp}(r)$ grows as finer crops generate more spatially distinct candidates. Under compute-constrained edge deployments (e.g., NVIDIA Jetson AGX Orin), set $r$ to the coarsest value that keeps mean localization error below your operational threshold, then tune dynamic batching on Triton to absorb burst request loads.


Future-Proofing Surrogate Modeling for 2026 and Beyond

Neural operator integration is maturing from research prototype to production infrastructure. The next wave—physics-informed neural operators, geometry-aware transformers, and foundation models pre-trained on simulation data—will demand evaluation frameworks that go beyond accuracy benchmarks on held-out test sets. Digital twin environments are non-stationary: new sensors, updated map databases, and domain shifts from seasonal variation will stress any fixed model.

Model Stability Evaluation Checklist for Scaling Surrogate Modeling Workloads:

  • [ ] Convergence Stability: Track loss variance across the final 10% of training iterations. Standard deviation exceeding 5% of mean loss signals optimizer hyperparameter sensitivity that will worsen under distribution shift.
  • [ ] Cross-Modal Alignment Drift: Measure cosine similarity between CLS token representations of matched thermal/visible pairs before and after continual training. Drift exceeding 0.05 cosine distance indicates backbone degradation requiring EWC lambda adjustment.
  • [ ] Throughput-Per-Watt Efficiency: Profile using nvidia-smi dmon at sustained batch load. Neural operators on H100 SXM5 should target >15 TFLOPS/W effective; below 10 TFLOPS/W indicates suboptimal kernel utilization from excessive host-device transfers.
  • [ ] Catastrophic Forgetting Regression Tests: After each domain addition, re-evaluate on all prior domain test sets. A >3% accuracy regression on any prior domain triggers EWC lambda increase before production promotion.
  • [ ] Sensor Dropout Robustness: Run localization with each modality individually masked. Any single-modality failure that degrades mean error beyond 2x the full-modality baseline requires redundancy hardening in the consensus weight fallback logic.
  • [ ] FIM Storage Budget Compliance: Verify EWC Fisher storage does not exceed allocated GPU memory across all consolidated domains. Track with torch.cuda.memory_allocated() after each consolidate() call; prune oldest task memories when budget is exceeded.
  • [ ] Triton Queue Saturation Monitoring: Set Prometheus alerts on nv_inference_queue_duration_us exceeding 10ms average. Sustained queue saturation at target throughput requires additional model instances or request routing changes.
  • [ ] Quantization Regression Guard: After FP16 or FP8 export, validate spatial feature map L2 distance against FP32 reference. Distances exceeding 0.1 normalized units indicate quantization-sensitive layers requiring mixed-precision treatment.

The pairing of DINOv2 spatial feature extraction with neural operator surrogate modeling is not a theoretical exercise—SCC-Loc demonstrates deployable performance at sub-10m localization accuracy in real GNSS-denied test conditions. The open problem is scale: extending this approach to nationwide digital twin infrastructure, where domain heterogeneity is extreme and training data arrives continuously, requires exactly the EWC-based continual learning and Triton-based orchestration described here. Teams that instrument these pipelines now will have the operational data needed to tune the next generation of geometry-aware foundation models as they arrive in 2026 and beyond.


Keywords: DINOv2, Neural Operators, Computational Fluid Dynamics, Digital Twin, GNSS-denied localization, SCC-Loc, Spatial-adaptive feature extraction, Heterogeneous modality alignment, NVIDIA Triton Inference Server, CUDA 12.x, Catastrophic forgetting, Semantic-Guided Viewport Alignment