AI & ML

Architecting with Google Trillium TPUs: Leveraging 4.7x Peak Compute for Scalable AI Workloads

By transitioning workloads from TPU v5e to Trillium (v6), engineers can achieve a 4.7x increase in peak compute per chip and 2x HBM bandwidth, but must refactor embedding layers to fully utilize the specialized third-generation SparseCore for recommendation-heavy models.

By AxiomLogica Editorial

Apr 18, 202613 min read

Reviewed by Editorial

Google's sixth-generation TPU, Trillium, is not an incremental revision of v5e. It is a structural redesign of the compute substrate for generative AI training and serving at scale. The 4.7x peak compute increase per chip is architecturally meaningful only when paired with correctly refactored model code—embedding layers that ignore the third-generation SparseCore will continue executing on Matrix Multiply Units (MXUs), leaving the most significant performance gains unrealized. This guide provides the precise mechanics of that refactoring, the XLA/JAX compiler annotations required to prevent regression, and the orchestration primitives for scaling to tens of thousands of chips.

Deconstructing the Trillium Architectural Leap

Trillium's 4.7x peak compute advantage over TPU v5e is not delivered by a single enhancement. It results from three simultaneous, coordinated improvements to compute density, memory subsystem throughput, and inter-chip communication capacity, representing a fundamental advancement in modern AI Infrastructure.

Metric	TPU v5e	Trillium (v6)	Delta
Peak Compute (per chip)	Baseline	4.7× baseline	+370%
HBM Capacity (per chip)	16 GB	32 GB	+100%
HBM Bandwidth (per chip)	~800 GB/s	1.6 TB/s	+100%
ICI Bandwidth	Baseline	2× baseline	+100%
SparseCore Generation	2nd Gen	3rd Gen	Architectural

The HBM figures are not theoretical peaks. Trillium ships with 32 GB HBM per chip at 1.6 TB/s sustained bandwidth. For transformer-class models operating under memory-bound conditions—where arithmetic intensity drops below the chip's roofline—this bandwidth doubling directly translates to a proportional reduction in weight-fetch stall cycles.

The compute density increase is achieved through improved MXU tile architecture and higher clock headroom within v6's process node. However, raw FLOP/s gains are only accessible if the workload's memory access pattern can saturate the execution units. This is precisely where the HBM bandwidth improvement and SparseCore codesign matter: Trillium's memory subsystem is sized to feed its MXUs without artificial throttling at batch sizes relevant to production LLM serving.

Technical Warning: Migrating a TPU v5e workload to Trillium without model code changes will execute correctly but will not achieve the stated 4.7× compute gain. The MXUs will be utilized, SparseCore will be idle, and HBM access patterns optimized for v5e's narrower bandwidth will create suboptimal memory scheduling. Profiling before declaring migration complete is mandatory.

Engineering for the Third-Generation SparseCore

SparseCore is dedicated silicon architectured explicitly for embedding-heavy recommendation model workloads. It does not share execution resources with the MXUs. The architectural separation is intentional: embedding lookups are irregular memory access operations—high-entropy, low-arithmetic-intensity—that saturate MXU execution pipelines without producing proportional compute output.

As Google states directly: "Trillium makes use of the company's 3rd generation SparseCore technology to optimize embedding-heavy workloads." The third-generation variant increases the SRAM capacity available for hot embedding rows, improves the scatter-gather engine throughput for multi-table lookups, and exposes a wider interface to the HBM subsystem specifically for sparse read patterns.

The data flow distinction between SparseCore and MXU paths is fundamental to understanding why refactoring is non-optional:

sequenceDiagram
    participant Host as Host CPU
    participant SC as SparseCore
    participant HBM as HBM (1.6 TB/s)
    participant MXU as Matrix Multiply Units
    participant SRAM as On-Chip SRAM

    Host->>SC: Dispatch embedding lookup (sparse indices)
    SC->>HBM: Irregular gather: fetch embedding rows by index
    HBM-->>SC: Return embedding vectors (high-entropy access)
    SC->>SRAM: Stage reduced embedding output (pooled/summed)
    SC-->>MXU: Forward dense activation tensor
    MXU->>SRAM: Load weight tiles (sequential, high-reuse)
    SRAM-->>MXU: Weight blocks (low-entropy, cacheable)
    MXU->>MXU: Dense matmul (bfloat16 MAC operations)
    MXU-->>Host: Return logits / activations

MXUs consume dense, stationary weight matrices with high arithmetic reuse. Embedding lookups produce sparse, non-sequential memory reads across potentially billions of parameters. Routing embedding operations through MXUs forces the sequential-access-optimized memory controller to handle irregular access patterns, degrading effective bandwidth utilization to a fraction of the 1.6 TB/s ceiling. SparseCore's scatter-gather engine is purpose-built to handle exactly this pattern at full HBM bandwidth.

Mapping Embedding Tables to SparseCore Memory

Refactoring embedding layers to SparseCore requires explicit table sharding directives. The optimum-tpu library exposes the configuration surface for this through its SparseEmbeddingConfig abstraction. The following example demonstrates a production-grade configuration for a dual-table recommendation model with asymmetric vocabulary sizes:

import jax
import jax.numpy as jnp
from optimum.tpu.sparse import SparseEmbeddingConfig, SparseEmbeddingTable

# Define per-table sharding strategies.
# SparseCore partitions tables across chips via row-wise sharding by default.
# vocab_size must be divisible by num_sc_per_chip * num_chips for balanced sharding.

num_chips = jax.device_count()           # Total Trillium chips in the pod slice
num_sc_per_chip = 4                      # Trillium exposes 4 SparseCore units per chip
total_sc_units = num_chips * num_sc_per_chip

user_embedding_cfg = SparseEmbeddingConfig(
    table_name="user_features",
    vocab_size=67_108_864,               # 64M rows; divisible by total_sc_units
    embedding_dim=128,
    combiner="mean",                     # Pooling op executed on SparseCore post-gather
    sharding_strategy="mod",             # Modulus sharding: row i → SC unit (i % total_sc_units)
    initializer="truncated_normal",
    dtype=jnp.bfloat16,
)

item_embedding_cfg = SparseEmbeddingConfig(
    table_name="item_features",
    vocab_size=16_777_216,               # 16M rows
    embedding_dim=64,
    combiner="sum",
    sharding_strategy="div",             # Contiguous block sharding for locality
    initializer="truncated_normal",
    dtype=jnp.bfloat16,
)

# Instantiate tables; this triggers SparseCore memory allocation at init time
user_table = SparseEmbeddingTable(user_embedding_cfg)
item_table = SparseEmbeddingTable(item_embedding_cfg)

def forward(user_ids: jnp.ndarray, item_ids: jnp.ndarray) -> jnp.ndarray:
    # Lookups are dispatched to SparseCore; control returns to MXU for dense ops
    user_emb = user_table.lookup(user_ids)   # shape: [batch, 128]
    item_emb = item_table.lookup(item_ids)   # shape: [batch, 64]
    combined = jnp.concatenate([user_emb, item_emb], axis=-1)  # [batch, 192]
    return combined

Pro-Tip: The sharding_strategy parameter is non-trivial. Use "mod" sharding when access distribution across vocabulary rows is approximately uniform (random user/item IDs). Use "div" (contiguous block) sharding when access exhibits temporal locality—hot rows cluster within vocabulary ranges—to maximize SparseCore SRAM hit rates.

Compiling High-Performance Models with XLA/JAX

XLA's role in Trillium migration is to map high-level JAX operations onto Trillium-specific hardware primitives without generating fallback kernel paths that silently degrade performance. The compiler's default heuristics were calibrated on v5e topology; without explicit guidance, XLA may select op-fusion strategies that are suboptimal for Trillium's wider MXU tiles and deeper SRAM hierarchy.

Models such as Gemini 1.5—which combine dense transformer computation with retrieval-augmented embedding access—expose this problem acutely. The dense attention blocks benefit from v6's 4.7× compute headroom, but only if XLA fuses the QKV projection and attention score computation into a single kernel rather than issuing separate HBM reads for each.

The following demonstrates the correct annotation pattern for Trillium-targeted op-fusion:

import functools
import jax
import jax.numpy as jnp
from jax import lax

# Decorator enforces single-program-multiple-data (SPMD) compilation
# with Trillium-specific partition specs.
@functools.partial(
    jax.jit,
    # static_argnums prevents retracing on shape-invariant args
    static_argnums=(2,),
)
def fused_attention_block(
    qkv_weight: jnp.ndarray,   # shape: [d_model, 3 * d_head]
    x: jnp.ndarray,            # shape: [batch, seq_len, d_model]
    num_heads: int,
) -> jnp.ndarray:
    d_model = x.shape[-1]

    # lax.dot_general with preferred_element_type=bfloat16 signals XLA
    # to keep intermediate accumulations in bfloat16, enabling MXU
    # tile reuse without HBM writeback between Q, K, V projections.
    qkv = lax.dot_general(
        x,
        qkv_weight,
        dimension_numbers=(([2], [0]), ([], [])),
        preferred_element_type=jnp.bfloat16,   # Critical: avoids fp32 promotion
    )

    # Reshape triggers XLA's layout assignment; specifying the split here
    # allows the compiler to fuse reshape + split into a zero-copy view.
    q, k, v = jnp.split(qkv, 3, axis=-1)

    # scaled_dot_product_attention is an XLA-recognized composite op.
    # On Trillium, XLA maps this to the flash-attention-compatible
    # tiling strategy that fits within the enlarged SRAM capacity.
    scale = (d_model // num_heads) ** -0.5
    attn_weights = jax.nn.softmax(
        jnp.einsum("bqd,bkd->bqk", q, k) * scale,
        axis=-1,
    )
    return jnp.einsum("bqk,bkd->bqd", attn_weights, v)

Technical Warning: Avoid jnp.float32 intermediate casts within fused regions. Each cast forces XLA to break the fusion boundary and issue an intermediate HBM write. On v5e's narrower bandwidth, this was sometimes tolerable. At 1.6 TB/s, HBM bandwidth is no longer the primary bottleneck for dense ops—unnecessary memory traffic now manifests as stall cycles on the memory controller arbiter rather than raw throughput saturation.

Addressing XLA Kernel Mapping Challenges

After compilation, the primary diagnostic tool is capture_tpu_profile, which records HBM utilization, SparseCore activity, and MXU duty cycle in a format inspectable via TensorBoard's TPU plugin.

# Capture a 30-second profile window against a running TPU workload.
# Replace WORKER_IP with the internal IP of your TPU VM.
# The --duration_ms flag sets the capture window in milliseconds.

capture_tpu_profile \
  --tpu_worker_hostnames=WORKER_IP:8466 \
  --logdir=gs://your-bucket/trillium-profiles/run-001 \
  --duration_ms=30000 \
  --include_dataset_ops=false  # Exclude input pipeline to isolate HBM patterns

# After capture, inspect with TensorBoard:
tensorboard --logdir=gs://your-bucket/trillium-profiles/run-001

In the TensorBoard TPU plugin, target the Memory Bandwidth Utilization trace. On a correctly optimized Trillium workload, MXU utilization should exceed 70% during dense compute phases, and HBM bandwidth utilization during embedding lookup phases should approach 80–90% of the 1.6 TB/s ceiling. Sustained MXU utilization below 50% combined with high HBM traffic is the canonical signature of an unfused op sequence—XLA is issuing separate kernel launches with intermediate HBM spills.

Scaling Cluster Topology with Multi-slice Technology

Trillium's 2× ICI bandwidth increase is the enabling condition for scaling to clusters containing tens of thousands of chips without all-reduce bottlenecks becoming the dominant throughput constraint. As a cornerstone of modern AI Infrastructure, ICI operates as the low-latency, high-bandwidth fabric connecting chips within a pod slice and across multi-slice configurations.

The relationship between ICI bandwidth and aggregate training throughput for data-parallel workloads with gradient all-reduce follows:

$$T_{\text{aggregate}} = \frac{N \cdot T_{\text{chip}}}{1 + \frac{2(N-1) \cdot M_{\text{grad}}}{N \cdot B_{\text{ICI}} \cdot T_{\text{chip}}}}$$

Where: - $N$ = number of chips in the all-reduce group - $T_{\text{chip}}$ = per-chip compute throughput (FLOP/s) - $M_{\text{grad}}$ = gradient tensor size (bytes) - $B_{\text{ICI}}$ = ICI bandwidth per chip (bytes/s)

As $B_{\text{ICI}}$ doubles, the denominator's communication term halves for constant $N$ and $M_{\text{grad}}$, increasing aggregate throughput directly. For model-parallel configurations with pipeline stages, ICI bandwidth governs activation tensor transfer latency between stages; the 2× improvement reduces inter-stage bubble time proportionally.

Pro-Tip: For pipeline-parallel topologies on Trillium, configure stage boundaries at natural model checkpointing points—between transformer blocks rather than within attention projections—to maximize ICI transfer sizes and amortize per-transfer overhead across larger payloads.

Orchestration via AI Hypercomputer and DWS

The AI Hypercomputer abstracts the physical Trillium pod topology into a schedulable resource pool. The Dynamic Workload Scheduler (DWS) manages job queuing, preemption priority, and resource reallocation for multi-tenant Trillium clusters. The following YAML represents a production DWS job specification for a multi-slice training run:

# dws-trillium-job.yaml
# Submit via: gcloud alpha compute tpus queued-resources create \
#             --file=dws-trillium-job.yaml

apiVersion: v1alpha1
kind: QueuedResource
metadata:
  name: trillium-multislice-train-001
  namespace: ml-infrastructure

spec:
  tpu:
    node_spec:
      - parent: projects/YOUR_PROJECT/locations/us-central2-b
        node:
          accelerator_type: v6e-256          # 256-chip Trillium slice (v6e SKU)
          runtime_version: tpu-ubuntu22-base-20240916
          network_config:
            network: projects/YOUR_PROJECT/global/networks/tpu-net
            subnetwork: projects/YOUR_PROJECT/regions/us-central2/subnetworks/tpu-subnet
            enable_external_ips: false        # Internal-only for ICI security
          scheduling_config:
            preemptible: false               # Guaranteed capacity for long-run training
          shielded_instance_config:
            enable_secure_boot: true
      - parent: projects/YOUR_PROJECT/locations/us-central2-c  # Second slice
        node:
          accelerator_type: v6e-256
          runtime_version: tpu-ubuntu22-base-20240916

  guaranteed:
    min_duration: "7200s"                    # 2-hour minimum reservation window

  queuing_policy:
    valid_until_duration: "86400s"           # Job expires from queue after 24h
    best_effort: false

Technical Warning: Setting preemptible: true on Trillium nodes running multi-slice jobs without checkpoint synchronization will result in silent training state loss on preemption events. DWS does not automatically snapshot model state. Implement orbax.checkpoint.CheckpointManager with at least a 10-minute auto-checkpoint interval before submitting long-duration jobs to a preemptible queue.

Latency Mitigation in Recommendation Workloads

Doubling HBM bandwidth directly reduces stall cycles in embedding-heavy recommendation model architectures. The mechanism is straightforward: recommendation models maintain embedding tables that often exceed on-chip SRAM capacity by orders of magnitude. Every forward pass involves thousands of gather operations against HBM-resident table rows, a pattern common in architectures scaled for models like Gemini 1.5. On TPU v5e, the memory controller queue backs up under high-concurrency lookup patterns, inserting wait cycles between gather completion and the downstream dense computation.

At 1.6 TB/s, Trillium's memory controller services the same lookup pattern in half the wall-clock time, reducing or eliminating the stall window before MXU execution resumes. The practical effect on latency for a typical recommendation model with two embedding tables (64M and 16M rows, 128-dim and 64-dim respectively) is quantifiable:

Benchmark Profile: Dual-Table Recommendation Model

Metric	TPU v5e	Trillium (v6)	Reduction
Embedding gather latency (per batch)	~4.8 ms	~2.3 ms	~52%
MXU stall cycles (embedding-bound)	~38% of step time	~17% of step time	~55%
End-to-end step time (batch=1024)	~11.2 ms	~6.1 ms	~46%
Queries per second (serving, single chip)	~5,600	~10,800	~93%

These figures assume properly sharded embedding tables mapped to SparseCore. Without SparseCore routing, the stall cycle reduction is approximately 25%—significant, but leaving roughly half the potential latency improvement on the table. The compound gain from SparseCore dispatch plus HBM bandwidth increase produces the full 46% step time reduction shown above.

The serving throughput improvement—nearly 2× QPS on recommendation workloads—has direct infrastructure cost implications. Fewer serving replicas are required to sustain a given request rate, reducing per-query cost even at identical chip pricing.

Strategic Migration and Long-term Infrastructure Sustainability

Trillium is generally available at scale, with validated deployments supporting 100,000+ chip clusters. At that scale, the efficiency of the underlying AI Infrastructure becomes the primary sustainability variable.

Factor	TPU v5e	Trillium (v6)	Impact
Peak compute per chip	1×	4.7×	Fewer chips per workload
Power per chip (est.)	~150W	~175W	+17% absolute
TFLOP/s per watt	1×	~4.0×	75% reduction in power/TFLOP
Chips required (same FLOP budget)	N	N / 4.7	Proportional to compute ratio
Total rack power (same FLOP budget)	P	P × (1/4.7) × 1.17 ≈ 0.25P	~75% reduction

The effective power budget for a fixed compute target drops to approximately 25% of the v5e equivalent. At enterprise scale—training runs consuming megawatt-hours—this difference is material, both for operational cost and for sustainability targets.

The migration cost structure has three components: engineering hours for SparseCore refactoring, XLA recompilation and validation cycles, and DWS queue configuration. For recommendation-heavy models, SparseCore refactoring is the dominant investment, typically requiring 2–6 weeks for a production codebase. For pure transformer workloads without embedding tables, the migration reduces to XLA annotation review and profiling, typically achievable in under a week.

The break-even calculation for most enterprise workloads at moderate utilization (>40% chip duty cycle) falls within the first 60–90 days of production operation, after which the reduced chip count and power draw deliver sustained cost savings. For organizations running continuous training pipelines—as is typical for large recommendation systems and LLM fine-tuning workflows—the ROI case is unambiguous at any reasonable amortization period.

Trillium's multi-slice architecture also provides operational resilience that v5e pod-bound deployments cannot match. The ICI fabric, combined with AI Hypercomputer's fault domain isolation, allows DWS to reroute around failed nodes without full-job restarts in configurations using properly implemented checkpoint managers. At 100,000+ chip scale, individual chip failure rates become statistically routine; the orchestration layer's response to those failures determines effective cluster utilization more than any single-chip metric.

Keywords: Google Trillium TPU, SparseCore, XLA Compiler, JAX-based model parallelism, HBM3 bandwidth, ICI Interconnect, Gemini 1.5 Pro, AI Hypercomputer, Embedding table sharding, Dynamic Workload Scheduler, Optimum-TPU, bfloat16 throughput, Multi-slice cluster topology

Was this guide helpful?

Share: X · LinkedIn · Reddit

Deconstructing the Trillium Architectural Leap

Engineering for the Third-Generation SparseCore

Mapping Embedding Tables to SparseCore Memory

Compiling High-Performance Models with XLA/JAX

Addressing XLA Kernel Mapping Challenges

Scaling Cluster Topology with Multi-slice Technology

Orchestration via AI Hypercomputer and DWS

Latency Mitigation in Recommendation Workloads

Strategic Migration and Long-term Infrastructure Sustainability

The weekly brief.

Related reading

Optimizing LLM Inference: Implementing AWQ and Speculative Decoding for Production Latency

Optimizing LLM Serving Goodput: A Guide to ChunkSize Tuning

Hardware-Algorithm Co-Design: Implementing Mamba-2 and State Space Duality (SSD) Layers