Closing the Loop in Recommender Design: Layered Reward Systems for Multi-Objective Optimization

16 min read · Published Apr 12, 2026, 12:06 PM

Single-objective training is a solved problem. The moment you introduce a second KPI—say, session length alongside click-through rate—you enter a regime where most production systems regress. The standard engineering response is a weighted-sum loss function, and it consistently fails. This article documents the failure mode precisely, then builds a concrete implementation of the layered inner-and-outer reward architecture that replaces it.


The Convergence Problem: Beyond Weighted-Sum Loss Functions

Traditional weighted-sum loss functions in multi-objective recommenders exhibit catastrophic interference in 65% of high-cardinality environments, forcing constant hyperparameter re-tuning (arXiv:2603.26100). The mechanism is not subtle: gradient dominance by high-frequency signals—clicks, impressions—overwhelms the optimization signal from sparse, high-value events like purchases or long-form engagement.

The mathematical formulation exposes the ceiling immediately. Naive linear scalarization collapses multiple objectives into a single scalar:

$$L_{\text{naive}} = \sum_{i=1}^{N} w_i \cdot \mathcal{L}_i(\theta)$$

Where $w_i$ are static scalar weights and $\mathcal{L}_i$ is the per-objective loss. The Pareto front of this formulation is bounded by the convexity of the loss landscape. When objectives are non-convex relative to each other—which is the rule, not the exception, in recommender systems—linear scalarization cannot reach Pareto-optimal solutions on the concave segments of the front. You are structurally excluded from a class of solutions.

The layered reward decomposition replaces this with objective-specific value functions:

$$R_{\text{total}} = R_{\text{inner}}(s_t, a_t) + \gamma \cdot R_{\text{outer}}(S_T, \pi)$$

Where $R_{\text{inner}}$ is the immediate, per-action reward evaluated locally, $R_{\text{outer}}$ is the delayed business-level reward conditioned on the full session trajectory $S_T$, and $\gamma$ is a temporal discount factor. This decomposition respects the causal structure of the funnel: local relevance drives immediate actions; global KPIs emerge from policy behavior over time.

Technical Warning: Linear scalarization is mathematically bounded by the Pareto front curvature. Any single weight vector $\mathbf{w}$ can only recover solutions on the convex hull of the Pareto front. Non-convex regions—where your best revenue outcomes often live—are inaccessible regardless of how carefully you tune weights.

As the source paper states directly: "When objectives are linearly scalarized, the gradient dominance of high-frequency signals inevitably suppresses the nuanced business utility of sparse events, leading to unstable policy convergence."


Architecting the Layered Reward Hierarchy

Dual-stage reward decomposition (Inner Agent / Outer Controller) achieves a 30–50% improvement in convergence speed compared to single-head monolithic architectures (arXiv:2603.26100). This speedup comes from one structural decision: the policy gradient for local relevance is never contaminated by the sparse, delayed signal from business-level objectives. Each loop sees a reward distribution appropriate to its temporal resolution.

The architecture separates two feedback loops with distinct cadences:

graph TD
    A[User Request] --> B[DLRM / Two-Tower Retrieval]
    B --> C[Ranking Agent - Inner Loop]
    C --> D[Served Recommendation Slate]
    D --> E[User Interaction: click / skip / dwell]
    E --> F[Inner Reward Service\nRust - P99 < 5ms]
    F --> G[Immediate Policy Update\nAction-Value Function Q_inner]
    D --> H[Session Buffer / Trajectory Store]
    H --> I[Outer Controller\nDelayed KPI Evaluator]
    I --> J[Business Reward Signal\nNDCG, Revenue, Retention]
    J --> K[Outer Policy Gradient\ngamma-discounted backprop]
    K --> C
    G --> C

    style F fill:#1a1a2e,color:#e0e0e0,stroke:#4a90e2
    style I fill:#1a1a2e,color:#e0e0e0,stroke:#e24a4a
    style C fill:#16213e,color:#e0e0e0,stroke:#4aе290

The Inner Agent operates at per-impression granularity. Its reward function is evaluated synchronously in the critical path (with strict latency constraints). The Outer Controller operates at session or daily granularity, accumulating trajectory data and computing credit assignment over multi-step sequences.

Decoupling requires an independent inference path for local reward signals, which increases compute overhead by roughly 12–15% per request cycle. That overhead is the cost of removing gradient interference—a trade that pays back on convergence speed within the first 10K training iterations in typical production settings.


Inner-Loop: Optimizing Local Relevance

The inner reward service must resolve within a P99 latency of < 5ms to avoid blocking the primary DLRM inference pipeline. Python is not viable for this constraint. The implementation below uses a Rust service that evaluates per-action reward signals from pre-computed feature embeddings passed via shared memory or a Unix socket.

use std::collections::HashMap;

/// Reward weights for inner-loop signals.
/// These are loaded from config and updated by the outer controller.
#[derive(Debug, Clone)]
pub struct InnerRewardConfig {
    pub click_weight: f32,
    pub dwell_weight: f32,
    pub skip_penalty: f32,
    pub relevance_weight: f32,
}

impl Default for InnerRewardConfig {
    fn default() -> Self {
        Self {
            click_weight: 1.0,
            dwell_weight: 0.5,
            skip_penalty: -0.3,
            relevance_weight: 0.8,
        }
    }
}

/// Per-action feedback from the serving layer.
#[derive(Debug)]
pub struct ActionFeedback {
    pub item_id: u64,
    pub clicked: bool,
    pub dwell_seconds: f32,
    pub skipped: bool,
    pub predicted_relevance: f32, // from two-tower cosine similarity
}

/// Computes the immediate inner-loop reward for a single served item.
/// Called synchronously in the request path; must stay under 5ms P99.
pub fn compute_inner_reward(feedback: &ActionFeedback, config: &InnerRewardConfig) -> f32 {
    // Base signal from explicit interaction
    let click_signal = if feedback.clicked { config.click_weight } else { 0.0 };

    // Dwell time is log-compressed to reduce variance from outlier sessions
    let dwell_signal = config.dwell_weight * (1.0 + feedback.dwell_seconds).ln();

    // Skip is a negative signal; penalize proportionally to position confidence
    let skip_signal = if feedback.skipped { config.skip_penalty } else { 0.0 };

    // Predicted relevance from retrieval stage anchors the reward to model confidence
    let relevance_signal = config.relevance_weight * feedback.predicted_relevance;

    click_signal + dwell_signal + skip_signal + relevance_signal
}

/// Batch evaluation across a recommendation slate.
pub fn evaluate_slate_rewards(
    slate: &[ActionFeedback],
    config: &InnerRewardConfig,
) -> HashMap<u64, f32> {
    slate
        .iter()
        .map(|fb| (fb.item_id, compute_inner_reward(fb, config)))
        .collect()
}

Pro-Tip: Log-compress dwell time before feeding it into the reward function. Raw dwell seconds produce high-variance gradients that destabilize the inner policy, particularly in content categories with bimodal consumption patterns (short clips vs. long-form articles).

The predicted_relevance field bridges the Two-Tower retrieval score into the reward signal, creating a soft constraint that prevents the inner agent from optimizing for click bait that the retrieval model would not have surfaced organically.


Outer-Loop: Enforcing Global KPI Constraints

Global objectives—revenue per session, 7-day retention, NDCG at position 10—arrive as sparse, delayed signals. A user's 7-day retention event cannot be attributed to a single recommendation action; it requires distributing credit across dozens of sequential decisions. Standard temporal difference (TD) learning handles this poorly when reward delays exceed 3–5 steps. The outer loop must implement explicit credit assignment over the full session trajectory.

import numpy as np
from typing import List, Tuple

def compute_discounted_returns(
    rewards: List[float],
    gamma: float = 0.99,
    normalize: bool = True
) -> np.ndarray:
    """
    Computes discounted cumulative returns G_t for a trajectory.
    Used by the outer controller to assign credit to historical actions.
    """
    T = len(rewards)
    returns = np.zeros(T, dtype=np.float32)
    G = 0.0
    # Reverse traversal: G_t = r_t + gamma * G_{t+1}
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G

    if normalize and returns.std() > 1e-8:
        # Normalize to zero mean, unit variance to stabilize policy gradient
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    return returns


def outer_policy_gradient_update(
    action_log_probs: np.ndarray,
    session_rewards: List[float],
    business_kpi_signal: float,
    kpi_weight: float = 0.4,
    gamma: float = 0.99,
) -> Tuple[np.ndarray, float]:
    """
    Computes the outer-loop REINFORCE gradient, blending session-level
    discounted returns with the delayed business KPI signal.

    action_log_probs: log pi(a_t | s_t) for each step in the session
    session_rewards:  per-step inner rewards (from Rust service, aggregated)
    business_kpi_signal: scalar KPI (e.g., normalized revenue delta)
    kpi_weight: interpolation between session return and KPI
    """
    # Step 1: Compute discounted returns from session-level signals
    returns = compute_discounted_returns(session_rewards, gamma=gamma)

    # Step 2: Augment returns with delayed business KPI—broadcast scalar to trajectory
    # This injects the global objective without overriding local credit structure
    augmented_returns = (1 - kpi_weight) * returns + kpi_weight * business_kpi_signal

    # Step 3: Policy gradient loss — negative because we ascend reward
    # action_log_probs shape: (T,)
    policy_loss_per_step = -action_log_probs * augmented_returns

    # Scalar loss for the optimizer; mean over trajectory steps
    total_loss = policy_loss_per_step.mean()

    return policy_loss_per_step, total_loss

The kpi_weight hyperparameter controls the interpolation between immediate session utility and delayed business value. Start at 0.2 and anneal upward as the inner policy stabilizes—typically over the first 50K training steps.


Engineering for Low-Latency Telemetry

Per-action feedback logging is where production systems routinely introduce self-inflicted latency. Synchronous logging—writing reward signals to a database in the critical request path—violates the 10ms network latency budget per action. The architecture must treat telemetry as a fire-and-forget side channel.

graph LR
    A[Ranking Agent\nServng Layer] -->|Non-blocking write| B[In-Process Ring Buffer\nLock-free SPSC Queue]
    B -->|Flush at 10ms interval| C[Kafka Producer\nAsync Batch]
    C --> D[Kafka Topic:\nrecsys.reward.raw]
    D --> E[Flink Stream Processor\nWindowed Aggregation]
    E --> F[Reward Feature Store\nRedis / ScyllaDB]
    F --> G[RL Training Loop\nBatch Sampler]
    D --> H[Outer Controller\nSession Aggregator]
    H --> I[Delayed KPI Evaluator]
    I --> G

    style B fill:#0d1117,color:#c9d1d9,stroke:#58a6ff
    style D fill:#0d1117,color:#c9d1d9,stroke:#f0883e
    style G fill:#0d1117,color:#c9d1d9,stroke:#3fb950

The ring buffer absorbs burst traffic without applying backpressure to the serving layer. The Kafka producer flushes on a 10ms timer—not per-event—trading minimal recency for a dramatic reduction in producer syscall overhead.

Technical Warning: Do not use Kafka's synchronous send().get() in the serving critical path. A single broker hiccup will cascade into serving latency spikes. Configure producers with acks=1, linger.ms=10, and compression.type=lz4 for the reward topic.


Scaling Telemetry Streams

Integrating RL training feedback into existing Kafka/Flink pipelines requires careful partitioning. Reward events must be co-partitioned with session keys so that Flink can reconstruct full session trajectories within a single task without cross-partition shuffles.

Metric Synchronous Logging Asynchronous (Ring Buffer + Kafka)
P50 Serving Latency Impact +4.2ms +0.08ms
P99 Serving Latency Impact +18ms +0.6ms
Throughput (events/sec/core) 12,000 380,000
Data Loss Rate (broker failover) 0% < 0.01% (at-least-once)
Training Feedback Lag Real-time 50–200ms (configurable)
Operational Complexity Low Medium

The 50–200ms training feedback lag from async logging is negligible for the outer controller, which operates at session granularity. For the inner loop, the Rust service evaluates rewards in-process from the serving layer's feature cache—no Kafka round-trip required.

Configure Flink with session windows keyed on user_id, with a 30-minute gap timeout. This captures full browsing sessions as discrete trajectory units for outer-loop credit assignment.


Addressing Reward Credit Assignment

The reward credit assignment problem is the central unsolved challenge in applying RL to recommendation. A user purchases a high-margin item on step 17 of a session. Which of the preceding 16 recommendation actions caused that outcome? TD learning diffuses credit backward uniformly, which is wrong—certain actions (the initial category pivot, the price-anchor display) have disproportionate causal weight.

Gated attention mechanisms increase attribution accuracy in multi-step sequences by 22% compared to standard TD learning (arXiv:2510.05592). The mechanism treats credit assignment as a learned attention problem over the action-state trajectory.

The temporal credit assignment formulation with gated attention:

$$c_t = \text{softmax}\left(\frac{Q(s_t, a_t) \cdot K(S_T)^{\top}}{\sqrt{d_k}}\right) \cdot R_{\text{outer}}$$

Where: - $Q(s_t, a_t) \in \mathbb{R}^{d_k}$ is the query projection of the local action-value function at step $t$ - $K(S_T) \in \mathbb{R}^{T \times d_k}$ is the key projection of the full session trajectory $S_T$ - $c_t$ is the credit scalar assigned to action $t$ from the terminal business reward $R_{\text{outer}}$

The gating mechanism learns to suppress credit for irrelevant actions (impression-only events with no dwell) and amplify it for actions that correlate with the terminal KPI. Credit vectors $\mathbf{c}$ then replace the uniform discount factor in the outer policy gradient update.

Memory Constraint: Gated attention layers increase memory consumption linearly with sequence length. Cap the context window at 512 steps for production stability. Sessions longer than 512 steps should be chunked with overlapping windows of 64 steps to preserve local trajectory context at boundaries.


Deployment Patterns and Cold-Start Handling

Transitioning a pre-trained ranking model to online RL inference without a warm-start strategy produces severe reward signal thrash. The model begins receiving live feedback from a distribution it has never been trained on, and the policy collapses toward high-confidence but low-reward actions within hundreds of steps.

Deployment stability checklist:

  • [ ] Pre-load the reward feature store with 30 days of historical interaction data before enabling the RL training loop
  • [ ] Initialize the inner reward weights from offline A/B experiment results, not from random initialization
  • [ ] Gate outer controller updates behind a minimum session count threshold (≥ 10K sessions) before allowing KPI gradients to propagate
  • [ ] Enable shadow mode: run the layered reward system in parallel with the existing ranker for 48 hours, logging reward distributions without acting on them
  • [ ] Monitor KL divergence between the pre-training policy and the online policy; trigger a rollback if KL > 0.3 within the first 6 hours of live traffic
  • [ ] Set a reward clipping range of $[-5, +5]$ to prevent outlier sessions from dominating early gradient updates
  • [ ] Validate that the Rust inner reward service maintains P99 < 5ms under peak QPS before enabling live updates

Managing Model Stability in Dynamic Environments

Non-stationary reward distributions are the norm in recommendation: seasonal trends, viral content, A/B experiments running concurrently, and feature pipeline changes all shift the reward baseline. A policy trained against yesterday's baseline will interpret today's rewards as artificially high or low, producing incorrect gradient magnitudes.

import numpy as np
from collections import deque

class RollingRewardBaseline:
    """
    Maintains a rolling exponential moving average of observed rewards
    to normalize incoming signals before feeding them to the policy gradient.
    Prevents training instability caused by non-stationary reward distributions.
    """

    def __init__(self, window_size: int = 1000, ema_alpha: float = 0.01):
        self.window_size = window_size
        self.ema_alpha = ema_alpha  # Lower alpha = slower adaptation = more stability
        self._buffer = deque(maxlen=window_size)
        self._ema_mean: float = 0.0
        self._ema_var: float = 1.0
        self._initialized: bool = False

    def update(self, reward: float) -> None:
        """Update baseline statistics with a new observed reward."""
        self._buffer.append(reward)
        if not self._initialized:
            # Bootstrap from first observation rather than assuming zero-mean
            self._ema_mean = reward
            self._ema_var = 1.0
            self._initialized = True
            return
        # Welford-style online EMA update for mean and variance
        delta = reward - self._ema_mean
        self._ema_mean += self.ema_alpha * delta
        self._ema_var = (1 - self.ema_alpha) * (
            self._ema_var + self.ema_alpha * delta ** 2
        )

    def normalize(self, reward: float) -> float:
        """
        Normalize a reward against the current baseline.
        Returns the raw reward during initialization to avoid division by zero.
        """
        if not self._initialized or self._ema_var < 1e-8:
            return reward
        return (reward - self._ema_mean) / (np.sqrt(self._ema_var) + 1e-8)

    def batch_normalize(self, rewards: np.ndarray) -> np.ndarray:
        """Normalize a batch of rewards; updates baseline for each entry."""
        normalized = np.empty_like(rewards, dtype=np.float32)
        for i, r in enumerate(rewards):
            normalized[i] = self.normalize(float(r))
            self.update(float(r))
        return normalized

Set ema_alpha conservatively—between 0.005 and 0.02—during initial deployment. Higher alpha values cause the baseline to track reward spikes too aggressively, which zeroes out gradient signal during viral content events that should be reinforced.


Performance Benchmarks and ROI Impact

Production deployment of layered reward architectures delivers a 14% increase in NDCG and a 40% reduction in training iteration cycles (arXiv:2603.26100). The NDCG improvement is a direct consequence of the inner loop's ability to optimize ranking quality without gradient interference from revenue objectives. The convergence reduction comes from the outer loop receiving a cleaner, lower-variance credit signal.

KPI Monolithic Weighted-Sum Layered Reward Architecture Delta
NDCG@10 0.412 0.469 +13.8%
Training Iterations to Convergence ~180K ~108K -40%
Convergence Speed (wall time) Baseline 30–50% faster
Reward Signal Variance (normalized) 2.4 0.9 -62.5%
Hyperparameter Re-tuning Frequency Every 2–3 weeks Every 6–8 weeks -67%
Catastrophic Interference Incidents 65% of deployments < 10% of deployments -84%

The reduction in hyperparameter re-tuning frequency compounds significantly at scale. Teams running 20+ concurrent recommendation experiments save 15–20 engineer-hours per week when they no longer need to re-balance objective weights after each upstream data distribution shift.

The 40% reduction in training iteration cycles directly reduces GPU compute costs. At scale—training on billions of impression events—this translates to 6-figure infrastructure savings annually, and a faster iteration cycle for the modeling team.


Future-Proofing Your Recommendation Infrastructure

The layered reward architecture has a structural property that becomes critical as LLM-based agents enter the recommendation stack: it separates policy from reward definition. Static weight vectors embedded in a monolithic loss function cannot be updated without retraining the full model. The outer controller in a layered system is a configurable component—its reward weights are operational parameters, not model weights.

This separation enables two capabilities that matter for LLM integration:

Dynamic policy-based reward tuning. As business priorities shift—say, transitioning from engagement maximization to subscription conversion during a product launch—the outer controller's KPI weights can be updated at runtime without triggering a model retraining cycle. The inner agent continues optimizing local relevance while the outer controller reorients toward the new business objective within one session epoch.

LLM-as-Outer-Controller. An LLM agent with access to business context (inventory levels, margin targets, user cohort data) can generate reward shaping signals for the outer loop directly, replacing hand-tuned KPI weights with contextually generated ones. The Two-Tower retrieval and DLRM ranking layers remain unchanged; only the outer reward signal source upgrades. This is the architectural pattern that makes recommendation systems composable with agentic infrastructure.

The transition from static weights to dynamic policy-based reward tuning is not a future migration—it is a property you build into the system at design time by choosing the layered architecture. Retrofitting dynamic reward logic into a monolithic ranker requires re-architecting the training loop from scratch.


Conclusion: Strategic Implementation Roadmap

Migrating from a monolithic weighted-sum ranker to a layered reward system is a three-phase infrastructure project. Phase 1 eliminates gradient interference. Phase 2 operationalizes the telemetry infrastructure. Phase 3 activates the full credit assignment mechanism.

Phase Duration Key Deliverables Success Criteria
Phase 1: Reward Decomposition 4–6 weeks Deploy Rust inner reward service; separate inner/outer loss heads in training; shadow-mode validation P99 inner reward latency < 5ms; KL divergence from baseline policy < 0.15
Phase 2: Telemetry Infrastructure 3–4 weeks Async Kafka telemetry pipeline; Flink session windowing; reward feature store integration Async logging latency impact < 1ms P99; zero synchronous writes in critical path
Phase 3: Credit Assignment + Outer Loop 6–8 weeks Gated attention credit module; outer controller with delayed KPI integration; rolling reward baseline 20%+ reduction in training iteration cycles; NDCG@10 improvement measurable in A/B test within 2 weeks of online deployment

Phase 1 delivers measurable value independently—removing gradient interference in the training loop improves convergence even before the telemetry infrastructure is fully operational. Do not treat this as a big-bang migration; each phase ships and validates in production before the next begins.

The core engineering contract of this architecture: inner rewards must be fast, outer rewards must be accurate, and telemetry must never touch the critical path. Everything else is implementation detail.


Keywords: Multi-Objective Reinforcement Learning, Catastrophic Interference, Reward Credit Assignment, Two-Tower Model, DLRM, Pareto Optimality, Action-Value Function, Policy Gradient, Telemetry Integration, Latency Optimization, Reward Shaping