Implementing Multi-Objective Reward Functions: Preference-Based RL for Urban Control Systems

15 min read · Published Apr 8, 2026, 12:04 AM

Urban traffic signal control is not a single-objective problem. Every intersection balances throughput, safety, emissions, pedestrian wait time, and emergency vehicle preemption simultaneously. The standard industry response—a hand-crafted scalar reward—fails structurally at this task. This article details the complete engineering path from broken scalar reward to a functioning RLAIF-based multi-objective policy, including gradient conflict resolution, preference annotation pipeline architecture, and production deployment constraints.


The Failure of Scalar Reward Engineering in Urban Systems

The canonical scalar reward for Urban Traffic Signal Control (UTSC) takes the form:

$$R(\mathbf{s}, \mathbf{a}) = \sum_{i=1}^{N} w_i \cdot r_i(\mathbf{s}, \mathbf{a})$$

where $w_i$ are manually assigned weights and $r_i$ are individual objective scores (throughput, emissions, queue length, etc.). This formulation contains a structural flaw: it assumes linear utility across all objectives. Real urban traffic environments exhibit non-convex Pareto frontiers, where no single weight vector can reach all optimal trade-off points.

As documented in ArXiv:2602.20728v1, "In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal." The research confirms what practitioners observe in production: agents trained on scalar-weighted sums frequently achieve policy collapse, over-optimizing for high-signal objectives like vehicle throughput at the expense of emission compliance or pedestrian safety.

The contrast between scalar scalarization and true multi-objective optimization is architectural:

Approach Representational Space Pareto Coverage
Weighted-sum scalar Single point on Pareto front Poor—misses non-convex regions
Chebyshev scalarization Single point, better coverage Moderate—still requires weight tuning
MORL with preference oracle Full Pareto front traversal High—adaptive to operator preferences

Reward engineering for multi-objective systems cannot be reduced to weight-tuning. The solution requires a different architectural primitive: a preference model that captures operator intent across the full objective space, without requiring engineers to enumerate every trade-off manually.


Architecting the RLAIF Preference Annotation Pipeline

The RLAIF pipeline replaces human annotators with an LLM acting as a preference oracle. This oracle compares trajectory pairs sampled from the autonomous system's environment simulation and returns preference labels that encode multi-objective trade-off reasoning. The policy model then trains on these labels rather than a hand-crafted reward function.

flowchart TD
    A[Environment Simulator\nSUMO / Custom GridEnv] -->|Rollout Pairs τ_A, τ_B| B[Trajectory Buffer\nAsynchronous Queue]
    B -->|Batch of Trajectory Pairs| C[LLM Preference Oracle\nGPT-4o / DeepSeek-R1]
    C -->|Preference Labels P τ_A ≻ τ_B| D[Preference Dataset\nReplay Memory]
    D -->|Training Signal| E[Reward Model\nNeural Preference Estimator]
    E -->|Scalar Proxy per Objective| F[MORL Policy\nPPO + PCGrad]
    F -->|Updated Policy π| A
    G[Operator Preference Config\nJSON / YAML] -->|System Prompt Context| C
    F -->|State Snapshot| H[State Persistence Layer\nLangGraph Serializer]
    H -->|Restored State on Re-plan| F

RLAIF pipelines for urban control require trajectory sampling overhead to be managed via asynchronous buffering. Without it, synchronous LLM API calls introduce I/O blocking that degrades GPU utilization below 30% during annotation phases. The buffer decouples environment rollout from annotation latency, keeping the policy training loop GPU-bound.

Technical Warning: LLM inference latency (typically 200–800ms per request for GPT-4o) is incompatible with real-time control loops. The preference annotation pipeline operates offline or in a parallel background process. The trained reward model—not the LLM—runs at inference time.

Integrating Large Language Models as Preference Oracles

The LLM annotator receives structured trajectory summaries, not raw state tensors. Engineers must compress trajectory data into semantically meaningful metrics before constructing the prompt. Consistency in preference labels depends entirely on prompt determinism: temperature must be set to 0, and the system prompt must define the objective hierarchy explicitly.

{
  "model": "gpt-4o",
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are a traffic systems evaluator. Rank trajectory pairs by multi-objective performance. Priority order: (1) safety—zero collisions, (2) emission compliance—CO2 below threshold, (3) throughput—vehicles per minute. Return JSON only."
    },
    {
      "role": "user",
      "content": {
        "trajectory_A": {
          "vehicles_per_min": 42.3,
          "avg_co2_kg_per_km": 0.18,
          "collision_events": 0,
          "avg_pedestrian_wait_s": 28.1
        },
        "trajectory_B": {
          "vehicles_per_min": 51.7,
          "avg_co2_kg_per_km": 0.31,
          "collision_events": 0,
          "avg_pedestrian_wait_s": 19.4
        },
        "query": "Which trajectory better satisfies the objective hierarchy? Return: {\"preferred\": \"A\" or \"B\", \"confidence\": 0.0-1.0, \"reasoning\": \"<one sentence>\"}"
      }
    }
  ]
}

The confidence field is operationally significant: labels with confidence below 0.65 are discarded from the preference dataset. Low-confidence annotations introduce label noise that destabilizes reward model training. The reasoning field feeds a secondary audit log for debugging objective drift over long training runs.


Conflict-Resolution Logic in MORL

Conflicting gradients are the central failure mode in multi-objective reinforcement learning. When the gradient from the throughput objective $\nabla_\theta \mathcal{L}{throughput}$ points in a direction that directly opposes $\nabla\theta \mathcal{L}_{emissions}$, naive gradient averaging degrades both objectives simultaneously.

The industry-standard resolution is PCGrad (Yu et al., 2020), which projects each objective's gradient onto the normal plane of any conflicting gradient. If two gradients exhibit negative cosine similarity (i.e., $\cos\theta < 0$), the conflicting component is subtracted before the update step.

The computational cost is $O(N^2)$ in the number of objectives $N$, making GPU parallelization non-optional for systems with more than four simultaneous objectives. The following PyTorch 2.5 implementation handles gradient projection across an arbitrary number of policy objectives:

import torch
from typing import List
from torch import Tensor

class PCGrad:
    """
    Implements the PCGrad gradient surgery algorithm (Yu et al., 2020).
    Resolves conflicting gradients across multiple RL objectives in-place.
    """

    def __init__(self, optimizer: torch.optim.Optimizer):
        self.optimizer = optimizer

    def _retrieve_flat_grad(self, losses: List[Tensor], model: torch.nn.Module) -> List[Tensor]:
        grads = []
        for loss in losses:
            self.optimizer.zero_grad(set_to_none=True)
            loss.backward(retain_graph=True)
            # Flatten all parameter gradients into a single vector per objective
            grad = torch.cat([
                p.grad.clone().view(-1)
                for p in model.parameters()
                if p.grad is not None
            ])
            grads.append(grad)
        return grads

    def _project(self, g_i: Tensor, g_j: Tensor) -> Tensor:
        # Subtract the conflicting component if cosine similarity is negative
        cos_sim = torch.dot(g_i, g_j) / (g_i.norm() * g_j.norm() + 1e-8)
        if cos_sim < 0:
            g_i = g_i - (cos_sim * g_j)
        return g_i

    def step(self, losses: List[Tensor], model: torch.nn.Module):
        grads = self._retrieve_flat_grad(losses, model)
        projected = []
        for i, g_i in enumerate(grads):
            g_proj = g_i.clone()
            for j, g_j in enumerate(grads):
                if i != j:
                    g_proj = self._project(g_proj, g_j)
            projected.append(g_proj)

        # Sum projected gradients and write back to parameter .grad fields
        final_grad = torch.stack(projected).sum(dim=0)
        pointer = 0
        self.optimizer.zero_grad(set_to_none=True)
        for p in model.parameters():
            numel = p.numel()
            p.grad = final_grad[pointer:pointer + numel].view_as(p).clone()
            pointer += numel

        self.optimizer.step()

Pro-Tip: For $N \leq 4$ objectives, PCGrad on a single GPU is sufficient. Beyond four objectives, partition objective groups across GPU streams using torch.cuda.Stream to parallelize the $O(N^2)$ projection loop.

Dynamic Policy Switching vs. Trade-off Weighting

Two strategies govern how the MORL policy responds to shifting operational conditions: hard constraint enforcement (dynamic policy switching) and soft preference weighting. The choice depends on traffic load characteristics.

Condition Strategy Mechanism Trade-off Impact
Peak load (>85% capacity) Hard constraint Throughput floor enforced; emissions clipped Emission compliance degrades ~12–18%
Off-peak load (<40% capacity) Soft weighting LLM preference weights applied dynamically Throughput drop ~5%; emissions optimize fully
Emergency preemption Hard constraint override Single-objective lock: response time only All other objectives suspended
Construction zone active Soft weighting + constraint Safety margin hard floor; pedestrian weight +0.4 Throughput reduction ~20–30% acceptable

Dynamic policy switching is appropriate when any single objective has a safety-critical lower bound that must not be violated regardless of LLM-assigned preference weights. Soft weighting is appropriate when all objectives are negotiable and the operator's preference ordering can shift between episodes.

The autonomous systems controlling physical infrastructure must implement both: hard constraints as non-negotiable floors, soft weighting for continuous optimization within feasible regions.


Hardware and Simulation Constraints

Running local policy simulation for UTSC requires a minimum of 16GB VRAM to maintain the policy network, trajectory buffers, and gradient projection tensors concurrently in GPU memory. Systems below this threshold will offload trajectory buffers to CPU RAM, introducing PCIe transfer latency that breaks the asynchronous buffering pipeline.

# Validate GPU memory and configure PyTorch DataLoader for parallel trajectory sampling
# Run before initiating any training loop

export CUDA_VISIBLE_DEVICES=0  # Pin to primary GPU; extend to "0,1" for DDP
export OMP_NUM_THREADS=4       # Prevent CPU thread oversubscription during rollouts

python -c "
import torch
assert torch.cuda.is_available(), 'CUDA not available'
vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
assert vram_gb >= 16, f'Insufficient VRAM: {vram_gb:.1f}GB (16GB required)'
print(f'VRAM OK: {vram_gb:.1f}GB on {torch.cuda.get_device_name(0)}')
print(f'PyTorch version: {torch.__version__}')
"

# Configure DataLoader workers for trajectory batch throughput
# num_workers=4 saturates most NVMe I/O without overcommitting CPU
torchrun --nproc_per_node=1 train_morl.py \
    --dataloader-workers 4 \
    --pin-memory true \
    --prefetch-factor 2 \
    --backend nccl

PyTorch 2.5's DistributedDataParallel is required to maintain GPU utilization above 80% during trajectory sampling phases. Without parallel prefetching, the annotator's LLM inference latency creates CPU-bound idle periods on the GPU.

Optimizing Computational Overhead During Trajectory Sampling

Trajectory sampling is the primary computational bottleneck in RLAIF for autonomous control systems. The reward engineering loop stalls when annotation requests queue faster than the LLM API can resolve them. The solution is an asynchronous trajectory buffer with batched annotation submission.

import asyncio
import aiohttp
import torch
from collections import deque
from dataclasses import dataclass, field
from typing import Deque, List, Tuple
import json

@dataclass
class TrajectoryPair:
    tau_a: dict  # Compressed trajectory metrics for trajectory A
    tau_b: dict  # Compressed trajectory metrics for trajectory B
    pair_id: str

@dataclass
class AnnotatedPair:
    pair: TrajectoryPair
    preferred: str   # "A" or "B"
    confidence: float

class AsyncTrajectoryAnnotator:
    """
    Decouples trajectory collection from LLM annotation via async queue.
    Prevents GPU idle time during API round-trips.
    """

    def __init__(
        self,
        api_key: str,
        model: str = "gpt-4o",
        batch_size: int = 8,
        confidence_threshold: float = 0.65,
        buffer_maxlen: int = 512,
    ):
        self.api_key = api_key
        self.model = model
        self.batch_size = batch_size
        self.confidence_threshold = confidence_threshold
        # Bounded buffer prevents OOM on extended training runs
        self.pending: Deque[TrajectoryPair] = deque(maxlen=buffer_maxlen)
        self.annotated: List[AnnotatedPair] = []

    def enqueue(self, pair: TrajectoryPair):
        self.pending.append(pair)

    async def _annotate_single(
        self, session: aiohttp.ClientSession, pair: TrajectoryPair
    ) -> AnnotatedPair | None:
        payload = {
            "model": self.model,
            "temperature": 0,
            "messages": [
                {"role": "system", "content": "Return JSON: {\"preferred\": \"A\" or \"B\", \"confidence\": float}"},
                {"role": "user", "content": json.dumps({"trajectory_A": pair.tau_a, "trajectory_B": pair.tau_b})}
            ]
        }
        async with session.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json=payload,
            timeout=aiohttp.ClientTimeout(total=10)
        ) as resp:
            data = await resp.json()
            result = json.loads(data["choices"][0]["message"]["content"])
            confidence = float(result.get("confidence", 0.0))
            # Discard low-confidence labels—they degrade reward model training
            if confidence < self.confidence_threshold:
                return None
            return AnnotatedPair(pair=pair, preferred=result["preferred"], confidence=confidence)

    async def flush_batch(self):
        """Drain pending queue in batches; run annotation concurrently."""
        batch: List[TrajectoryPair] = []
        while self.pending and len(batch) < self.batch_size:
            batch.append(self.pending.popleft())

        async with aiohttp.ClientSession() as session:
            tasks = [self._annotate_single(session, p) for p in batch]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            for r in results:
                if isinstance(r, AnnotatedPair):
                    self.annotated.append(r)

    def drain_annotations(self) -> List[AnnotatedPair]:
        """Called by training loop to consume completed annotations."""
        completed = self.annotated.copy()
        self.annotated.clear()
        return completed

The bounded deque(maxlen=512) is operationally critical: it prevents the buffer from growing unbounded during periods of high rollout throughput relative to annotation bandwidth.


Validating Policy Trade-offs in Urban Environments

Policy validation in UTSC requires running the trained policy against a simulated intersection grid across a range of LLM-assigned preference weight vectors. The goal is to confirm that the Pareto front is well-distributed—not collapsed to a single operating point.

Pareto Front Visualization Protocol: Generate a 2D scatter plot with normalized throughput (vehicles/min) on the x-axis and normalized emission compliance score (inverse CO₂/km) on the y-axis. Each point represents the average policy performance under a distinct preference weight configuration $\lambda \in \Lambda$, where $\Lambda$ is a uniform grid over the simplex spanning three objectives: throughput, emissions, and pedestrian wait.

As LLM preference weights shift from throughput-dominant ($\lambda_{throughput} = 0.8$) toward emission-dominant ($\lambda_{emissions} = 0.8$), the Pareto front should exhibit a smooth convex curve moving from the upper-left region (high throughput, moderate compliance) toward the lower-right region (lower throughput, high compliance). Collapse of multiple points to a single region indicates either gradient interference (PCGrad misconfiguration) or reward model underfitting on the preference dataset.

Technical Warning: A Pareto front that appears perfectly convex in simulation does not guarantee equivalent coverage in production. Urban traffic exhibits phase transitions at capacity thresholds (~85% road utilization) that simulation rarely captures at full fidelity.

Managing State Persistence During Re-planning

When the MORL policy receives updated preference weights mid-execution—triggered by operator intervention or time-of-day schedule changes—the agent must restore a consistent internal state without policy drift. Serialization latency must remain below 5ms to maintain real-time performance in signal controller loops.

import json
import time
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Any, Dict
from langgraph.checkpoint.memory import MemorySaver

@dataclass
class AgentState:
    episode_step: int
    current_phase: str           # Signal phase ID active at snapshot time
    objective_weights: Dict[str, float]  # Current MORL weight vector
    trajectory_buffer_size: int
    last_preference_update_ts: float

class StateManager:
    """
    Manages LangGraph-compatible state persistence for mid-execution re-planning.
    Serialization target: < 5ms round-trip for real-time compatibility.
    """

    def __init__(self, checkpoint_dir: str = "/tmp/morl_state"):
        self.checkpoint_path = Path(checkpoint_dir)
        self.checkpoint_path.mkdir(parents=True, exist_ok=True)
        self.saver = MemorySaver()  # In-memory saver; swap for SqliteSaver in production

    def save(self, state: AgentState, run_id: str) -> float:
        t0 = time.perf_counter()
        payload = json.dumps(asdict(state))
        state_file = self.checkpoint_path / f"{run_id}.json"
        state_file.write_text(payload)
        latency_ms = (time.perf_counter() - t0) * 1000
        # Alert if serialization exceeds real-time budget
        if latency_ms > 5.0:
            raise RuntimeWarning(f"State serialization exceeded 5ms budget: {latency_ms:.2f}ms")
        return latency_ms

    def restore(self, run_id: str) -> AgentState:
        state_file = self.checkpoint_path / f"{run_id}.json"
        if not state_file.exists():
            raise FileNotFoundError(f"No checkpoint found for run_id={run_id}")
        data: Dict[str, Any] = json.loads(state_file.read_text())
        return AgentState(**data)

    def update_weights(self, run_id: str, new_weights: Dict[str, float]) -> AgentState:
        """Atomically update preference weights on re-plan trigger."""
        state = self.restore(run_id)
        state.objective_weights = new_weights
        state.last_preference_update_ts = time.time()
        self.save(state, run_id)
        return state

The MemorySaver from LangGraph handles in-process checkpointing with negligible overhead. For production deployments where the controller process may restart, replace MemorySaver with SqliteSaver and co-locate the SQLite file on a RAM-backed filesystem (/dev/shm) to maintain the 5ms latency budget.


Future-Proofing Preference-Based MORL

The long-term reliability question for RLAIF in safety-critical deployments is not algorithmic—it is distributional. LLM preference oracles trained or prompted on today's traffic patterns will generate systematically biased labels when urban infrastructure changes: new road layouts, autonomous vehicle penetration rates, or policy interventions (congestion pricing, emissions zones) shift the underlying state distribution.

Reward engineering for production systems must therefore treat the LLM preference model as a living component, not a frozen artifact. This means maintaining a versioned preference dataset with timestamps, running periodic re-annotation of archived trajectory pairs when the base LLM is updated, and monitoring reward model calibration via held-out preference accuracy metrics.

Reinforcement learning systems that achieve Pareto-optimal policies in 2026 infrastructure conditions will require active recalibration as conditions evolve—the preference annotation pipeline is not a one-time build; it is a continuous data engineering obligation.

Scaling Beyond Single-Intersection Traffic Control

Extending MORL from a single intersection to a city-wide grid requires transitioning to a multi-agent reinforcement learning (MARL) framework where each intersection agent maintains local objectives but coordinates on shared network-level constraints (emission caps per district, emergency corridor clearance).

The RLAIF architecture scales horizontally: each sub-agent submits trajectory pairs to the shared LLM annotator, which evaluates both local performance and network-level externalities (does maximizing throughput at intersection A create downstream gridlock at intersection B?). Prompt context expands to include a neighborhood summary—a compressed representation of adjacent intersection states—ensuring the preference oracle reasons over systemic rather than purely local trade-offs.

Practical implementations use a parameter-sharing policy architecture (all intersection agents share a base policy network, with intersection-specific adaptation layers) combined with a centralized reward model that receives decentralized trajectory pairs. This reduces the annotation volume by exploiting structural similarity across intersections while preserving local responsiveness.


Finalizing the Deployment Workflow

Validate the complete deployment environment before initiating any training run. The following checklist covers all dependency requirements for reproducible implementation:

#!/bin/bash
# Deployment validation script — run before first training invocation
# Requires: Python 3.10+, PyTorch 2.5+, CUDA 12.1+

set -euo pipefail

echo "=== Python Version Check ==="
python_version=$(python3 --version | awk '{print $2}')
required="3.10"
python3 -c "import sys; assert sys.version_info >= (3,10), 'Python 3.10+ required'"
echo "OK: Python $python_version"

echo "=== PyTorch Version Check ==="
python3 -c "
import torch
major, minor = [int(x) for x in torch.__version__.split('.')[:2]]
assert (major, minor) >= (2, 5), f'PyTorch 2.5+ required, found {torch.__version__}'
print(f'OK: PyTorch {torch.__version__}')
"

echo "=== CUDA + VRAM Check ==="
python3 -c "
import torch
assert torch.cuda.is_available(), 'CUDA unavailable'
vram = torch.cuda.get_device_properties(0).total_memory / (1024**3)
assert vram >= 16, f'16GB VRAM required, found {vram:.1f}GB'
print(f'OK: {vram:.1f}GB VRAM on {torch.cuda.get_device_name(0)}')
"

echo "=== Dependency Check ==="
python3 -c "
import aiohttp       # Async HTTP for LLM annotation
import langgraph     # State persistence for re-planning
import openai        # LLM API client (GPT-4o)
print('OK: All runtime dependencies present')
"

echo "=== LLM API Connectivity ==="
python3 -c "
import openai, os
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
resp = client.models.list()
print('OK: LLM API reachable')
"

echo "=== Environment Validation Complete ==="

Pro-Tip: Pin all package versions in requirements.txt using pip freeze > requirements.txt after successful validation. LangGraph's API surface has changed across minor versions; unpinned dependencies will break the MemorySaver interface.


Summary of Findings

Scalar reward engineering reaches its structural limit in multi-objective urban control. The non-convex Pareto frontiers of real traffic systems cannot be captured by any static weight vector. The RLAIF architecture—LLM preference oracle feeding a trained reward model, with PCGrad handling gradient conflicts—solves this without requiring engineers to enumerate every possible trade-off manually.

The core reward engineering shift is from specification to annotation: instead of defining what the agent should optimize, operators describe their preferences over observed trajectories, and the LLM extracts consistent preference signals at scale. This approach yields policies that traverse the Pareto front in response to changing operator priorities, rather than collapsing to a single fixed operating point.

The computational costs are real: LLM annotation overhead during the trajectory sampling phase is the primary engineering constraint, not model capacity. Asynchronous buffering, batch annotation, and GPU-bound training loops are not optimizations—they are requirements for practical deployment. Systems that treat the preference pipeline as an offline preprocessing step will find it scales directly to multi-agent city-wide coordination with minimal architectural change.


Keywords: Multi-Objective Reinforcement Learning (MORL), Reinforcement Learning from AI Feedback (RLAIF), Pareto Optimal Policy, Scalar Reward Functions, Conflict Resolution Logic, Gradient Clipping, Trajectory Sampling, Preference Annotation Pipeline, PyTorch 2.5, Urban Traffic Signal Control (UTSC)