Domain-Specific Model Adaptation: Evaluating COBOL-Coder and Modern LLM Code Synthesis

15 min read · Published Apr 11, 2026, 6:04 PM

General-purpose LLMs are competent polyglots and brittle specialists. When the target dialect is COBOL—a language whose structural contracts were designed for batch mainframe execution in the 1960s—the polyglot advantage collapses. The COBOL-Coder paper (ArXiv 2604.03986v1) quantifies exactly how badly: GPT-4o achieves a 41.8% compilation success rate on COBOL synthesis tasks. That number is not a benchmark curiosity; it is an operational risk figure for any team running migration workloads against production mainframe systems.


The Crisis of LLM Code Synthesis in Legacy Mainframe Environments

COBOL's division-based structure—Identification, Environment, Data, and Procedure Divisions—is not syntactic decoration. Each division carries strict dependency semantics. The Data Division declares all working storage and file descriptors that the Procedure Division legally references. A model that generates a MOVE statement referencing an undeclared 01-level group item will produce code that is syntactically plausible to a human reader and immediately rejected by any conformant COBOL compiler. General-purpose LLMs, trained predominantly on C, Python, Java, and JavaScript, internalize a fundamentally different structural grammar. They hallucinate division boundaries, emit PERFORM targets that reference nonexistent paragraphs, and misplace FD descriptors outside the File Section.

The synthesis pipeline failure is not stochastic noise—it is systematic. The model generates, the compiler rejects, and without a feedback loop the failure is silent.

sequenceDiagram
    participant Engineer
    participant GeneralLLM as General-Purpose LLM
    participant Compiler as GnuCOBOL Compiler
    participant Output

    Engineer->>GeneralLLM: Prompt: "Generate COBOL for payroll calc"
    GeneralLLM->>GeneralLLM: Samples from learned distribution<br/>(Python/Java-dominant)
    GeneralLLM-->>Engineer: COBOL artifact (unvalidated)
    Engineer->>Compiler: Submit artifact
    Compiler-->>Engineer: ERROR: Undefined data-name SALARY-WS<br/>ERROR: PERFORM target CALC-NET not found
    Note over Engineer,Output: Pipeline halts. No behavioral signal<br/>returned to model. Engineer manually patches.
    Engineer->>Output: Manual correction (unscalable)

The 41.8% compilation success rate for GPT-4o means that for every ten COBOL programs requested, fewer than five compile at all—before any behavioral correctness check is applied. At scale, this makes general-purpose models unsuitable for automated code synthesis in mainframe migration programs without substantial scaffolding.


Engineering Behavioral Fidelity: Moving Beyond Static Analysis

Static analysis of generated COBOL—linting, AST inspection, pattern matching—catches surface errors but cannot verify that a payroll deduction loop produces the same output as the legacy mainframe program it replaces. Behavioral Fidelity is the property that a generated or translated program, when executed against identical inputs in a deterministic environment, produces byte-identical outputs to the reference implementation.

Static analysis fails here for a specific reason: COBOL programs frequently maintain state across file I/O operations, REDEFINES clauses share memory layouts that are pointer-equivalent, and COMP-3 packed-decimal arithmetic has platform-specific rounding behavior. None of these properties are inspectable without execution.

The implementation contract is a Docker-based execution harness that treats the Gold Standard mainframe environment as an oracle. The harness submits the generated program alongside a reference test corpus, captures both stdout and exit codes, and computes a fidelity score.

import subprocess
import hashlib
import json
from pathlib import Path
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class FidelityResult:
    program_id: str
    compiled: bool
    behavioral_match: bool
    output_hash_generated: Optional[str] = None
    output_hash_reference: Optional[str] = None
    error_log: List[str] = field(default_factory=list)

def compute_behavioral_fidelity(
    generated_cobol: str,
    reference_output_path: Path,
    container_name: str = "gnucobol-gold-standard",
    timeout_seconds: int = 30,
) -> FidelityResult:
    """
    Compiles and executes generated COBOL inside a locked Docker container,
    then compares the stdout hash against the pre-computed reference hash.
    The container image is version-pinned to prevent drift in compiler behavior.
    """
    program_id = hashlib.sha256(generated_cobol.encode()).hexdigest()[:12]
    result = FidelityResult(program_id=program_id, compiled=False, behavioral_match=False)

    # Write generated COBOL to a temp file accessible by the container
    source_path = Path(f"/tmp/cobol_eval/{program_id}.cbl")
    source_path.parent.mkdir(parents=True, exist_ok=True)
    source_path.write_text(generated_cobol)

    # Stage 1: Compilation inside the Gold Standard container
    compile_cmd = [
        "docker", "exec", container_name,
        "cobc", "-x", "-free",           # -free enables free-format COBOL source
        f"/workspace/{program_id}.cbl",
        "-o", f"/workspace/{program_id}",
    ]
    # Copy source into the container before compiling
    subprocess.run(
        ["docker", "cp", str(source_path), f"{container_name}:/workspace/{program_id}.cbl"],
        check=True, capture_output=True
    )
    compile_proc = subprocess.run(compile_cmd, capture_output=True, text=True, timeout=timeout_seconds)

    if compile_proc.returncode != 0:
        result.error_log = compile_proc.stderr.strip().splitlines()
        return result  # Compilation failure; no behavioral check possible

    result.compiled = True

    # Stage 2: Execute and capture output for behavioral comparison
    run_cmd = ["docker", "exec", container_name, f"/workspace/{program_id}"]
    run_proc = subprocess.run(run_cmd, capture_output=True, text=True, timeout=timeout_seconds)

    generated_hash = hashlib.sha256(run_proc.stdout.encode()).hexdigest()
    reference_hash = reference_output_path.read_text().strip()

    result.output_hash_generated = generated_hash
    result.output_hash_reference = reference_hash
    result.behavioral_match = (generated_hash == reference_hash)

    return result

Technical Warning: The Docker image used as the Gold Standard must be version-pinned by digest, not tag. A latest tag on your GnuCOBOL image introduces compiler version drift, which silently invalidates historical benchmark comparisons. Use docker pull gnucobol:3.2@sha256:<digest> and record the digest in your reproducibility manifest.

The Failure of LLM-as-a-Judge for COBOL Verification

LLM-as-a-judge pipelines—where a second model scores the output of the first—are a pragmatic shortcut for domains where ground truth is expensive to compute. COBOL is not that domain. A compiler is cheaper to invoke than a judge model and is deterministically correct.

The correlation between LLM judge scores and actual compiler execution results is poor for legacy code. The failure modes are distinct and non-overlapping:

Error Category LLM-as-a-Judge Detection Compiler Detection Example
Compilation Error Inconsistent (30–60% detection) 100% deterministic Missing STOP RUN, malformed FD entry
Logic Mismatch Near-zero (<5% detection) Requires test oracle Incorrect COMPUTE rounding, wrong PERFORM UNTIL boundary
Syntactic Hallucination Partial (50–70% detection) 100% deterministic Invented reserved words (e.g., PROCESS-LOOP), wrong division ordering
Runtime State Error 0% detection Execution-time only File not opened before READ, subscript out of range

The fundamental problem: LLM judges evaluate plausibility, not correctness. In domains with high training data representation (Python, SQL), plausibility and correctness correlate reasonably. In COBOL, where the judge's own training exposure is thin, it will confidently score structurally broken code as acceptable because the code reads like COBOL. Heuristic evaluation cannot replace deterministic compiler-guided validation—this is not a limitation to engineer around; it is a category error in evaluation design.


Architecting the COBOL-Coder Pipeline: Compiler-Guided Curation

The COBOL-Coder pipeline attacks the training data problem directly. Raw COBOL scraped from public repositories is low-quality by default: it includes incomplete programs, non-compiling fragments, dialect-specific extensions, and COBOL that compiles but encodes wrong business logic. Indiscriminate scraping produces a training set that teaches a model to produce statistically typical COBOL—which is frequently broken.

The multi-stage pipeline enforces quality at each stage:

  1. Compiler Gate (Stage 1): Every candidate program is submitted to GnuCOBOL. Non-compiling programs are rejected outright. This immediately eliminates the majority of scraped data.
  2. Similarity Deduplication (Stage 2): Near-duplicate programs (common in enterprise repositories where template programs proliferate) are clustered and reduced to a single canonical representative. This prevents the model from memorizing boilerplate at the expense of structural diversity.
  3. Behavioral Oracle Validation (Stage 3): Programs that compile are executed against a synthetic test harness. Programs that produce undefined behavior, abnormal termination, or null output are rejected.
  4. Difficulty-Stratified Retention (Stage 4): The remaining programs are ranked by structural complexity (division count, nested PERFORM depth, file operation count) and sampled to maintain difficulty distribution in the final training set.

The fine-tuning loss function is modified to impose asymmetric penalties. Let $\mathcal{L}_{CE}$ denote standard cross-entropy loss over token sequence $\mathbf{y} = (y_1, \ldots, y_T)$ given context $\mathbf{x}$. The compiler-guided loss introduces a per-sequence weight $w_c$ derived from compiler feedback:

$$\mathcal{L}{COBOL} = \frac{1}{N} \sum)$$}^{N} w_{c}^{(i)} \cdot \mathcal{L}_{CE}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)

where:

$$w_{c}^{(i)} = \begin{cases} \alpha & \text{if program } i \text{ compiles and passes behavioral oracle} \ \beta & \text{if program } i \text{ compiles but fails behavioral oracle} \ \gamma & \text{if program } i \text{ fails to compile} \end{cases}$$

with $\alpha > \beta > \gamma$, typically $\alpha = 1.5$, $\beta = 1.0$, $\gamma = 0.3$. The $\gamma$ weight does not zero out failing examples entirely—some structural signal exists even in non-compiling fragments—but it substantially reduces their gradient contribution. This feedback loop requires a GHC-compliant environment integrated into the training data pipeline, not just the evaluation pipeline.

Setting Up the Gold Standard Mainframe Environment

The Gold Standard environment is the single source of truth for all behavioral verification. It must be reproducible, version-locked, and isolated from host system library drift. Crucially, the GnuCOBOL compiler (GHC) functions as the definitive validator within this environment, acting as the mandatory compiler gate through which all synthesized code must pass to ensure structural correctness.

#!/usr/bin/env bash
# provision_gold_standard.sh
# Provisions the GnuCOBOL Gold Standard container.
# The image digest MUST be recorded and committed to your reproducibility manifest.
# Do not substitute the version tag without a full benchmark re-run.
# Dependencies: GHC (GnuCOBOL) compiler suite must be available at build time.

set -euo pipefail

CONTAINER_NAME="gnucobol-gold-standard"
# Pin by digest to prevent silent version drift
IMAGE="ubuntu:22.04"
WORKSPACE_DIR="$(pwd)/cobol_workspace"
GNUCOBOL_VERSION="3.2.0"

mkdir -p "${WORKSPACE_DIR}"

# Build a deterministic image with GnuCOBOL from source
docker build --no-cache -t "${CONTAINER_NAME}:${GNUCOBOL_VERSION}" - <<'DOCKERFILE'
FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV GNUCOBOL_VERSION=3.2.0

# Install build dependencies for GnuCOBOL
RUN apt-get update && apt-get install -y \
    build-essential \
    libgmp-dev \
    libdb-dev \
    libncurses-dev \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Download and compile GnuCOBOL from source to guarantee exact version
RUN wget -q "https://ftp.gnu.org/gnu/gnucobol/gnucobol-${GNUCOBOL_VERSION}.tar.xz" \
    && tar xf "gnucobol-${GNUCOBOL_VERSION}.tar.xz" \
    && cd "gnucobol-${GNUCOBOL_VERSION}" \
    && ./configure --prefix=/usr/local \
    && make -j"$(nproc)" \
    && make install \
    && ldconfig \
    && cd / && rm -rf "gnucobol-${GNUCOBOL_VERSION}"*

# Verify installation and record compiler identity
RUN cobc --version > /etc/gnucobol-version.txt && cat /etc/gnucobol-version.txt

WORKDIR /workspace
DOCKERFILE

# Start the persistent container used as an execution oracle
docker rm -f "${CONTAINER_NAME}" 2>/dev/null || true
docker run -d \
    --name "${CONTAINER_NAME}" \
    --memory="2g" \
    --cpus="2" \
    -v "${WORKSPACE_DIR}:/workspace" \
    --restart=unless-stopped \
    "${CONTAINER_NAME}:${GNUCOBOL_VERSION}" \
    tail -f /dev/null   # Keep container alive for exec calls

# Record the image digest for the reproducibility manifest
IMAGE_DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' "${CONTAINER_NAME}:${GNUCOBOL_VERSION}" 2>/dev/null || echo "local-build-no-remote-digest")
echo "GOLD_STANDARD_DIGEST=${IMAGE_DIGEST}" >> .env.gold_standard
echo "GOLD_STANDARD_COBOL_VERSION=${GNUCOBOL_VERSION}" >> .env.gold_standard
echo "Gold Standard provisioned. Digest recorded in .env.gold_standard"

Pro-Tip: Run the Gold Standard container on dedicated hardware with locked CPU frequency scaling (cpupower frequency-set -g performance). Timing-sensitive COBOL programs with ACCEPT from system clock can produce non-deterministic outputs if the host scheduler introduces variance between reference and generated program execution runs.


Quantitative Results: COBOL-Coder vs. SOTA General LLMs

The performance delta between COBOL-Coder and general-purpose models is not marginal—it represents a different tier of operational viability. COBOL-Coder achieves a 73.95% compilation success rate versus GPT-4o's 41.8%, a 32-percentage-point gap. More telling is the Pass@1 metric on COBOLEval: COBOL-Coder scores 49.33 against GPT-4o's 16.4. Pass@1 measures whether a single sample solves the problem correctly—this is the metric that matters for automated pipelines, where you cannot retry indefinitely.

xychart-beta
    title "COBOL-Coder vs GPT-4o Performance by Complexity"
    x-axis ["Simple", "Intermediate", "Complex"]
    y-axis "Success Rate (%)" 0 --> 100
    bar [92.1, 75.4, 48.2]
    bar [65.2, 38.5, 15.1]
    line [92.1, 75.4, 48.2]

The visualization above highlights performance degradation across COBOLEval complexity tiers. As structural demands increase (nested logic, data dependency), COBOL-Coder maintains significantly higher reliability than GPT-4o, which exhibits a sharp decline in synthesis quality as the program depth increases.

Handling Edge-Case Logic in Legacy Business Rules

Legacy COBOL encodes decades of business rule accumulation. A single payroll program may contain pension deduction logic from four different regulatory eras, each conditioned on hire-date ranges that themselves reference lookup tables defined in copybooks. Translating this in a single LLM context window is not feasible for programs exceeding ~800 lines due to token window constraints. Long-running translation tasks require stateful orchestration.

LangGraph's state graph model maps directly to this problem: each node represents a translation substep, and the state object carries the partially translated program, the list of unresolved copybook dependencies, and the behavioral fidelity score accumulated so far.

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional
import json
from pathlib import Path

# State schema persisted across translation steps.
# All fields must be JSON-serializable for checkpoint recovery.
class COBOLMigrationState(TypedDict):
    program_id: str
    source_cobol: str
    divisions_completed: List[str]       # Tracks which COBOL divisions have been translated
    translated_segments: dict            # Maps division name -> translated output
    unresolved_copybooks: List[str]      # COPY statements not yet resolved
    fidelity_scores: List[float]         # Per-segment behavioral fidelity scores
    current_division: Optional[str]
    error_log: List[str]
    complete: bool

def extract_division(state: COBOLMigrationState) -> COBOLMigrationState:
    """Identify the next untranslated COBOL division to process."""
    all_divisions = ["IDENTIFICATION", "ENVIRONMENT", "DATA", "PROCEDURE"]
    remaining = [d for d in all_divisions if d not in state["divisions_completed"]]
    if not remaining:
        return {**state, "complete": True}
    return {**state, "current_division": remaining[0]}

def translate_division(state: COBOLMigrationState) -> COBOLMigrationState:
    """
    Translate a single COBOL division using the domain-specific model.
    Isolating by division keeps each LLM call within token budget
    and makes partial failures recoverable without full restart.
    """
    division = state["current_division"]
    # In production: call COBOL-Coder API with division-scoped prompt
    # Placeholder illustrates the state mutation pattern
    translated = f"-- Translated {division} DIVISION content --"
    updated_segments = {**state["translated_segments"], division: translated}
    updated_completed = state["divisions_completed"] + [division]
    return {**state, "translated_segments": updated_segments, "divisions_completed": updated_completed}

def verify_segment(state: COBOLMigrationState) -> COBOLMigrationState:
    """
    Run behavioral fidelity check on the accumulated translated output.
    Score is appended to history; low scores trigger re-planning (not shown).
    """
    # Assemble translated program to date and run fidelity check
    current_program = "\n".join(state["translated_segments"].values())
    # Simplified: in production calls compute_behavioral_fidelity()
    score = 0.85  # Placeholder; replace with actual harness invocation
    return {**state, "fidelity_scores": state["fidelity_scores"] + [score]}

def should_continue(state: COBOLMigrationState) -> str:
    """Route to END only when all divisions are translated."""
    return END if state["complete"] else "extract_division"

# Build the state graph
graph = StateGraph(COBOLMigrationState)
graph.add_node("extract_division", extract_division)
graph.add_node("translate_division", translate_division)
graph.add_node("verify_segment", verify_segment)

graph.set_entry_point("extract_division")
graph.add_edge("extract_division", "translate_division")
graph.add_edge("translate_division", "verify_segment")
graph.add_conditional_edges("verify_segment", should_continue, {
    "extract_division": "extract_division",
    END: END
})

migration_graph = graph.compile(checkpointer=None)  # Attach MemorySaver or SqliteSaver for persistence

# Initialize and invoke with a checkpoint config for resumability
initial_state = COBOLMigrationState(
    program_id="payroll_v4_2003",
    source_cobol=Path("payroll_v4_2003.cbl").read_text(),
    divisions_completed=[],
    translated_segments={},
    unresolved_copybooks=["PENSION-CALC", "TAX-TABLE-2019"],
    fidelity_scores=[],
    current_division=None,
    error_log=[],
    complete=False,
)
config = {"configurable": {"thread_id": initial_state["program_id"]}}
final_state = migration_graph.invoke(initial_state, config=config)

Pro-Tip: Serialize COBOLMigrationState to SQLite via LangGraph's SqliteSaver checkpointer. This provides crash-recovery for multi-hour translation jobs without re-invoking completed divisions, which typically account for 60–70% of total API cost in large program migrations.


Strategic Implications for CTOs and Engineering Leadership

The upfront cost of establishing a Gold Standard simulation environment is real: containerization setup, compiler sourcing, test corpus construction, and pipeline integration typically represent 4–8 weeks of senior engineering time. The operational economics justify this investment within the first migration program.

Factor Manual Refactoring LLM (General-Purpose, No Simulator) COBOL-Coder + Behavioral Fidelity Pipeline
Setup Cost Low Low-Medium High (4–8 weeks)
Per-Program Translation Cost $15,000–$40,000 (labor) $500–$2,000 (review-heavy) $200–$800 (automated validation)
Defect Escape Rate 5–15% (expert-dependent) 30–60% (no behavioral check) <8% (oracle-verified)
Throughput (programs/month) 10–30 50–150 (with rework) 200–500 (with pipeline)
Mainframe Decommission Risk Low High Low
Audit Trail Quality Manual documentation Incomplete Full execution trace per program

Automated simulation-backed migration reduces technical debt rework cycles directly: defects caught at the behavioral fidelity stage cost a fraction of defects discovered in user acceptance testing or production. The break-even point for the simulator investment, at scale, is typically 50–80 programs—a threshold most meaningful migration programs cross within the first quarter of execution.

The critical strategic constraint is organizational: the Gold Standard environment must be treated as production infrastructure, not a development tool. It requires version governance, access controls, and change management equivalent to the mainframe it is replacing.


Future Trajectory of Domain-Specialized Code Agents

COBOL-Coder demonstrates a repeatable methodology, not a one-off solution. The compiler-guided curation pipeline is language-agnostic in its architecture; the language-specific components are the compiler binary, the dialect grammar, and the behavioral test corpus. This generalizes.

Roadmap for scaling compiler-guided curation to other legacy enterprise languages:

  1. PL/I (immediate priority): IBM's PL/I runtime remains entrenched in insurance and government mainframes. GnuPL/I and IBM's open compiler tooling provide the compiler gate. The structural challenge is PL/I's free-form block structure and pointer arithmetic—more complex than COBOL but tractable with the same pipeline architecture.

  2. RPG IV (AS/400 base): IBM iSeries installations number in the hundreds of thousands globally. RPG's fixed-format legacy and free-format modern dialect require dialect-aware tokenization before compiler gating. Priority for financial and manufacturing sectors.

  3. Natural/ADABAS: Software AG's Natural language powers a substantial fraction of German banking infrastructure. No open-source compiler exists, requiring either commercial licensing or emulator-based behavioral verification. Highest barrier to entry; highest enterprise value.

  4. Fortran 77 (scientific computing): NCAR, NASA, and DOE facilities maintain critical Fortran 77 simulation codebases. GFortran provides a free compiler gate. The behavioral fidelity challenge is floating-point reproducibility across architectures—requiring fixed-seed deterministic test inputs.

  5. Generalized Domain-Specific Agent Framework: The terminal evolution is a compiler-guided curation framework that accepts a compiler binary, a grammar specification, and a seed code corpus, and produces a fine-tuning dataset automatically. COBOL-Coder is the existence proof; the framework is the product.

The constraint for each step is identical: behavioral verification requires an execution oracle. Any legacy language without an accessible compiler or deterministic simulator will block the pipeline at Stage 1. For those languages, the investment must go into building the oracle before fine-tuning can begin. This is a hard dependency, not a configuration option.

Code synthesis for legacy systems is not a solved problem after COBOL-Coder—it is a demonstrated methodology. The 73.95% compilation rate is a floor to build from, not a ceiling to accept.


Keywords: COBOL-Coder, Behavioral Fidelity, Mainframe Simulation, Compiler-Guided Curation, COBOLEval, Legacy Code Migration, GHC, Docker-based Validation, Token Window Constraints, Code Synthesis Benchmarks