Reasoning Model Evaluation Frameworks: Benchmarking Accuracy vs. Reasoning Cost

Aggregate accuracy scores on AIME 2024 and GPQA create a systematic blind spot: they tell you a model answered correctly, not what it cost to get there. o3 running at high compute budget can consume 3–10× the tokens of a comparable reasoning pass on DeepSeek-R1 or o3-mini, yet standard leaderboards report the same binary "correct." This article compares two frameworks — OckBench and TCCA — that surface the metric that actually determines production viability: cost-per-correct-response.

Evaluating Model Efficiency: The Cost-per-Correct-Response Metric

Standard model performance metrics collapse reasoning quality to a single accuracy figure, hiding the compute expenditure required to reach that figure. The Tokens Per Correctness (TPC) metric, introduced in early 2026 for search-augmented LLMs (arXiv:2601.05503), formalizes this: TPC quantifies the computational cost-performance trade-off by counting tokens consumed per verifiably correct answer.

Artificial Analysis tracks over 100 LLMs with normalized pricing (USD/1M tokens), latency (TTFT in milliseconds), and throughput (tokens per second, TPS). Combining those rate-card figures with benchmark-level token counts yields a cost-per-correct-response that standard leaderboards omit entirely.

The formula is:

$$\text{Cost-per-Correct} = \frac{(\text{Input tokens} + \text{Output tokens}) \times \text{USD/1M tokens}}{N_{\text{correct}}}$$

Model	AIME 2024 Accuracy	Avg. Output Tokens/Problem	USD/1M Output Tokens	Relative Cost-per-Correct
o3 (high compute)	~87%	~8,000	$40	1.0× (baseline)
o3-mini (medium)	~63%	~2,400	$12	~0.21×
DeepSeek-R1	~72%	~3,200	$2.19	~0.04×
GPT-4o	~9%	~400	$10	~1.3×

As Objective AI framed it in 2026: "Greater Cost Efficiency is defined as cost-per-correct-response... by utilizing aggregate LLMs which are an order of magnitude cheaper per-token than the respective singular LLM."

OckBench: Framework Capabilities and Trade-offs

OckBench is the most direct implementation of token-efficiency evaluation available for LLM benchmarking tools. Its design philosophy is captured in the project motto: "Tokens are not to be multiplied without necessity" — an explicit Occam's Razor framing applied to decoding budgets.

OckBench measures decoding token consumption alongside accuracy for every evaluation run, producing per-problem token efficiency scores rather than aggregate-only summaries. This matters because reasoning chains are non-uniform: a model may answer 80% of GPQA problems efficiently but spend 5× the expected tokens on the remaining 20%, inflating the true production cost.

Key capabilities: - Per-problem token consumption logging, enabling outlier detection on pathologically long chains - Accuracy/token Pareto frontier visualization across benchmark subsets - Native support for AIME 2024 and GPQA problem sets - Integration hooks for both API-hosted and local inference harnesses

Honest trade-offs: OckBench does not yet provide real-time cost-tracking against live provider pricing; you pipe its token-count output into a separate rate-card calculation. Its benchmark coverage is narrower than general evaluation suites — it optimizes depth over breadth, which suits reasoning-specific evaluation but requires supplementation for broader capability audits.

Pro Tip: OckBench requires Python 3.10+ and enforces strict library versioning for reproducibility across local inference harnesses. Pin your dependencies explicitly (use pip freeze > requirements.txt immediately after a clean install) — version drift in transformers or vllm between eval runs produces token-count discrepancies that corrupt cross-model comparisons.

For local evaluation, GPU memory requirements scale with model size and context length. Running DeepSeek-R1 (671B MoE) locally requires multi-node A100/H100 setups; for most teams, OckBench is most practical against API endpoints or quantized local runners.

TCCA: Harnessing Real-World Inference Cost Analysis

TCCA (Token-Cost-Correctness Analysis) approaches inference cost analysis from the production billing angle rather than the research benchmarking angle. Where OckBench is a structured evaluation harness, TCCA is a methodology: multiply verified output token counts by the model provider's current rate card, then normalize by task correctness.

TCCA frameworks require integration with API endpoints or local model runners via OpenAPI-compliant client wrappers. The cost derivation is explicit: output token count × provider rate card = dollar cost per run, summed across the evaluation set and divided by correct-answer count.

Compatibility with local harnesses is conditional. TCCA works cleanly against any OpenAPI-compliant inference server — vLLM's OpenAI-compatible endpoint, Together AI, Fireworks, and standard provider SDKs all expose the token-count metadata TCCA requires. The limitation surfaces when running multi-turn reasoning chains: longer context accumulates input tokens rapidly, and local GPU VRAM constraints can force mid-chain truncations that corrupt cost measurements.

Watch Out: Local evaluation harnesses frequently hit two failure modes during TCCA benchmarks. First, API rate limits (typically 60–120 RPM on most provider tiers) stretch multi-model batch evaluations across hours, introducing pricing volatility risk if the provider adjusts rates mid-run. Second, local GPU memory ceilings — particularly when running extended reasoning chains on models above 7B parameters — can trigger context truncation that silently reduces output token counts, producing artifically optimistic cost figures. Instrument your harness to assert that full chain completion occurred before recording cost data.

TCCA's strength is immediacy: it operates on whatever inference endpoint your production system already uses, making its cost numbers directly comparable to your actual billing. Its weakness is that it is sensitive to provider pricing changes — a rate-card update between evaluation runs invalidates cross-run comparisons unless you snapshot pricing at eval time.

Comparative Performance: Benchmarking Results on AIME 2024

Both frameworks converge on the same underlying data — token counts and accuracy — but surface it differently. The table below combines Artificial Analysis infrastructure benchmarks with reasoning model accuracy data to populate the metrics each framework reports.

Model	TTFT (ms)	TPS (output)	USD/1M Output Tokens
o3 (high compute)	~1,200	~25	$40.00
o3-mini (medium)	~800	~60	$12.00
DeepSeek-R1 (API)	~900	~40	$2.19
Qwen QwQ-32B	~600	~55	$0.90
GPT-4o	~350	~90	$10.00

TTFT and TPS figures are representative ranges drawn from Artificial Analysis tracking data; values vary by provider, region, and load. AIME accuracy figures reflect published benchmark results as of Q1 2026.

Inference compute per token scales with model architecture: for standard dense models, inference compute per token is approximately twice the number of model parameters, a constraint documented by Austin R. Ellis-Mohr (2026). The framework-level model performance metrics and inference cost analysis diverge in one important way: OckBench captures token efficiency per problem, enabling you to identify which problem categories drive cost spikes. TCCA gives you the aggregate billing figure faster but hides the distribution.

Decision Framework for Engineering Leads

Choosing between OckBench, TCCA, or a combination of both maps to three distinct operational regimes. The decision also determines which LLM benchmarking tools you integrate into your CI/CD pipeline and which model performance metrics feed your SLA reviews.

Criterion	Choose OckBench when...	Choose TCCA when...
Primary Objective	Token efficiency research, model selection	Production cost auditing, billing validation
Best For	Domain-specific Pareto profiling	Live API usage monitoring
Pricing Sensitivity	Low (uses raw token counts)	High (requires rate-card snapshots)

Choose both when: - You are running pre-deployment validation and need OckBench's token distribution analysis to feed TCCA's dollar calculation — particularly when provider pricing is volatile or you are evaluating across multiple providers simultaneously

The decision matrix requires mapping your business SLA constraints — maximum allowable latency, acceptable accuracy threshold on domain-hard problems — against cost-per-correct-response. A 50ms TTFT SLA eliminates most high-compute reasoning runners immediately; at that point, cost comparison is moot.

Frequently Asked Questions

How does inference compute-per-token impact total cost of ownership?

Inference compute scales linearly with token count at fixed model size. For reasoning models, inference-time compute scaling means the model generates longer chains to achieve higher accuracy — each additional reasoning step adds both input tokens (growing context) and output tokens (the chain itself). The TPC metric captures this directly: over-searching, where the model generates excessive reasoning steps without a proportional accuracy gain, produces high TPC values and inflates TCO without a correctness return. As verified in arXiv:2601.05503, increased reasoning depth linearly increases both input and output token costs.

What metrics matter most for reasoning model benchmarking?

For production reasoning workloads: cost-per-correct-response (primary), TTFT (latency SLA gate), accuracy on your domain benchmark (not AIME/GPQA unless your domain maps to competition math or graduate science), and token-count distribution (to flag pathological chain lengths). TPS matters when you have concurrent user load; TTFT matters when you have interactive latency requirements.

Is OckBench compatible with local model evaluation harnesses?

Yes, with the constraint that Python 3.10+ and pinned library versions are required. For local inference, OckBench integrates against OpenAI-compatible local servers (vLLM, llama.cpp with server mode). Hardware requirements scale with the model under test — smaller models (QwQ-32B, o3-mini via API) can run evaluations on a single A100; full DeepSeek-R1 requires multi-GPU or API access.

How do I compare LLM reasoning cost vs. output accuracy?

Plot accuracy against cost-per-correct-response across your candidate models using OckBench token counts × provider rate cards. Models on the Pareto frontier — highest accuracy for a given cost budget — are your shortlist. Models significantly below the frontier are paying a cost premium not justified by accuracy gain on your benchmark.

Production Note: Scaling evaluation pipelines beyond a few hundred problems requires automated cost-tracking dashboards. Instrument your harness to record provider pricing at eval start time, total token counts per model per problem, and running dollar spend. Without real-time spend alerting, large batch evaluations against high-compute reasoning runners (o3 at max budget) can exhaust API credit limits mid-run, producing incomplete datasets that corrupt cost-per-correct calculations. Set hard spend caps at the API key level before initiating any batch evaluation exceeding 500 problems.

Sources & References

Artificial Analysis LLM Leaderboards — Unified tracking of intelligence index, USD/1M tokens, TTFT, and TPS for 100+ LLMs; primary source for rate-card and latency data used throughout this article
arXiv:2511.05722v1 — OckBench — Original OckBench paper; defines token-efficiency evaluation methodology for LLM reasoning benchmarks
arXiv:2601.05503 — TPC Metric and Over-Searching — Introduces Tokens Per Correctness for search-augmented LLMs; documents over-searching failure mode and linear TCO scaling with reasoning depth

Keywords: OckBench, TCCA, AIME 2024, GPQA, o3-mini, DeepSeek-R1, inference-compute-per-token, TTFT, TPS, Artificial Analysis, JSON Schema, OpenAPI