Hybrid Evaluation Pipeline – Marian E. Arenskrieger

§01 Context

The problem it solves

Evaluating AI systems – coding agents, analytical outputs, data pipelines – at dataset scale collides with three constraints at once. Correctness cannot be left to one model grading another: a judge reading only the model's prose will pass a confident, wrong answer, and one grading its own family inflates the score. Results have to be reproducible, or a rationale written today can't be re-derived and defended tomorrow. And grading tens of thousands of samples through a frontier API without caching is simply expensive. The pipeline answers all three at once: deterministic ground truth for correctness, pinned-and-versioned configuration for reproducibility, and aggressive caching plus local inference for cost.

Run the code, ground the judgment in facts, keep the reference fixed – and make it cheap enough to run often.

§02 Architecture

Three roles, a verification layer, an automated pipeline

The design keeps three things apart that are usually conflated – the model being judged, the model that judges, and the deterministic layer that supplies facts – and adds a headless track for high-volume, multi-format comparison. Nothing grades itself, and no semantic verdict is issued before the code has actually been run.

Deterministic facts anchor every judgment; the frozen rubric is a cached prefix; a headless track handles high-volume, multi-format runs.

Design layer · local

Interactive rubric design

Rubric authoring and single-case spot-checks in an editor-integrated local workflow (Continue + Ollama), plus a native local chat app for ad-hoc analysis. Fast iteration while the rubric is still unstable and caching is irrelevant.

Batch judge · GPU

Local high-throughput judging

Whole datasets scored on a single RTX 5090 (Blackwell) via vLLM with native FP4 weights, chunked prefill and prefix caching of the frozen rubric. Continuous batching keeps the card saturated.

Arbiter · cloud

Frontier arbiter for hard calls

The hardest semantic, safety and honesty judgments routed to a frontier cloud model through its batch API, with the rubric as a cached, reused prefix and the sandbox's facts supplied as context.

Verification · cross-cutting

Hardened execution sandbox

Untrusted, model-generated code runs only inside a disposable, network-less, non-root, read-only container – yielding the deterministic ground truth (does it run? do the tests pass?) that anchors every judgment.

Automated pipeline · headless

Multi-format A/B & vision evaluation

A separate, fully automated track built on a local OpenAI-compatible inference server and a CLI evaluation harness. It ingests several input formats – plain text, structured data, and rendered images (screenshots of charts, dashboards or UI among them) – for pairwise A/B and vision-capable comparison, then renders the results to a review matrix. No interactive UI in the loop, so a run is a single reproducible command rather than a session.

§03 Engineering

The decisions behind the three outcomes

VRAM efficiency on a single 32 GB card

Three settings carry most of the weight: flash attention on, a q8_0 KV-cache (which halves the 32B model's cache from roughly 4 GB to 2 GB at 16k context), and an explicit 16384-token context length. The last one matters more than it looks – without it the runtime silently falls back to a 4096-token default, so the client keeps sending 16k while the model only ever sees 4k, and parts of the code under evaluation drop out of the window unnoticed. On the vLLM batch judge the equivalents are --kv-cache-dtype fp8, chunked prefill and --gpu-memory-utilization 0.90, deliberately leaving room for the KV cache to grow before it preempts.

The result is a 32B judge and a 7B autocomplete model co-resident with headroom, and the two backends never contend for the same VRAM – the local judge releases the card before the batch engine claims it.

The design layer holds a 32B judge (FP4) and a 7B autocomplete model co-resident with a q8_0 KV cache and ~7 GB spare – no spill. The vLLM batch judge claims the whole card separately, never at the same time.

Reducing judge bias

The judge is never the system under test; a model scoring its own family reads its own habits as quality. Deterministic tool output – ruff, pyright and pytest, run inside the sandbox – is fed in as fact, before the semantic judgment, so the arbiter reasons from what the code actually did rather than anchoring on a prior model's verdict. Confidence is not a single-shot number: it comes from self-consistency across samples, with the variance recorded rather than hidden.

One precision rule underpins all of it: a model that is itself under evaluation runs at full weights – only a judge or an open-weight baseline may run compressed (FP4). Grade a compressed system-under-test and the result measures the shrunken copy, not the model.

Cutting token cost

The rubric sits at the front of every request as an identical, ≥1024-token cached prefix. A cache read costs about a tenth of the base input price; batch processing takes a further 50% off input and output and stacks with the cache – which is where the ≈95% saving on the repeated portion comes from. Because a 24-hour batch outlives even the one-hour cache TTL, within-batch reuse isn't guaranteed, so the run is costed on the guaranteed 50% batch floor, kept warm with a prime call plus an occasional heartbeat read, and cache hits are treated as upside rather than assumed.

For sustained volume the local FP4 judge takes over at roughly the cost of electricity, and the cloud arbiter is reserved for the contested and safety-critical cases – a two-stage routing that keeps the expensive path narrow.

The run is costed on the guaranteed 50% batch floor. Prefix-cache reuse takes the repeated portion down to ~5% (≈95% off), but because a 24 h batch outlives the cache TTL that reuse is treated as upside, not assumed.

Treating model code as untrusted

Because part of the work is deliberately pushing systems to fail, their output is assumed hostile by default. Every execution is disposable (--rm), has no network, a read-only filesystem with a small tmpfs scratch space, a non-root user, dropped Linux capabilities with the default seccomp profile intact, and hard CPU, memory, PID and wall-clock limits. Dependencies are baked into the image precisely so the run needs no network – a run with network access is not a safe run.

The engineering above is not theoretical. The panel below is measured telemetry from a single ~227-minute production run – the card is driven to its memory, compute and power limits at once, yet stays thermally comfortable and costs about the price of electricity to run.

Real HWiNFO telemetry, ~6,800 samples at a 2-second cadence. VRAM holds 31.1 GB median and 31.8 GB peak of 32; GPU load is bursty (24% median, 96% p95) as batches saturate then drain; board power peaks at 573 W against the 575 W cap. Thermals stay cool and the whole run draws ≈0.70 kWh.

§04 Methodology

Rubric lifecycle & guardrails

A rubric earns the freeze only after it clears four gates. It has to discriminate – if every sample passes or every sample fails, it measures nothing, which is fatal for capability elicitation. It has to survive adversarial hardening against the failure modes agentic systems actually show: reward-hacking, absent error-recovery, hallucinated tool results. It has to pass a contamination check – public tasks get rebuilt as private ones. And it has to be pinned: semver plus a content hash, written into the eval metadata.

Design→Calibrate→Discriminative check→Adversarial hardening→Contamination check→Freeze + version→Batch

A rubric earns the freeze only after clearing all four gates; any failure returns it to design. The freeze is a semver tag plus a content hash written into the eval metadata – the anchor for reproducibility.

After the freeze a single character change is off-limits – it breaks the cache and makes already-scored samples inconsistent with the new text. The rationale is produced in the same pass as the verdict, because here the rationale is the deliverable: it can't be back-filled by a weaker model that never did the reasoning.

§05 Stack

Tooling & environment

Hardware

RTX 5090 · Blackwell · 32 GBRyzen 9 X3D

Local inference

OllamavLLM · FP4LM Studio

Eval & tooling

Continuepromptfooruff · pyright · pytest

Cloud

Frontier batch API (arbiter)

Safety

Docker sandboxno-network · non-root · read-only · seccomp

Practice

Reproducible pinsRubric versioningPrefix caching

§06 Scenario

In practice: regression-testing a coding-model upgrade at dataset scale

SSituation

A provider ships a new version of the coding model behind an internal agentic tool. Leaderboard deltas look positive, but aggregate benchmarks hide whether it has regressed on the failure modes that actually cost the team – silent error-recovery failures, fabricated passing tests – and the proprietary codebase cannot be sent to a cloud API.

TTask

Deliver a reproducible, defensible verdict across thousands of samples, grounded in whether code actually runs and passes rather than in a model's prose – without a frontier-API bill that scales linearly with sample count.

AAction

Freeze the rubric (semver + content hash); execute every sample in the hardened sandbox for deterministic facts (ruff, pyright, pytest); score the bulk on the local FP4 32B judge; route only contested and safety-critical cases to the cloud arbiter via batch with the rubric as a cached prefix; take confidence from self-consistency across samples and record the variance.

RResult

A version-over-version comparison on identical frozen criteria that surfaces the silent drift aggregate scores miss. The bulk runs at roughly the cost of electricity, with the cloud path narrowed to hard cases on the guaranteed 50% batch floor; and every verdict is re-derivable months later from the pinned rubric plus the recorded facts – an audit property, not just a number. The upgrade ships, or doesn't, on evidence rather than on a leaderboard headline.

§07 In practice

Where it earns its keep

Coding & data science

Vetting a coding assistant or agentic tool before it touches a codebase – running its output in the hardened sandbox to measure whether it genuinely recovers from a failing test or quietly fabricates a passing one, instead of trusting its own report.
Dataset-QA and preference-data quality for RLHF – auditing the training signal itself against a fixed, versioned rubric, so consistency and coverage gaps surface against a stable reference rather than shifting human judgment.
Regression-testing a model upgrade – replaying the same frozen rubric across versions to surface the silent quality drift that aggregate benchmarks average away, with every delta re-derivable from the pinned criteria.
Choosing between candidate models for a task on deterministic correctness and calibrated, self-consistent confidence – reading the variance across samples rather than trusting a single lucky first impression.

Finance · Code → Capital

Executing model-written quantitative code – a backtest, a pricing script, a data transform – inside the hardened sandbox, so a result that merely looks plausible is caught deterministically when the tests fail, rather than discovered after the position is already on the book.
Checking AI-written research for fabricated figures – a hallucinated number costs far more when it drives a capital allocation than when it sits in a chat window, so the same rubric that grades code is turned on the claim and the sourcing behind it.
Reproducibility as an audit property – dated model pins, temperature zero and a content-hashed rubric mean a judgment can be re-derived and defended months later, which is what turns a one-off verdict into something a risk committee can actually stand behind.
Two-stage local/cloud routing as a cost-control architecture – the same capital-efficiency instinct applied to inference spend, keeping the bulk on the local judge and paying for the cloud only where a verdict genuinely turns on it.

This pipeline is the operational core behind Code → Capital: the discipline that grades a coding agent is the discipline that grades any AI system whose output touches money. In both, the failure that hurts is the confident wrong answer – and the defence is the same: run the code, ground the judgment in facts, keep the reference fixed and versioned, and make the whole thing cheap enough to run often.

Built and operated end-to-end under CTC AI Operations – from GPU configuration and sandbox hardening to rubric design, cloud batching and cost modelling.