Contamination-Resistant Code Evaluation

§01 Context

The problem it solves

Benchmark validity decays as test cases leak into training corpora. Public and synthetic suites are increasingly compromised by data contamination: the tasks sit un-decontaminated inside the models' training data, so a high leaderboard score can reflect memorisation rather than the capability a specific team actually needs. Two properties follow that this pipeline is built to restore – tasks that provably post-date and stay outside any training set, and a measurement that can be re-derived months later and defended.

The pilot target is Cerberus (pyeve/cerberus, ~26 source files), a Python library that validates data against declarative schemas: type rules, value coercion, and nested or conditional validation. That design produces a dense edge-case surface – None versus a missing key, empty-string coercion, recursively referenced sub-schemas – which is exactly what makes it a useful audit target. A model that fabricates a plausible-looking validator is caught by the very cases the library exists to handle, rather than by a synthetic trap.

A benchmark you can regenerate from real code is one that contamination can’t quietly rot.

§02 Method

Structural task synthesis from a live repository

Tasks are a pure function of the repository state and a fixed set of extraction rules, so re-running against the same commit reproduces the same evaluation. The engine scans the repo, filters build artefacts and __pycache__, parses each remaining file's abstract syntax tree, and extracts a bounded metadata set – the top function and class signatures, the import dependencies from the file head, and the body of the first functionally significant method. That metadata drives a synthesis step that emits up to one task per category per file.

Tasks = f(repo state, extraction rules). Regenerate the commit, regenerate the evaluation.

Category 01 · 20 tasks

code_generation

Write a function from a requirement plus the AST-extracted import context and stated safety constraints – measured against the signatures the repo actually defines.

Category 02 · 20 tasks

code_understanding

Explain and locate a deliberately injected bug inside an isolated validation snippet, testing whether the model reasons about behaviour rather than surface syntax.

Category 03 · 20 tasks

security_audit

Identify injection vectors – SQL, command, path traversal – and unfiltered eval/exec in real validation code, the failure class most costly to miss here.

Orchestration

YAML isolation → SQLite

Declaring several targets in one promptfoo config lets async timeouts during model boot drop blocked providers and log only the fastest. One YAML per model forces OS-process-level sequential runs and a deterministic write into the aggregate store.

§03 Architecture

A ~3B-active MoE inside a 32 GB budget

Qwen3.6-35B-A3B is a sparse Mixture-of-Experts model. Per the published architecture it runs 40 layers as ten repetitions of three Gated-DeltaNet blocks followed by one Gated-Attention block, and every block feeds a 256-expert MoE where the router activates 8 experts plus 1 shared expert per token. Roughly 3 billion of the 35 billion parameters do arithmetic on any given token – but all 256 experts must stay resident, so the memory footprint is that of the full model while compute and bandwidth track the active path.

The router activates 9 of 256 experts per token, so only ~3B of 35B parameters do arithmetic – but all 35B stay loaded. Compute and bandwidth shrink; the VRAM footprint does not. That asymmetry is why MoE fits a fixed card.

Why the 32 GB holds

A 35B model at FP16 needs ~70 GB for weights alone. Q4_K_M packs them into super-blocks of 256 weights (eight 32-weight sub-blocks) with 4-bit quants, 6-bit per-sub-block scales and mins – about 4.5 effective bits per weight – and promotes selected tensors to Q6_K. That mixed precision brings the footprint to ~24 GB while holding enough of the weight distribution for syntactic validity to survive at scale.

The remaining budget absorbs the KV cache. Because only the ten Gated-Attention layers grow a cache – the thirty DeltaNet layers carry a fixed-size recurrent state – and those layers use grouped-query attention with two KV heads at head-dim 256, the cache costs on the order of ~10 KB per token at q8_0. Even long contexts stay inside the ~5 GB left after the weights, without spilling.

Weights dominate the card; the ~5 GB residual is deliberately reserved so the KV cache can grow before it ever preempts.

Single-stream determinism – and the cliff it avoids

The backend is pinned with OLLAMA_NUM_PARALLEL=1 and OLLAMA_MAX_LOADED_MODELS=1. Concurrency is tempting, but each parallel request multiplies the KV cache linearly, and the first one that overruns VRAM forces weights into system RAM. Decoding is memory-bandwidth-bound – every token reads the active weights once through the bus – so the swap from GDDR7 at ~1.79 TB/s to DDR5 at ~60–80 GB/s is not a slowdown but a collapse.

The bar on the right is drawn to scale – its near-invisibility is the point. This is why the pipeline forbids the parallelism that would cause the spill.

RTX 5090 · 32 GB GDDR7 · ~1.79 TB/sRyzen 7 9850X3DOllama · Q4_K_MFLASH_ATTENTION=1KV_CACHE_TYPE=q8_0temperature 0.0

§04 Results

What the run actually shows

Tasks60 / 60 completed

Duration4 min 57 s

Avg latency~5.0 s / task

Throughput~200 tokens/s

Tokens74,342 – 12,682 prompt / 59,392 completion

Stabilityno latency spikes, no OOM under NUM_PARALLEL=1

On the substance of the output: generated code was syntactically valid with consistent type hints (Optional[str]), preferred the standard library (re) over fragile third-party parsers, and handled None, empty-string and TypeError cases consistently across the generation set. The completion-to-prompt ratio (~4.7:1) reflects tasks that ask for whole functions and audit write-ups, not single-line answers.

Left: the current task set, saturated. Right: the spread a discriminative set should show. The gap between them is the next iteration.

Reading the 100 %. By the discrimination criterion this lab applies to every rubric – a task set on which every sample passes measures nothing – a 60/60 is not a capability claim; it says the generated tasks sat below the model's ceiling. That is the pilot's central, honest finding. The measurement that is trustworthy here is the operational one – throughput, stability and token accounting under a pinned single stream – and it is that operational envelope, not the score, that the pipeline was built to establish first.

§05 Scenario

In practice: vetting an open-weight model for an air-gapped validation library

SSituation

A team maintains a security-critical Python input-validation library – the Cerberus problem class: schema coercion, nested and conditional rules. They want to adopt an open-weight coding model, but the codebase is proprietary and under a data-sovereignty constraint: nothing may leave the network. Public coding benchmarks are contaminated, so a leaderboard rank is no evidence for this codebase.

TTask

Produce a defensible, on-premises measurement of whether the candidate model actually handles this repository's edge cases and injection-vector reasoning – reproducible from the exact commit, and fitting inside a single 32 GB GPU with no cloud dependency in the evaluation path.

AAction

Point the AST extractor at the pinned commit and synthesise 60 tasks bound to the repo's real signatures and imports. Run Qwen3.6-35B-A3B at temperature 0 under NUM_PARALLEL=1; keep the run air-gapped (no network in the eval path); log latency, token counts and finish reasons per task into SQLite via one isolated promptfoo config.

RResult

A complete 60-task evaluation in 4 min 57 s, entirely on-premises and re-derivable from (commit, extraction rules) – a measurement the team can defend in an audit months later. The uniform pass rate surfaced a concrete methodological gap: the task set doesn't discriminate at this model's level. That converted directly into the next action – stratify difficulty and add adversarial audit cases before any score is read as a capability signal. The deliverable is a trustworthy, sovereign measurement and a precise instruction for the next iteration, rather than a number taken on faith.

§06 Limitations & next steps

What this pilot is not, and where it goes

Single repository. All tasks derive from Cerberus; external validity needs diverse targets. The extractor already generalises – pointing it at FastAPI (async routing), Pydantic (validation metaclasses) or Django REST via the GitHub API is the immediate cross-domain step.
Single model, by design. A cross-model comparison (DeepSeek-R1 32B, Gemma 4-31B, Mistral NeMo 12B) follows only when every model runs under identical measured conditions – no simulated rows standing in for a run that didn't happen.
Non-discriminative task set. Difficulty must be stratified (trivial → adversarial) so the pass rate spreads, per §04. This is the precondition for any capability claim.
No commercial baseline yet. A frontier-API gold standard via batch API would anchor the local scores against a known reference point.
Hallucination detection. Cross-referencing generated imports and API calls against real package documentation would catch fabricated dependencies before they read as valid code – the highest-value automation to add next.

The contribution is the method and its discipline: tasks synthesised from live code to resist contamination, a VRAM regime that keeps the numbers honest, and results reported against the standard that a saturated score is a finding, not a win.

Built and operated end-to-end under CTC AI Operations – from GPU configuration and backend tuning to task-synthesis design, evaluation orchestration and result aggregation.