The problem it solves
Benchmark validity decays as test cases leak into training corpora. Public and synthetic suites are increasingly compromised by data contamination: the tasks sit un-decontaminated inside the models' training data, so a high leaderboard score can reflect memorisation rather than the capability a specific team actually needs. Two properties follow that this pipeline is built to restore – tasks that provably post-date and stay outside any training set, and a measurement that can be re-derived months later and defended.
The pilot target is Cerberus (pyeve/cerberus, ~26 source files), a Python library that validates data against declarative schemas: type rules, value coercion, and nested or conditional validation. That design produces a dense edge-case surface – None versus a missing key, empty-string coercion, recursively referenced sub-schemas – which is exactly what makes it a useful audit target. A model that fabricates a plausible-looking validator is caught by the very cases the library exists to handle, rather than by a synthetic trap.
Structural task synthesis from a live repository
Tasks are a pure function of the repository state and a fixed set of extraction rules, so re-running against the same commit reproduces the same evaluation. The engine scans the repo, filters build artefacts and __pycache__, parses each remaining file's abstract syntax tree, and extracts a bounded metadata set – the top function and class signatures, the import dependencies from the file head, and the body of the first functionally significant method. That metadata drives a synthesis step that emits up to one task per category per file.
code_generation
Write a function from a requirement plus the AST-extracted import context and stated safety constraints – measured against the signatures the repo actually defines.
code_understanding
Explain and locate a deliberately injected bug inside an isolated validation snippet, testing whether the model reasons about behaviour rather than surface syntax.
security_audit
Identify injection vectors – SQL, command, path traversal – and unfiltered eval/exec in real validation code, the failure class most costly to miss here.
YAML isolation → SQLite
Declaring several targets in one promptfoo config lets async timeouts during model boot drop blocked providers and log only the fastest. One YAML per model forces OS-process-level sequential runs and a deterministic write into the aggregate store.
A ~3B-active MoE inside a 32 GB budget
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts model. Per the published architecture it runs 40 layers as ten repetitions of three Gated-DeltaNet blocks followed by one Gated-Attention block, and every block feeds a 256-expert MoE where the router activates 8 experts plus 1 shared expert per token. Roughly 3 billion of the 35 billion parameters do arithmetic on any given token – but all 256 experts must stay resident, so the memory footprint is that of the full model while compute and bandwidth track the active path.
Why the 32 GB holds
A 35B model at FP16 needs ~70 GB for weights alone. Q4_K_M packs them into super-blocks of 256 weights (eight 32-weight sub-blocks) with 4-bit quants, 6-bit per-sub-block scales and mins – about 4.5 effective bits per weight – and promotes selected tensors to Q6_K. That mixed precision brings the footprint to ~24 GB while holding enough of the weight distribution for syntactic validity to survive at scale.
The remaining budget absorbs the KV cache. Because only the ten Gated-Attention layers grow a cache – the thirty DeltaNet layers carry a fixed-size recurrent state – and those layers use grouped-query attention with two KV heads at head-dim 256, the cache costs on the order of ~10 KB per token at q8_0. Even long contexts stay inside the ~5 GB left after the weights, without spilling.
Single-stream determinism – and the cliff it avoids
The backend is pinned with OLLAMA_NUM_PARALLEL=1 and OLLAMA_MAX_LOADED_MODELS=1. Concurrency is tempting, but each parallel request multiplies the KV cache linearly, and the first one that overruns VRAM forces weights into system RAM. Decoding is memory-bandwidth-bound – every token reads the active weights once through the bus – so the swap from GDDR7 at ~1.79 TB/s to DDR5 at ~60–80 GB/s is not a slowdown but a collapse.
What the run actually shows
On the substance of the output: generated code was syntactically valid with consistent type hints (Optional[str]), preferred the standard library (re) over fragile third-party parsers, and handled None, empty-string and TypeError cases consistently across the generation set. The completion-to-prompt ratio (~4.7:1) reflects tasks that ask for whole functions and audit write-ups, not single-line answers.
In practice: vetting an open-weight model for an air-gapped validation library
A team maintains a security-critical Python input-validation library – the Cerberus problem class: schema coercion, nested and conditional rules. They want to adopt an open-weight coding model, but the codebase is proprietary and under a data-sovereignty constraint: nothing may leave the network. Public coding benchmarks are contaminated, so a leaderboard rank is no evidence for this codebase.
Produce a defensible, on-premises measurement of whether the candidate model actually handles this repository's edge cases and injection-vector reasoning – reproducible from the exact commit, and fitting inside a single 32 GB GPU with no cloud dependency in the evaluation path.
Point the AST extractor at the pinned commit and synthesise 60 tasks bound to the repo's real signatures and imports. Run Qwen3.6-35B-A3B at temperature 0 under NUM_PARALLEL=1; keep the run air-gapped (no network in the eval path); log latency, token counts and finish reasons per task into SQLite via one isolated promptfoo config.
A complete 60-task evaluation in 4 min 57 s, entirely on-premises and re-derivable from (commit, extraction rules) – a measurement the team can defend in an audit months later. The uniform pass rate surfaced a concrete methodological gap: the task set doesn't discriminate at this model's level. That converted directly into the next action – stratify difficulty and add adversarial audit cases before any score is read as a capability signal. The deliverable is a trustworthy, sovereign measurement and a precise instruction for the next iteration, rather than a number taken on faith.
What this pilot is not, and where it goes
- Single repository. All tasks derive from Cerberus; external validity needs diverse targets. The extractor already generalises – pointing it at FastAPI (async routing), Pydantic (validation metaclasses) or Django REST via the GitHub API is the immediate cross-domain step.
- Single model, by design. A cross-model comparison (DeepSeek-R1 32B, Gemma 4-31B, Mistral NeMo 12B) follows only when every model runs under identical measured conditions – no simulated rows standing in for a run that didn't happen.
- Non-discriminative task set. Difficulty must be stratified (trivial → adversarial) so the pass rate spreads, per §04. This is the precondition for any capability claim.
- No commercial baseline yet. A frontier-API gold standard via batch API would anchor the local scores against a known reference point.
- Hallucination detection. Cross-referencing generated imports and API calls against real package documentation would catch fabricated dependencies before they read as valid code – the highest-value automation to add next.
The contribution is the method and its discipline: tasks synthesised from live code to resist contamination, a VRAM regime that keeps the numbers honest, and results reported against the standard that a saturated score is a finding, not a win.
Built and operated end-to-end under CTC AI Operations – from GPU configuration and backend tuning to task-synthesis design, evaluation orchestration and result aggregation.