CTCarenskrieger.dev ENDE
Research Program // CTC AI Operations

Research Program

One research agenda behind the projects: whether frontier and multi-agent AI can be evaluated trustworthily, reproducibly, and cheaply on sovereign commodity hardware – a single 32 GB GPU rather than a datacentre. Five workstreams, each stated as a falsifiable hypothesis with a defined metric, sequenced toward preprint publication and non-dilutive funding.

P‑H1 · Sovereign sufficiency
One 32 GB GPU is enough
Trustworthy dataset-scale evaluation of frontier agentic systems, non-inferior to an all-cloud baseline within a preset margin.
P‑H2 · Reproducibility
Frozen by construction
Content-hashed references yield verdicts reproducible across time and software updates, above a pre-registered threshold.
P‑H3 · Emergence gap
Safety does not compose
A measurable fraction of failure modes appear only under multi-agent interaction – invisible to single-model evaluation.
Research program & roadmap (PDF)
§01 Thesis

Evaluation is the bottleneck of trustworthy AI

Capability advances faster than our ability to measure whether a system is correct, honest, and safe. The dominant assumption is that credible evaluation requires frontier-scale cloud infrastructure. This program tests the opposite: that a disciplined, hypothesis-driven protocol on commodity hardware produces evaluation evidence that is reproducible (frozen, content-hashed references), grounded (deterministic execution facts, not opinion), and economical (local low-precision judging anchored by sampled cloud arbitration) – and that the same discipline extends from single-model correctness to multi-agent, fleet-level safety.

Trustworthy, reproducible, low-cost evaluation – on hardware anyone can own.
§02 Invariants

Four invariants across every workstream

Full-weight SUT

The system under test is never compressed

Quantization is a throughput lever for the judge only. Grade a compressed model and you measure the shrunken copy, not the model.

Frozen references

Rubrics and benchmarks are pinned and content-hashed

A verdict is reproducible across time and software updates; a later change to the reference is visible rather than silent.

Hardened sandbox

Untrusted model code runs network-less, non-root, read-only

The same seccomp-profiled substrate produces the deterministic facts that ground judgment and hosts the multi-agent testbed.

32 GB discipline

Everything fits and is measured within one card

Sovereign, commodity hardware – the constraint that makes the results independently reproducible.

§03 Workstreams

Five workstreams, five falsifiable hypotheses

Each project is a workstream with a core hypothesis and a primary metric. Papers marked infrastructure measured have established engineering results; papers marked pilot or agenda define an evaluation to be run.

W1 · Infrastructure measured

Hybrid Evaluation Pipeline

Execution-grounded, frozen-rubric, local FP4 judging is trustworthy and roughly 10× cheaper than an all-cloud baseline. Metric: Cohen’s κ; cost per 1k judgments.

W2 · Pilot

Contamination-Resistant Code Evaluation

Benchmarks regenerated from live repositories resist memorisation, shrinking the train–test gap versus static benchmarks. Metric: contamination gap (fresh − stale).

W3 · Infrastructure measured

Local Three-Tier Agent Workstation

Single-residency time-multiplexing serves three model tiers in 32 GB without VRAM collision, at bounded swap cost. Metric: peak VRAM; swap cost.

W4 · Agenda

Multi-Agent Safety Evaluation

Single-model safety evaluation misses fleet-level failure modes that emerge only under agent interaction. Metric: emergent-risk detection rate.

W5 · Design

Sovereign Personal AI Assistant

A local-plus-remote assistant preserves data sovereignty at usable latency, with zero egress of private data. Metric: latency; egress = 0.

§04 Roadmap

Four phases, from pilot to open protocol

The critical path is P1 – it converts infrastructure papers into empirical studies with human-labelled evidence, which is what both peer venues and funders require. P2 is the funding-defining phase, aligning the multi-agent-safety workstream with dedicated research calls.

P0 Pilot infra Working papers v1 complete P1 Validation κ study · SSRN v2 next P2 Scale & safety Testbed · funding call planned P3 Open protocol Tooling · follow-on planned

P0 – measured pilot infrastructure and cost models (this set of working papers). P1 – human-labelled validation of judgment quality; SSRN preprints v2 with agreement statistics. P2 – multi-agent taxonomy and instrumented testbed; alignment with a multi-agent-safety research call. P3 – released protocols (rubric-gate spec, reproducibility packages) and a deployment study.

§05 Publications

Preprints & dissemination

Working papers are versioned and posted as preprints on SSRN, linked to a single ORCID identifier for authorship continuity, with [email protected] as the permanent corresponding-author address and a University of Pittsburgh affiliation on the record. Each follows the same template: abstract, motivation, related work, explicit falsifiable hypotheses, methodology and metrics, results (measured vs. predicted, clearly separated), roadmap, limitations, and references.

Integrity note. Version 1 establishes design and measurement plan; empirical agreement statistics are added in version 2 after the P1 study. Claims are labelled measured or predicted throughout – no result is reported that has not been measured.