CTCarenskrieger.dev ENDE
Project // CTC AI Operations

Local Three-Tier Agent Workstation

One RTX 5090, three distinct workloads – everyday reasoning, software engineering, and unattended long-horizon jobs – each paired with the agent framework and open-weight model whose architecture fits it, under a strict rule that only one model is resident at a time.

Design intent · target setup, not a measured deployment
Structure
One GPU, three roles
Hermes for the cockpit, OpenCode for engineering, OpenClaw for the background.
Constraint
One model resident
32 GB holds any single model with headroom, none of them two at once.
Boundary
No cloud dependency
Every tier runs locally; untrusted input is gated behind confirmation.
Whitepaper (PDF)
§01 Design intent

Match the tool to the workload, not the workload to the tool

A single model and a single agent can technically cover research, coding and background automation – but not well. A reasoning model spends tokens on internal deliberation that a bulk-refactor does not need; a coding harness that reads the working directory is wasted on drafting an email; a long-horizon orchestrator's checkpointing is overhead in an interactive session. This setup instead assigns each workload to the agent and model built for it, and treats the 32 GB card as a shared resource that is time-multiplexed rather than partitioned.

What follows is a design, not a benchmark. The VRAM figures are model footprints and budgets; the tier assignments are architectural judgments. The single-stream discipline the setup relies on is the same one established in the evaluation work – here it is applied to an everyday operator context.

One model with room for its full context beats three models each starved of cache.
§02 The three tiers

Cockpit, engineering, background

TIER 1 · COGNITIVE COCKPIT Hermes Agent · Nous Research Everyday & research · human-in-the-loop [Y/n] MODELS DeepSeek-R1 32B – reasoning (autoregressive CoT) Gemma 4-31B · Qwen3.6-35B-A3B – writing TIER 2 · DEV ENVIRONMENT OpenCode · terminal agent SWE tasks · repo-aware coding, reviews, tests · 256K context MODEL qwen3-coder:30b – MoE, ~3.3B active TIER 3 · BACKGROUND FACTORY OpenClaw · long-horizon agent Autonomous multi-hour jobs · checkpointed, fault-tolerant MODELS Qwen3.6-35B-A3B – allrounder Devstral 24B – agentic specialist
Three agent frameworks, each paired with the model whose architecture matches its workload – sharing one card in time, never at once.
Tier 1 · Cockpit

Hermes Agent – everyday & research

A self-hosted agent with persistent, SQLite-backed memory and full-text recall over past sessions, autonomous skill creation, and isolated subagents for parallel workstreams; an optional temporal-knowledge-graph memory is available via plugins. Reasoning-heavy research runs on DeepSeek-R1 32B – an RL-trained reasoning model that spends thinking tokens on an autoregressive chain-of-thought and self-corrects before answering (no tree search at inference) – while Gemma 4-31B or Qwen3.6-35B-A3B draft reports and correspondence.

Because the cockpit ingests untrusted unstructured data – inbound email, web content – sensitive actions require an explicit [Y/n] confirmation and run under container isolation with dropped Linux capabilities, the standard mitigation against indirect prompt injection.

Tier 2 · Engineering

OpenCode – software engineering

A terminal-native agentic coding tool that reads the working directory and operates over the repository – in the class of terminal-native coding agents, not a lightweight syntax checker. It runs qwen3-coder:30b, a Mixture-of-Experts model activating ~3.3B parameters per token: sparse routing gives heavy-model depth at light-model latency, and the 256K context holds a working set of the codebase so long reviews and automated test runs don't thrash the KV cache.

Tier 3 · Background

OpenClaw – autonomous long-horizon jobs

A persistent agent for multi-hour unattended work, with checkpointing so a failed subtask resumes rather than restarting from zero. It runs Qwen3.6-35B-A3B as the allrounder or Devstral 24B as an agentic-coding specialist, and handles the workloads where fault-tolerant orchestration matters more than latency – data ingestion, background scripts, document assembly.

§03 Fitting 32 GB

One model at a time – by design, not by compromise

VRAM FOOTPRINT · Q4_K_M 32 GB ceiling 15 18 19 19 24 Devstral 24B qwen3-coder deepseek-r1 gemma4 qwen3.6
Each model fits the card with headroom (GB shown); the smallest two together already crowd 32 GB, and the largest pair cannot coexist. Single-residency follows from the arithmetic.
RESIDENT MODEL OVER TIME · OLLAMA_NUM_PARALLEL=1 32 GB time → qwen3-coder · 18 GB deepseek-r1 · 19 GB qwen3.6 · 24 GB ⇄ swap ⇄ swap
Switching tiers swaps the single resident model; further requests queue rather than co-loading a second model. Time-multiplexing is the mechanism, not a workaround.

On a single 32 GB card only one tier's model is resident at a time. Ollama enforces this with OLLAMA_NUM_PARALLEL=1 and OLLAMA_MAX_LOADED_MODELS=1: incoming work queues instead of loading a second model, and switching tiers swaps the resident weights. Because any single model fits with headroom but no two fit together, single-residency is the design, not a limitation worked around – and it is the same discipline that keeps the evaluation pipeline's latency measurements clean.

Swap cost

The swap has a cost – and how it's paid

Time-multiplexing is not free: switching tiers unloads the resident weights and loads the next model into VRAM. Cold, that is a multi-second read; warm, it is far less, because the previous model's file still sits in the OS page cache in system RAM – which the Ryzen 7 9850X3D and ample RAM make the common case. Ollama's keep_alive governs how long a model stays resident before eviction. The swap is paid once per tier switch, not per request, so occasional context changes amortize cleanly; what is avoided is paying it inside a working loop.

Context budget

The context budget sets the real ceiling

Weights are only half the footprint. The KV cache grows with context length, and two tiers push on it hardest: the coding tier carries a 256K window, and background jobs accumulate long histories. Left unmanaged, that cache – not the weights – is what would breach 32 GB. The same discipline as the evaluation pipeline applies: a q8_0 KV cache roughly halves it, and a bounded working set keeps it in the headroom. This is the concrete reason single-residency wins – one model with room for its full context beats three models each starved of cache.

Precision

Precision is a per-tier lever

The footprint chart assumes Q4_K_M (~4.5 bpw), the baseline that fits every model with headroom. But precision is a per-tier lever, not a global constant: the reasoning tier can trade some headroom for a Q5/Q6 quant when fidelity matters, while background bulk jobs stay at Q4 for throughput. Because only one model is resident, that headroom is spendable on the tier that needs it – another dividend of not co-loading.

RTX 5090 · 32 GBRyzen 7 9850X3DOllamaNUM_PARALLEL=1MAX_LOADED_MODELS=1
§04 Scenario

A day across the three tiers

SSituation

A single operator moves across research, coding and long-running background jobs on one 32 GB workstation, with no cloud dependency permitted – proprietary material and sovereignty constraints keep everything local.

TTask

Assign each workload to the agent and model whose architecture fits it, without ever exceeding the VRAM ceiling or leaving a second model half-loaded.

AAction

Route everyday reasoning and writing to Hermes (DeepSeek-R1 / Gemma 4), engineering to OpenCode (qwen3-coder MoE), and unattended multi-hour jobs to OpenClaw (Qwen3.6 / Devstral); Ollama swaps the single resident model on each tier switch.

RIntended result · design goal

Each task runs on the tool matched to it, at full local speed, with no VRAM collision and no data leaving the machine – one coherent single-GPU operator setup rather than three runtimes competing for the same card. This is the target the design is built to reach; the end-to-end ergonomics of daily tier-swapping remain to be measured.

§05 Status

What is designed, and what is still to prove

This is a design, not a benchmark. The VRAM figures are model footprints and budgets; the tier assignments are architectural judgments, not measured outcomes. The single-stream VRAM discipline the setup depends on is already established in the evaluation work; what remains open is the lived ergonomics of swapping across three tiers in daily use – the next thing worth measuring rather than asserting.

The value of the setup is coherence under a hard constraint: three workloads, three matched tools, one card, and an explicit rule – one model resident – that turns a memory limit into a clean operating discipline.

Designed under CTC AI Operations, on the same single-GPU discipline as the evaluation pipeline.