CTCarenskrieger.dev ENDE
Product design // CTC AI Operations

Sovereign Personal AI Assistant

A personal assistant with the polish of a modern chat app, whose inference and chat history stay on owned hardware – reached from a phone anywhere over an encrypted tunnel, with nothing routed through a vendor's servers. And a single architecture that scales from one GPU to a multi-GPU team host by adding hardware, not by re-architecting.

Design & build plan · buildable at the consumer tier today, enterprise-scalable – not a measured deployment
Sovereignty
Inference and history stay local
The model runs on the owner's machine; only a device-discovery list ever touches the vendor's backend.
Mobility
Personal rig, paired with a smartphone
Reach the home GPU over an end-to-end encrypted mesh – no open ports, no cloud relay.
Scale
Consumer → enterprise, one stack
Add a second card or an 80 GB host and the same setup serves a team a 120B-class model.
Whitepaper (PDF)
§01 The idea

A personal assistant that is owned, not rented

Cloud assistants are polished but put every prompt, response and history on someone else's server. Local assistants keep the data local but have been desktop-bound – tethered to the machine the model runs on, useless the moment one leaves the desk. This project closes that gap: a sovereign assistant with the comfort of a modern chat interface – chat history, projects, artifacts, tool use – reached from a phone, while the model and every conversation stay on owned hardware.

It is deliberately a consumer product first, which makes it different from the rest of this portfolio: not evaluation infrastructure and not a professional operator's workstation, but the everyday assistant a privacy-conscious person actually lives in. The design that makes that work at home is the same design that scales to a team – the only variable is hardware.

Remote access without surrendering sovereignty – the model runs on owned hardware, reachable from anywhere.
§02 Remote architecture

The whole point: the model on a phone, over an encrypted tunnel

The differentiator is remote access done without giving up sovereignty. The host runs the model behind an OpenAI-compatible endpoint; the phone is a thin client; the two are bridged by an end-to-end encrypted mesh (WireGuard over Tailscale) that opens no ports and exposes nothing to the public internet. Inference runs on the host, chat history stays on the devices, and the only thing that reaches the vendor's backend is the device-discovery list needed to pair the machines.

CLIENT phone · anywhere ENCRYPTED MESH · WireGuard / Tailscale no open ports · no cloud relay HOME HOST RTX 5090 model @ :1234 OpenAI-compatible inference + history stay here vendor backend · discovery list only
The phone loads and uses a model running on the home GPU as if it were local. Inference and chat history never leave the devices; only the discovery list that pairs them touches the vendor. (First-party mobile client is iPhone/iPad; both ends run the same app.)
Transport

How the tunnel actually works

The mesh is not a metaphor. Devices authenticate once, then discover each other through a coordination server that only ever exchanges public keys – the prompts and responses never pass through it; they flow peer-to-peer. The encryption is standard WireGuard: ChaCha20-Poly1305 for the data, Curve25519 for key exchange. Because the Tailscale layer is embedded in the app (via tsnet), there is no separate VPN to configure, and NAT traversal punches through CGNAT, double-NAT home routers and corporate firewalls without a single forwarded port.

Placement

Why the host does the inference, not the phone

The phone is the client, not the engine, for a concrete reason: a desktop GPU has far higher memory bandwidth than a phone, and decode is bandwidth-bound – every token reads the active weights once through the bus. So the model runs on the 5090 (GDDR7 at ~1.79 TB/s, many times what a phone's memory sustains) while the phone renders tokens as they arrive. Over the tunnel the honest cost is not throughput but time-to-first-token on a long context – which is exactly why phone sessions default to a shorter (~8k) context rather than pretending the round-trip is free.

§03 Honest hardware envelope

What fits 32 GB – and what the enterprise host adds

Sovereignty is only real if the model actually fits the card. On the consumer tier – a single RTX 5090 (32 GB) – the assistant runs a model that fits with headroom: gpt-oss-20b (~13 GB, MXFP4), Qwen3.6-35B-A3B (~24 GB), or Gemma 4-31B (~19 GB). A 120B-class model does not belong here: gpt-oss-120b is ~60 GB even in MXFP4 and needs an 80 GB card or a two-5090 host – trying to force it onto 32 GB means offloading half the weights to system RAM and paying the bandwidth cliff. So the 120B is not a consumer claim; it is exactly what the enterprise tier adds when the hardware arrives.

CONSUMER · 1× RTX 5090 · 32 GB 32 GB 13gpt-oss-20b 19gemma4-31b 24qwen3.6 1–2 users · fits with headroom ENTERPRISE · 2× 5090 (64 GB) / 80 GB host 64 GB 32 GB · single card 60gpt-oss-120b team via LM Link · 120B exceeds a single 32 GB card
Consumer: any of these run on one card with room for KV cache. Enterprise: adding a second 5090 or an 80 GB host lets the same stack serve a ~60 GB 120B model – which is precisely why it cannot sit on the single 32 GB card.
Budget

The consumer VRAM budget – and why enterprise is about concurrency

The consumer tier obeys the same discipline as the evaluation pipeline. Qwen3.6-35B-A3B at ~24 GB leaves ~6–7 GB for a q8_0 KV cache and activations on the 32 GB card – comfortable for personal single-stream use, no spill.

The enterprise tier changes the sizing question from model size to concurrency: serving a team means many simultaneous requests, and continuous batching gives each its own KV-cache slice, so the 64–80 GB host is sized for parallel streams, not merely for the larger 120B weights. That is the real reason a team needs more than one card – not the model alone, but everyone using it at once.

§04 The stack

A polished chat UX on a model-agnostic, tool-compatible base

Because the server is OpenAI-compatible, the assistant is not locked to one model or one client. The backend (Ollama, vLLM or LM Studio) serves any open-weight model on the standard endpoint; a modern chat-UX layer supplies chat history, projects, artifacts and tool use; and the remote layer wraps the whole thing for the phone. The same endpoint means existing tools – agentic CLIs like OpenCode – keep working unchanged, locally or remotely.

MODEL BACKENDOllama · vLLM · LM Studio – any open-weight model OPENAI-COMPATIBLE APIlocalhost:1234 – one endpoint, model-agnostic UX LAYERchat history · projects · artifacts · tool-use REMOTE LAYER Encrypted mesh phone ↔ host same endpoint existing agentic tools (e.g. OpenCode) target the same endpoint – no reconfiguration
One OpenAI-compatible endpoint underneath a polished chat UX, wrapped by the remote layer. Swap the model, keep the interface and every tool that already targets localhost:1234.
Boundaries

Threat model: what the tunnel does and doesn't protect

The tunnel earns the sovereignty claim, but not unconditionally. It protects data in transit (end-to-end encrypted) and keeps inference and history on owned hardware; what it does not remove is trust in the account system that pairs the devices. It is a privacy-and-convenience layer, not a full threat model – a home rig on a local network has a different exposure than one reachable from anywhere. Stated plainly, the property is that data stays on owned machines, not that it is unconditionally secure against every adversary – and being precise about that difference is part of doing it honestly.

§05 Practical use case

From a private phone assistant to a team's sovereign AI

SSituation

A privacy-conscious professional – and, later, their small firm – wants a capable assistant with the comfort of a modern chat interface, including access from a phone. But confidential material cannot go to a third-party cloud, and desktop-bound local setups are useless away from the desk.

TTask

Stand up a sovereign assistant that is genuinely usable day to day, reachable from a phone anywhere, and runs a model that actually fits the hardware – with a clean path to a team that doesn't require rebuilding it.

AAction

On the home RTX 5090, serve a 32 GB-fitting model (gpt-oss-20b or Qwen3.6-35B-A3B) through the OpenAI-compatible endpoint; wrap it in a polished chat UX (history, projects, tool use); reach it from the phone over the encrypted mesh – no open ports, inference and history on owned hardware.

RIntended result · practical impact

A personal assistant the owner uses from their phone with nothing leaving their control – and a one-step enterprise path: add a second 5090 (or an 80 GB host) and the same stack serves a 120B-class model to the whole team over the same encrypted mesh, still with no public exposure. Consumer build today; enterprise by adding hardware, not by re-architecting.

§06 Status

Buildable now at the consumer tier, honest about the rest

This is a build plan, not a measured deployment. The consumer tier is buildable today with real, fitting models; the enterprise tier is a hardware step, not a rewrite. The remote layer's current limits are stated plainly rather than glossed: the first-party mobile client is iPhone/iPad, both ends run the same app, pairing is account-gated, and phone sessions default to a shorter context. None of that changes the core property – inference and history stay on owned hardware – which is the whole reason to build locally.

The design turns a familiar tension into a single answer: the polish of a cloud assistant with the sovereignty of local inference – carried in a pocket, and able to grow from one person to a team by adding a card rather than surrendering the data.

Designed under CTC AI Operations, on the same local-inference discipline as the evaluation and workstation projects it sits beside.