A personal assistant that is owned, not rented
Cloud assistants are polished but put every prompt, response and history on someone else's server. Local assistants keep the data local but have been desktop-bound – tethered to the machine the model runs on, useless the moment one leaves the desk. This project closes that gap: a sovereign assistant with the comfort of a modern chat interface – chat history, projects, artifacts, tool use – reached from a phone, while the model and every conversation stay on owned hardware.
It is deliberately a consumer product first, which makes it different from the rest of this portfolio: not evaluation infrastructure and not a professional operator's workstation, but the everyday assistant a privacy-conscious person actually lives in. The design that makes that work at home is the same design that scales to a team – the only variable is hardware.
The whole point: the model on a phone, over an encrypted tunnel
The differentiator is remote access done without giving up sovereignty. The host runs the model behind an OpenAI-compatible endpoint; the phone is a thin client; the two are bridged by an end-to-end encrypted mesh (WireGuard over Tailscale) that opens no ports and exposes nothing to the public internet. Inference runs on the host, chat history stays on the devices, and the only thing that reaches the vendor's backend is the device-discovery list needed to pair the machines.
How the tunnel actually works
The mesh is not a metaphor. Devices authenticate once, then discover each other through a coordination server that only ever exchanges public keys – the prompts and responses never pass through it; they flow peer-to-peer. The encryption is standard WireGuard: ChaCha20-Poly1305 for the data, Curve25519 for key exchange. Because the Tailscale layer is embedded in the app (via tsnet), there is no separate VPN to configure, and NAT traversal punches through CGNAT, double-NAT home routers and corporate firewalls without a single forwarded port.
Why the host does the inference, not the phone
The phone is the client, not the engine, for a concrete reason: a desktop GPU has far higher memory bandwidth than a phone, and decode is bandwidth-bound – every token reads the active weights once through the bus. So the model runs on the 5090 (GDDR7 at ~1.79 TB/s, many times what a phone's memory sustains) while the phone renders tokens as they arrive. Over the tunnel the honest cost is not throughput but time-to-first-token on a long context – which is exactly why phone sessions default to a shorter (~8k) context rather than pretending the round-trip is free.
What fits 32 GB – and what the enterprise host adds
Sovereignty is only real if the model actually fits the card. On the consumer tier – a single RTX 5090 (32 GB) – the assistant runs a model that fits with headroom: gpt-oss-20b (~13 GB, MXFP4), Qwen3.6-35B-A3B (~24 GB), or Gemma 4-31B (~19 GB). A 120B-class model does not belong here: gpt-oss-120b is ~60 GB even in MXFP4 and needs an 80 GB card or a two-5090 host – trying to force it onto 32 GB means offloading half the weights to system RAM and paying the bandwidth cliff. So the 120B is not a consumer claim; it is exactly what the enterprise tier adds when the hardware arrives.
The consumer VRAM budget – and why enterprise is about concurrency
The consumer tier obeys the same discipline as the evaluation pipeline. Qwen3.6-35B-A3B at ~24 GB leaves ~6–7 GB for a q8_0 KV cache and activations on the 32 GB card – comfortable for personal single-stream use, no spill.
The enterprise tier changes the sizing question from model size to concurrency: serving a team means many simultaneous requests, and continuous batching gives each its own KV-cache slice, so the 64–80 GB host is sized for parallel streams, not merely for the larger 120B weights. That is the real reason a team needs more than one card – not the model alone, but everyone using it at once.
A polished chat UX on a model-agnostic, tool-compatible base
Because the server is OpenAI-compatible, the assistant is not locked to one model or one client. The backend (Ollama, vLLM or LM Studio) serves any open-weight model on the standard endpoint; a modern chat-UX layer supplies chat history, projects, artifacts and tool use; and the remote layer wraps the whole thing for the phone. The same endpoint means existing tools – agentic CLIs like OpenCode – keep working unchanged, locally or remotely.
localhost:1234.Threat model: what the tunnel does and doesn't protect
The tunnel earns the sovereignty claim, but not unconditionally. It protects data in transit (end-to-end encrypted) and keeps inference and history on owned hardware; what it does not remove is trust in the account system that pairs the devices. It is a privacy-and-convenience layer, not a full threat model – a home rig on a local network has a different exposure than one reachable from anywhere. Stated plainly, the property is that data stays on owned machines, not that it is unconditionally secure against every adversary – and being precise about that difference is part of doing it honestly.
From a private phone assistant to a team's sovereign AI
A privacy-conscious professional – and, later, their small firm – wants a capable assistant with the comfort of a modern chat interface, including access from a phone. But confidential material cannot go to a third-party cloud, and desktop-bound local setups are useless away from the desk.
Stand up a sovereign assistant that is genuinely usable day to day, reachable from a phone anywhere, and runs a model that actually fits the hardware – with a clean path to a team that doesn't require rebuilding it.
On the home RTX 5090, serve a 32 GB-fitting model (gpt-oss-20b or Qwen3.6-35B-A3B) through the OpenAI-compatible endpoint; wrap it in a polished chat UX (history, projects, tool use); reach it from the phone over the encrypted mesh – no open ports, inference and history on owned hardware.
A personal assistant the owner uses from their phone with nothing leaving their control – and a one-step enterprise path: add a second 5090 (or an 80 GB host) and the same stack serves a 120B-class model to the whole team over the same encrypted mesh, still with no public exposure. Consumer build today; enterprise by adding hardware, not by re-architecting.
Buildable now at the consumer tier, honest about the rest
The design turns a familiar tension into a single answer: the polish of a cloud assistant with the sovereignty of local inference – carried in a pocket, and able to grow from one person to a team by adding a card rather than surrendering the data.
Designed under CTC AI Operations, on the same local-inference discipline as the evaluation and workstation projects it sits beside.