Live Self-initiated · production personal infrastructure 2026

Self-managed VPS running a containerised, Tailscale-gated AI assistant (Donna) with multi-model routing, cron-driven briefings, and a documented cost-forensic and hardening cycle

Hermes is a self-managed Linux VPS (Hostinger KVM, Singapore) that runs "Donna" — a persistent AI assistant — as a 24/7 Telegram agent. The stack uses Docker containers, a NousResearch Hermes Agent gateway, multi-model LLM routing via LiteLLM and OpenRouter, system cron for daily briefings and data-sync triggers, and Tailscale for zero-public-surface SSH access. A forensic cost audit, architectural decision process (including an Opus-arbitrated head-to-head comparison against an alternative runtime), and a multi-phase hardening plan are all documented in the project vault.

~$5.09/day (~$159/mo) peak; target $0.30–0.50/day after hardeningLLM API spend at peak vs hardened target

~200 LLM API calls/day at peak burnDaily LLM call volume (peak)

~$9/mo VPS hosting (Hostinger KVM)VPS infrastructure cost

28 skill packages deployedHermes skill packages installed on container

Role

Solo AI Infrastructure Engineer — provisioning, containerisation, model routing, security hardening, cost forensics, cron orchestration, observability design

Stack

Hostinger KVM VPS (Singapore)Docker / Docker ComposeNousResearch Hermes Agent (gateway + agent containers)LiteLLM proxy (Postgres-backed, port 53165)OpenRouter API (multi-provider model routing)Tailscale (WireGuard-based private networking)Cloudflare Tunnel (SSH exposure, no public API surface)Traefik (reverse proxy)System cronPython 3 (prompt-builder scripts)Bash (cron orchestration scripts)DeepSeek V3/V4 Flash (auxiliary model)Gemini 3.5 Flash (main conversation model)Claude Haiku 4.5 (target post-migration model)Supabase REST API (GLOS state reads)Telegram Bot API (long-poll outbound)

By the numbers

~$5.09/day (~$159/mo) peak; target $0.30–0.50/day after hardening

LLM API spend at peak vs hardened target

~200 LLM API calls/day at peak burn

Daily LLM call volume (peak)

~$9/mo VPS hosting (Hostinger KVM)

VPS infrastructure cost

28 skill packages deployed

Hermes skill packages installed on container

5 cost incidents in 12 days (peak instability period)

Cost incidents before architectural review

~48% prompt cache hit rate (May 11 observed)

LiteLLM prompt cache hit rate at audit

Every 2 hours, 06:30–22:30 Manila time

Jefit sync cron cadence

The brief

I needed an always-on AI assistant — “Donna” — that responds to Telegram at any hour, fires morning briefings on a schedule, and reacts to external data events (workout logs, daily habit completions). Running this on my laptop meant silent failures the moment the lid closed. I provisioned a self-managed VPS and built the full runtime from scratch: container orchestration, model routing, observability, and security hardening.

What I built

A production Linux VPS (Hostinger KVM2, Singapore datacenter) running a multi-container Docker stack: a Hermes Agent gateway container, an agent container (the AI brain), an auth-proxy container, a LiteLLM proxy container with its own Postgres 16 sidecar, Traefik as the reverse proxy, and a Cloudflare Tunnel daemon for SSH exposure. The Telegram integration is outbound long-poll from the Hermes gateway — no inbound ports opened on the public interface.

A companion script pair (hermes-jefit-cron.sh + hermes-jefit-prompt.py) handles the workout-sync automation loop: a system cron fires every two hours during waking hours (06:30–22:30 Manila), POSTs to the GLOS Vercel app with a Bearer token to pull new workout records, and if a new workout is detected, builds a prompt with exercise stats (volume, reps, PR count, XP delta) and fires it at the agent’s local loopback API — so the AI congratulates in its own voice without GLOS ever touching Telegram directly. The script is idempotent by design; GLOS flags a workout as “new” only once.

How it’s built

Networking and access hardening. The public IPv4 address is firewalled — all SSH access is Tailscale-only (WireGuard-encrypted mesh). The Cloudflare Tunnel exposes SSH through the tunnel; it does not expose the agent API. The agent itself has no public inbound surface — it polls Telegram outbound. A web cockpit (Hermes Workspace, outsourc-e/hermes-workspace) is served via tailscale serve --http rather than Docker port-binding, which avoids ufw routing table conflicts and keeps it reachable only on the private Tailnet.

Container architecture. The full stack is defined in a Docker Compose file under /docker/hermes-agent-7buh/. The agent’s data volume is separate from the workspace eval volume, so the cockpit cannot write the agent’s config. The LiteLLM proxy runs as its own compose service (litellm-jtsq) with a dedicated Postgres instance for spend tracking and virtual key management.

Multi-model routing. LiteLLM acts as the internal control plane. Named model slots (cos-agent, aux-flash, aux-vision, broker-agent, content-agent) map to different providers and models. At peak configuration: main conversation on Gemini 3.5 Flash via OpenRouter, all auxiliary slots (session search, compression, title generation, approval, skills hub, web extract) on DeepSeek V3 via OpenRouter, with a Sonnet route reserved for sensitive data (Lane Holdings, client data, content drafting). An OpenRouter guardrail allowlist enforces model scope — any call to an unlisted model returns a 404, which stopped a class of runaway retry storms.

Spend control and observability. A $1.00/day hard cap is enforced at the OpenRouter key level (not in prose config the agent can rewrite). A Python balance-check script runs daily via cron and emails when the DeepSeek prepaid wallet drops below a threshold. A bleed-detector script runs every 15 minutes, counts recent API calls in the agent log, and emails an alert if the rate exceeds a threshold — a pure-bash watchdog that costs zero LLM tokens. The architecture separates spend enforcement into three independent layers: per-key budget at the provider dashboard, LiteLLM virtual key limits, and the bleed-detector cron — so no single config change can disable all guardrails simultaneously.

The self-reconfiguration incident and fix. A critical failure occurred when the agent rewrote its own model config, switching to a model with thinking mode on by default. Context ballooned to an estimated 18.5M tokens/day. The fix was architectural, not configurational: agent config files were moved out of any path the agent has write access to, and the rule “auto-self-improving config: permanently off” was codified in both the agent SOUL.md and the project vault. Config changes are human-applied only.

Cost forensics. I conducted a full forensic audit of 37 aggregated OpenRouter API records, gateway logs, and the LiteLLM routing config. The audit found that 6 of 7 “confirmed leaks” from a prior session’s analysis were either historical, already fixed, or never existed — the real cost driver was architectural overhead: a 90-turn max_iterations ceiling, auto-compression firing at 30% context threshold (generating up to 9 LLM calls per compression event), and prompt-context history bloat (30–40K tokens per call, not from system prompt size but from conversation history accumulation). This informed a targeted fix plan rather than a speculative rebuild.

Architectural decision process. When evaluating whether to migrate to an alternative runtime (ClaudeClaw/Claude Code headless on VPS), I produced a structured decision brief, had it reviewed by a stronger model (Opus 4.7), and logged the verdict and rationale in the vault. The decision process itself — brief, verdict, verification items, evaluation criteria with pass thresholds, revised execution sequence — is fully documented and reproducible. The revised plan ran a parallel shadow install (separate Telegram bot token, separate Docker directory, no cron migration until interactive chat proved stable) rather than a big-bang migration.

Vault sync for agent context. The agent reads a read-only mirror of my markdown vault (canonical operating context, handoff notes, kaizen audit trail) at /opt/data/vault/, synced via rclone from the Mac. The agent cannot write to this path. Write-back is scoped to specific allowed paths (logs, kaizen trail, heartbeat file) via rclone filter rules — so the agent can record observations without clobbering human-authored strategy documents.

Daily cadence via system cron. Morning briefings (07:00 Manila), midday energy pings (12:30), evening reflections (18:00), and weekly review prompts all run via system cron entries that POST to the agent’s loopback API. The agent reads live GLOS state (Supabase get_full_state() RPC) before generating each briefing — no stale cached state.

Why it matters

This is the infrastructure layer underneath a personal AI system that actually ships. Every architectural decision — Tailscale over VPN, guardrail allowlists over prose spend caps, read-only vault mounts, idempotent cron scripts, separated config authority — was made in response to a specific failure mode encountered in production. The forensic audit, the incident post-mortems, and the architectural decision records are evidence of how I think about AI infrastructure: not as a demo, but as a system that has to stay alive and stay within budget at 3 AM without anyone watching it.

AI InfrastructureDevOpsLLM OpsCost OptimisationSecurity HardeningVPS

Want something like this?

That's the kind of thing I build. Tell me about yours.

Start a project Book a call