How a personal AI operating system manages model costs, TOS compliance, and graceful degradation — without going broke or getting banned.
Running an autonomous AI agent (Ralph) that codes, researches, and operates 24/7 creates a real tension: the best models cost $15-75 per million tokens, but the tasks Ralph handles don't all need frontier intelligence. Most coding tasks succeed with a 7B parameter model. Only the hard ones need Claude Opus.
We needed a routing system that tries the cheapest model first, escalates only when necessary, respects budget caps, and stays compliant with provider terms of service — all automatically.
Every prompt Ralph generates flows through a cascade. Each tier gets a fixed number of attempts before the system escalates to the next one.
We stress-tested the cascade against our actual hardware — a 4-core AMD EPYC VPS with 15GB RAM, no GPU, no swap — and found three critical issues:
Qwen 7B runs at ~5.3 tokens/second on CPU-only inference. A typical coding prompt takes 73+ seconds to complete. Our original 30-second timeout guaranteed Ollama would fail on every real task, wasting 3 retries (90 seconds) before escalating. Fix: timeout increased to 180 seconds, retries reduced from 3 to 2.
When Ollama crashes or LiteLLM goes down, the old system retried against the dead service multiple times before escalating. With 3 retries at 30 seconds each, that's 90 seconds of guaranteed waste. Fix: health caching — when a service fails, it's marked "down" for 5 minutes. Subsequent tasks skip it instantly (0 seconds wasted) and cascade directly to cloud models.
When all tiers fail, the old system paused Ralph and sent a generic "all models failed" alert. Unhelpful. Fix: auto-diagnosis — when the circuit breaker triggers, the system checks each component (Ollama process, LiteLLM health, internet connectivity, budget status, Claude Max gates) and reports exactly why everything failed in the Telegram alert.
A task that previously burned 90 seconds on a dead Ollama before escalating now skips to cloud models in under 1 second. Cost of Ollama being down for a full day: ~$2-5 extra on OpenRouter. Annoying, not catastrophic.
The cascade above governs automated usage — Ralph running overnight, cron jobs, background processing. But Forge also has interactive features where a human is physically typing and clicking. These are a different category entirely.
If a human physically initiates the request — types a message, clicks a button, presses Enter — it's human-initiated usage. Claude Max is designed for this. No cascade needed, no rate gates, no guilt.
claude from a terminal you opened is the most straightforward individual usage possible./skills or /status to the bot. You typed it, you sent it. Human in the loop by definition.Interactive features don't go through the 7-tier cascade at all. They use a simpler path:
claude -p) or Anthropic APINo retries needed — if Claude is down, the human sees the error and can retry manually. No budget gates needed — the human IS the gate. No compliance concern — this is exactly how individual subscriptions are meant to be used.
| Guard | Limit | Enforcement |
|---|---|---|
| Daily API spend | $10.00 | Hard stop — refuses API calls when hit |
| Per-task spend | $3.00 | Accumulates across retries within one task |
| Monthly API spend | $100.00 | Alert at 80%, hard stop at 100% |
| Anthropic API budget | $3.20-5.23/mo | Randomized monthly, spread across days |
| Claude Max daily calls (automated) | 15/day | Counter file in /tmp, resets at midnight UTC |
| Claude Max hourly calls (automated) | 4/hour | Counter file, resets each hour |
| Claude Max cooldown (automated) | 5 min between calls | Timestamp file, checked before each call |
Note: Budget guardrails apply only to automated (Ralph/cascade) usage. Interactive human-initiated usage through the dashboard or terminal is ungated — the human is the gate.
When sending prompts to cloud models via OpenRouter, we enforce two rules:
data_collection: "deny" — OpenRouter providers cannot store or train on our dataallow_fallbacks: false — If the primary provider is down, the request fails rather than routing to an unknown provider in an unknown jurisdictionPII detection runs before any external model call. If Presidio detects names, emails, or other PII in the prompt, the request is blocked from leaving the local machine entirely.
Two usage lanes, clearly separated. Automated lane: cascade through cheap models first, Claude Max as gated last resort with strict rate limits. Interactive lane: human types, human sends, Claude Max responds directly. Both lanes are compliant. Neither pretends to be the other.
The architecture ensures that:
| Component | Monthly Cost | What It Provides |
|---|---|---|
| Claude Max subscription | $200 flat | Interactive features (chat, terminal, actions) + gated automated fallback |
| Local Ollama (Qwen 7B) | $0 | ~60% of automated task completions (when healthy) |
| OpenRouter (Qwen 32B + DeepSeek) | $2-8 | Cheap cloud fallback for complex automated tasks |
| Anthropic API (Sonnet) | $3-5 | Good-faith paid bot usage |
| Total | ~$210/mo | 24/7 autonomous agent + interactive AI dashboard |