Forge Model Router

How a personal AI operating system manages model costs, TOS compliance, and graceful degradation — without going broke or getting banned.

The Problem

Running an autonomous AI agent (Ralph) that codes, researches, and operates 24/7 creates a real tension: the best models cost $15-75 per million tokens, but the tasks Ralph handles don't all need frontier intelligence. Most coding tasks succeed with a 7B parameter model. Only the hard ones need Claude Opus.

We needed a routing system that tries the cheapest model first, escalates only when necessary, respects budget caps, and stays compliant with provider terms of service — all automatically.

The Cascade: 7 Tiers of Fallback

Every prompt Ralph generates flows through a cascade. Each tier gets a fixed number of attempts before the system escalates to the next one.

Tier 0: Qwen 7B (Local Ollama) FREE
qwen2.5-coder:7b — running on the VPS itself (CPU-only, 4-core AMD EPYC)
2 attempts • 180s timeout • No network, no cost, no privacy concerns
Handles boilerplate coding, file edits, config generation. CPU inference at ~5 tok/s means longer timeouts are essential. Health-cached: if Ollama is down, tier is skipped instantly for 5 minutes instead of burning retries.
↓ failed 2x or cached down
Tier 1: Qwen 32B (OpenRouter) ~$0.20/M
qwen/qwen-2.5-coder-32b-instruct — US/EU servers only
1 attempt • 60s timeout • Data collection denied, no fallback routing
Bigger context, better reasoning. Catches what 7B misses on complex logic.
↓ failed
Tier 2: DeepSeek V3 (OpenRouter) ~$0.27/M
deepseek/deepseek-chat-v3-0324 — reasoning mode enabled
2 attempts • 60s timeout • Different architecture catches different failure modes
Still cheap. Data collection denied, jurisdiction-safe routing enforced.
↓ failed 2x
Tier 2.5: Anthropic API (Sonnet) ~$3-5/MONTH
claude-sonnet-4-5 — paid API, good-faith usage
1 attempt • Randomized monthly budget ($3.20-$5.23/mo)
Legitimate paid bot usage via API key. Shows up in Anthropic's billing as organic automated use. Budget randomized to look natural.
↓ budget hit or failed
Tier 3: Claude Max CLI (Gated) GATED
claude -p via CLI — Max subscription ($200/mo flat)
1 attempt • Rate limited: 15 calls/day, 4 calls/hour, 5 min cooldown
Full Opus-level intelligence at flat cost. Gated because the Max plan assumes ordinary individual usage. These calls are intermittent last-resort, not primary pipeline.
↓ gate limit hit
Tier 3.5: Human Escalation HUMAN-IN-LOOP
Telegram notification → tmux runner pane → Jason presses Enter
15 minute wait • Parks task and moves on if no response
When gate limits are hit, the command is preloaded in a terminal pane. Jason gets a Telegram ping. If he hits Enter, the call is human-initiated — fully TOS compliant. If he doesn't respond in 15 minutes, the task is parked and Ralph moves to the next item in queue.
↓ timeout (15 min)
Tier 4: Poll & Wait
Check availability every 5 minutes, up to 24 hours
Task is deferred. Ralph polls for model availability (Ollama back up, Claude Max cooldown reset, budget refreshed). After 24 hours, the task is marked failed with a clear reason.

Stress-Hardening: What We Learned

We stress-tested the cascade against our actual hardware — a 4-core AMD EPYC VPS with 15GB RAM, no GPU, no swap — and found three critical issues:

1. Ollama on CPU Is Slow, Not Broken

Qwen 7B runs at ~5.3 tokens/second on CPU-only inference. A typical coding prompt takes 73+ seconds to complete. Our original 30-second timeout guaranteed Ollama would fail on every real task, wasting 3 retries (90 seconds) before escalating. Fix: timeout increased to 180 seconds, retries reduced from 3 to 2.

2. Dead Services Burn Time

When Ollama crashes or LiteLLM goes down, the old system retried against the dead service multiple times before escalating. With 3 retries at 30 seconds each, that's 90 seconds of guaranteed waste. Fix: health caching — when a service fails, it's marked "down" for 5 minutes. Subsequent tasks skip it instantly (0 seconds wasted) and cascade directly to cloud models.

3. Failures Need Diagnosis, Not Just Alerts

When all tiers fail, the old system paused Ralph and sent a generic "all models failed" alert. Unhelpful. Fix: auto-diagnosis — when the circuit breaker triggers, the system checks each component (Ollama process, LiteLLM health, internet connectivity, budget status, Claude Max gates) and reports exactly why everything failed in the Telegram alert.

Net effect of hardening

A task that previously burned 90 seconds on a dead Ollama before escalating now skips to cloud models in under 1 second. Cost of Ollama being down for a full day: ~$2-5 extra on OpenRouter. Annoying, not catastrophic.

Claude Max for Interactive Features

The cascade above governs automated usage — Ralph running overnight, cron jobs, background processing. But Forge also has interactive features where a human is physically typing and clicking. These are a different category entirely.

The Rule

If a human physically initiates the request — types a message, clicks a button, presses Enter — it's human-initiated usage. Claude Max is designed for this. No cascade needed, no rate gates, no guilt.

What Can Use Claude Max Directly

Cowork Chat MAX OK
You type a message, you hit send. Human-initiated. This is the primary interactive interface — planning, architecture decisions, complex reasoning. Exactly what Max is for.
Terminal (ttyd) MAX OK
You're literally typing in a shell. Running claude from a terminal you opened is the most straightforward individual usage possible.
Dashboard Actions MAX OK
Clicking "approve merge," "queue task," or "generate plan" — each click is a human decision. The LLM call behind the button is human-initiated.
Telegram Commands MAX OK
Sending /skills or /status to the bot. You typed it, you sent it. Human in the loop by definition.

What Must Use the Cascade

Ralph (Autonomous Agent) CASCADE
Runs overnight without human presence. No one types, no one clicks. This is automated pipeline usage — the cascade exists specifically for this.
Cron Jobs & Timers CASCADE
Reconciliation, cascade evaluation, health checks. Scheduled automation with no human present at execution time.
Auto-Triggered Workflows CASCADE
Event-driven processing, webhook handlers, queue consumers. Even if a human action triggered the chain, the LLM call itself is automated.
Batch Processing CASCADE
Processing 50 tasks from a queue, generating reports, bulk operations. Volume automated usage — use cheap models first.

How the Interactive Path Works

Interactive features don't go through the 7-tier cascade at all. They use a simpler path:

  1. Human types/clicks in dashboard, chat, or terminal
  2. Request hits the VPS via the dashboard API proxy
  3. VPS calls Claude via Max CLI (claude -p) or Anthropic API
  4. Response streams back to the human in real-time

No retries needed — if Claude is down, the human sees the error and can retry manually. No budget gates needed — the human IS the gate. No compliance concern — this is exactly how individual subscriptions are meant to be used.

Budget Guardrails

GuardLimitEnforcement
Daily API spend$10.00Hard stop — refuses API calls when hit
Per-task spend$3.00Accumulates across retries within one task
Monthly API spend$100.00Alert at 80%, hard stop at 100%
Anthropic API budget$3.20-5.23/moRandomized monthly, spread across days
Claude Max daily calls (automated)15/dayCounter file in /tmp, resets at midnight UTC
Claude Max hourly calls (automated)4/hourCounter file, resets each hour
Claude Max cooldown (automated)5 min between callsTimestamp file, checked before each call

Note: Budget guardrails apply only to automated (Ralph/cascade) usage. Interactive human-initiated usage through the dashboard or terminal is ungated — the human is the gate.

Jurisdiction Safety

When sending prompts to cloud models via OpenRouter, we enforce two rules:

PII detection runs before any external model call. If Presidio detects names, emails, or other PII in the prompt, the request is blocked from leaving the local machine entirely.

TOS Compliance Design

The Core Principle

Two usage lanes, clearly separated. Automated lane: cascade through cheap models first, Claude Max as gated last resort with strict rate limits. Interactive lane: human types, human sends, Claude Max responds directly. Both lanes are compliant. Neither pretends to be the other.

The architecture ensures that:

What This Costs

ComponentMonthly CostWhat It Provides
Claude Max subscription$200 flatInteractive features (chat, terminal, actions) + gated automated fallback
Local Ollama (Qwen 7B)$0~60% of automated task completions (when healthy)
OpenRouter (Qwen 32B + DeepSeek)$2-8Cheap cloud fallback for complex automated tasks
Anthropic API (Sonnet)$3-5Good-faith paid bot usage
Total~$210/mo24/7 autonomous agent + interactive AI dashboard
Published by Forge • ideas.asapai.net • March 2026
Updated March 3 with stress-hardening results and interactive feature guide
Built with the Forge personal AI operating system