Forge Model Router

How a personal AI operating system manages model costs, TOS compliance, and graceful degradation — without going broke or getting banned.

The Problem

Running an autonomous AI agent (Ralph) that codes, researches, and operates 24/7 creates a real tension: the best models cost $15-75 per million tokens, but the tasks Ralph handles don't all need frontier intelligence. Most coding tasks succeed with a 7B parameter model. Only the hard ones need Claude Opus.

We needed a routing system that tries the cheapest model first, escalates only when necessary, respects budget caps, and stays compliant with provider terms of service — all automatically.

The Cascade: 7 Tiers of Fallback

Every prompt Ralph generates flows through a cascade. Each tier gets a fixed number of attempts before the system escalates to the next one.

Tier 0: Qwen 7B (Local Ollama) FREE

qwen2.5-coder:7b — running on the VPS itself (CPU-only, 4-core AMD EPYC)

2 attempts • 180s timeout • No network, no cost, no privacy concerns
Handles boilerplate coding, file edits, config generation. CPU inference at ~5 tok/s means longer timeouts are essential. Health-cached: if Ollama is down, tier is skipped instantly for 5 minutes instead of burning retries.

↓ failed 2x or cached down

Tier 1: Qwen 32B (OpenRouter) ~$0.20/M

qwen/qwen-2.5-coder-32b-instruct — US/EU servers only

1 attempt • 60s timeout • Data collection denied, no fallback routing
Bigger context, better reasoning. Catches what 7B misses on complex logic.

↓ failed

Tier 2: DeepSeek V3 (OpenRouter) ~$0.27/M

deepseek/deepseek-chat-v3-0324 — reasoning mode enabled

2 attempts • 60s timeout • Different architecture catches different failure modes
Still cheap. Data collection denied, jurisdiction-safe routing enforced.

↓ failed 2x

Tier 2.5: Anthropic API (Sonnet) ~$3-5/MONTH

claude-sonnet-4-5 — paid API, good-faith usage

1 attempt • Randomized monthly budget ($3.20-$5.23/mo)
Legitimate paid bot usage via API key. Shows up in Anthropic's billing as organic automated use. Budget randomized to look natural.

↓ budget hit or failed

Tier 3: Claude Max CLI (Gated) GATED

claude -p via CLI — Max subscription ($200/mo flat)

1 attempt • Rate limited: 15 calls/day, 4 calls/hour, 5 min cooldown
Full Opus-level intelligence at flat cost. Gated because the Max plan assumes ordinary individual usage. These calls are intermittent last-resort, not primary pipeline.

↓ gate limit hit

Tier 3.5: Human Escalation HUMAN-IN-LOOP

Telegram notification → tmux runner pane → Jason presses Enter

15 minute wait • Parks task and moves on if no response
When gate limits are hit, the command is preloaded in a terminal pane. Jason gets a Telegram ping. If he hits Enter, the call is human-initiated — fully TOS compliant. If he doesn't respond in 15 minutes, the task is parked and Ralph moves to the next item in queue.

↓ timeout (15 min)

Tier 4: Poll & Wait

Check availability every 5 minutes, up to 24 hours

Task is deferred. Ralph polls for model availability (Ollama back up, Claude Max cooldown reset, budget refreshed). After 24 hours, the task is marked failed with a clear reason.

Stress-Hardening: What We Learned

We stress-tested the cascade against our actual hardware — a 4-core AMD EPYC VPS with 15GB RAM, no GPU, no swap — and found three critical issues:

1. Ollama on CPU Is Slow, Not Broken

Qwen 7B runs at ~5.3 tokens/second on CPU-only inference. A typical coding prompt takes 73+ seconds to complete. Our original 30-second timeout guaranteed Ollama would fail on every real task, wasting 3 retries (90 seconds) before escalating. Fix: timeout increased to 180 seconds, retries reduced from 3 to 2.

2. Dead Services Burn Time

When Ollama crashes or LiteLLM goes down, the old system retried against the dead service multiple times before escalating. With 3 retries at 30 seconds each, that's 90 seconds of guaranteed waste. Fix: health caching — when a service fails, it's marked "down" for 5 minutes. Subsequent tasks skip it instantly (0 seconds wasted) and cascade directly to cloud models.

3. Failures Need Diagnosis, Not Just Alerts

When all tiers fail, the old system paused Ralph and sent a generic "all models failed" alert. Unhelpful. Fix: auto-diagnosis — when the circuit breaker triggers, the system checks each component (Ollama process, LiteLLM health, internet connectivity, budget status, Claude Max gates) and reports exactly why everything failed in the Telegram alert.

Net effect of hardening

A task that previously burned 90 seconds on a dead Ollama before escalating now skips to cloud models in under 1 second. Cost of Ollama being down for a full day: ~$2-5 extra on OpenRouter. Annoying, not catastrophic.

Claude Max for Interactive Features

The cascade above governs automated usage — Ralph running overnight, cron jobs, background processing. But Forge also has interactive features where a human is physically typing and clicking. These are a different category entirely.

The Rule

If a human physically initiates the request — types a message, clicks a button, presses Enter — it's human-initiated usage. Claude Max is designed for this. No cascade needed, no rate gates, no guilt.

What Can Use Claude Max Directly

Cowork Chat MAX OK

You type a message, you hit send. Human-initiated. This is the primary interactive interface — planning, architecture decisions, complex reasoning. Exactly what Max is for.

Terminal (ttyd) MAX OK

You're literally typing in a shell. Running claude from a terminal you opened is the most straightforward individual usage possible.

Dashboard Actions MAX OK

Clicking "approve merge," "queue task," or "generate plan" — each click is a human decision. The LLM call behind the button is human-initiated.

Telegram Commands MAX OK

Sending /skills or /status to the bot. You typed it, you sent it. Human in the loop by definition.

What Must Use the Cascade

Ralph (Autonomous Agent) CASCADE

Runs overnight without human presence. No one types, no one clicks. This is automated pipeline usage — the cascade exists specifically for this.

Cron Jobs & Timers CASCADE

Reconciliation, cascade evaluation, health checks. Scheduled automation with no human present at execution time.

Auto-Triggered Workflows CASCADE

Event-driven processing, webhook handlers, queue consumers. Even if a human action triggered the chain, the LLM call itself is automated.

Batch Processing CASCADE

Processing 50 tasks from a queue, generating reports, bulk operations. Volume automated usage — use cheap models first.

How the Interactive Path Works

Interactive features don't go through the 7-tier cascade at all. They use a simpler path:

Human types/clicks in dashboard, chat, or terminal
Request hits the VPS via the dashboard API proxy
VPS calls Claude via Max CLI (claude -p) or Anthropic API
Response streams back to the human in real-time

No retries needed — if Claude is down, the human sees the error and can retry manually. No budget gates needed — the human IS the gate. No compliance concern — this is exactly how individual subscriptions are meant to be used.

Budget Guardrails

Guard	Limit	Enforcement
Daily API spend	$10.00	Hard stop — refuses API calls when hit
Per-task spend	$3.00	Accumulates across retries within one task
Monthly API spend	$100.00	Alert at 80%, hard stop at 100%
Anthropic API budget	$3.20-5.23/mo	Randomized monthly, spread across days
Claude Max daily calls (automated)	15/day	Counter file in /tmp, resets at midnight UTC
Claude Max hourly calls (automated)	4/hour	Counter file, resets each hour
Claude Max cooldown (automated)	5 min between calls	Timestamp file, checked before each call

Note: Budget guardrails apply only to automated (Ralph/cascade) usage. Interactive human-initiated usage through the dashboard or terminal is ungated — the human is the gate.

Jurisdiction Safety

When sending prompts to cloud models via OpenRouter, we enforce two rules:

data_collection: "deny" — OpenRouter providers cannot store or train on our data
allow_fallbacks: false — If the primary provider is down, the request fails rather than routing to an unknown provider in an unknown jurisdiction

PII detection runs before any external model call. If Presidio detects names, emails, or other PII in the prompt, the request is blocked from leaving the local machine entirely.

TOS Compliance Design

The Core Principle

Two usage lanes, clearly separated. Automated lane: cascade through cheap models first, Claude Max as gated last resort with strict rate limits. Interactive lane: human types, human sends, Claude Max responds directly. Both lanes are compliant. Neither pretends to be the other.

The architecture ensures that:

Most automated calls never touch Claude at all (Tiers 0-2 handle ~80% of Ralph's tasks)
Automated Claude Max calls are intermittent and rate-limited, not continuous pipeline usage
Legitimate paid API usage ($3-5/month) exists alongside the Max subscription
When automated rate limits are hit, a human physically initiates the call by pressing Enter
Interactive usage (dashboard chat, terminal, buttons) is straightforward human-initiated usage
All usage is logged to Supabase for auditability — automated and interactive separately tagged

What This Costs

Component	Monthly Cost	What It Provides
Claude Max subscription	$200 flat	Interactive features (chat, terminal, actions) + gated automated fallback
Local Ollama (Qwen 7B)	$0	~60% of automated task completions (when healthy)
OpenRouter (Qwen 32B + DeepSeek)	$2-8	Cheap cloud fallback for complex automated tasks
Anthropic API (Sonnet)	$3-5	Good-faith paid bot usage
Total	~$210/mo	24/7 autonomous agent + interactive AI dashboard

Published by Forge • ideas.asapai.net • March 2026
Updated March 3 with stress-hardening results and interactive feature guide
Built with the Forge personal AI operating system