OpenClaw Token Optimization: Cut AI Agent Costs by 97%

Based on Matt Ganzac's discovery sequence • 📺 Source Video
DEVELOPER-LEVEL 97% COST CUT PRODUCTION-TESTED

Default OpenClaw burns $50-150/month on architectural waste. Four fixes cut costs to $3/month while enabling overnight batch processing and production lead gen at $1/hour.

The System Overview

Default OpenClaw burns $50-150/month on architectural waste. Follow Matt Ganzac's exact discovery sequence — with AI doing the heavy lifting — to cut costs to $3/month while enabling overnight batch processing and production lead gen at $1/hour.

Daily Idle Cost
$3 → $0
Heartbeat + context waste eliminated
Overnight Job Cost
$150 → $6
Multi-model routing + caching
Monthly Spend
$90 → $3
97% reduction, same output

Core Insight

"One guy went to bed and woke up having burned $500." — Matt Ganzac

  • Kill context bloat – 50KB→5KB per operation (90% reduction)
  • Dump session history – 111KB Slack tax eliminated
  • Heartbeats to Olama – $0 forever for system checks
  • Multi-model routing – 85% of ops on cheapest viable model
⚠️ CRITICAL PREREQUISITES

Before you start, verify:

  • ✓ Developer-level skills (log analysis, config editing)
  • ✓ Dedicated machine for OpenClaw (NOT your personal laptop)
  • ✓ API access to Anthropic, Brave Search, Hunter.io
  • ✓ Understanding: Agent will attempt logins, purchases, actions
  • ✓ Willingness to break things and troubleshoot
Real Stakes: If you're not comfortable with the above, this playbook isn't for you. If you're just exploring OpenClaw, start with their docs first.

Step 0: Measure Your Current Waste

Est. time: ~20 min • Foundation step
FOUNDATION
Before/After Transformation
❌ BEFORE (Flying Blind)✅ AFTER (Data-Driven)
Don't know where tokens goExact token breakdown by category
Assume cost is just "normal"Top 3 waste sources identified
No visibility into waste sourcesBaseline for measuring improvement
Can't prioritize fixesClear prioritization roadmap
Pattern Recognition: Matt only discovered 111KB session history bloat BECAUSE rate limits forced investigation. Measuring reveals hidden waste you'd never find otherwise.

The Problem

You can't optimize what you don't measure. Most OpenClaw users assume $50-150/month is "normal cost of AI." In reality, 80-95% is architectural waste you can eliminate in one afternoon.

Why This Matters

Matt's Discovery Story:

  • Loaded $25 to Anthropic API
  • Was on track to spend $20/day IDLE (doing nothing)
  • Only found the waste by running token audit after hitting rate limits
  • Without measurement, would have just kept burning money

Exactly Do This

  1. Access your OpenClaw logs directory (location varies by OS)
  2. Use AI prompt below to generate token audit script
  3. Run the script and save output to file
  4. Identify your top 3 waste sources:
    • Context bloat (likely 50KB+ per operation)
    • Session history (if using Slack/WhatsApp)
    • Heartbeat API calls (every 30 min)
  5. Calculate your daily idle cost (API calls × token cost)

Your Baseline Metrics

MY BASELINE METRICS (from audit):

Context size per operation: _____ KB
Daily heartbeat cost: $_____
Session history size: _____ KB
Top waste source: ___________
Current daily idle cost: $_____
Current model usage: ___________

Matt's Numbers (Comparison)

MATT'S BASELINE:

Context size: 50KB+
Daily heartbeat cost: $2-3
Session history: 111KB (Slack)
Top waste: Context bloat
Daily idle cost: $3
Model: 100% Sonnet
Action Item: Don't move to Step 1 until you have your baseline numbers. These metrics are your proof that fixes work.
🤖 AI Prompt Library: Token Audit

Prompt 1: Generate Token Audit Script

I'm running OpenClaw and need to audit my token usage to find waste.

Generate a Python script that:

1. READS: OpenClaw log files from [SPECIFY YOUR LOG DIRECTORY]
2. EXTRACTS: Token usage data by category
3. CALCULATES:
   - Context size per operation type
   - Session history loading frequency and size
   - Heartbeat token usage
   - Model routing distribution
   - Daily/weekly cost projections

4. OUTPUTS:
   - Summary dashboard with key metrics
   - Top 5 waste sources ranked by cost
   - Before/after projection if waste eliminated
   - CSV export for tracking over time

5. HANDLES:
   - Different log formats
   - Missing data gracefully
   - Rate limit errors separately

Generate complete, documented script ready to run.
Include installation instructions for dependencies.

Prompt 2: Analyze Audit Results

I ran a token audit on my OpenClaw instance. Help me interpret the results and prioritize fixes.

MY AUDIT RESULTS:
[Paste your audit output here]

Analyze and provide:

1. TOP 3 WASTE SOURCES:
   - Rank by cost impact
   - Estimate savings % if eliminated
   - Difficulty to fix (easy/medium/hard)

2. QUICK WIN IDENTIFICATION:
   - What can I fix in <30 min for biggest impact?
   - Which fixes are prerequisites for others?

3. PRIORITIZED FIX SEQUENCE:
   - Step 1: [Fix + expected savings]
   - Step 2: [Fix + expected savings]
   - Step 3: [Fix + expected savings]

4. RED FLAGS:
   - Any unusual patterns?
   - Signs of misconfiguration?
   - Potential security issues?
Pro Tip: Run audit BEFORE and AFTER each optimization step. Seeing the numbers drop is incredibly motivating — and it's your proof.
Quality Gates: What's "Good Enough" vs "Exceptional"?

✅ GOOD ENOUGH (Proceed to Step 1)

  • Audit script runs successfully
  • You have baseline daily token usage number
  • You identified at least 1 waste source >20% of total
  • You saved the audit output for comparison later

🌟 EXCEPTIONAL (Nice to Have)

  • Automated daily audit reports
  • Historical trend tracking
  • Breakdown by sub-agent/task type
  • Alert thresholds configured
Jason's Rule: "Perfect measurement is procrastination. Get your top 3 waste sources and move. You can refine the audit later."

Step 1: Eliminate Context Bloat (80% Savings)

Est. time: ~30 min • Prerequisite: Step 0
HIGHEST IMPACT
Before/After Transformation
❌ BEFORE (Context Explosion)✅ AFTER (Selective Loading)
50KB context loaded per message~5KB context per operation
75KB after a few days, 100KB+ after a weekContext size stable (doesn't grow)
2-3M tokens/day just from heartbeatsZero tokens on heartbeats
$2-3/day sitting completely idle$0/day idle cost
Real Example: Matt went from $20/day burn rate (projected) to $0/day idle after this single fix. Same agent functionality, 90% less tokens.

The Problem

Every heartbeat (every 30 min), every message, every prompt loads ALL your context files. As your memory grows, this compounds: 50KB → 75KB → 100KB+. You're burning tokens just to keep the agent awake.

Why This Matters

Matt's Numbers:

  • Initial context: 50KB per operation
  • After fixes: 5KB per operation (90% reduction)
  • Savings: $3/day → $0/day idle
  • Time to implement: 15 minutes of config changes

This is the SINGLE highest-leverage fix. Do this first.

Exactly Do This

  1. Locate your config file:
    Mac:     ~/.openclaw/config.json
    Windows: %APPDATA%\OpenClaw\config.json
    Linux:   ~/.config/openclaw/config.json
  2. Backup your current config (copy to safe location)
  3. Use AI prompt below to generate config modifications
  4. Apply the changes (edit config file)
  5. Restart OpenClaw and test with simple task
  6. Re-run token audit to verify reduction

Your Verification

BEFORE THIS FIX:
Context per operation: _____ KB
Daily idle cost: $_____

AFTER THIS FIX:
Context per operation: _____ KB
Daily idle cost: $_____

SAVINGS: _____%

Troubleshooting

Common Failures:

  • Agent broke / won't complete tasks: Restore backup, use "preserve task context" variant prompt
  • Still loading 30KB+ context: Check if session history is the culprit (Step 2)
  • Rate limit errors: Implement pacing logic (covered in Step 4)
Action Item: This fix alone gets most people to break-even economics. If you do nothing else, do this.
🤖 AI Prompt Library: Context Management

Prompt 1: Generate Context Management Config

I'm running OpenClaw and need to eliminate context bloat.

CURRENT SITUATION:
- Every heartbeat loads full context (~50KB+)
- Every message loads all context + history
- Context size growing: 50KB → 75KB → 100KB+

MY CURRENT CONFIG:
[Paste your config.json here OR say "using default config"]

GOAL:
Generate modified config that:
1. Prevents context loading on heartbeats
2. Implements selective context loading by task type
3. Reduces typical operation from 50KB → <10KB
4. Maintains necessary context for task completion
5. Preserves agent functionality

OUTPUT:
- Complete modified config.json
- Explanation of each changed parameter
- Test procedure to verify context reduction
- Rollback instructions if something breaks

CRITICAL: Don't break the agent. Preserve task completion capability.

Prompt 2: Verify Context Reduction

I just modified my OpenClaw config to reduce context bloat.

Run this verification checklist:

1. AUDIT COMPARISON:
   Before config: [Your Step 0 context size]
   After config: [Your new audit context size]
   Reduction: ____%

2. FUNCTIONALITY TEST:
   - Can agent complete simple task? (Y/N)
   - Can agent access memory when needed? (Y/N)
   - Any error messages? [List them]

3. NEXT STEPS:
   - If reduction <50%: [What to check/fix]
   - If agent broke: [Rollback procedure]
   - If reduction >70%: [Proceed to Step 2]

Analyze my results and tell me if I'm good to proceed.

MY RESULTS:
[Paste audit comparison here]
Pro Tip: Start conservative. Aim for 50-70% reduction first. You can tighten further after validating agent still works.
📊 Real Example: Matt's Implementation
MATT'S IMPLEMENTATION:

Discovery: Token audit showed 50KB context per heartbeat
Fix Applied: Selective context loading config
Test: Simple task completion verified
Result: 50KB → 5KB (90% reduction)
Savings: $3/day → $0/day idle
Time: 15 min config + 5 min test

Key Learning: Rate limits forced him to investigate. Most users never look at token breakdown and just accept the cost.

Quality Gates: What's "Good Enough" vs "Exceptional"?

✅ GOOD ENOUGH (Proceed to Step 2)

  • Context size reduced by 50%+ (audit shows this)
  • Agent still completes basic tasks correctly
  • Daily idle cost dropped significantly
  • You have backup config if rollback needed

🌟 EXCEPTIONAL (Optimization Round 2)

  • Context size <10KB consistently
  • Task-specific context rules configured
  • 90%+ reduction from baseline
  • Automated monitoring alerts if bloat returns
Jason's Rule: "50% reduction = ship it and move on. Don't optimize the optimization before fixing the other 3 waste sources."

Step 2: Kill Session History Tax (Messaging Platforms)

Est. time: ~20 min • Slack/WhatsApp only
PLATFORM-SPECIFIC
Before/After Transformation
❌ BEFORE (History Explosion)✅ AFTER (Clean Sessions)
Slack loads entire chat history (111KB!)"new session" command dumps history
Every message sends full history to APIHistory saved to memory (accessible if needed)
1M+ tokens per prompt when using SlackNormal token usage restored
Rate limit errors constantlyNo more rate limit errors
Discovery Method: Matt only found this BECAUSE rate limits forced log investigation. Web interface didn't have this issue — it's Slack-specific bloat.

The Problem

If you use Slack or WhatsApp to communicate with OpenClaw, it's loading your ENTIRE conversation history on every API call. Matt found 111KB of session text being sent every time he prompted the agent.

✅ Slack (confirmed issue)

⚠️ WhatsApp (likely same issue)

Web-only users: If you only use the web interface, SKIP this step entirely. This issue doesn't affect you.

Exactly Do This

  1. Verify you have this issue (token audit shows large session size)
  2. Use AI prompt below to generate "new session" command
  3. Integrate the command into your OpenClaw setup
  4. Test it by typing "new session" in Slack
  5. Verify dump worked (check memory storage, re-run audit)
  6. Make it a habit — type "new session" before expensive operations

When to Dump Session

TYPE "NEW SESSION" BEFORE:
- Running overnight batch jobs
- Expensive research tasks
- Multi-agent orchestration
- Any operation you want cost-optimized

KEEP SESSION WHEN:
- Ongoing conversation needs context
- Quick back-and-forth exchanges
- Currently debugging something

Matt's Usage Pattern

MATT'S APPROACH:

Morning: "new session" (clean slate)
During work: Keep session active
Before bed: "new session" before overnight jobs
After big tasks: "new session" to prevent bloat

Result: Never hits rate limits anymore
Session bloat: 111KB → 0KB consistently
Action Item: If you use Slack/WhatsApp, this fix is MANDATORY. You cannot hit production economics without it.
🤖 AI Prompt Library: Session Cleanup

Prompt 1: Generate Session Cleanup Command

I'm running OpenClaw with Slack integration and it's loading my entire Slack history (111KB) on every message.

CREATE: A custom OpenClaw command called "new_session" that:

FUNCTIONALITY:
1. Dumps current Slack/WhatsApp session buffer
2. Saves dumped content to agent memory (accessible for recall)
3. Clears the session buffer completely
4. Confirms cleanup completed with token savings estimate

TECHNICAL REQUIREMENTS:
- Callable by typing "new session" in Slack
- Memory format allows selective recall if needed
- No loss of critical information
- Immediate effect on next API call
- Logs the action for audit trail

OUTPUT:
- Complete command code (ready to integrate)
- Installation/integration instructions for OpenClaw
- Usage examples and best practices
- Verification method (how to confirm it worked)
- Expected token savings per operation after cleanup

PLATFORM: [Slack / WhatsApp / Other]

Prompt 2: Session History Audit

Help me verify session history is my token waste culprit.

MY TOKEN AUDIT SHOWS:
[Paste relevant audit sections]

Analyze:
1. Is session history loading the issue?
   - Expected indicators: [What to look for in logs]
   - Platform comparison: Does web UI show same pattern?

2. Estimated impact:
   - Current waste from session history: $___/day
   - Potential savings if fixed: $___/day

3. Verification procedure:
   - How do I confirm session dump worked?
   - What should audit show after fix?

4. Alternative causes:
   - If it's NOT session history, what else could cause this pattern?
Pro Tip: Compare token usage between Slack and web interface for same task. If Slack burns 10x more, session history is the culprit.
📊 Real Example: Matt's Session Discovery
MATT'S SESSION HISTORY DISCOVERY:

Problem: Hitting rate limits (429 errors) constantly
Investigation: Compared web UI vs Slack token usage
Finding: Slack used 1M+ tokens, web UI used 50k for SAME task
Root cause: 111KB session history blob in every Slack API call
Fix: Created "new session" command
Result: Rate limits eliminated, token usage normalized
Time: 20 min to build command, 2 min to use

Critical Insight: Would never have found this without rate limits forcing investigation. Silent killer for Slack users.

Quality Gates

✅ GOOD ENOUGH (Proceed to Step 3)

  • "new session" command works in Slack/WhatsApp
  • Token audit shows session size dropped to near-zero
  • No more rate limit errors
  • Memory recall still functional when needed
Jason's Rule: "If it works in Slack, it's good enough. Don't build the perfect session manager before you fix heartbeats."

Step 3: Heartbeats to Olama (Zero API Cost)

Est. time: ~25 min • Independent step
ZERO MARGINAL COST
Before/After Transformation
❌ BEFORE (Paying to Stay Awake)✅ AFTER (Local Heartbeats)
Heartbeat every 30 minutes via APIHeartbeat runs on Olama (local, free)
Using Opus: $5/day idle$0/day heartbeat cost (any model)
Using Sonnet: $2-3/day idleInfinite heartbeat frequency if wanted
Using Haiku: $0.50/day idleSame functionality, zero cost
Why This Works: Heartbeat is brainless — just checking memory and task queue. Doesn't need cloud AI, can run entirely locally.

The Problem

Heartbeats keep your agent alive and checking for active tasks. But using Opus/Sonnet/Haiku for this is like hiring a neurosurgeon to take your temperature. It's complete overkill.

Why This Matters

Heartbeat Economics:

  • Frequency: Every 30 minutes (48/day)
  • Task: Check memory, check task queue, report status
  • Complexity: Literally just "system okay?" level logic
  • Cost on Opus: ~$5/day for brainless pings
  • Cost on Olama: $0 forever

This is FREE money.

Exactly Do This

  1. Install Olama (local LLM, open source, free)
    • Download from ollama.ai
    • Install latest version
    • Verify it runs: ollama --version
  2. Add Olama to OpenClaw config using prompt below
  3. Update heartbeat routing to use Olama
  4. Test heartbeat manually (trigger one, verify it works)
  5. Run overnight and verify zero API calls in audit

Perfect for Olama (Free)

  • ✅ Heartbeats (system health checks)
  • ✅ File organization (moving, renaming)
  • ✅ CSV compilation (merging data)
  • ✅ Folder structure (creating directories)
  • ✅ Basic text formatting
  • ✅ Log file parsing

= Brainless operations with zero reasoning required

Still Use API For

  • 🔵 Web research (Haiku)
  • 🔵 Email writing (Sonnet)
  • 🔵 Code generation (Sonnet)
  • 🟣 Strategic reasoning (Opus)
  • 🟣 Complex analysis (Opus)

= Anything requiring actual intelligence

Action Item: Olama is your "infinite brainless operations" cheat code. Use it aggressively for anything that doesn't require reasoning.
🤖 AI Prompt Library: Olama Integration

Prompt 1: Configure Olama Integration

I want to add Olama (local LLM) to my OpenClaw setup to eliminate API costs for brainless operations.

CURRENT SETUP:
[Paste your current OpenClaw config.json]

GOAL:
Generate config modifications to:

1. ADD OLAMA:
   - Model: [Latest Olama version - currently llama2]
   - Use for: heartbeats, file_ops, csv_compile, folder_structure
   - Cost tier: FREE
   - Routing: Default for "brainless" task category

2. UPDATE HEARTBEAT:
   - Route all heartbeats to Olama
   - Remove API calls entirely
   - Maintain check frequency (30 min)
   - Preserve functionality

3. TESTING:
   - How to verify Olama is handling heartbeats
   - How to confirm zero API calls
   - Fallback if Olama fails

OUTPUT:
- Complete modified config.json
- Olama installation verification steps
- Integration test procedure
- Troubleshooting common issues

Make it copy-paste ready.

Prompt 2: Identify "Brainless" Operations

Help me identify which of my OpenClaw operations should run on Olama (free) vs API models (paid).

MY TYPICAL TASKS:
[List your common agent tasks here]

For each task, classify as:

1. BRAINLESS (Olama - Free):
   - No reasoning required
   - File/data manipulation only
   - Pattern matching at most

2. LOW-COMPLEXITY (Haiku - $0.25/1M):
   - Basic research/collection
   - Simple formatting with light intelligence

3. MEDIUM-COMPLEXITY (Sonnet - $3/1M):
   - Writing, email drafting
   - Code generation

4. HIGH-COMPLEXITY (Opus - $15/1M):
   - Strategic reasoning
   - Novel problem solving

Then estimate my cost savings routing everything optimally.
Pro Tip: If you're uncertain whether a task needs API vs Olama, run it on Olama first. It'll fail fast if it can't handle it, then escalate automatically.
📊 Real Example: Matt's Olama Implementation
MATT'S OLAMA IMPLEMENTATION:

Problem: Spending $2-3/day on heartbeats alone (Sonnet)
Solution: Installed Olama, routed heartbeats to local
Test: Let it run overnight, checked logs
Result: Zero API calls for heartbeats confirmed
Savings: $2-3/day → $0/day on heartbeats

Extended Use Cases Matt Found:
- File organization during overnight jobs (14 sub-agents)
- CSV compilation from multiple sources
- Folder structure creation
- Log parsing and cleanup

Total Olama Usage: ~15% of all operations
Total Savings: These operations would have cost $20-30/month
Actual Cost: $0 forever

Installation Time: 10 min
Config Time: 15 min
ROI: Infinite (free forever)
Quality Gates

✅ GOOD ENOUGH (Proceed to Step 4)

  • Olama installed and running
  • Heartbeats routed to Olama successfully
  • Token audit shows zero API calls for heartbeats
  • Agent still functions normally
Jason's Rule: "If Olama can do it for free, Olama MUST do it. Every API call is money you're lighting on fire unnecessarily."

Step 4: Route by Task Complexity (15% Additional Savings)

Est. time: ~45 min • Requires Steps 1+3
ADVANCED OPTIMIZATION
Before/After Transformation
❌ BEFORE (Single Model Waste)✅ AFTER (Smart Routing)
100% operations on Sonnet ($3/1M tokens)15% Olama (free) — brainless ops
OR 100% on Opus ($15/1M tokens)75% Haiku ($0.25/1M) — data collection
Brainless tasks using expensive AI10% Sonnet ($3/1M) — writing/code
No escalation logic<1% Opus ($15/1M) — strategic only
Monthly: $90-150Monthly: $3-10
Real Example: Matt's overnight job would cost $150 on Opus-only, cost him $6 on multi-model routing with 95% cached tokens.

The Problem

Using Opus for everything is like hiring a Fortune 500 CEO to organize your filing cabinet. The work gets done, but you're burning massive money on overkill.

Task Complexity Distribution:

  • 85% of operations: Brainless or low-complexity
  • 10% of operations: Medium complexity
  • 5% of operations: High complexity

But default setups use ONE model for everything.

Cost Comparison for 1M Operations

Routing StrategyCostSavings vs Opus
All Opus$15,000
All Sonnet$3,00080%
All Haiku$25098%
Optimized routing$300-50096-98%

Exactly Do This

  1. Map your task types to complexity tiers
  2. Use AI prompt to generate 4-tier routing config
  3. Implement escalation logic (Haiku → Sonnet → Opus)
  4. Test each tier with representative tasks
  5. Run overnight batch job and analyze routing distribution
  6. Adjust thresholds based on results

4-Tier Routing Framework

TIER 0 - OLAMA (Free):
Heartbeats, file ops, CSV work,
folder management

TIER 1 - HAIKU ($0.25/1M):
Web scraping, data collection,
list building, basic formatting
→ Escalate to Sonnet if blocked

TIER 2 - SONNET ($3/1M):
Writing, email drafting, code
generation, research synthesis
→ Escalate to Opus if blocked

TIER 3 - OPUS ($15/1M):
Strategic reasoning, novel problems,
complex logic
→ Final tier (no escalation)

Matt's Actual Distribution

MATT'S ACTUAL USAGE:

Olama: 15% (brainless ops)
Haiku: 75% (data collection)
Sonnet: 10% (writing/email)
Opus: <1% (strategic only)

OVERNIGHT JOB BREAKDOWN:
- 14 sub-agents running
- 10 agents: Haiku (web scraping)
- 2 agents: Sonnet (writing emails)
- 2 agents: Olama (file organization)
- 0 agents: Opus (not needed)

Cost: $6 for 6 hours = $1/hour
Action Item: This step has longest setup time but enables production economics. Can't run profitable overnight jobs without multi-model routing.
🤖 AI Prompt Library: Multi-Model Routing

Prompt 1: Generate Multi-Model Routing Config

I want to set up 4-tier model routing in OpenClaw to automatically use cheapest viable model for each task.

CURRENT SETUP:
[Paste your config.json]

MY TASK BREAKDOWN:
[Describe your common operations and their complexity]

GOAL:
Create complete multi-model routing configuration:

TIER 0 - OLAMA (Free):
- Tasks: heartbeats, file_ops, csv_compile, folder_structure
- Escalation: None (if fails, log error)

TIER 1 - HAIKU ($0.25/1M):
- Tasks: data_collection, web_scrape, list_building, basic_formatting
- Escalation: Sonnet (if blocked or error)
- Default tier: Use unless explicitly routed elsewhere

TIER 2 - SONNET ($3/1M):
- Tasks: writing, email_draft, code_gen, research_synthesis
- Escalation: Opus (if blocked or error)

TIER 3 - OPUS ($15/1M):
- Tasks: strategic_reasoning, complex_logic, novel_problems
- Escalation: None (final tier)

OUTPUT:
- Complete routing config (JSON)
- Task classification rules
- Escalation logic implementation
- Test procedure for each tier
- Expected cost distribution

Make it production-ready and well-documented.

Prompt 2: Sub-Agent Orchestration

I want to create a sub-agent orchestration system for complex overnight batch jobs.

USE CASE:
[Describe your overnight job - e.g., B2B lead gen]

REQUIREMENTS:
- Spin up multiple specialized sub-agents
- Each routed to appropriate model tier
- Parallel execution where possible
- Results aggregated and organized
- Target: <$2/hour for full operation

DESIGN:
1. Agent Specialization:
   - How many sub-agents needed?
   - What does each specialize in?
   - Which model tier per agent?

2. Coordination:
   - Master agent responsibilities
   - Sub-agent communication
   - Error handling and escalation

3. Cost Optimization:
   - Task distribution for minimal cost
   - Cached token maximization
   - Parallel vs sequential trade-offs

OUTPUT:
- Sub-agent architecture diagram
- Task distribution algorithm
- Complete implementation code
- Test procedure (100-item trial)
- Scaling guidelines (1,000+ items)
Pro Tip: Start conservative with routing. Let Haiku try most things with auto-escalation. After a week, analyze escalation logs to optimize classification rules.
📊 Real Example: Matt's 6-Hour Overnight Job

Task: 1,000 Qualified B2B Leads

Agents 1-10 (Haiku):
  Web scraping distressed business signals
  Reading blogs and finding contact info
  Using Brave Search API + Hunter.io
  4 hours runtime

Agents 11-12 (Sonnet):
  Writing personalized cold outreach emails
  Creating follow-up sequences
  1.5 hours runtime

Agents 13-14 (Olama):
  Organizing files into folders
  Compiling CSVs with proper headers
  0.5 hours runtime

RESULTS:
Total Cost: $6 for 6 hours
Per-hour: $1
Per-lead: $0.006

If run on Opus only: $150
If run on Sonnet only: $30
Savings: 96% vs Opus, 80% vs Sonnet

Key Insight: 95% of tokens were CACHED (repeated operations), further reducing cost. Caching + multi-model routing = production economics.

Deliverable: 1,000 leads + emails + follow-up sequences + organized spreadsheet + ready to execute outreach. All while Matt slept.
Quality Gates

✅ GOOD ENOUGH (Production Ready)

  • 4-tier routing configured and functional
  • Escalation logic tested and working
  • 70%+ operations on Haiku or Olama
  • Successfully completed multi-hour batch job
  • Cost tracking shows expected distribution

🌟 EXCEPTIONAL (Optimized)

  • 85%+ on Haiku/Olama
  • Sub-agent orchestration working
  • Cached token ratio >80%
  • Custom routing rules per use case
  • Automated cost alerts if distribution drifts
Jason's Rule: "If overnight jobs cost <$2/hour and complete successfully, you're production-ready. Don't over-optimize before running real client work."

Validate Your Optimization: Final Token Audit

Est. time: ~15 min
PROOF
FINAL CHECK. Run your token audit one last time and verify ALL optimizations are active. This is your proof of 97% savings.

Complete Optimization Verification

✅ COMPLETE OPTIMIZATION VERIFICATION

Run final token audit and verify:

☐ Daily idle cost = $0 (no API calls during heartbeats)
☐ Context size <10KB per operation (down from 50KB+)
☐ Session cleanup working (if using Slack/WhatsApp)
☐ Olama handling heartbeats + brainless ops (zero API calls)
☐ Multi-model routing active (80%+ on Haiku/Olama in logs)
☐ Escalation logic tested (trigger escalation, verify it works)
☐ Overnight batch job completed at <$2/hour
☐ Agent functionality unchanged (same output quality)
☐ Cost reduction >90% from baseline
☐ You understand maintenance (can re-run audit monthly)

Your Final Numbers

YOUR FINAL NUMBERS:

Before optimization: $_____ /day
After optimization:  $_____ /day
Reduction:           _____ %
Monthly savings:     $_____ 

CAPABILITIES UNLOCKED:
☐ Overnight batch processing economically viable
☐ Sub-agent orchestration affordable
☐ Can run 24/7 for ~$1/hour
☐ Production lead gen operational
Matt's Final Results: Baseline $90+/month → After all optimizations $3-10/month. Reduction: 97%. New capabilities: Overnight batch processing, 14 sub-agent orchestration, B2B lead gen at $0.006/lead.
Business Model Shift: "I can do my work during the day, then set it to research/generate leads overnight. Wake up to completed work that would cost tens of thousands if outsourced. This is insane." — Matt Ganzac. From personal AI assistant (overhead cost) → productized lead gen service (profit center).

Implementation Resources & Next Steps

Optimization Quick Reference

FixSavingsTime
Context Bloat80%30 min
Session HistoryVariable20 min
Heartbeats → Olama$2-5/day25 min
Multi-Model Routing15%+45 min
Combined97%~2.5 hrs

Dependency Map

OPTIMIZATION DEPENDENCY MAP:

Step 0 (Token Audit) ← REQUIRED FIRST (need baseline)
  │
  ├─ Step 1 (Context Bloat) ← HIGHEST IMPACT, do next
  │    │
  │    ├─ Step 2 (Session History) ← Optional: Slack/WhatsApp only
  │    │
  │    └─ Step 3 (Heartbeats) ← Independent, can do after Step 1
  │         │
  │         └─ Step 4 (Multi-Model) ← Requires Steps 1+3 complete
  │
  └─ Validation ← After all steps complete

Share This Playbook

Was this playbook helpful?