OpenClaw Token Optimization: Cut AI Agent Costs by 97%

Step 0: Measure Your Current Waste

Est. time: ~20 min • Foundation step

FOUNDATION

Before/After Transformation

❌ BEFORE (Flying Blind)	✅ AFTER (Data-Driven)
Don't know where tokens go	Exact token breakdown by category
Assume cost is just "normal"	Top 3 waste sources identified
No visibility into waste sources	Baseline for measuring improvement
Can't prioritize fixes	Clear prioritization roadmap

Pattern Recognition: Matt only discovered 111KB session history bloat BECAUSE rate limits forced investigation. Measuring reveals hidden waste you'd never find otherwise.

The Problem

You can't optimize what you don't measure. Most OpenClaw users assume $50-150/month is "normal cost of AI." In reality, 80-95% is architectural waste you can eliminate in one afternoon.

Why This Matters

Matt's Discovery Story:

Loaded $25 to Anthropic API
Was on track to spend $20/day IDLE (doing nothing)
Only found the waste by running token audit after hitting rate limits
Without measurement, would have just kept burning money

Exactly Do This

Access your OpenClaw logs directory (location varies by OS)
Use AI prompt below to generate token audit script
Run the script and save output to file
Identify your top 3 waste sources:
- Context bloat (likely 50KB+ per operation)
- Session history (if using Slack/WhatsApp)
- Heartbeat API calls (every 30 min)
Calculate your daily idle cost (API calls × token cost)

Your Baseline Metrics

MY BASELINE METRICS (from audit):

Context size per operation: _____ KB
Daily heartbeat cost: $_____
Session history size: _____ KB
Top waste source: ___________
Current daily idle cost: $_____
Current model usage: ___________

Matt's Numbers (Comparison)

MATT'S BASELINE:

Context size: 50KB+
Daily heartbeat cost: $2-3
Session history: 111KB (Slack)
Top waste: Context bloat
Daily idle cost: $3
Model: 100% Sonnet

Action Item: Don't move to Step 1 until you have your baseline numbers. These metrics are your proof that fixes work.

🤖 AI Prompt Library: Token Audit

Prompt 1: Generate Token Audit Script

I'm running OpenClaw and need to audit my token usage to find waste.

Generate a Python script that:

1. READS: OpenClaw log files from [SPECIFY YOUR LOG DIRECTORY]
2. EXTRACTS: Token usage data by category
3. CALCULATES:
   - Context size per operation type
   - Session history loading frequency and size
   - Heartbeat token usage
   - Model routing distribution
   - Daily/weekly cost projections

4. OUTPUTS:
   - Summary dashboard with key metrics
   - Top 5 waste sources ranked by cost
   - Before/after projection if waste eliminated
   - CSV export for tracking over time

5. HANDLES:
   - Different log formats
   - Missing data gracefully
   - Rate limit errors separately

Generate complete, documented script ready to run.
Include installation instructions for dependencies.

Prompt 2: Analyze Audit Results

I ran a token audit on my OpenClaw instance. Help me interpret the results and prioritize fixes.

MY AUDIT RESULTS:
[Paste your audit output here]

Analyze and provide:

1. TOP 3 WASTE SOURCES:
   - Rank by cost impact
   - Estimate savings % if eliminated
   - Difficulty to fix (easy/medium/hard)

2. QUICK WIN IDENTIFICATION:
   - What can I fix in <30 min for biggest impact?
   - Which fixes are prerequisites for others?

3. PRIORITIZED FIX SEQUENCE:
   - Step 1: [Fix + expected savings]
   - Step 2: [Fix + expected savings]
   - Step 3: [Fix + expected savings]

4. RED FLAGS:
   - Any unusual patterns?
   - Signs of misconfiguration?
   - Potential security issues?

Pro Tip: Run audit BEFORE and AFTER each optimization step. Seeing the numbers drop is incredibly motivating — and it's your proof.

Mark Step 0 Complete

Quality Gates: What's "Good Enough" vs "Exceptional"?

✅ GOOD ENOUGH (Proceed to Step 1)

Audit script runs successfully
You have baseline daily token usage number
You identified at least 1 waste source >20% of total
You saved the audit output for comparison later

🌟 EXCEPTIONAL (Nice to Have)

Automated daily audit reports
Historical trend tracking
Breakdown by sub-agent/task type
Alert thresholds configured

Jason's Rule: "Perfect measurement is procrastination. Get your top 3 waste sources and move. You can refine the audit later."

Step 1: Eliminate Context Bloat (80% Savings)

Est. time: ~30 min • Prerequisite: Step 0

HIGHEST IMPACT

Before/After Transformation

❌ BEFORE (Context Explosion)	✅ AFTER (Selective Loading)
50KB context loaded per message	~5KB context per operation
75KB after a few days, 100KB+ after a week	Context size stable (doesn't grow)
2-3M tokens/day just from heartbeats	Zero tokens on heartbeats
$2-3/day sitting completely idle	$0/day idle cost

Real Example: Matt went from $20/day burn rate (projected) to $0/day idle after this single fix. Same agent functionality, 90% less tokens.

The Problem

Every heartbeat (every 30 min), every message, every prompt loads ALL your context files. As your memory grows, this compounds: 50KB → 75KB → 100KB+. You're burning tokens just to keep the agent awake.

Why This Matters

Matt's Numbers:

Initial context: 50KB per operation
After fixes: 5KB per operation (90% reduction)
Savings: $3/day → $0/day idle
Time to implement: 15 minutes of config changes

This is the SINGLE highest-leverage fix. Do this first.

Exactly Do This

Locate your config file:

Mac:     ~/.openclaw/config.json
Windows: %APPDATA%\OpenClaw\config.json
Linux:   ~/.config/openclaw/config.json

Backup your current config (copy to safe location)
Use AI prompt below to generate config modifications
Apply the changes (edit config file)
Restart OpenClaw and test with simple task
Re-run token audit to verify reduction

Your Verification

BEFORE THIS FIX:
Context per operation: _____ KB
Daily idle cost: $_____

AFTER THIS FIX:
Context per operation: _____ KB
Daily idle cost: $_____

SAVINGS: _____%

Troubleshooting

Common Failures:

Agent broke / won't complete tasks: Restore backup, use "preserve task context" variant prompt
Still loading 30KB+ context: Check if session history is the culprit (Step 2)
Rate limit errors: Implement pacing logic (covered in Step 4)

Action Item: This fix alone gets most people to break-even economics. If you do nothing else, do this.

🤖 AI Prompt Library: Context Management

Prompt 1: Generate Context Management Config

I'm running OpenClaw and need to eliminate context bloat.

CURRENT SITUATION:
- Every heartbeat loads full context (~50KB+)
- Every message loads all context + history
- Context size growing: 50KB → 75KB → 100KB+

MY CURRENT CONFIG:
[Paste your config.json here OR say "using default config"]

GOAL:
Generate modified config that:
1. Prevents context loading on heartbeats
2. Implements selective context loading by task type
3. Reduces typical operation from 50KB → <10KB
4. Maintains necessary context for task completion
5. Preserves agent functionality

OUTPUT:
- Complete modified config.json
- Explanation of each changed parameter
- Test procedure to verify context reduction
- Rollback instructions if something breaks

CRITICAL: Don't break the agent. Preserve task completion capability.

Prompt 2: Verify Context Reduction

I just modified my OpenClaw config to reduce context bloat.

Run this verification checklist:

1. AUDIT COMPARISON:
   Before config: [Your Step 0 context size]
   After config: [Your new audit context size]
   Reduction: ____%

2. FUNCTIONALITY TEST:
   - Can agent complete simple task? (Y/N)
   - Can agent access memory when needed? (Y/N)
   - Any error messages? [List them]

3. NEXT STEPS:
   - If reduction <50%: [What to check/fix]
   - If agent broke: [Rollback procedure]
   - If reduction >70%: [Proceed to Step 2]

Analyze my results and tell me if I'm good to proceed.

MY RESULTS:
[Paste audit comparison here]

Pro Tip: Start conservative. Aim for 50-70% reduction first. You can tighten further after validating agent still works.

📊 Real Example: Matt's Implementation

MATT'S IMPLEMENTATION:

Discovery: Token audit showed 50KB context per heartbeat
Fix Applied: Selective context loading config
Test: Simple task completion verified
Result: 50KB → 5KB (90% reduction)
Savings: $3/day → $0/day idle
Time: 15 min config + 5 min test

Key Learning: Rate limits forced him to investigate. Most users never look at token breakdown and just accept the cost.

Mark Step 1 Complete

Quality Gates: What's "Good Enough" vs "Exceptional"?

✅ GOOD ENOUGH (Proceed to Step 2)

Context size reduced by 50%+ (audit shows this)
Agent still completes basic tasks correctly
Daily idle cost dropped significantly
You have backup config if rollback needed

🌟 EXCEPTIONAL (Optimization Round 2)

Context size <10KB consistently
Task-specific context rules configured
90%+ reduction from baseline
Automated monitoring alerts if bloat returns

Jason's Rule: "50% reduction = ship it and move on. Don't optimize the optimization before fixing the other 3 waste sources."

Step 2: Kill Session History Tax (Messaging Platforms)

Est. time: ~20 min • Slack/WhatsApp only

PLATFORM-SPECIFIC

Before/After Transformation

❌ BEFORE (History Explosion)	✅ AFTER (Clean Sessions)
Slack loads entire chat history (111KB!)	"new session" command dumps history
Every message sends full history to API	History saved to memory (accessible if needed)
1M+ tokens per prompt when using Slack	Normal token usage restored
Rate limit errors constantly	No more rate limit errors

Discovery Method: Matt only found this BECAUSE rate limits forced log investigation. Web interface didn't have this issue — it's Slack-specific bloat.

The Problem

If you use Slack or WhatsApp to communicate with OpenClaw, it's loading your ENTIRE conversation history on every API call. Matt found 111KB of session text being sent every time he prompted the agent.

✅ Slack (confirmed issue)

⚠️ WhatsApp (likely same issue)

Web-only users: If you only use the web interface, SKIP this step entirely. This issue doesn't affect you.

Exactly Do This

Verify you have this issue (token audit shows large session size)
Use AI prompt below to generate "new session" command
Integrate the command into your OpenClaw setup
Test it by typing "new session" in Slack
Verify dump worked (check memory storage, re-run audit)
Make it a habit — type "new session" before expensive operations

When to Dump Session

TYPE "NEW SESSION" BEFORE:
- Running overnight batch jobs
- Expensive research tasks
- Multi-agent orchestration
- Any operation you want cost-optimized

KEEP SESSION WHEN:
- Ongoing conversation needs context
- Quick back-and-forth exchanges
- Currently debugging something

Matt's Usage Pattern

MATT'S APPROACH:

Morning: "new session" (clean slate)
During work: Keep session active
Before bed: "new session" before overnight jobs
After big tasks: "new session" to prevent bloat

Result: Never hits rate limits anymore
Session bloat: 111KB → 0KB consistently

Action Item: If you use Slack/WhatsApp, this fix is MANDATORY. You cannot hit production economics without it.

🤖 AI Prompt Library: Session Cleanup

Prompt 1: Generate Session Cleanup Command

I'm running OpenClaw with Slack integration and it's loading my entire Slack history (111KB) on every message.

CREATE: A custom OpenClaw command called "new_session" that:

FUNCTIONALITY:
1. Dumps current Slack/WhatsApp session buffer
2. Saves dumped content to agent memory (accessible for recall)
3. Clears the session buffer completely
4. Confirms cleanup completed with token savings estimate

TECHNICAL REQUIREMENTS:
- Callable by typing "new session" in Slack
- Memory format allows selective recall if needed
- No loss of critical information
- Immediate effect on next API call
- Logs the action for audit trail

OUTPUT:
- Complete command code (ready to integrate)
- Installation/integration instructions for OpenClaw
- Usage examples and best practices
- Verification method (how to confirm it worked)
- Expected token savings per operation after cleanup

PLATFORM: [Slack / WhatsApp / Other]

Prompt 2: Session History Audit

Help me verify session history is my token waste culprit.

MY TOKEN AUDIT SHOWS:
[Paste relevant audit sections]

Analyze:
1. Is session history loading the issue?
   - Expected indicators: [What to look for in logs]
   - Platform comparison: Does web UI show same pattern?

2. Estimated impact:
   - Current waste from session history: $___/day
   - Potential savings if fixed: $___/day

3. Verification procedure:
   - How do I confirm session dump worked?
   - What should audit show after fix?

4. Alternative causes:
   - If it's NOT session history, what else could cause this pattern?

Pro Tip: Compare token usage between Slack and web interface for same task. If Slack burns 10x more, session history is the culprit.

📊 Real Example: Matt's Session Discovery

MATT'S SESSION HISTORY DISCOVERY:

Problem: Hitting rate limits (429 errors) constantly
Investigation: Compared web UI vs Slack token usage
Finding: Slack used 1M+ tokens, web UI used 50k for SAME task
Root cause: 111KB session history blob in every Slack API call
Fix: Created "new session" command
Result: Rate limits eliminated, token usage normalized
Time: 20 min to build command, 2 min to use

Critical Insight: Would never have found this without rate limits forcing investigation. Silent killer for Slack users.

Mark Step 2 Complete (or Skipped — Web Only)

Quality Gates

✅ GOOD ENOUGH (Proceed to Step 3)

"new session" command works in Slack/WhatsApp
Token audit shows session size dropped to near-zero
No more rate limit errors
Memory recall still functional when needed

Jason's Rule: "If it works in Slack, it's good enough. Don't build the perfect session manager before you fix heartbeats."

Step 3: Heartbeats to Olama (Zero API Cost)

Est. time: ~25 min • Independent step

ZERO MARGINAL COST

Before/After Transformation

❌ BEFORE (Paying to Stay Awake)	✅ AFTER (Local Heartbeats)
Heartbeat every 30 minutes via API	Heartbeat runs on Olama (local, free)
Using Opus: $5/day idle	$0/day heartbeat cost (any model)
Using Sonnet: $2-3/day idle	Infinite heartbeat frequency if wanted
Using Haiku: $0.50/day idle	Same functionality, zero cost

Why This Works: Heartbeat is brainless — just checking memory and task queue. Doesn't need cloud AI, can run entirely locally.

The Problem

Heartbeats keep your agent alive and checking for active tasks. But using Opus/Sonnet/Haiku for this is like hiring a neurosurgeon to take your temperature. It's complete overkill.

Why This Matters

Heartbeat Economics:

Frequency: Every 30 minutes (48/day)
Task: Check memory, check task queue, report status
Complexity: Literally just "system okay?" level logic
Cost on Opus: ~$5/day for brainless pings
Cost on Olama: $0 forever

This is FREE money.

Exactly Do This

Install Olama (local LLM, open source, free)
- Download from ollama.ai
- Install latest version
- Verify it runs: ollama --version
Add Olama to OpenClaw config using prompt below
Update heartbeat routing to use Olama
Test heartbeat manually (trigger one, verify it works)
Run overnight and verify zero API calls in audit

Perfect for Olama (Free)

✅ Heartbeats (system health checks)
✅ File organization (moving, renaming)
✅ CSV compilation (merging data)
✅ Folder structure (creating directories)
✅ Basic text formatting
✅ Log file parsing

= Brainless operations with zero reasoning required

Still Use API For

🔵 Web research (Haiku)
🔵 Email writing (Sonnet)
🔵 Code generation (Sonnet)
🟣 Strategic reasoning (Opus)
🟣 Complex analysis (Opus)

= Anything requiring actual intelligence

Action Item: Olama is your "infinite brainless operations" cheat code. Use it aggressively for anything that doesn't require reasoning.

🤖 AI Prompt Library: Olama Integration

Prompt 1: Configure Olama Integration

I want to add Olama (local LLM) to my OpenClaw setup to eliminate API costs for brainless operations.

CURRENT SETUP:
[Paste your current OpenClaw config.json]

GOAL:
Generate config modifications to:

1. ADD OLAMA:
   - Model: [Latest Olama version - currently llama2]
   - Use for: heartbeats, file_ops, csv_compile, folder_structure
   - Cost tier: FREE
   - Routing: Default for "brainless" task category

2. UPDATE HEARTBEAT:
   - Route all heartbeats to Olama
   - Remove API calls entirely
   - Maintain check frequency (30 min)
   - Preserve functionality

3. TESTING:
   - How to verify Olama is handling heartbeats
   - How to confirm zero API calls
   - Fallback if Olama fails

OUTPUT:
- Complete modified config.json
- Olama installation verification steps
- Integration test procedure
- Troubleshooting common issues

Make it copy-paste ready.

Prompt 2: Identify "Brainless" Operations

Help me identify which of my OpenClaw operations should run on Olama (free) vs API models (paid).

MY TYPICAL TASKS:
[List your common agent tasks here]

For each task, classify as:

1. BRAINLESS (Olama - Free):
   - No reasoning required
   - File/data manipulation only
   - Pattern matching at most

2. LOW-COMPLEXITY (Haiku - $0.25/1M):
   - Basic research/collection
   - Simple formatting with light intelligence

3. MEDIUM-COMPLEXITY (Sonnet - $3/1M):
   - Writing, email drafting
   - Code generation

4. HIGH-COMPLEXITY (Opus - $15/1M):
   - Strategic reasoning
   - Novel problem solving

Then estimate my cost savings routing everything optimally.

Pro Tip: If you're uncertain whether a task needs API vs Olama, run it on Olama first. It'll fail fast if it can't handle it, then escalate automatically.

📊 Real Example: Matt's Olama Implementation

MATT'S OLAMA IMPLEMENTATION:

Problem: Spending $2-3/day on heartbeats alone (Sonnet)
Solution: Installed Olama, routed heartbeats to local
Test: Let it run overnight, checked logs
Result: Zero API calls for heartbeats confirmed
Savings: $2-3/day → $0/day on heartbeats

Extended Use Cases Matt Found:
- File organization during overnight jobs (14 sub-agents)
- CSV compilation from multiple sources
- Folder structure creation
- Log parsing and cleanup

Total Olama Usage: ~15% of all operations
Total Savings: These operations would have cost $20-30/month
Actual Cost: $0 forever

Installation Time: 10 min
Config Time: 15 min
ROI: Infinite (free forever)

Mark Step 3 Complete

Quality Gates

✅ GOOD ENOUGH (Proceed to Step 4)

Olama installed and running
Heartbeats routed to Olama successfully
Token audit shows zero API calls for heartbeats
Agent still functions normally

Jason's Rule: "If Olama can do it for free, Olama MUST do it. Every API call is money you're lighting on fire unnecessarily."

Step 4: Route by Task Complexity (15% Additional Savings)

Est. time: ~45 min • Requires Steps 1+3

ADVANCED OPTIMIZATION

Before/After Transformation

❌ BEFORE (Single Model Waste)	✅ AFTER (Smart Routing)
100% operations on Sonnet ($3/1M tokens)	15% Olama (free) — brainless ops
OR 100% on Opus ($15/1M tokens)	75% Haiku ($0.25/1M) — data collection
Brainless tasks using expensive AI	10% Sonnet ($3/1M) — writing/code
No escalation logic	<1% Opus ($15/1M) — strategic only
Monthly: $90-150	Monthly: $3-10

Real Example: Matt's overnight job would cost $150 on Opus-only, cost him $6 on multi-model routing with 95% cached tokens.

The Problem

Using Opus for everything is like hiring a Fortune 500 CEO to organize your filing cabinet. The work gets done, but you're burning massive money on overkill.

Task Complexity Distribution:

85% of operations: Brainless or low-complexity
10% of operations: Medium complexity
5% of operations: High complexity

But default setups use ONE model for everything.

Cost Comparison for 1M Operations

Routing Strategy	Cost	Savings vs Opus
All Opus	$15,000	—
All Sonnet	$3,000	80%
All Haiku	$250	98%
Optimized routing	$300-500	96-98%

Exactly Do This

Map your task types to complexity tiers
Use AI prompt to generate 4-tier routing config
Implement escalation logic (Haiku → Sonnet → Opus)
Test each tier with representative tasks
Run overnight batch job and analyze routing distribution
Adjust thresholds based on results

4-Tier Routing Framework

TIER 0 - OLAMA (Free):
Heartbeats, file ops, CSV work,
folder management

TIER 1 - HAIKU ($0.25/1M):
Web scraping, data collection,
list building, basic formatting
→ Escalate to Sonnet if blocked

TIER 2 - SONNET ($3/1M):
Writing, email drafting, code
generation, research synthesis
→ Escalate to Opus if blocked

TIER 3 - OPUS ($15/1M):
Strategic reasoning, novel problems,
complex logic
→ Final tier (no escalation)

Matt's Actual Distribution

MATT'S ACTUAL USAGE:

Olama: 15% (brainless ops)
Haiku: 75% (data collection)
Sonnet: 10% (writing/email)
Opus: <1% (strategic only)

OVERNIGHT JOB BREAKDOWN:
- 14 sub-agents running
- 10 agents: Haiku (web scraping)
- 2 agents: Sonnet (writing emails)
- 2 agents: Olama (file organization)
- 0 agents: Opus (not needed)

Cost: $6 for 6 hours = $1/hour

Action Item: This step has longest setup time but enables production economics. Can't run profitable overnight jobs without multi-model routing.

🤖 AI Prompt Library: Multi-Model Routing

Prompt 1: Generate Multi-Model Routing Config

I want to set up 4-tier model routing in OpenClaw to automatically use cheapest viable model for each task.

CURRENT SETUP:
[Paste your config.json]

MY TASK BREAKDOWN:
[Describe your common operations and their complexity]

GOAL:
Create complete multi-model routing configuration:

TIER 0 - OLAMA (Free):
- Tasks: heartbeats, file_ops, csv_compile, folder_structure
- Escalation: None (if fails, log error)

TIER 1 - HAIKU ($0.25/1M):
- Tasks: data_collection, web_scrape, list_building, basic_formatting
- Escalation: Sonnet (if blocked or error)
- Default tier: Use unless explicitly routed elsewhere

TIER 2 - SONNET ($3/1M):
- Tasks: writing, email_draft, code_gen, research_synthesis
- Escalation: Opus (if blocked or error)

TIER 3 - OPUS ($15/1M):
- Tasks: strategic_reasoning, complex_logic, novel_problems
- Escalation: None (final tier)

OUTPUT:
- Complete routing config (JSON)
- Task classification rules
- Escalation logic implementation
- Test procedure for each tier
- Expected cost distribution

Make it production-ready and well-documented.

Prompt 2: Sub-Agent Orchestration

I want to create a sub-agent orchestration system for complex overnight batch jobs.

USE CASE:
[Describe your overnight job - e.g., B2B lead gen]

REQUIREMENTS:
- Spin up multiple specialized sub-agents
- Each routed to appropriate model tier
- Parallel execution where possible
- Results aggregated and organized
- Target: <$2/hour for full operation

DESIGN:
1. Agent Specialization:
   - How many sub-agents needed?
   - What does each specialize in?
   - Which model tier per agent?

2. Coordination:
   - Master agent responsibilities
   - Sub-agent communication
   - Error handling and escalation

3. Cost Optimization:
   - Task distribution for minimal cost
   - Cached token maximization
   - Parallel vs sequential trade-offs

OUTPUT:
- Sub-agent architecture diagram
- Task distribution algorithm
- Complete implementation code
- Test procedure (100-item trial)
- Scaling guidelines (1,000+ items)

Pro Tip: Start conservative with routing. Let Haiku try most things with auto-escalation. After a week, analyze escalation logs to optimize classification rules.

📊 Real Example: Matt's 6-Hour Overnight Job

Task: 1,000 Qualified B2B Leads

Agents 1-10 (Haiku):
  Web scraping distressed business signals
  Reading blogs and finding contact info
  Using Brave Search API + Hunter.io
  4 hours runtime

Agents 11-12 (Sonnet):
  Writing personalized cold outreach emails
  Creating follow-up sequences
  1.5 hours runtime

Agents 13-14 (Olama):
  Organizing files into folders
  Compiling CSVs with proper headers
  0.5 hours runtime

RESULTS:
Total Cost: $6 for 6 hours
Per-hour: $1
Per-lead: $0.006

If run on Opus only: $150
If run on Sonnet only: $30
Savings: 96% vs Opus, 80% vs Sonnet

Key Insight: 95% of tokens were CACHED (repeated operations), further reducing cost. Caching + multi-model routing = production economics.

Deliverable: 1,000 leads + emails + follow-up sequences + organized spreadsheet + ready to execute outreach. All while Matt slept.

Mark Step 4 Complete

Quality Gates

✅ GOOD ENOUGH (Production Ready)

4-tier routing configured and functional
Escalation logic tested and working
70%+ operations on Haiku or Olama
Successfully completed multi-hour batch job
Cost tracking shows expected distribution

🌟 EXCEPTIONAL (Optimized)

85%+ on Haiku/Olama
Sub-agent orchestration working
Cached token ratio >80%
Custom routing rules per use case
Automated cost alerts if distribution drifts

Jason's Rule: "If overnight jobs cost <$2/hour and complete successfully, you're production-ready. Don't over-optimize before running real client work."

Validate Your Optimization: Final Token Audit

Est. time: ~15 min

PROOF

FINAL CHECK. Run your token audit one last time and verify ALL optimizations are active. This is your proof of 97% savings.

Complete Optimization Verification

✅ COMPLETE OPTIMIZATION VERIFICATION

Run final token audit and verify:

☐ Daily idle cost = $0 (no API calls during heartbeats)
☐ Context size <10KB per operation (down from 50KB+)
☐ Session cleanup working (if using Slack/WhatsApp)
☐ Olama handling heartbeats + brainless ops (zero API calls)
☐ Multi-model routing active (80%+ on Haiku/Olama in logs)
☐ Escalation logic tested (trigger escalation, verify it works)
☐ Overnight batch job completed at <$2/hour
☐ Agent functionality unchanged (same output quality)
☐ Cost reduction >90% from baseline
☐ You understand maintenance (can re-run audit monthly)

Your Final Numbers

YOUR FINAL NUMBERS:

Before optimization: $_____ /day
After optimization:  $_____ /day
Reduction:           _____ %
Monthly savings:     $_____ 

CAPABILITIES UNLOCKED:
☐ Overnight batch processing economically viable
☐ Sub-agent orchestration affordable
☐ Can run 24/7 for ~$1/hour
☐ Production lead gen operational

Matt's Final Results: Baseline $90+/month → After all optimizations $3-10/month. Reduction: 97%. New capabilities: Overnight batch processing, 14 sub-agent orchestration, B2B lead gen at $0.006/lead.

Business Model Shift: "I can do my work during the day, then set it to research/generate leads overnight. Wake up to completed work that would cost tens of thousands if outsourced. This is insane." — Matt Ganzac. From personal AI assistant (overhead cost) → productized lead gen service (profit center).

Mark Validation Complete

Implementation Resources & Next Steps

Essential Resources

Optimization Quick Reference

Fix	Savings	Time
Context Bloat	80%	30 min
Session History	Variable	20 min
Heartbeats → Olama	$2-5/day	25 min
Multi-Model Routing	15%+	45 min
Combined	97%	~2.5 hrs

Dependency Map

OPTIMIZATION DEPENDENCY MAP:

Step 0 (Token Audit) ← REQUIRED FIRST (need baseline)
  │
  ├─ Step 1 (Context Bloat) ← HIGHEST IMPACT, do next
  │    │
  │    ├─ Step 2 (Session History) ← Optional: Slack/WhatsApp only
  │    │
  │    └─ Step 3 (Heartbeats) ← Independent, can do after Step 1
  │         │
  │         └─ Step 4 (Multi-Model) ← Requires Steps 1+3 complete
  │
  └─ Validation ← After all steps complete

Need Help Deploying Production AI Agent Infrastructure?

These optimizations require developer-level skills and production testing. If you're building multi-tenant AI systems or need systematic deployment across multiple clients, the architecture patterns here become even more critical.

Context management baked in from Day 1
Token optimization at architectural level
Multi-model routing as default infrastructure
Production economics, not afterthoughts

MasteryMade AI Systems

We architect AI agent systems with production economics as foundational infrastructure — not bolt-on optimization.

Discuss AI Agent Architecture →

Share This Playbook

Was this playbook helpful?

The System Overview

Core Insight

Step 0: Measure Your Current Waste

The Problem

Why This Matters

Exactly Do This

Your Baseline Metrics

Matt's Numbers (Comparison)

Prompt 1: Generate Token Audit Script

Prompt 2: Analyze Audit Results

Step 1: Eliminate Context Bloat (80% Savings)

The Problem

Why This Matters

Exactly Do This

Your Verification

Troubleshooting

Prompt 1: Generate Context Management Config

Prompt 2: Verify Context Reduction

Step 2: Kill Session History Tax (Messaging Platforms)

The Problem

Exactly Do This

When to Dump Session

Matt's Usage Pattern

Prompt 1: Generate Session Cleanup Command

Prompt 2: Session History Audit

Step 3: Heartbeats to Olama (Zero API Cost)

The Problem

Why This Matters

Exactly Do This

Perfect for Olama (Free)

Still Use API For

Prompt 1: Configure Olama Integration

Prompt 2: Identify "Brainless" Operations

Step 4: Route by Task Complexity (15% Additional Savings)

The Problem

Cost Comparison for 1M Operations

Exactly Do This

4-Tier Routing Framework

Matt's Actual Distribution

Prompt 1: Generate Multi-Model Routing Config

Prompt 2: Sub-Agent Orchestration

Task: 1,000 Qualified B2B Leads

Validate Your Optimization: Final Token Audit

Complete Optimization Verification

Your Final Numbers

Implementation Resources & Next Steps

Essential Resources

Optimization Quick Reference

Dependency Map

Need Help Deploying Production AI Agent Infrastructure?

MasteryMade AI Systems

Share This Playbook