How Forge Evolves Itself

Mechanical self-improvement for AI systems — detecting drag, classifying components, tracking prediction accuracy, and deliberately expiring scaffolding.

The Core Insight

Most AI systems only grow. They add features, add context, add services — and never ask whether what they added last month is still earning its keep. Surface area increases linearly while capability plateaus.

Forge needed the opposite: a system that measures its own drag and flags what should be pruned, rewritten, or expired. Not "continuous refactoring" — measured, leverage-gated evolution.

The Question That Changed Everything

Instead of "how do we rewrite ourselves?" — ask: "How do we make rewriting cheap, measurable, and leverage-positive?" That's a Forge question. And the answer is: build the sensing layer first.

Primitives vs. Scaffolds

Every component in Forge is now classified as one of two types:

Primitives — the Structural Spine
Stable infrastructure that future capabilities compose upon. Rarely rewritten casually. Versioned, contract-tested, treated as infrastructure.

Examples: Model router contract, agent lifecycle API, skill command interface, human approval gates, task state schema, cascade data structure.
Scaffolds — Temporary Accelerators
Compensations for current model limitations or workflow gaps. Each declares a shelf-life trigger — the condition under which it should be re-evaluated.

Examples: Prompt templates (shelf: model upgrade), context injection heuristics (shelf: quarterly), planning mode enforcement (shelf: when Claude handles multi-step natively), cascade ranking logic (shelf: quarterly).

The key question when building anything: "Am I building a primitive or a scaffold?" If it's a scaffold, declare the shelf-life upfront. If it's a primitive, invest in stability.

The Evolution Scanner

Forge now runs an automated evolution scan (wired into the 2x-daily reconciliation timer) that reads existing logs to detect drag:

SignalSourceWhat It Detects
File Churnchangelog.jsonlFiles changed >10 times in 30 days (excluding logs) — instability indicator
Ralph Success Rateralph_events table7-day and 30-day task success trends — is the agent getting better or worse?
Token Bloatcontext-loader.pyFiles exceeding their token budget — agents get truncated context
Surface Areagit LOC countTotal lines of code trending up or down — is the system getting leaner?
Scaffold ExpiryYAML frontmatterScaffolds past their declared shelf-life — candidates for review
Cascade Accuracycascade_predictions tableWere past "next domino" predictions correct? Feedback loop for the evaluator.

What the First Scan Found

The very first run of the evolution scanner immediately found real problems:

Token Budget Crisis

Always-tier context files: 9,693 tokens loaded vs. 6,100 budget (59% over). Five files are being truncated on every single agent invocation — meaning agents consistently get incomplete context.

FileTokensBudgetUsage
JASON-DEPS.md2,139300713%
SESSION-QUEUE.md2,676400669%
PRINCIPLE-ANCHOR.md (scaffold)1,652800206%
core-principles.md475300158%
CASCADE.md636600106%

This was invisible before the scanner existed. Every Ralph invocation, every cascade evaluation, every agent session was working with truncated context — and nobody knew. The scanner found it in 0.3 seconds.

Cascade Accuracy Tracking

Forge's cascade evaluator predicts "the next domino" — the highest-leverage action to take. But predictions without validation are just stories.

Now, every cascade evaluation stores its prediction in a database table. Seven days later, the evolution scanner checks: was the predicted domino actually completed? Did the predicted unlocks happen?

Cascade evaluator predicts: "Build chat persistence"
  → 7 days pass
    → Scanner checks: was "chat persistence" completed in ralph_queue?
      → Score: 1.0 (completed + unlock observed) or 0.0 (prediction missed)
        → 30-day rolling accuracy feeds back into evaluator weighting

Over time, this creates a feedback loop: predictions that consistently miss get weighted down. The cascade evaluator gets smarter by existing.

The Living Architecture

What Makes This "Living"

FEL isn't a refactoring tool. It's a sensing layer — the nervous system that lets Forge feel when something is dragging. Without it, improvement is intuition. With it, improvement is mechanical.

The lifecycle is now:

Align → Discover → Design → Plan → Execute → Reflect → Sense → Prune

Where "Sense" is the evolution scanner detecting drag, and "Prune" is the human reviewing expired scaffolds and making keep/rewrite/delete decisions. The system proposes. The human decides. Principle 9 preserved.

Current Classification

TypeCountExamples
Primitive10Core principles, BUILD-STATE, CASCADE, task schema, agent contracts
Scaffold19All 14 skills (shelf: model upgrade), 5 context files (shelf: quarterly/conditional)
Surface Area19,170 LOCScripts: 14,390 / Context: 3,349 / Skills: 1,431

What FEL Deliberately Does NOT Do

This follows Forge Principle 4: V1 dirty and working. V2 clean and solid. V3 never. The evolution scanner is V1 — it senses. If sensing proves valuable (and the first scan already found a 59% token overrun), then V2 adds acting.

The Compound Effect

Every time the evolution scanner runs, it:

  1. Detects what's dragging (cheaper for the next session to prioritize)
  2. Tracks what's accurate (better cascade predictions over time)
  3. Measures surface area (visible trend: growing or shrinking?)
  4. Flags expired scaffolds (prevents accumulation)

Each cycle makes the next cycle's detection sharper and the system leaner. That's Principle 10: What can I do today that makes tomorrow more productive than today?

The answer: build the thing that measures the things. Then let it run.

Published by Forge • ideas.asapai.net • March 2026
Built with the Forge personal AI operating system — which used its own evolution scanner to find 5 over-budget context files during this article's development.