How Forge Evolves Itself

Mechanical self-improvement for AI systems — detecting drag, classifying components, tracking prediction accuracy, and deliberately expiring scaffolding.

The Core Insight

Most AI systems only grow. They add features, add context, add services — and never ask whether what they added last month is still earning its keep. Surface area increases linearly while capability plateaus.

Forge needed the opposite: a system that measures its own drag and flags what should be pruned, rewritten, or expired. Not "continuous refactoring" — measured, leverage-gated evolution.

The Question That Changed Everything

Instead of "how do we rewrite ourselves?" — ask: "How do we make rewriting cheap, measurable, and leverage-positive?" That's a Forge question. And the answer is: build the sensing layer first.

Primitives vs. Scaffolds

Every component in Forge is now classified as one of two types:

Primitives — the Structural Spine

Stable infrastructure that future capabilities compose upon. Rarely rewritten casually. Versioned, contract-tested, treated as infrastructure.

Examples: Model router contract, agent lifecycle API, skill command interface, human approval gates, task state schema, cascade data structure.

Scaffolds — Temporary Accelerators

Compensations for current model limitations or workflow gaps. Each declares a shelf-life trigger — the condition under which it should be re-evaluated.

Examples: Prompt templates (shelf: model upgrade), context injection heuristics (shelf: quarterly), planning mode enforcement (shelf: when Claude handles multi-step natively), cascade ranking logic (shelf: quarterly).

The key question when building anything: "Am I building a primitive or a scaffold?" If it's a scaffold, declare the shelf-life upfront. If it's a primitive, invest in stability.

The Evolution Scanner

Forge now runs an automated evolution scan (wired into the 2x-daily reconciliation timer) that reads existing logs to detect drag:

Signal	Source	What It Detects
File Churn	changelog.jsonl	Files changed >10 times in 30 days (excluding logs) — instability indicator
Ralph Success Rate	ralph_events table	7-day and 30-day task success trends — is the agent getting better or worse?
Token Bloat	context-loader.py	Files exceeding their token budget — agents get truncated context
Surface Area	git LOC count	Total lines of code trending up or down — is the system getting leaner?
Scaffold Expiry	YAML frontmatter	Scaffolds past their declared shelf-life — candidates for review
Cascade Accuracy	cascade_predictions table	Were past "next domino" predictions correct? Feedback loop for the evaluator.

What the First Scan Found

The very first run of the evolution scanner immediately found real problems:

Token Budget Crisis

Always-tier context files: 9,693 tokens loaded vs. 6,100 budget (59% over). Five files are being truncated on every single agent invocation — meaning agents consistently get incomplete context.

File	Tokens	Budget	Usage
JASON-DEPS.md	2,139	300	713%
SESSION-QUEUE.md	2,676	400	669%
PRINCIPLE-ANCHOR.md (scaffold)	1,652	800	206%
core-principles.md	475	300	158%
CASCADE.md	636	600	106%

This was invisible before the scanner existed. Every Ralph invocation, every cascade evaluation, every agent session was working with truncated context — and nobody knew. The scanner found it in 0.3 seconds.

Cascade Accuracy Tracking

Forge's cascade evaluator predicts "the next domino" — the highest-leverage action to take. But predictions without validation are just stories.

Now, every cascade evaluation stores its prediction in a database table. Seven days later, the evolution scanner checks: was the predicted domino actually completed? Did the predicted unlocks happen?

Cascade evaluator predicts: "Build chat persistence"
  → 7 days pass
    → Scanner checks: was "chat persistence" completed in ralph_queue?
      → Score: 1.0 (completed + unlock observed) or 0.0 (prediction missed)
        → 30-day rolling accuracy feeds back into evaluator weighting

Over time, this creates a feedback loop: predictions that consistently miss get weighted down. The cascade evaluator gets smarter by existing.

The Living Architecture

What Makes This "Living"

FEL isn't a refactoring tool. It's a sensing layer — the nervous system that lets Forge feel when something is dragging. Without it, improvement is intuition. With it, improvement is mechanical.

The lifecycle is now:

Align → Discover → Design → Plan → Execute → Reflect → Sense → Prune

Where "Sense" is the evolution scanner detecting drag, and "Prune" is the human reviewing expired scaffolds and making keep/rewrite/delete decisions. The system proposes. The human decides. Principle 9 preserved.

Current Classification

Type	Count	Examples
Primitive	10	Core principles, BUILD-STATE, CASCADE, task schema, agent contracts
Scaffold	19	All 14 skills (shelf: model upgrade), 5 context files (shelf: quarterly/conditional)
Surface Area	19,170 LOC	Scripts: 14,390 / Context: 3,349 / Skills: 1,431

What FEL Deliberately Does NOT Do

No automatic deletion. Scanner flags candidates. Human decides.
No rewrite scoring formula. Human judgment is sufficient at current scale. Metrics inform, don't dictate.
No simulation layer. Measure first, optimize later. V1 senses. V2 acts.
No dashboard. EVOLUTION.md is readable in the terminal and morning ritual. No vanity panels.

This follows Forge Principle 4: V1 dirty and working. V2 clean and solid. V3 never. The evolution scanner is V1 — it senses. If sensing proves valuable (and the first scan already found a 59% token overrun), then V2 adds acting.

The Compound Effect

Every time the evolution scanner runs, it:

Detects what's dragging (cheaper for the next session to prioritize)
Tracks what's accurate (better cascade predictions over time)
Measures surface area (visible trend: growing or shrinking?)
Flags expired scaffolds (prevents accumulation)

Each cycle makes the next cycle's detection sharper and the system leaner. That's Principle 10: What can I do today that makes tomorrow more productive than today?

The answer: build the thing that measures the things. Then let it run.

Published by Forge • ideas.asapai.net • March 2026
Built with the Forge personal AI operating system — which used its own evolution scanner to find 5 over-budget context files during this article's development.