Mechanical self-improvement for AI systems — detecting drag, classifying components, tracking prediction accuracy, and deliberately expiring scaffolding.
Most AI systems only grow. They add features, add context, add services — and never ask whether what they added last month is still earning its keep. Surface area increases linearly while capability plateaus.
Forge needed the opposite: a system that measures its own drag and flags what should be pruned, rewritten, or expired. Not "continuous refactoring" — measured, leverage-gated evolution.
Instead of "how do we rewrite ourselves?" — ask: "How do we make rewriting cheap, measurable, and leverage-positive?" That's a Forge question. And the answer is: build the sensing layer first.
Every component in Forge is now classified as one of two types:
The key question when building anything: "Am I building a primitive or a scaffold?" If it's a scaffold, declare the shelf-life upfront. If it's a primitive, invest in stability.
Forge now runs an automated evolution scan (wired into the 2x-daily reconciliation timer) that reads existing logs to detect drag:
| Signal | Source | What It Detects |
|---|---|---|
| File Churn | changelog.jsonl | Files changed >10 times in 30 days (excluding logs) — instability indicator |
| Ralph Success Rate | ralph_events table | 7-day and 30-day task success trends — is the agent getting better or worse? |
| Token Bloat | context-loader.py | Files exceeding their token budget — agents get truncated context |
| Surface Area | git LOC count | Total lines of code trending up or down — is the system getting leaner? |
| Scaffold Expiry | YAML frontmatter | Scaffolds past their declared shelf-life — candidates for review |
| Cascade Accuracy | cascade_predictions table | Were past "next domino" predictions correct? Feedback loop for the evaluator. |
The very first run of the evolution scanner immediately found real problems:
Always-tier context files: 9,693 tokens loaded vs. 6,100 budget (59% over). Five files are being truncated on every single agent invocation — meaning agents consistently get incomplete context.
| File | Tokens | Budget | Usage |
|---|---|---|---|
| JASON-DEPS.md | 2,139 | 300 | 713% |
| SESSION-QUEUE.md | 2,676 | 400 | 669% |
| PRINCIPLE-ANCHOR.md (scaffold) | 1,652 | 800 | 206% |
| core-principles.md | 475 | 300 | 158% |
| CASCADE.md | 636 | 600 | 106% |
This was invisible before the scanner existed. Every Ralph invocation, every cascade evaluation, every agent session was working with truncated context — and nobody knew. The scanner found it in 0.3 seconds.
Forge's cascade evaluator predicts "the next domino" — the highest-leverage action to take. But predictions without validation are just stories.
Now, every cascade evaluation stores its prediction in a database table. Seven days later, the evolution scanner checks: was the predicted domino actually completed? Did the predicted unlocks happen?
Over time, this creates a feedback loop: predictions that consistently miss get weighted down. The cascade evaluator gets smarter by existing.
FEL isn't a refactoring tool. It's a sensing layer — the nervous system that lets Forge feel when something is dragging. Without it, improvement is intuition. With it, improvement is mechanical.
The lifecycle is now:
Align → Discover → Design → Plan → Execute → Reflect → Sense → Prune
Where "Sense" is the evolution scanner detecting drag, and "Prune" is the human reviewing expired scaffolds and making keep/rewrite/delete decisions. The system proposes. The human decides. Principle 9 preserved.
| Type | Count | Examples |
|---|---|---|
| Primitive | 10 | Core principles, BUILD-STATE, CASCADE, task schema, agent contracts |
| Scaffold | 19 | All 14 skills (shelf: model upgrade), 5 context files (shelf: quarterly/conditional) |
| Surface Area | 19,170 LOC | Scripts: 14,390 / Context: 3,349 / Skills: 1,431 |
This follows Forge Principle 4: V1 dirty and working. V2 clean and solid. V3 never. The evolution scanner is V1 — it senses. If sensing proves valuable (and the first scan already found a 59% token overrun), then V2 adds acting.
Every time the evolution scanner runs, it:
Each cycle makes the next cycle's detection sharper and the system leaner. That's Principle 10: What can I do today that makes tomorrow more productive than today?
The answer: build the thing that measures the things. Then let it run.