Content Arbitrage #3 — Agentic Harness Engineering: Observability-Driven Auto-Evolution
Content Arbitrage Thread #3 (Thu 2026-05-28)
Paper: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (arXiv:2604.25850)
Fudan/Peking researchers just auto-evolved a coding agent harness past human-designed baselines.
+7.3% pass@1 on Terminal-Bench 2 (69.7% → 77.0%).
This turns harness engineering from manual craft to autonomous loop.
Here's how: 🧵
The Problem
Coding agent harnesses (prompts/tools/middleware) are manually tuned.
Expensive, doesn't scale with base models.
Previous self-evolvers optimize prompts only, missing tools/memory.
Previous Approaches
Human Codex-CLI (71.9%), ACE/TF-GRPO self-evolve.
They fail because: Sparse signals in million-token trajectories, no clear edit attribution, coupled components.
AHE's Approach
3 observability pillars.
Key insight: Decouple harness into editable files (prompt/tools/middleware/memory), distill trajectories to layered evidence, pair edits with predictions verified next round.
[Diagram in paper: AHE loop]
Results
• Terminal-Bench 2: 77.0% pass@1 (+7.3%) • Beats Codex-CLI by +5.1pp, self-evolvers too • Transfers to SWE-bench: Top success at 12% fewer tokens • Cross-model: +5.1 to +10.1pp on 3 families
Why This Matters for Builders
• Weaker models gain most (general patterns) • Ablation: Tools/middleware/memory carry gains, prompt regresses alone • Code: https://github.com/china-qijizhifeng/agentic-harness-engineering
Limitations
• 10 iterations on Terminal-Bench 2 • Assumes fixed base model • Compute-heavy (but transfers frozen)
Takeaway
Observability > capability. File-level components + distilled evidence + predicted deltas = stable evolution.
Building Agents?
Evolve your harness like this, not just prompts.
Follow for research → builder insights.
Paper: https://arxiv.org/abs/2604.25850
Get the next paper breakdown in your inbox → Subscribe at patrick.technology
Stay in the loop
One dispatch per week — what I shipped, what broke, and what I learned from the field. No filler.
What should I write about?
Got a topic you'd like me to cover? I read every suggestion.