Content Arbitrage Thread #3 (Thu 2026-05-28)
Content Arbitrage Thread #3 (Thu 2026-05-28)
Paper: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (arXiv:2604.25850)
Fudan/Peking researchers just auto-evolved a coding agent harness past human-designed baselines.
+7.3% pass@1 on Terminal-Bench 2 (69.7% β 77.0%).
This turns harness engineering from manual craft to autonomous loop.
Here's how: π§΅
The Problem
Coding agent harnesses (prompts/tools/middleware) are manually tuned.
Expensive, doesn't scale with base models.
Previous self-evolvers optimize prompts only, missing tools/memory.
Previous Approaches
Human Codex-CLI (71.9%), ACE/TF-GRPO self-evolve.
They fail because: Sparse signals in million-token trajectories, no clear edit attribution, coupled components.
AHE's Approach
3 observability pillars.
Key insight: Decouple harness into editable files (prompt/tools/middleware/memory), distill trajectories to layered evidence, pair edits with predictions verified next round.
[Diagram in paper: AHE loop]
Results
β’ Terminal-Bench 2: 77.0% pass@1 (+7.3%) β’ Beats Codex-CLI by +5.1pp, self-evolvers too β’ Transfers to SWE-bench: Top success at 12% fewer tokens β’ Cross-model: +5.1 to +10.1pp on 3 families
Why This Matters for Builders
β’ Weaker models gain most (general patterns) β’ Ablation: Tools/middleware/memory carry gains, prompt regresses alone β’ Code: https://github.com/china-qijizhifeng/agentic-harness-engineering
Limitations
β’ 10 iterations on Terminal-Bench 2 β’ Assumes fixed base model β’ Compute-heavy (but transfers frozen)
Takeaway
Observability > capability. File-level components + distilled evidence + predicted deltas = stable evolution.
Building Agents?
Evolve your harness like this, not just prompts.
Follow for research β builder insights.
Get Updates
New posts on systems thinking, AI, and building things. No spam, unsubscribe anytime.
What should I write about?
Got a topic you'd like me to cover? I read every suggestion.