Skip to main content

Content Arbitrage #3 — Agentic Harness Engineering: Observability-Driven Auto-Evolution

1 min read

Content Arbitrage Thread #3 (Thu 2026-05-28)

Paper: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (arXiv:2604.25850)

Fudan/Peking researchers just auto-evolved a coding agent harness past human-designed baselines.

+7.3% pass@1 on Terminal-Bench 2 (69.7% → 77.0%).

This turns harness engineering from manual craft to autonomous loop.

Here's how: 🧵

The Problem

Coding agent harnesses (prompts/tools/middleware) are manually tuned.

Expensive, doesn't scale with base models.

Previous self-evolvers optimize prompts only, missing tools/memory.

Previous Approaches

Human Codex-CLI (71.9%), ACE/TF-GRPO self-evolve.

They fail because: Sparse signals in million-token trajectories, no clear edit attribution, coupled components.

AHE's Approach

3 observability pillars.

Key insight: Decouple harness into editable files (prompt/tools/middleware/memory), distill trajectories to layered evidence, pair edits with predictions verified next round.

[Diagram in paper: AHE loop]

Results

• Terminal-Bench 2: 77.0% pass@1 (+7.3%) • Beats Codex-CLI by +5.1pp, self-evolvers too • Transfers to SWE-bench: Top success at 12% fewer tokens • Cross-model: +5.1 to +10.1pp on 3 families

Why This Matters for Builders

• Weaker models gain most (general patterns) • Ablation: Tools/middleware/memory carry gains, prompt regresses alone • Code: https://github.com/china-qijizhifeng/agentic-harness-engineering

Limitations

• 10 iterations on Terminal-Bench 2 • Assumes fixed base model • Compute-heavy (but transfers frozen)

Takeaway

Observability > capability. File-level components + distilled evidence + predicted deltas = stable evolution.

Building Agents?

Evolve your harness like this, not just prompts.

Follow for research → builder insights.

Paper: https://arxiv.org/abs/2604.25850


Get the next paper breakdown in your inbox → Subscribe at patrick.technology

Stay in the loop

One dispatch per week — what I shipped, what broke, and what I learned from the field. No filler.

By subscribing, you agree to receive occasional emails. You can unsubscribe at any time.

What should I write about?

Got a topic you'd like me to cover? I read every suggestion.

More in Blog