Skip to main content

Content Arbitrage Thread #3 (Thu 2026-05-28)

1 min read

Content Arbitrage Thread #3 (Thu 2026-05-28)

Paper: Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (arXiv:2604.25850)

Fudan/Peking researchers just auto-evolved a coding agent harness past human-designed baselines.

+7.3% pass@1 on Terminal-Bench 2 (69.7% β†’ 77.0%).

This turns harness engineering from manual craft to autonomous loop.

Here's how: 🧡

The Problem

Coding agent harnesses (prompts/tools/middleware) are manually tuned.

Expensive, doesn't scale with base models.

Previous self-evolvers optimize prompts only, missing tools/memory.

Previous Approaches

Human Codex-CLI (71.9%), ACE/TF-GRPO self-evolve.

They fail because: Sparse signals in million-token trajectories, no clear edit attribution, coupled components.

AHE's Approach

3 observability pillars.

Key insight: Decouple harness into editable files (prompt/tools/middleware/memory), distill trajectories to layered evidence, pair edits with predictions verified next round.

[Diagram in paper: AHE loop]

Results

β€’ Terminal-Bench 2: 77.0% pass@1 (+7.3%) β€’ Beats Codex-CLI by +5.1pp, self-evolvers too β€’ Transfers to SWE-bench: Top success at 12% fewer tokens β€’ Cross-model: +5.1 to +10.1pp on 3 families

Why This Matters for Builders

β€’ Weaker models gain most (general patterns) β€’ Ablation: Tools/middleware/memory carry gains, prompt regresses alone β€’ Code: https://github.com/china-qijizhifeng/agentic-harness-engineering

Limitations

β€’ 10 iterations on Terminal-Bench 2 β€’ Assumes fixed base model β€’ Compute-heavy (but transfers frozen)

Takeaway

Observability > capability. File-level components + distilled evidence + predicted deltas = stable evolution.

Building Agents?

Evolve your harness like this, not just prompts.

Follow for research β†’ builder insights.

Paper: https://arxiv.org/abs/2604.25850

Get Updates

New posts on systems thinking, AI, and building things. No spam, unsubscribe anytime.

By subscribing, you agree to receive occasional emails. You can unsubscribe at any time.

What should I write about?

Got a topic you'd like me to cover? I read every suggestion.

More in Blog