TDAD: Test-Driven Agentic Development – Reducing Regressions in AI Coding Agents
TDAD: Test-Driven Agentic Development – Reducing Regressions in AI Coding Agents
AI coding agents promise to automate software fixes, but they frequently introduce regressions—breaking tests that previously passed. Benchmarks like SWE-bench emphasize resolution rates, overlooking this critical failure mode. Without targeted mitigation, agents undermine codebase stability, forcing developers to spend more time verifying changes than benefiting from automation.
This paper introduces TDAD (Test-Driven Agentic Development), an open-source tool and benchmark that uses abstract-syntax-tree (AST) based code-test graphs with weighted impact analysis. For a proposed code change, TDAD constructs a graph linking modified code to potentially affected tests, prioritizing those with highest impact scores. The agent then selectively verifies these tests using a GraphRAG workflow.
Evaluated on SWE-bench Verified:
- Baseline regressions: 6.08% at test level.
- TDAD: 1.82% (70% reduction).
- Resolution rate: 24% → 32%.
Smaller models (Qwen3-Coder 30B) benefit most from contextual test guidance over procedural TDD prompts, which actually increased regressions to 9.94%. An auto-improvement loop pushed resolution to 60% on a subset with zero regressions.
For builders integrating agents into CI/CD pipelines, TDAD addresses a core reliability gap. Instead of full test suite runs post-agent edit (slow and wasteful), run targeted verification on high-impact tests. This accelerates iteration while maintaining coverage. Graph-based prioritization scales to large repos, and the open-source implementation (https://github.com/pepealonso95/TDAD) integrates via simple Python APIs.
Builder takeaway:
After agent-generated PRs, pipe diffs into TDAD's graph analyzer to select <20% of tests for regression checks—saving 80% compute. Experiment with GraphRAG for your repo's test-code coupling.
Source: Pepe Alonso — ArXiv cs.SE, March 2026
Get Updates
New posts on systems thinking, AI, and building things. No spam, unsubscribe anytime.