High Coverage, Zero Signal: Tackling AI Agent Test Bloat

· origo's blog


We’ve all felt that specific hit of Silicon Valley dopamine: You ask an AI agent to build a feature, and it returns with the code and a 98% coverage report. In that moment, you feel like you've hacked the software development lifecycle. You’re a 10x developer; no, a 100x developer.

Then, Monday morning hits. Your CI pipeline, which used to take 2 minutes, now takes 15. Your test suite has quadrupled in size, but when you actually break a core piece of logic, fourteen different tests fail with the same cryptic error message.

You don't have a robust suite. You have Test Bloat.

AI agents are coverage-hunting missiles. They are programmed to find uncovered lines and eliminate them with extreme prejudice. But because they often lack global context, they end up "carpet bombing" your codebase—hitting the same code path from five slightly different angles just to see that coverage bar move a fraction of a percent to the right.

The Mechanics of the Mess: Why Agents Carpet-Bomb Your Suite #

To solve the problem, we have to understand the "economics" of an LLM.

1. The Append-Only Mental Model #

Refactoring a complex, parameterized test structure is a high-risk, high-reasoning task for an AI. It requires the agent to parse existing intent, understand a complex AST, and safely inject new data without breaking existing assertions.

Appending a new test() block at the end of the file, however, is a low-risk, high-reward "append-only" operation. It’s a bounded task with a clear success signal: "Is the test green? Did the coverage bar move?" The agent chooses the path of least resistance, which is almost always addition rather than integration.

2. Local vs. Global Context #

An agent usually works within a narrow context window—typically the file it’s currently editing. It sees that auth.ts has 40% coverage and goes to work. What it doesn’t see are the three integration tests and the Playwright E2E suite that already exercise those same authentication paths. Lacking this "global map," it treats every uncovered line in the local file as a vacuum that must be filled.

3. The Success Hallucination #

Industry data shows that for many mature organizations, 30-50% of tests are redundant or obsolete. But an agent’s evaluation function for "Is this test good?" is binary:

It doesn't ask: "Does this test provide new information?" or "Will this triple the maintenance burden?" Research suggests that doubling a test suite typically triples the maintenance tax, as brittle tests begin to overlap and conflict.

The Audit: Quantifying the Overlap #

If you're staring at a 5,000-test suite and suspect half of it is redundant, you need a methodology to quantify the mess. Here is how I wrestle an existing suite back into shape.

1. Geometric Overlap: The Coverage Intersection Score (CIS) #

The "Coverage Intersection Score" moves beyond aggregate coverage. We don't care that 90% of the lines are hit; we care how many times each line is hit by different tests.

The Logic: Imagine you have three tests:

To calculate the CIS of Test C relative to Test A, we use the formula: Overlap(C, A) = |Lines(C) ∩ Lines(A)| / |Lines(C)|

In this case, Test C's lines are a 100% subset of Test A. Test C has a CIS of 1.0. Unless Test C is validating a radically different logical outcome (which we’ll check in step 2), Test C is pure geometric bloat. It is a "carpet bomb" hitting a spot already scorched by Test A.

The Method: Run your suite with a tool like Istanbul or LCOV, but generate reports per test file. Use a script (read: have your agent write you a script) to compare the JSON outputs. If you find tests where the "Intersection" is nearly equal to the "Total Lines Covered" by that test, you’ve found a candidate for the bin.

2. Logical Overlap: Mutation Redundancy #

Two tests might cover the same lines but check different edge cases. A geometric check isn't enough; you need to see if the tests actually catch different bugs. This is where mutation testing (like StrykerJS) becomes the ultimate truth-teller.

Mutation testing injects tiny "mutants" (bugs) into your code—changing a > to a >= or a true to a false.

In a bloated agent-written suite, you will often find dozens of tests that all kill the same "easy" mutants (like a function returning early) but none that kill the "hard" mutants (like boundary condition logic).

3. Marginal Coverage Utility #

Every new test should be forced to answer one question: "What do you add that wasn't there before?" Marginal Utility = TotalCoverage(Suite + NewTest) - TotalCoverage(Suite)

If the Delta is zero, and the Mutation Kill Count is zero, the test is a liability, not an asset.

The Preventive: The AGENTS.md Protocol #

To stop the bloat from returning, you need to codify your testing discipline. I’ve moved away from proprietary config files and toward the AGENTS.md standard. It’s a "README for agents" that provides a predictable place for the context and constraints that LLMs need to stay disciplined.

By placing an AGENTS.md file in your root, you give your AI wingman a flight manual that prioritizes suite health over raw coverage numbers.

Here is the protocol I now bake into my projects:

1# AGENTS.md
2
3## Testing Guidelines (Anti-Bloat Protocol)
4
5- **Search Before Creation**: Before writing a new test, you MUST search the `tests/` directory (use grep or vector search) for existing tests covering the target functions.
6- **Prioritize Table-Driven Tests**: If a test for the target logic exists, do not create a new `test()` block. Instead, refactor the existing test into a `test.each()` (Jest/Vitest) table and add your case as a new row of data.
7- **Justify Redundancy**: If you propose a test that overlaps with existing coverage, you must explicitly state in your summary why a new test is required (e.g., "Hits a unique error boundary that existing integration tests miss").
8- **Quality over Quantity**: Avoid "smoke tests" that only assert that a function doesn't throw. Every test must include at least one meaningful `expect()` that validates a business rule.
9- **Clean Up**: If you notice three tests performing nearly identical assertions, you are encouraged to consolidate them into a single parameterized test.

Wrapping Up: The Developer as Architect #

In the era of AI, the developer's role has shifted. We are no longer just "code writers"; we are Architects and Editors.

An agent will happily give you 100% coverage by writing 1,000 brittle tests. It’s your job to realize that a 100-test suite that catches everything is infinitely more valuable than a 1,000-test suite that catches the same things but takes ten times longer to run.

Don't let the "coverage-hunting missiles" dictate the health of your codebase. Use the AGENTS.md protocol to enforce discipline, audit your geometric overlap, and remember: every line of test code you don't have to maintain is a win.

last updated: