The Bigger Software Challenge with AI

The challenge

As more and more software is generated with agents, there is a subtle but massive bottleneck emerging. Software verification and testing still have to be done by humans. Verification in general is a very hard problem with agent-generated work.

Mistakes in AI-generated code can be incredibly subtle and hard to catch. They compile. They pass a cursory review. But they compound into problems that the person who commissioned the code cannot technically debug, because they never understood the implementation in the first place.

Most test-driven development and software verification methods assume that the person writing the code has a deep understanding of the problem domain. With vibe-coded software, that's further from the truth. We don't think that's necessarily a bad thing — it's just a different paradigm, and one we should embrace. But it demands a fundamentally different approach to verification.

For example, as we write a database proxy, we need to ensure it works with JDBC, native drivers, and other database interfaces. There are n-interfaces out there, and of course you can argue you make it robust over time. But the surface area for subtle bugs is enormous — connection pooling edge cases, transaction isolation quirks, wire protocol inconsistencies. An AI can generate plausible-looking proxy code, but verifying correctness across all these interfaces requires a methodology that goes beyond "write a few unit tests and hope for the best."

Why traditional testing breaks down

The fundamental issue is this: if an AI writes your code and the same AI writes your tests, the tests inherit the same assumptions and blind spots as the code. The tests pass, everything looks green, and you ship a bug that neither the code nor the tests were designed to catch.

TDD works because the developer's mental model of the problem acts as an independent check on the implementation. Write the test first, make it pass, refactor. But the test's value comes from the developer's understanding — their ability to think about edge cases, boundary conditions, and invariants that matter.

When you remove that deep understanding from the loop, you need something else to fill the gap. That something is separation of concerns between writing and verifying.

A new approach: Adversarial Specification Testing

We're proposing a methodology we call Adversarial Specification Testing (AST). The core idea is simple: separate the agent that writes the code from the agent that verifies it, and give the human a focused role reviewing specifications rather than implementations.

The three-agent architecture

Agent 1: Specification. The human writes their intent in natural language. A specification agent formalizes it into a behavioral contract — not code, but structured assertions about inputs, outputs, side effects, and invariants. For example, a database proxy might have the contract: "all connections are returned to the pool on transaction completion, queries are forwarded without modification, connection failures surface the original error to the caller, and concurrent requests never share a transaction context." The human reviews this contract. This is where human effort is most valuable — verifying that the spec captures what they actually want.

Agent 2: Implementation. A standard coding agent generates the code. It has access to the contract but the contract doesn't dictate how to implement — only what the behavior must be.

Agent 3: Verification. This agent has access to the contract but not the implementation source. It independently generates property-based tests from the contract invariants, boundary and edge-case inputs designed to violate the contract, and mutation tests that subtly break the code to check if the test suite catches it. The independence is critical — it prevents the circular reasoning that happens when one agent writes both code and tests.

The verification layers

AST stacks verification from cheapest to most expensive, so you catch problems early and only involve humans where they add the most value:

Layer 0 — Static analysis. Type checking, linting, known vulnerability patterns. Fully automated, near-zero cost. This catches the obvious stuff before anything else runs.

Layer 1 — Contract verification. Does the code satisfy its behavioral specification? The verification agent generates tests from the contract and runs them. Low cost, high signal.

Layer 2 — Adversarial testing. Fuzzing, property-based testing, active attempts to find inputs that break invariants. This is where you catch the subtle bugs that pass normal test suites. Medium cost, catches what Layer 1 misses.

Layer 3 — Integration. Do components work together correctly? This involves both agents and human judgment, especially at system boundaries.

Layer 4 — Intent review. Does the specification actually capture what the human wanted? This is the only layer that requires deep human engagement, and it's focused on reviewing specifications, not implementations. That's a fundamentally more tractable task.

How this differs from TDD

In TDD, the same developer writes tests from their understanding, then writes code to pass them. The tests are implementation-aware and assume deep domain knowledge. This works brilliantly when that knowledge exists.

In AST, different agents write code versus tests. Tests are specification-aware but implementation-blind. The methodology assumes clear intent specification rather than deep domain knowledge. And the human's role shifts from "reviewer of code" to "reviewer of specifications" — a much more natural and effective use of their attention.

The key insight is that humans are much better at answering "is this what I meant?" than "is this code correct?" AST leans into that strength.

Open questions

This isn't a solved problem. There are real challenges with this approach.

Specification completeness: How do you know your spec captures everything that matters? An incomplete spec means verified but wrong code. The spec itself needs adversarial review.

Emergent behavior: AI-generated systems can have emergent interactions between components that no single contract captures. Integration testing helps, but we need better tools for reasoning about system-level behavior.

Cost: Running three agents is 3x the compute. Is the verification worth it for every function, or only for critical paths? Likely the answer is tiered — full AST for core logic, lighter verification for scaffolding and glue code.

Conclusion

The software verification problem isn't going away — it's getting harder as AI generates more of our code. But the solution isn't to slow down adoption. It's to build verification methodologies that are native to the AI-assisted paradigm. Separate writing from checking. Let humans verify intent, not implementation. And treat adversarial testing not as an afterthought, but as a first-class part of the development workflow.

We're building tools at System32 that move in this direction. The frontier isn't code generation — it's code verification. And that's where the real leverage is.