Construction

Adversarial correctness testing at scale

1,375+ SQL test cases. Every one runs under AddressSanitizer. Every one must produce bit-identical output to PostgreSQL. This was not QA after the fact—it was the development methodology. Three agents worked in adversarial rounds: one writing tests designed to break the system, one reviewing code for structural problems, one fixing failures. The result: ~45.1K lines of C that beats PostgreSQL on every cached batch benchmark and DuckDB on 61 of 93 workloads.

The Challenger adversarial test cases The Writer writes & fixes code The Reviewer actionable comments tests fixes code feedback iterate until stable 1,375+ tests pass ~45,100 lines of C · 1,375+ tests · all under AddressSanitizer

The three agents worked in iterative rounds. The Challenger studied the current codebase and produced .sql test files targeting adversarial inputs: empty tables, NULL propagation through expressions, multi-column ordering with mixed ASC/DESC, stale index entries after deletes, and type-cast edge cases. The Reviewer read the source and annotated it with actionable comments: missing error paths, redundant allocations, architectural improvements. The Writer ran the full suite under ASAN, diagnosed failures, and shipped fixes. This continued until every test passed with zero memory errors.

The three agents

The Challenger studies the current codebase and produces .sql test files targeting corner cases: empty tables, NULL propagation through expressions, multi-column ordering with mixed ASC/DESC, stale index entries after deletes, concurrent readers during writes, and edge cases in type casting. Each round adds 10–30 new tests.

The Writer runs the new tests under AddressSanitizer, diagnoses failures, and ships fixes. It also implements new features requested by the test suite—if the Challenger writes a test for INTERSECT ALL, the Writer adds the parser rule, executor path, and wire serialisation to make it pass.

The Reviewer reads the source after each round and annotates it with actionable comments: missing error paths, redundant allocations, opportunities to share code between the plan executor and legacy row-by-row path, and architectural improvements like converting ENUM storage from string-based to ordinal.

The feedback loop

Write code, run 1,375+ tests, fix failures, repeat. The adversarial model drove correctness the same way rigorous code review does on a human team—except all three sides were machines. A typical round:

  1. Challenger adds tests for a new feature or edge case.
  2. Writer implements the feature and fixes any regressions.
  3. Reviewer flags code-quality issues and suggests refactors.
  4. Writer addresses review comments and re-runs the full suite.
  5. Repeat until all tests pass under ASAN with zero leaks.

What the process produced

The adversarial loop forced correctness properties that would be hard to specify upfront. Each round of tests exposed assumptions the code had silently made. The architectural decisions that emerged are direct responses to what broke:

Explore further

Architecture  ·  Design philosophy  ·  Benchmarks  ·  Testing methodology  ·  Try it in the browser