1,375+ SQL test cases. Every one runs under AddressSanitizer. Every one must produce bit-identical output to PostgreSQL. This was not QA after the fact—it was the development methodology. Three agents worked in adversarial rounds: one writing tests designed to break the system, one reviewing code for structural problems, one fixing failures. The result: ~45.1K lines of C that beats PostgreSQL on every cached batch benchmark and DuckDB on 61 of 93 workloads.
The three agents worked in iterative rounds. The Challenger studied the current
codebase and produced .sql test files targeting adversarial inputs:
empty tables, NULL propagation through expressions, multi-column ordering with
mixed ASC/DESC, stale index entries after deletes, and type-cast edge cases.
The Reviewer read the source and annotated it with actionable comments: missing
error paths, redundant allocations, architectural improvements. The Writer ran the
full suite under ASAN, diagnosed failures, and shipped fixes. This continued until
every test passed with zero memory errors.
The Challenger studies the current codebase and produces
.sql test files targeting corner cases: empty tables, NULL
propagation through expressions, multi-column ordering with mixed
ASC/DESC, stale index entries after deletes, concurrent readers during
writes, and edge cases in type casting. Each round adds 10–30 new
tests.
The Writer runs the new tests under AddressSanitizer,
diagnoses failures, and ships fixes. It also implements new features
requested by the test suite—if the Challenger writes a test for
INTERSECT ALL, the Writer adds the parser rule, executor
path, and wire serialisation to make it pass.
The Reviewer reads the source after each round and annotates it with actionable comments: missing error paths, redundant allocations, opportunities to share code between the plan executor and legacy row-by-row path, and architectural improvements like converting ENUM storage from string-based to ordinal.
Write code, run 1,375+ tests, fix failures, repeat. The adversarial model drove correctness the same way rigorous code review does on a human team—except all three sides were machines. A typical round:
The adversarial loop forced correctness properties that would be hard to specify upfront. Each round of tests exposed assumptions the code had silently made. The architectural decisions that emerged are direct responses to what broke:
int32_t
internally, converted at I/O boundaries only.switch on every
tagged union with -Wswitch-enum and no default:.
Adding a new type variant produces a compile error at every unhandled site.Architecture · Design philosophy · Benchmarks · Testing methodology · Try it in the browser