The modern programming landscape has moved steadily toward more abstraction. ORMs wrap SQL. Query builders wrap ORMs. Connection pools wrap drivers. Frameworks wrap frameworks. At the database layer: buffer pools manage page files, WAL layers serialize writes, MVCC layers version every row, query planners rewrite your joins before executing them.
Each layer is individually justifiable. Together, they make it difficult to answer a simple question: why is this query slow? The answer might be in the ORM, the connection pool, the query planner, the buffer pool, the WAL checkpoint, or the MVCC vacuum process. The stack is too deep to hold in your head.
This engine is a counterexample: an in-memory SQL engine with no buffer pool, no WAL, no MVCC, no query planner, and no external dependencies. Every query flows through four stages implemented in four files. The full execution path from TCP bytes to DataRow bytes is traceable in roughly 500 lines across three files. When something is slow, the answer is in the code.
Most database engines call malloc/free per query,
per row, or per expression evaluation. Each call acquires a lock, searches a
free list, and potentially triggers a system call. Under concurrent load, the
allocator becomes a bottleneck.
The bump allocator (arena.h) handles all per-query scratch
memory. After the first query warms up the slab chain, subsequent queries
perform zero malloc/free calls.
query_arena_reset() rewinds the bump pointer to zero—no
free, no lock, no fragmentation. Hash tables for joins and aggregation, sort
index arrays, and plan nodes are all bump-allocated from
arena.scratch.
Measured result: aggregate queries run 8–10× faster than PostgreSQL and DuckDB. The gap is not algorithmic—the same GROUP BY algorithm runs in both. The gap is allocator overhead.
PostgreSQL stores rows as heap tuples: variable-length byte sequences with attribute offsets. Reading a single column from 5,000 rows means jumping to 5,000 non-contiguous memory locations. Modern CPUs prefetch ahead, but only if access is sequential. Random pointer chasing defeats prefetching.
The engine stores data in a row-oriented format (table.rows) for
writes, but maintains a columnar scan cache (table.scan_cache)
for reads. The first scan after a write pays a one-time O(N) conversion cost
to build flat typed arrays per column. Every subsequent scan reads contiguous
memory. For a 5,000-row, 3-column table: 32 KB of sequential reads
instead of 5,000 pointer dereferences.
Measured result: full-scan 32% faster than PostgreSQL. Parquet aggregate queries 2.8× faster than DuckDB on cached workloads, because the result cache serves them without touching Parquet at all.
The plan executor processes data in 1,024-row col_block chunks
(block.h). A 4-column INT block occupies
4 × 1,024 × 4 = 16 KB.
L1 data cache on a modern CPU is 32–48 KB. The entire working set
for a filter pass or aggregate accumulation fits in L1.
This is not a coincidence. 1,024 rows was chosen so that common analytical workloads (4–8 INT columns) fit in L1. The filter inner loop runs entirely from registers and L1—no cache misses during the hot path.
Measured result: correlated subquery 22ms vs PostgreSQL 3,665ms (166×). The subquery executes once per outer row; with L1 residency, each inner scan is effectively free.
A naive wire protocol implementation calls write() once per
result row. For a 5,000-row result, that is 5,000 system calls—each
crossing the kernel boundary, flushing registers, and returning. This was
the original bottleneck for ORDER BY workloads (4.65× slower than
PostgreSQL).
try_plan_send() in pgwire.c serializes
col_block arrays directly into a 64 KB wire buffer, then
flushes with a single write(). For a 5,000-row result:
5 syscalls instead of 5,000. The columnar data flows from scan cache
straight to the TCP socket with no row materialization in between.
Measured result: ORDER BY workload went from 4.65×
slower to 1.18× PostgreSQL after this change alone. Mixed analytical
throughput reaches 94,525 QPS—190× PostgreSQL—because
the result cache serves subsequent identical queries as a single
send_all() of cached bytes.
A system with fewer abstractions is not just faster—it is easier to debug. The call stack for a failing query bottoms out at a function you can read in full in a few minutes. There is no query planner that rewrote your join. There is no buffer pool eviction racing your scan. There is no MVCC garbage collection running concurrently with your transaction.
This has practical consequences for the code itself:
.c file per subsystem.
pgwire.c owns the wire protocol.
parser.c owns the SQL parser.
plan.c owns the block executor.
No function in parser.c calls anything in plan.c.
The dependency graph is a DAG with no cycles.
switch on an enum covers every variant explicitly.
-Wswitch-enum is always on. default: is
forbidden on finite enums. Adding a new plan node type or column type
produces a compile error at every unhandled dispatch site—not a
silent runtime skip.
arena_set_error with an appropriate SQLSTATE). Partial
output, silent empty results for queries that should match, and wrong
answers are all bugs with the same priority as crashes.
The abstraction problem is getting worse, not better. Each generation of tooling adds a new layer: the database driver, the ORM, the query builder, the API client, the caching layer. At the database layer, production systems now run planners that generate machine-readable hints that feed back into planners. The stack is no longer navigable by a single engineer.
This project is an existence proof in the other direction. ~45,100 lines of C11. Zero external dependencies. The entire system fits in a single developer’s head. It outperforms PostgreSQL on every cached analytical workload and DuckDB on 61 of 93 benchmarks. The performance is not the point—it is evidence that the design works.
The full execution path from TCP bytes to DataRow bytes is traceable in
roughly 500 lines across pgwire.c, plan.c,
and table.c. When a query fails, the answer is in those files.
That is the point.
Architecture · Benchmarks · Construction methodology · Source on GitHub · Try it in the browser