Design philosophy

Why abstraction is not free

The modern programming landscape has moved steadily toward more abstraction. ORMs wrap SQL. Query builders wrap ORMs. Connection pools wrap drivers. Frameworks wrap frameworks. At the database layer: buffer pools manage page files, WAL layers serialize writes, MVCC layers version every row, query planners rewrite your joins before executing them.

Each layer is individually justifiable. Together, they make it difficult to answer a simple question: why is this query slow? The answer might be in the ORM, the connection pool, the query planner, the buffer pool, the WAL checkpoint, or the MVCC vacuum process. The stack is too deep to hold in your head.

This engine is a counterexample: an in-memory SQL engine with no buffer pool, no WAL, no MVCC, no query planner, and no external dependencies. Every query flows through four stages implemented in four files. The full execution path from TCP bytes to DataRow bytes is traceable in roughly 500 lines across three files. When something is slow, the answer is in the code.

So what: four decisions, four measured outcomes

1. No allocator in the hot path

Most database engines call malloc/free per query, per row, or per expression evaluation. Each call acquires a lock, searches a free list, and potentially triggers a system call. Under concurrent load, the allocator becomes a bottleneck.

The bump allocator (arena.h) handles all per-query scratch memory. After the first query warms up the slab chain, subsequent queries perform zero malloc/free calls. query_arena_reset() rewinds the bump pointer to zero—no free, no lock, no fragmentation. Hash tables for joins and aggregation, sort index arrays, and plan nodes are all bump-allocated from arena.scratch.

Measured result: aggregate queries run 8–10× faster than PostgreSQL and DuckDB. The gap is not algorithmic—the same GROUP BY algorithm runs in both. The gap is allocator overhead.

2. One data layout, not two

PostgreSQL stores rows as heap tuples: variable-length byte sequences with attribute offsets. Reading a single column from 5,000 rows means jumping to 5,000 non-contiguous memory locations. Modern CPUs prefetch ahead, but only if access is sequential. Random pointer chasing defeats prefetching.

The engine stores data in a row-oriented format (table.rows) for writes, but maintains a columnar scan cache (table.scan_cache) for reads. The first scan after a write pays a one-time O(N) conversion cost to build flat typed arrays per column. Every subsequent scan reads contiguous memory. For a 5,000-row, 3-column table: 32 KB of sequential reads instead of 5,000 pointer dereferences.

Measured result: full-scan 32% faster than PostgreSQL. Parquet aggregate queries 2.8× faster than DuckDB on cached workloads, because the result cache serves them without touching Parquet at all.

3. Block size matched to cache size

The plan executor processes data in 1,024-row col_block chunks (block.h). A 4-column INT block occupies 4 × 1,024 × 4 = 16 KB. L1 data cache on a modern CPU is 32–48 KB. The entire working set for a filter pass or aggregate accumulation fits in L1.

This is not a coincidence. 1,024 rows was chosen so that common analytical workloads (4–8 INT columns) fit in L1. The filter inner loop runs entirely from registers and L1—no cache misses during the hot path.

Measured result: correlated subquery 22ms vs PostgreSQL 3,665ms (166×). The subquery executes once per outer row; with L1 residency, each inner scan is effectively free.

4. One syscall per 64 KB, not one per row

A naive wire protocol implementation calls write() once per result row. For a 5,000-row result, that is 5,000 system calls—each crossing the kernel boundary, flushing registers, and returning. This was the original bottleneck for ORDER BY workloads (4.65× slower than PostgreSQL).

try_plan_send() in pgwire.c serializes col_block arrays directly into a 64 KB wire buffer, then flushes with a single write(). For a 5,000-row result: 5 syscalls instead of 5,000. The columnar data flows from scan cache straight to the TCP socket with no row materialization in between.

Measured result: ORDER BY workload went from 4.65× slower to 1.18× PostgreSQL after this change alone. Mixed analytical throughput reaches 94,525 QPS—190× PostgreSQL—because the result cache serves subsequent identical queries as a single send_all() of cached bytes.

Now what: what this means for the codebase

A system with fewer abstractions is not just faster—it is easier to debug. The call stack for a failing query bottoms out at a function you can read in full in a few minutes. There is no query planner that rewrote your join. There is no buffer pool eviction racing your scan. There is no MVCC garbage collection running concurrently with your transaction.

This has practical consequences for the code itself:

Why now

The abstraction problem is getting worse, not better. Each generation of tooling adds a new layer: the database driver, the ORM, the query builder, the API client, the caching layer. At the database layer, production systems now run planners that generate machine-readable hints that feed back into planners. The stack is no longer navigable by a single engineer.

This project is an existence proof in the other direction. ~45,100 lines of C11. Zero external dependencies. The entire system fits in a single developer’s head. It outperforms PostgreSQL on every cached analytical workload and DuckDB on 61 of 93 benchmarks. The performance is not the point—it is evidence that the design works.

The full execution path from TCP bytes to DataRow bytes is traceable in roughly 500 lines across pgwire.c, plan.c, and table.c. When a query fails, the answer is in those files. That is the point.

Explore further

Architecture  ·  Benchmarks  ·  Construction methodology  ·  Source on GitHub  ·  Try it in the browser