Most database engines accumulate abstraction layers over decades: a buffer pool to manage disk pages, a WAL for durability, an MVCC layer for concurrent isolation, a query planner that rewrites your joins. Each layer adds correctness surface area and runtime overhead. mskql has none of these layers—not as a shortcut, but as a deliberate constraint.
The result: zero steady-state malloc/free calls per query,
a scan path whose working set fits in L1 cache, a call stack shallow enough to hold
in your head, and benchmarks that outperform PostgreSQL on every cached workload
and DuckDB on 61 of 93. ~45.1K lines of C. Zero external dependencies.
Four stages: wire protocol, parser, executor, storage.
When a query fails, the stack trace bottoms out at code you can read. There is no planner that rewrote your join. There is no buffer pool eviction racing your scan. The debuggability is a direct consequence of the same design decisions that make it fast.
Every query passes through four stages:
| Stage | Source | Description |
|---|---|---|
| Wire protocol | pgwire.c |
Accepts TCP connections on port 5433, handles SSL negotiation,
implements the Simple Query, Extended Query (prepared
statements, portals, $1/$2 parameter
substitution), and COPY protocols (tab-delimited and CSV),
intercepts information_schema and
pg_catalog queries for tool compatibility,
and serializes result rows |
| SQL parser | parser.c |
Hand-written recursive-descent parser producing a typed
struct query AST—no generators.
~7,600 lines with proper operator precedence,
CTEs, window functions, set operations, and
generate_series() table functions |
| Query executor | query.c, database.c |
Block-oriented plan executor (plan.c, ~11,500 lines)
handles most queries with vectorized operators: scan, filter
(branchless vectorized comparison loops for INT/BIGINT/FLOAT),
project, sort (radix sort for integers, composite multi-key radix,
Top-N via binary heap), hash join (with flat_col
dynamic columnar storage), hash aggregation, index scan,
window functions, set operations, CTEs, Parquet foreign-table
scans, correlated subquery decorrelation, and
generate_series().
A scan cache with generation tracking skips redundant scans.
Direct columnar-to-wire serialization bypasses row materialization.
Complex queries (ROLLUP/CUBE)
fall back to the legacy row-at-a-time evaluator
(query.c, database.c) |
| Storage | table.c, row.c, index.c, datetime.c |
Tables are arrays of rows; rows are arrays of typed cells.
Native integer storage for temporal types (DATE as int32,
TIMESTAMP/TIME as int64 microseconds, INTERVAL as 16-byte struct).
Binary UUID storage (two uint64). ENUM stored as int32 ordinal.
VECTOR(n) stored as contiguous float[] arrays (n × 4 bytes per row).
B-tree indexes (single and composite multi-column) for equality lookups.
Lazy copy-on-write snapshots for transaction rollback |
The engine uses six distinct memory regions, each with a different
lifetime and allocation strategy. Understanding how they interact explains
why the system achieves zero steady-state malloc/free
calls and why queries run faster than PostgreSQL on every benchmark.
| Region | Backing | Lifetime | Location |
|---|---|---|---|
| Row store | malloc’d dynamic arrays of rows, each a dynamic array of typed cells |
Table lifetime | table.rows |
| Scan cache | Heap-allocated flat typed arrays per column | Invalidated on generation++ |
table.scan_cache |
| Arena pools | 17 dynamic arrays of typed structs, indexed by uint32_t |
Connection-scoped, reset per query | client_state.arena |
| Bump slab chain | Linked list of fixed-size slabs (4 KB → doubling) | Reset per query (rewind, no free) | arena.bump, arena.scratch, arena.result_text |
| Block executor | 1,024-row col_block arrays, bump-allocated |
Per-query (freed with scratch reset) | plan_exec_ctx via arena.scratch |
| Result cache | 8,192-slot hash table of serialized wire bytes | Invalidated on any write (total_generation) |
g_rcache[] global |
A SELECT query touches all six regions in sequence.
Following a single query through the system reveals why each region
exists and how they eliminate overhead at every stage.
1. Wire → Arena.
The pgwire layer receives raw SQL bytes over TCP. The parser allocates
AST nodes into the arena’s 17 typed pools—expressions, conditions,
join descriptors, aggregate definitions—each referenced by
uint32_t index, not pointer. String literals and identifiers
go into the bump slab. After the first query warms up the dynamic-array
capacities and bump slabs, subsequent queries reuse all existing memory:
query_arena_reset() sets counts to zero and rewinds bump
pointers without calling free().
2. Arena → Plan.
plan_build_select() reads the arena AST and allocates plan
nodes into arena.plan_nodes. Auxiliary arrays—sort
column indices, projection column maps, hash-join key mappings—are
bump-allocated from arena.scratch. The plan builder is
composable: build_join(), build_aggregate(),
build_window(), and build_set_op() each return
a struct plan_result with status (PLAN_OK,
PLAN_NOTIMPL, or PLAN_ERROR). Unimplemented
features fall back to the legacy row-at-a-time executor automatically.
3. Row store → Scan cache.
When seq_scan_next() runs, it checks
scan_cache.generation against table.generation.
On a cache hit, it copies a 1,024-row slice from the
pre-built flat arrays into a col_block—a single
memcpy per column. On a cache miss (first
scan, or after an INSERT/UPDATE/DELETE
bumped the generation), it rebuilds the cache by walking the row store
once—O(N) cost amortized across all subsequent queries until the
next write.
4. Scan cache → Block executor.
Filter, sort, hash join, hash aggregation, window functions, and set
operations all process 1,024-row col_block chunks. A filter
produces a selection vector (an array of active row indices) instead of
copying data. Hash tables for joins and aggregation, sort index arrays,
and window-function accumulators are all bump-allocated from
arena.scratch—zero malloc calls during
execution.
5. Block executor → Wire.
try_plan_send() serializes col_block arrays
directly into a 64 KB wire buffer, then flushes with a single
write() system call. No row materialization
occurs—data flows columnar from the scan cache straight
to the TCP socket. For a 5,000-row result, this means 5 system calls
instead of 5,000.
6. Wire → Result cache.
The serialized wire bytes (RowDescription + DataRows + CommandComplete)
are stored in g_rcache, a 8,192-slot hash table keyed by
SQL hash and total_generation (the sum of all table
generations). The next identical query skips parsing, planning, and
execution entirely—a single send_all() of cached
bytes. The result cache is invalidated on any write to any table.
Zero steady-state allocation.
After the first query warms up dynamic-array capacities and bump slabs,
subsequent queries perform zero malloc/free
calls. The parser, plan builder, and executor all reuse existing memory.
This eliminates allocator contention and fragmentation entirely.
Scan cache eliminates row-store overhead. Converting 5,000 rows × 3 columns from a row-of-cells layout to flat typed arrays costs ~0.1 ms once. Subsequent scans read 32 KB of contiguous data instead of chasing 5,000 pointers through individually-allocated cell structures.
1,024-row blocks fit L1 cache.
A 4-column INT block occupies 4 × 1,024 × 4
= 16 KB. L1 data cache on modern CPUs is typically 32–48 KB.
The entire working set for a filter or aggregate pass fits in L1,
eliminating cache misses during the inner loop.
Batched wire send.
Accumulating DataRow messages into a 64 KB buffer and flushing with
one write() call eliminates one system call per row. For a
5,000-row result: 5 write() calls instead of 5,000. This
alone closed the full-scan performance gap with PostgreSQL.
Write coalescing (TCP_NODELAY + TCP_NOPUSH/CORK + 256 KB buffer)
further reduces syscall overhead.
Result cache.
Repeated identical SELECTs skip parsing, planning, and
execution entirely. The throughput benchmark’s analytical CTE
workload runs at 94,525 QPS (190× PostgreSQL) because the
result is served from cached wire bytes in microseconds. On Parquet
workloads, the cache makes mskql faster than DuckDB on all 9 cached
queries.
Lazy copy-on-write transactions.
BEGIN records table names and generation numbers—O(1).
The first write to a table triggers a table_deep_copy() of
that table only—O(N) for that table. COMMIT is free
(discard the snapshot). ROLLBACK swaps the saved copy back.
This is why the transaction benchmark runs at 0.86× PostgreSQL
despite mskql having no WAL.
JPL-style ownership. The allocating module deallocates. The arena owns all parser output. Bump slabs own all executor scratch state. Each table owns its rows and scan cache. No module ever frees memory it did not allocate.
arena_owns_text flag.
When set on result rows, text cells live in the result_text
bump slab. On query completion, all result text is bulk-freed by
rewinding the bump pointer—no per-cell free() needed.
When the flag is not set (legacy path), each cell’s text is
individually free()’d.
No cross-region pointers.
Plan nodes reference arena items by uint32_t index, not
pointer. The bump slab chain never moves old slabs (new slabs are
appended), so pointers into any slab remain valid for the lifetime of
the query. This eliminates an entire class of use-after-free bugs.
Dynamic-array macro (dynamic_array.h):
type-safe DYNAMIC_ARRAY(T) with da_push,
da_reset, da_free. Backing array doubles on
overflow. Used for table rows, arena pools, and all variable-length
internal structures.
Zero-copy string views (stringview.h):
sv is a {const char *data, size_t len} pair.
The parser operates entirely on string views into the original SQL
text—no copies until a value must outlive the input buffer.
Expression evaluator (eval_expr()):
a tagged-union AST (struct expr) supporting
CAST/::, date/time arithmetic, 30+ built-in
scalar functions (math, string, temporal), CASE WHEN,
and subqueries.
The benchmark numbers are a consequence of the architecture, not a separate optimization effort. Each structural decision eliminates a specific overhead:
malloc/free calls per query.
The bump allocator rewinds to zero on each reset—no free, no fragmentation,
no allocator lock contention. Result: aggregate queries run 8–10× faster
than PostgreSQL and DuckDB.
write(). For a 5,000-row result: 5 syscalls instead of 5,000.
This alone closed the full-scan gap with PostgreSQL.
→ Full benchmark results · Design philosophy: why abstraction is not free
Every test runs under AddressSanitizer. Arena allocation and index-based references
eliminate use-after-free and leak classes by construction. The bump slab chain’s
append-only growth means pointers never dangle within a query’s lifetime.
The -Wswitch-enum flag plus the no-default: rule means
adding a new enum variant produces a compile error at every unhandled dispatch
site—not a silent runtime skip.
Benchmarks · Design philosophy · Benchmark workload details · SQL grammar · Source on GitHub · Try it in the browser