mskql — Architecture

Aggregates 20× faster than PostgreSQL. Correlated subqueries 826× faster. Dashboard-style throughput at 97,000 queries per second. These numbers are not the result of exotic algorithms—they come from removing the layers that make traditional databases slow: buffer pools, WAL, MVCC, and query planners.

What remains is a four-stage pipeline—wire protocol, parser, executor, storage—in ~55K lines of C with zero external dependencies. Every query flows from TCP bytes to results through code you can read end-to-end.

When a query fails, the stack trace bottoms out at code you can read. There is no planner that rewrote your join. There is no buffer pool eviction racing your scan. The debuggability is a direct consequence of the same design decisions that make it fast.

The pipeline

Stage	Source	Description
Wire protocol	`pgwire.c`	Accepts TCP connections on port 5433, handles SSL negotiation, implements the Simple Query, Extended Query (prepared statements, portals, `$1`/`$2` parameter substitution), and COPY protocols (tab-delimited and CSV), intercepts `information_schema` and `pg_catalog` queries for tool compatibility, and serializes result rows
SQL parser	`parser.c`	Hand-written recursive-descent parser producing a typed `struct query` AST—no generators. ~7,600 lines with proper operator precedence, CTEs, window functions, set operations, and `generate_series()` table functions
Query executor	`query.c`, `database.c`	Block-oriented plan executor (`plan.c`, ~11,500 lines) handles most queries with vectorized operators: scan, filter (branchless vectorized comparison loops for INT/BIGINT/FLOAT), project, sort (radix sort for integers, composite multi-key radix, Top-N via binary heap), hash join (with `flat_col` dynamic columnar storage), hash aggregation, index scan, window functions, set operations, CTEs, Parquet foreign-table scans, correlated subquery decorrelation, and `generate_series()`. A scan cache with generation tracking skips redundant scans. Direct columnar-to-wire serialization bypasses row materialization. Complex queries (`ROLLUP`/`CUBE`) fall back to the legacy row-at-a-time evaluator (`query.c`, `database.c`)
Storage	`table.c`, `row.c`, `index.c`, `datetime.c`	Tables are arrays of rows; rows are arrays of typed cells. Native integer storage for temporal types (DATE as int32, TIMESTAMP/TIME as int64 microseconds, INTERVAL as 16-byte struct). Binary UUID storage (two uint64). ENUM stored as int32 ordinal. VECTOR(n) stored as contiguous `float[]` arrays (n × 4 bytes per row). B-tree indexes (single and composite multi-column) for equality lookups. Lazy copy-on-write snapshots for transaction rollback

Memory model

After the first query warms up, every subsequent query allocates zero bytes of heap memory. No malloc, no free, no allocator contention. This is the single largest reason aggregates run 20× faster than PostgreSQL—both databases use the same GROUP BY algorithm, but PostgreSQL allocates and frees memory on every query. Six memory regions make this possible:

The six regions

Region	Backing	Lifetime	Location
Row store	`malloc`’d dynamic arrays of rows, each a dynamic array of typed cells	Table lifetime	`table.rows`
Scan cache	Heap-allocated flat typed arrays per column	Invalidated on `generation++`	`table.scan_cache`
Arena pools	17 dynamic arrays of typed structs, indexed by `uint32_t`	Connection-scoped, reset per query	`client_state.arena`
Bump slab chain	Linked list of fixed-size slabs (4 KB → doubling)	Reset per query (rewind, no free)	`arena.bump`, `arena.scratch`, `arena.result_text`
Block executor	1,024-row `col_block` arrays, bump-allocated	Per-query (freed with scratch reset)	`plan_exec_ctx` via `arena.scratch`
Result cache	8,192-slot hash table of serialized wire bytes	Invalidated on any write (`total_generation`)	`g_rcache[]` global

What happens when you run a query

A SELECT touches all six regions in sequence. Each handoff eliminates an overhead that traditional databases pay on every query.

1. Wire → Arena. The pgwire layer receives raw SQL bytes over TCP. The parser allocates AST nodes into the arena’s 17 typed pools—expressions, conditions, join descriptors, aggregate definitions—each referenced by uint32_t index, not pointer. String literals and identifiers go into the bump slab. After the first query warms up the dynamic-array capacities and bump slabs, subsequent queries reuse all existing memory: query_arena_reset() sets counts to zero and rewinds bump pointers without calling free().

2. Arena → Plan. plan_build_select() reads the arena AST and allocates plan nodes into arena.plan_nodes. Auxiliary arrays—sort column indices, projection column maps, hash-join key mappings—are bump-allocated from arena.scratch. The plan builder is composable: build_join(), build_aggregate(), build_window(), and build_set_op() each return a struct plan_result with status (PLAN_OK, PLAN_NOTIMPL, or PLAN_ERROR). Unimplemented features fall back to the legacy row-at-a-time executor automatically.

3. Row store → Scan cache. When seq_scan_next() runs, it checks scan_cache.generation against table.generation. On a cache hit, it copies a 1,024-row slice from the pre-built flat arrays into a col_block—a single memcpy per column. On a cache miss (first scan, or after an INSERT/UPDATE/DELETE bumped the generation), it rebuilds the cache by walking the row store once—O(N) cost amortized across all subsequent queries until the next write.

4. Scan cache → Block executor. Filter, sort, hash join, hash aggregation, window functions, and set operations all process 1,024-row col_block chunks. A filter produces a selection vector (an array of active row indices) instead of copying data. Hash tables for joins and aggregation, sort index arrays, and window-function accumulators are all bump-allocated from arena.scratch—zero malloc calls during execution.

5. Block executor → Wire. try_plan_send() serializes col_block arrays directly into a 64 KB wire buffer, then flushes with a single write() system call. No row materialization occurs—data flows columnar from the scan cache straight to the TCP socket. For a 5,000-row result, this means 5 system calls instead of 5,000.

6. Wire → Result cache. The serialized wire bytes (RowDescription + DataRows + CommandComplete) are stored in g_rcache, a 8,192-slot hash table keyed by SQL hash and total_generation (the sum of all table generations). The next identical query skips parsing, planning, and execution entirely—a single send_all() of cached bytes. The result cache is invalidated on any write to any table.

Performance implications

Zero steady-state allocation. After the first query warms up dynamic-array capacities and bump slabs, subsequent queries perform zero malloc/free calls. The parser, plan builder, and executor all reuse existing memory. This eliminates allocator contention and fragmentation entirely.

Scan cache eliminates row-store overhead. Converting 5,000 rows × 3 columns from a row-of-cells layout to flat typed arrays costs ~0.1 ms once. Subsequent scans read 32 KB of contiguous data instead of chasing 5,000 pointers through individually-allocated cell structures.

1,024-row blocks fit L1 cache. A 4-column INT block occupies 4 × 1,024 × 4 = 16 KB. L1 data cache on modern CPUs is typically 32–48 KB. The entire working set for a filter or aggregate pass fits in L1, eliminating cache misses during the inner loop.

Batched wire send. Accumulating DataRow messages into a 64 KB buffer and flushing with one write() call eliminates one system call per row. For a 5,000-row result: 5 write() calls instead of 5,000. This alone closed the full-scan performance gap with PostgreSQL. Write coalescing (TCP_NODELAY + TCP_NOPUSH/CORK + 256 KB buffer) further reduces syscall overhead.

Result cache. Repeated identical SELECTs skip parsing, planning, and execution entirely. The throughput benchmark’s analytical CTE workload runs at 96,007 QPS (192× PostgreSQL) because the result is served from cached wire bytes in microseconds. On Parquet workloads, the cache makes mskql faster than DuckDB on 7 of 9 cached queries.

Lazy copy-on-write transactions. BEGIN records table names and generation numbers—O(1). The first write to a table triggers a table_deep_copy() of that table only—O(N) for that table. COMMIT is free (discard the snapshot). ROLLBACK swaps the saved copy back. This is why the transaction benchmark runs at 0.86× PostgreSQL despite mskql having no WAL.

Ownership rules

JPL-style ownership. The allocating module deallocates. The arena owns all parser output. Bump slabs own all executor scratch state. Each table owns its rows and scan cache. No module ever frees memory it did not allocate.

arena_owns_text flag. When set on result rows, text cells live in the result_text bump slab. On query completion, all result text is bulk-freed by rewinding the bump pointer—no per-cell free() needed. When the flag is not set (legacy path), each cell’s text is individually free()’d.

No cross-region pointers. Plan nodes reference arena items by uint32_t index, not pointer. The bump slab chain never moves old slabs (new slabs are appended), so pointers into any slab remain valid for the lifetime of the query. This eliminates an entire class of use-after-free bugs.

Internal abstractions

Dynamic-array macro (dynamic_array.h): type-safe DYNAMIC_ARRAY(T) with da_push, da_reset, da_free. Backing array doubles on overflow. Used for table rows, arena pools, and all variable-length internal structures.

Zero-copy string views (stringview.h): sv is a {const char *data, size_t len} pair. The parser operates entirely on string views into the original SQL text—no copies until a value must outlive the input buffer.

Expression evaluator (eval_expr()): a tagged-union AST (struct expr) supporting CAST/::, date/time arithmetic, 30+ built-in scalar functions (math, string, temporal), CASE WHEN, and subqueries.

Why it’s fast: architecture as cause

Every benchmark result traces directly to a structural decision. There is no separate “optimization layer”—the architecture is the optimization:

Memory safety without runtime cost

The same design that makes queries fast also makes them safe. Arena allocation and index-based references eliminate use-after-free and leak classes by construction—no garbage collector, no reference counting. Every test runs under AddressSanitizer. The bump slab chain’s append-only growth means pointers never dangle within a query’s lifetime.

Architecture