Architecture

Simple by design. Fast because of it.

Aggregates 20× faster than PostgreSQL. Correlated subqueries 826× faster. Dashboard-style throughput at 97,000 queries per second. These numbers are not the result of exotic algorithms—they come from removing the layers that make traditional databases slow: buffer pools, WAL, MVCC, and query planners.

What remains is a four-stage pipeline—wire protocol, parser, executor, storage—in ~55K lines of C with zero external dependencies. Every query flows from TCP bytes to results through code you can read end-to-end.

When a query fails, the stack trace bottoms out at code you can read. There is no planner that rewrote your join. There is no buffer pool eviction racing your scan. The debuggability is a direct consequence of the same design decisions that make it fast.

The pipeline

Every query passes through four stages:

StageSourceDescription
Wire protocol pgwire.c Accepts TCP connections on port 5433, handles SSL negotiation, implements the Simple Query, Extended Query (prepared statements, portals, $1/$2 parameter substitution), and COPY protocols (tab-delimited and CSV), intercepts information_schema and pg_catalog queries for tool compatibility, and serializes result rows
SQL parser parser.c Hand-written recursive-descent parser producing a typed struct query AST—no generators. ~7,600 lines with proper operator precedence, CTEs, window functions, set operations, and generate_series() table functions
Query executor query.c, database.c Block-oriented plan executor (plan.c, ~11,500 lines) handles most queries with vectorized operators: scan, filter (branchless vectorized comparison loops for INT/BIGINT/FLOAT), project, sort (radix sort for integers, composite multi-key radix, Top-N via binary heap), hash join (with flat_col dynamic columnar storage), hash aggregation, index scan, window functions, set operations, CTEs, Parquet foreign-table scans, correlated subquery decorrelation, and generate_series(). A scan cache with generation tracking skips redundant scans. Direct columnar-to-wire serialization bypasses row materialization. Complex queries (ROLLUP/CUBE) fall back to the legacy row-at-a-time evaluator (query.c, database.c)
Storage table.c, row.c, index.c, datetime.c Tables are arrays of rows; rows are arrays of typed cells. Native integer storage for temporal types (DATE as int32, TIMESTAMP/TIME as int64 microseconds, INTERVAL as 16-byte struct). Binary UUID storage (two uint64). ENUM stored as int32 ordinal. VECTOR(n) stored as contiguous float[] arrays (n × 4 bytes per row). B-tree indexes (single and composite multi-column) for equality lookups. Lazy copy-on-write snapshots for transaction rollback

Memory model

After the first query warms up, every subsequent query allocates zero bytes of heap memory. No malloc, no free, no allocator contention. This is the single largest reason aggregates run 20× faster than PostgreSQL—both databases use the same GROUP BY algorithm, but PostgreSQL allocates and frees memory on every query. Six memory regions make this possible:

The six regions

RegionBackingLifetimeLocation
Row store malloc’d dynamic arrays of rows, each a dynamic array of typed cells Table lifetime table.rows
Scan cache Heap-allocated flat typed arrays per column Invalidated on generation++ table.scan_cache
Arena pools 17 dynamic arrays of typed structs, indexed by uint32_t Connection-scoped, reset per query client_state.arena
Bump slab chain Linked list of fixed-size slabs (4 KB → doubling) Reset per query (rewind, no free) arena.bump, arena.scratch, arena.result_text
Block executor 1,024-row col_block arrays, bump-allocated Per-query (freed with scratch reset) plan_exec_ctx via arena.scratch
Result cache 8,192-slot hash table of serialized wire bytes Invalidated on any write (total_generation) g_rcache[] global

What happens when you run a query

A SELECT touches all six regions in sequence. Each handoff eliminates an overhead that traditional databases pay on every query.

1. Wire → Arena. The pgwire layer receives raw SQL bytes over TCP. The parser allocates AST nodes into the arena’s 17 typed pools—expressions, conditions, join descriptors, aggregate definitions—each referenced by uint32_t index, not pointer. String literals and identifiers go into the bump slab. After the first query warms up the dynamic-array capacities and bump slabs, subsequent queries reuse all existing memory: query_arena_reset() sets counts to zero and rewinds bump pointers without calling free().

2. Arena → Plan. plan_build_select() reads the arena AST and allocates plan nodes into arena.plan_nodes. Auxiliary arrays—sort column indices, projection column maps, hash-join key mappings—are bump-allocated from arena.scratch. The plan builder is composable: build_join(), build_aggregate(), build_window(), and build_set_op() each return a struct plan_result with status (PLAN_OK, PLAN_NOTIMPL, or PLAN_ERROR). Unimplemented features fall back to the legacy row-at-a-time executor automatically.

3. Row store → Scan cache. When seq_scan_next() runs, it checks scan_cache.generation against table.generation. On a cache hit, it copies a 1,024-row slice from the pre-built flat arrays into a col_block—a single memcpy per column. On a cache miss (first scan, or after an INSERT/UPDATE/DELETE bumped the generation), it rebuilds the cache by walking the row store once—O(N) cost amortized across all subsequent queries until the next write.

4. Scan cache → Block executor. Filter, sort, hash join, hash aggregation, window functions, and set operations all process 1,024-row col_block chunks. A filter produces a selection vector (an array of active row indices) instead of copying data. Hash tables for joins and aggregation, sort index arrays, and window-function accumulators are all bump-allocated from arena.scratch—zero malloc calls during execution.

5. Block executor → Wire. try_plan_send() serializes col_block arrays directly into a 64 KB wire buffer, then flushes with a single write() system call. No row materialization occurs—data flows columnar from the scan cache straight to the TCP socket. For a 5,000-row result, this means 5 system calls instead of 5,000.

6. Wire → Result cache. The serialized wire bytes (RowDescription + DataRows + CommandComplete) are stored in g_rcache, a 8,192-slot hash table keyed by SQL hash and total_generation (the sum of all table generations). The next identical query skips parsing, planning, and execution entirely—a single send_all() of cached bytes. The result cache is invalidated on any write to any table.

Performance implications

Zero steady-state allocation. After the first query warms up dynamic-array capacities and bump slabs, subsequent queries perform zero malloc/free calls. The parser, plan builder, and executor all reuse existing memory. This eliminates allocator contention and fragmentation entirely.

Scan cache eliminates row-store overhead. Converting 5,000 rows × 3 columns from a row-of-cells layout to flat typed arrays costs ~0.1 ms once. Subsequent scans read 32 KB of contiguous data instead of chasing 5,000 pointers through individually-allocated cell structures.

1,024-row blocks fit L1 cache. A 4-column INT block occupies 4 × 1,024 × 4 = 16 KB. L1 data cache on modern CPUs is typically 32–48 KB. The entire working set for a filter or aggregate pass fits in L1, eliminating cache misses during the inner loop.

Batched wire send. Accumulating DataRow messages into a 64 KB buffer and flushing with one write() call eliminates one system call per row. For a 5,000-row result: 5 write() calls instead of 5,000. This alone closed the full-scan performance gap with PostgreSQL. Write coalescing (TCP_NODELAY + TCP_NOPUSH/CORK + 256 KB buffer) further reduces syscall overhead.

Result cache. Repeated identical SELECTs skip parsing, planning, and execution entirely. The throughput benchmark’s analytical CTE workload runs at 96,007 QPS (192× PostgreSQL) because the result is served from cached wire bytes in microseconds. On Parquet workloads, the cache makes mskql faster than DuckDB on 7 of 9 cached queries.

Lazy copy-on-write transactions. BEGIN records table names and generation numbers—O(1). The first write to a table triggers a table_deep_copy() of that table only—O(N) for that table. COMMIT is free (discard the snapshot). ROLLBACK swaps the saved copy back. This is why the transaction benchmark runs at 0.86× PostgreSQL despite mskql having no WAL.

Ownership rules

JPL-style ownership. The allocating module deallocates. The arena owns all parser output. Bump slabs own all executor scratch state. Each table owns its rows and scan cache. No module ever frees memory it did not allocate.

arena_owns_text flag. When set on result rows, text cells live in the result_text bump slab. On query completion, all result text is bulk-freed by rewinding the bump pointer—no per-cell free() needed. When the flag is not set (legacy path), each cell’s text is individually free()’d.

No cross-region pointers. Plan nodes reference arena items by uint32_t index, not pointer. The bump slab chain never moves old slabs (new slabs are appended), so pointers into any slab remain valid for the lifetime of the query. This eliminates an entire class of use-after-free bugs.

Internal abstractions

Dynamic-array macro (dynamic_array.h): type-safe DYNAMIC_ARRAY(T) with da_push, da_reset, da_free. Backing array doubles on overflow. Used for table rows, arena pools, and all variable-length internal structures.

Zero-copy string views (stringview.h): sv is a {const char *data, size_t len} pair. The parser operates entirely on string views into the original SQL text—no copies until a value must outlive the input buffer.

Expression evaluator (eval_expr()): a tagged-union AST (struct expr) supporting CAST/::, date/time arithmetic, 30+ built-in scalar functions (math, string, temporal), CASE WHEN, and subqueries.

Why it’s fast: architecture as cause

Every benchmark result traces directly to a structural decision. There is no separate “optimization layer”—the architecture is the optimization:

→ Full benchmark results  ·  Design philosophy: why abstraction is not free

Memory safety without runtime cost

The same design that makes queries fast also makes them safe. Arena allocation and index-based references eliminate use-after-free and leak classes by construction—no garbage collector, no reference counting. Every test runs under AddressSanitizer. The bump slab chain’s append-only growth means pointers never dangle within a query’s lifetime.

Explore further

Benchmarks  ·  Design philosophy  ·  Benchmark workload details  ·  SQL grammar  ·  Source on GitHub  ·  Try it in the browser