Architecture

Simple by design. Fast because of it.

Most database engines accumulate abstraction layers over decades: a buffer pool to manage disk pages, a WAL for durability, an MVCC layer for concurrent isolation, a query planner that rewrites your joins. Each layer adds correctness surface area and runtime overhead. mskql has none of these layers—not as a shortcut, but as a deliberate constraint.

The result: zero steady-state malloc/free calls per query, a scan path whose working set fits in L1 cache, a call stack shallow enough to hold in your head, and benchmarks that outperform PostgreSQL on every cached workload and DuckDB on 61 of 93. ~45.1K lines of C. Zero external dependencies. Four stages: wire protocol, parser, executor, storage.

When a query fails, the stack trace bottoms out at code you can read. There is no planner that rewrote your join. There is no buffer pool eviction racing your scan. The debuggability is a direct consequence of the same design decisions that make it fast.

The pipeline

Every query passes through four stages:

StageSourceDescription
Wire protocol pgwire.c Accepts TCP connections on port 5433, handles SSL negotiation, implements the Simple Query, Extended Query (prepared statements, portals, $1/$2 parameter substitution), and COPY protocols (tab-delimited and CSV), intercepts information_schema and pg_catalog queries for tool compatibility, and serializes result rows
SQL parser parser.c Hand-written recursive-descent parser producing a typed struct query AST—no generators. ~7,600 lines with proper operator precedence, CTEs, window functions, set operations, and generate_series() table functions
Query executor query.c, database.c Block-oriented plan executor (plan.c, ~11,500 lines) handles most queries with vectorized operators: scan, filter (branchless vectorized comparison loops for INT/BIGINT/FLOAT), project, sort (radix sort for integers, composite multi-key radix, Top-N via binary heap), hash join (with flat_col dynamic columnar storage), hash aggregation, index scan, window functions, set operations, CTEs, Parquet foreign-table scans, correlated subquery decorrelation, and generate_series(). A scan cache with generation tracking skips redundant scans. Direct columnar-to-wire serialization bypasses row materialization. Complex queries (ROLLUP/CUBE) fall back to the legacy row-at-a-time evaluator (query.c, database.c)
Storage table.c, row.c, index.c, datetime.c Tables are arrays of rows; rows are arrays of typed cells. Native integer storage for temporal types (DATE as int32, TIMESTAMP/TIME as int64 microseconds, INTERVAL as 16-byte struct). Binary UUID storage (two uint64). ENUM stored as int32 ordinal. VECTOR(n) stored as contiguous float[] arrays (n × 4 bytes per row). B-tree indexes (single and composite multi-column) for equality lookups. Lazy copy-on-write snapshots for transaction rollback

Memory model

The engine uses six distinct memory regions, each with a different lifetime and allocation strategy. Understanding how they interact explains why the system achieves zero steady-state malloc/free calls and why queries run faster than PostgreSQL on every benchmark.

The six regions

RegionBackingLifetimeLocation
Row store malloc’d dynamic arrays of rows, each a dynamic array of typed cells Table lifetime table.rows
Scan cache Heap-allocated flat typed arrays per column Invalidated on generation++ table.scan_cache
Arena pools 17 dynamic arrays of typed structs, indexed by uint32_t Connection-scoped, reset per query client_state.arena
Bump slab chain Linked list of fixed-size slabs (4 KB → doubling) Reset per query (rewind, no free) arena.bump, arena.scratch, arena.result_text
Block executor 1,024-row col_block arrays, bump-allocated Per-query (freed with scratch reset) plan_exec_ctx via arena.scratch
Result cache 8,192-slot hash table of serialized wire bytes Invalidated on any write (total_generation) g_rcache[] global

How they interact

A SELECT query touches all six regions in sequence. Following a single query through the system reveals why each region exists and how they eliminate overhead at every stage.

1. Wire → Arena. The pgwire layer receives raw SQL bytes over TCP. The parser allocates AST nodes into the arena’s 17 typed pools—expressions, conditions, join descriptors, aggregate definitions—each referenced by uint32_t index, not pointer. String literals and identifiers go into the bump slab. After the first query warms up the dynamic-array capacities and bump slabs, subsequent queries reuse all existing memory: query_arena_reset() sets counts to zero and rewinds bump pointers without calling free().

2. Arena → Plan. plan_build_select() reads the arena AST and allocates plan nodes into arena.plan_nodes. Auxiliary arrays—sort column indices, projection column maps, hash-join key mappings—are bump-allocated from arena.scratch. The plan builder is composable: build_join(), build_aggregate(), build_window(), and build_set_op() each return a struct plan_result with status (PLAN_OK, PLAN_NOTIMPL, or PLAN_ERROR). Unimplemented features fall back to the legacy row-at-a-time executor automatically.

3. Row store → Scan cache. When seq_scan_next() runs, it checks scan_cache.generation against table.generation. On a cache hit, it copies a 1,024-row slice from the pre-built flat arrays into a col_block—a single memcpy per column. On a cache miss (first scan, or after an INSERT/UPDATE/DELETE bumped the generation), it rebuilds the cache by walking the row store once—O(N) cost amortized across all subsequent queries until the next write.

4. Scan cache → Block executor. Filter, sort, hash join, hash aggregation, window functions, and set operations all process 1,024-row col_block chunks. A filter produces a selection vector (an array of active row indices) instead of copying data. Hash tables for joins and aggregation, sort index arrays, and window-function accumulators are all bump-allocated from arena.scratch—zero malloc calls during execution.

5. Block executor → Wire. try_plan_send() serializes col_block arrays directly into a 64 KB wire buffer, then flushes with a single write() system call. No row materialization occurs—data flows columnar from the scan cache straight to the TCP socket. For a 5,000-row result, this means 5 system calls instead of 5,000.

6. Wire → Result cache. The serialized wire bytes (RowDescription + DataRows + CommandComplete) are stored in g_rcache, a 8,192-slot hash table keyed by SQL hash and total_generation (the sum of all table generations). The next identical query skips parsing, planning, and execution entirely—a single send_all() of cached bytes. The result cache is invalidated on any write to any table.

Performance implications

Zero steady-state allocation. After the first query warms up dynamic-array capacities and bump slabs, subsequent queries perform zero malloc/free calls. The parser, plan builder, and executor all reuse existing memory. This eliminates allocator contention and fragmentation entirely.

Scan cache eliminates row-store overhead. Converting 5,000 rows × 3 columns from a row-of-cells layout to flat typed arrays costs ~0.1 ms once. Subsequent scans read 32 KB of contiguous data instead of chasing 5,000 pointers through individually-allocated cell structures.

1,024-row blocks fit L1 cache. A 4-column INT block occupies 4 × 1,024 × 4 = 16 KB. L1 data cache on modern CPUs is typically 32–48 KB. The entire working set for a filter or aggregate pass fits in L1, eliminating cache misses during the inner loop.

Batched wire send. Accumulating DataRow messages into a 64 KB buffer and flushing with one write() call eliminates one system call per row. For a 5,000-row result: 5 write() calls instead of 5,000. This alone closed the full-scan performance gap with PostgreSQL. Write coalescing (TCP_NODELAY + TCP_NOPUSH/CORK + 256 KB buffer) further reduces syscall overhead.

Result cache. Repeated identical SELECTs skip parsing, planning, and execution entirely. The throughput benchmark’s analytical CTE workload runs at 94,525 QPS (190× PostgreSQL) because the result is served from cached wire bytes in microseconds. On Parquet workloads, the cache makes mskql faster than DuckDB on all 9 cached queries.

Lazy copy-on-write transactions. BEGIN records table names and generation numbers—O(1). The first write to a table triggers a table_deep_copy() of that table only—O(N) for that table. COMMIT is free (discard the snapshot). ROLLBACK swaps the saved copy back. This is why the transaction benchmark runs at 0.86× PostgreSQL despite mskql having no WAL.

Ownership rules

JPL-style ownership. The allocating module deallocates. The arena owns all parser output. Bump slabs own all executor scratch state. Each table owns its rows and scan cache. No module ever frees memory it did not allocate.

arena_owns_text flag. When set on result rows, text cells live in the result_text bump slab. On query completion, all result text is bulk-freed by rewinding the bump pointer—no per-cell free() needed. When the flag is not set (legacy path), each cell’s text is individually free()’d.

No cross-region pointers. Plan nodes reference arena items by uint32_t index, not pointer. The bump slab chain never moves old slabs (new slabs are appended), so pointers into any slab remain valid for the lifetime of the query. This eliminates an entire class of use-after-free bugs.

Internal abstractions

Dynamic-array macro (dynamic_array.h): type-safe DYNAMIC_ARRAY(T) with da_push, da_reset, da_free. Backing array doubles on overflow. Used for table rows, arena pools, and all variable-length internal structures.

Zero-copy string views (stringview.h): sv is a {const char *data, size_t len} pair. The parser operates entirely on string views into the original SQL text—no copies until a value must outlive the input buffer.

Expression evaluator (eval_expr()): a tagged-union AST (struct expr) supporting CAST/::, date/time arithmetic, 30+ built-in scalar functions (math, string, temporal), CASE WHEN, and subqueries.

Performance: architecture as cause

The benchmark numbers are a consequence of the architecture, not a separate optimization effort. Each structural decision eliminates a specific overhead:

→ Full benchmark results  ·  Design philosophy: why abstraction is not free

Safety by construction

Every test runs under AddressSanitizer. Arena allocation and index-based references eliminate use-after-free and leak classes by construction. The bump slab chain’s append-only growth means pointers never dangle within a query’s lifetime. The -Wswitch-enum flag plus the no-default: rule means adding a new enum variant produces a compile error at every unhandled dispatch site—not a silent runtime skip.

Explore further

Benchmarks  ·  Design philosophy  ·  Benchmark workload details  ·  SQL grammar  ·  Source on GitHub  ·  Try it in the browser