93 batch workloads and 15 throughput workloads. Each entry below shows the exact SQL, explains what database capability it exercises, and describes how mskql’s architecture produces the measured result. For the summary numbers, see Benchmarks.
Each batch workload runs a fixed number of iterations of the same query
through a single psql session. The timer measures total
wall-clock time including connection overhead. Source:
bench/bench_vs_pg.py.
SELECT * FROM bench_data;
A table with 5,000 rows and 3 columns (id INT,
category TEXT, value INT). The query
reads every row and sends all columns to the client. Repeated
200 times to amortize connection overhead.
Full scans test raw read throughput: how fast the engine can materialize rows and push them over the wire. This is the baseline for all read workloads. A database that is slow here will be slow everywhere.
mskql’s scan cache converts the row store to flat columnar
arrays once (O(N), ~0.1 ms for 5K rows). Subsequent scans
memcpy 1,024-row slices into col_block
arrays. try_plan_send serializes directly from
columnar blocks into a 64 KB wire buffer—5 system calls
instead of 5,000. No row materialization occurs.
SELECT * FROM bench_data WHERE category = 'cat_42';
Same 5,000-row table. The WHERE clause filters to
rows matching one of 100 categories (~50 rows per category).
Tests predicate evaluation speed. The engine must scan all rows but only emit matching ones. This measures the cost of the filter operator relative to the scan.
The plan executor’s PLAN_FILTER node produces
a selection vector (array of matching row indices) instead of
copying data. Only matching rows are serialized to the wire.
The scan cache hit means the filter operates on contiguous
int32_t[] and char*[] arrays in L1 cache.
SELECT * FROM bench_data WHERE id = 42;
Point lookup by primary key (B-tree index). Returns exactly 1 row. Repeated 2,000 times with varying IDs.
Tests index lookup latency—the most common operation in OLTP workloads. The result is nearly tied (0.98×) because both databases use B-tree indexes and the overhead is dominated by network round-trips, not engine work.
mskql’s PLAN_INDEX_SCAN walks the B-tree to
find the matching row, then serializes it directly. The result
cache does not help here because each query has a different
WHERE value. The near-parity shows that mskql’s
index implementation is competitive with PostgreSQL’s.
SELECT * FROM bench_data ORDER BY value DESC;
Sort all 5,000 rows by a single integer column in descending order.
Tests sort performance on a medium dataset. Sorting is a core
operation used by ORDER BY, DISTINCT,
window functions, and merge joins.
The PLAN_SORT node collects all blocks from its
child, builds a flat index array, and sorts it with
qsort. The sort key array is bump-allocated from
arena.scratch—zero malloc calls.
After sorting, blocks are emitted in sorted order directly to
the wire.
SELECT * FROM bench_data ORDER BY category, value DESC;
Sort by two columns: category ascending, then
value descending within each category.
Multi-column sorts are common in reporting queries. The comparator must check the first column, then fall through to the second on ties. This tests the overhead of compound sort keys.
The sort node stores parallel arrays of sort columns and direction flags, all bump-allocated. The comparator loops through columns in order, short-circuiting on the first difference. Columnar storage means each column’s data is contiguous in memory, improving cache locality during the sort.
SELECT DISTINCT category FROM bench_data;
Extract the 100 unique category values from 5,000 rows.
DISTINCT tests deduplication speed. The 7×
speedup (0.14×) is one of the largest gaps because the
result set is tiny (100 rows) while the input is large (5,000).
The DISTINCT path uses the hash aggregation
path with no aggregate functions. The hash table is bump-allocated,
and the 100-row result is serialized in a single wire buffer
flush. The result cache further accelerates repeated executions.
INSERT INTO bench_data (id, category, value) VALUES (1, 'cat_1', 42);
-- ... repeated 10,000 times with different values
Tests per-statement insert overhead. Each insert is a separate SQL statement parsed, planned, and executed independently. The 3× speedup reflects mskql’s lack of WAL, fsync, and MVCC bookkeeping.
Each insert appends a row to the table’s dynamic array and bumps the table generation (invalidating the scan cache). The arena is reset between statements. No durability overhead: no WAL write, no fsync, no checkpoint. The result cache is invalidated by the generation bump.
UPDATE bench_data SET value = value + 1 WHERE category = 'cat_42';
Update ~50 rows matching a category filter. Repeated 200 times across 20 different categories (1,000 total rows updated per iteration set).
Tests in-place update speed. The 5× speedup (0.20×) is the second-largest gap after expression aggregates. Updates are expensive in PostgreSQL because of MVCC: each update creates a new tuple version and marks the old one dead.
Rows are updated in place—no tuple versioning, no dead tuple cleanup. The scan cache is invalidated by the generation bump, but the update itself is a direct cell modification in the row store. No WAL write, no vacuum needed.
-- Insert 2,000 rows, then:
DELETE FROM bench_data WHERE id % 2 = 0;
-- Repeated 50 times (re-inserting each time)
Tests the combined cost of bulk insert + bulk delete. This exercises the full write path including row removal and generation tracking.
Deletes remove rows from the dynamic array and bump the table generation. No dead tuple tracking, no vacuum. The 3× speedup is primarily from avoiding WAL and MVCC overhead.
BEGIN;
INSERT INTO bench_data VALUES (1, 'cat_1', 42);
-- ... 50 inserts
COMMIT;
Tests transaction overhead. The near-parity (0.86×) is notable because PostgreSQL’s transactions are backed by WAL with group commit, while mskql uses lazy copy-on-write snapshots with no durability.
BEGIN records table names and generation numbers
in O(1). The first write to a table triggers
table_deep_copy()—O(N) for that table only.
COMMIT discards the snapshot (free).
ROLLBACK swaps the saved copy back. The O(N) deep
copy on first write is why transactions are the closest result
to PostgreSQL.
SELECT category, SUM(value), COUNT(*), AVG(value)
FROM bench_data GROUP BY category;
Group 5,000 rows by 100 categories and compute 3 aggregates per group.
Tests hash aggregation speed. The 8× speedup (0.12×) comes from columnar processing: the aggregator reads contiguous integer arrays instead of extracting values from row tuples.
PLAN_HASH_AGG builds a hash table keyed by group
columns. For each 1,024-row block, it looks up the group and
updates accumulators in a tight loop. The hash table and
accumulators are bump-allocated. The 100-row result is serialized
in a single wire buffer flush. The result cache makes repeated
executions near-instant.
SELECT category, SUM(price * quantity)
FROM bench_data GROUP BY category;
Tests expression evaluation inside aggregates. The 10×
speedup (0.10×) is the largest gap in the batch suite.
The expression price * quantity must be evaluated
for every row before aggregation.
The PLAN_EXPR_PROJECT node evaluates expressions
in columnar batches, feeding results into the hash aggregator.
Combined with the result cache, repeated executions skip all
computation entirely.
SELECT region, product,
SUM(revenue), AVG(cost), MIN(margin),
MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;
Tests aggregation on a wider table with more columns and a larger dataset. The 7× speedup shows that the columnar advantage scales with both row count and column count.
Each column is stored as a contiguous typed array in the scan cache. The hash aggregator processes each column independently, keeping the working set in L1 cache. A 4-column INT block is 16 KB—well within the 32–48 KB L1 data cache.
SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY value DESC)
FROM bench_data;
Tests window function execution with partitioning and ordering. Window functions are among the most complex query operations, requiring sorting within each partition.
The PLAN_WINDOW node sorts all input blocks by
partition key + order key, then walks the sorted data to compute
window values. All sort indices and window accumulators are
bump-allocated from arena.scratch.
SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;
Tests window functions on a larger dataset (50K rows). RANK handles ties differently from ROW_NUMBER, requiring comparison with the previous row’s order value.
Same window node, but with RANK semantics: equal order values get the same rank, and the next rank skips ahead. The 10× larger dataset shows the columnar advantage holds at scale.
SELECT c.name, o.amount, o.created
FROM customers c
JOIN orders o ON c.id = o.customer_id;
Tests hash join performance on a typical OLTP join pattern: a small dimension table (500 customers) joined to a larger fact table (2,000 orders).
PLAN_HASH_JOIN builds a hash table from the smaller
(inner) table, then probes it for each block of the larger
(outer) table. The hash table is bump-allocated. The join cache
(table.join_cache) stores the pre-built hash table
and is invalidated by the same generation counter as the scan
cache.
SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;
Tests a star-schema join pattern: one large fact table joined to two small dimension tables, followed by aggregation. This is the canonical analytical query pattern.
Two hash joins are chained, feeding into a hash aggregation node. All three hash tables are bump-allocated. The result (a small number of region × product combinations) is cached by the result cache, making repeated executions near-instant.
SELECT * FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE region = 'West'
);
Tests IN subquery execution. The subquery is
converted to a hash semi-join internally, avoiding the naive
nested-loop approach.
PLAN_HASH_SEMI_JOIN builds a hash set from the
subquery result, then probes it for each row of the outer
query. This is O(N + M) instead of O(N × M) for a
naive correlated subquery.
SELECT * FROM (
SELECT * FROM wide_bench WHERE region = 'North'
ORDER BY revenue DESC
LIMIT 100
) sub;
Tests a multi-stage pipeline: filter → sort → limit, wrapped in a subquery. The 5× speedup (0.20×) shows the benefit of the block-oriented executor on compound operations.
The plan is a chain: SeqScan → Filter → Sort
→ Limit. Each node pulls blocks from its child
on demand. The limit node stops pulling after 100 rows,
avoiding unnecessary sort work on the remaining data.
WITH filtered AS (
SELECT * FROM bench_data WHERE value > 50
)
SELECT category, COUNT(*) FROM filtered
GROUP BY category ORDER BY COUNT(*) DESC;
Tests CTE materialization and subsequent aggregation. CTEs are a common pattern for breaking complex queries into readable steps.
The CTE is materialized into a temporary table, then the outer query runs against it. The scan cache is built for the temp table on first access. The result cache caches the final wire output.
WITH summary AS (
SELECT region, SUM(revenue) AS total
FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;
Tests a multi-stage analytical pipeline: aggregate in a CTE, then sort and limit the result. The closer ratio (0.75×) reflects that PostgreSQL’s query optimizer handles this pattern well.
The CTE materializes the aggregation result, then the outer query sorts and limits it. Both stages use the block executor. The result cache makes repeated executions near-instant.
WITH joined AS (
SELECT r.name AS region, p.name AS product, s.revenue
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region
ORDER BY total DESC;
The most complex batch workload: two joins, aggregation, and sorting in a single CTE pipeline. This is a realistic analytical query on a star schema.
The CTE materializes the join result, then the outer query aggregates and sorts. In the throughput this same query runs at 94,525 QPS (190× PostgreSQL) because the 4-row result is served from the result cache in microseconds.
SELECT id, category, value FROM bench_data_a
UNION
SELECT id, category, value FROM bench_data_b;
Tests set operation performance. UNION requires
deduplication across two result sets, combining scan, hash,
and serialization costs.
PLAN_SET_OP pulls blocks from both children and
uses a hash table for deduplication. The hash table is
bump-allocated. The deduplicated result is serialized directly
to the wire.
SELECT * FROM generate_series(1, 10000);
Tests table-valued function performance. The 3× speedup (0.32×) was achieved by generating integers directly into columnar blocks instead of materializing a temporary table.
PLAN_GENERATE_SERIES produces 1,024-row blocks of
contiguous int32_t arrays in a tight loop—zero
malloc, zero row-store materialization. The blocks
are serialized directly to the wire via try_plan_send.
SELECT UPPER(name), LENGTH(name), ABS(score),
ROUND(score::numeric, 2)
FROM bench_data;
Tests expression projection with multiple scalar functions. The 2.5× speedup (0.40×) comes from batched wire send and scan cache hits.
PLAN_EXPR_PROJECT evaluates all 4 functions per
row within each 1,024-row block, writing results into output
col_block arrays. The input columns come from the
scan cache (contiguous arrays), and the output is serialized
directly to the wire—no row materialization.
SELECT * FROM wide_bench ORDER BY revenue DESC;
Tests sort performance at scale. The 0.65× ratio shows that sorting 50K rows benefits from mskql’s multi-column radix sort on contiguous columnar arrays.
The sort node collects all ~49 blocks (50K / 1024), builds a
flat index array, and sorts with qsort. The
sorted blocks are then serialized to the wire. The remaining
gap is from mskql’s batched wire send (fewer system calls)
and scan cache (contiguous sort key array).
Each throughput workload runs 8 threads for 5 seconds, measuring queries
per second (QPS) and median latency (p50). Source:
bench/bench_throughput.c.
SELECT * FROM tp_data WHERE id = $1;
Point lookup by primary key with a random ID each iteration. 8 threads, each with a persistent connection.
The most common OLTP operation. The 3.37× throughput advantage comes from mskql’s lower per-query overhead: no planner, no MVCC snapshot, no buffer pool lookup.
Each query is parsed, the index is walked, and the single matching row is serialized. The arena is reset between queries. At 79,102 QPS, each query completes in ~101 microseconds including network round-trip.
SELECT * FROM tp_data;
5,000 rows, 8 concurrent threads.
Dead heat. Under concurrent load, full scans are network-bound: serializing 5,000 rows over TCP dominates the total time. The engine speed difference is invisible.
Both databases spend most of their time in write()
system calls. The mskql batched 64 KB buffer helps in
batch mode but under concurrent load the network becomes the
bottleneck.
SELECT * FROM tp_data WHERE category = 'cat_42';
Near-parity, same as full scan. The filter reduces the result set but the scan cost is similar. Network-bound under concurrent load.
The selection vector avoids copying non-matching rows, but the result set (~50 rows) is small enough that serialization time is negligible. The bottleneck is the scan itself.
INSERT INTO tp_data (id, category, value) VALUES ($1, $2, $3);
Tests concurrent write throughput. The 1.61× advantage is smaller than the 2.7× batch advantage because PostgreSQL’s group commit amortizes WAL overhead across concurrent writers.
Each insert appends to the row dynamic array and bumps the generation counter. The single-threaded executor serializes all inserts, but each one is fast enough (115μs) to sustain 63K QPS across 8 threads.
-- 80% reads:
SELECT * FROM tp_data WHERE id = $1;
-- 20% writes:
UPDATE tp_data SET value = $1 WHERE id = $2;
The only workload where PostgreSQL wins. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries, so writers block readers.
This is a fundamental architectural limitation: mskql processes one query at a time. Adding MVCC or a concurrent executor would close this gap but would add significant complexity to the ~45.1K-line codebase.
SELECT category, SUM(value), COUNT(*), AVG(value)
FROM tp_data GROUP BY category;
Nearly 7× throughput advantage on aggregation. The result cache serves the 100-row result from cached wire bytes after the first execution, skipping all computation.
First execution: hash aggregation on columnar blocks (~0.13 ms).
Subsequent executions: result cache hit, single
send_all() of cached bytes (~0.01 ms). The
cache is invalidated only when a write bumps
total_generation.
SELECT region, product,
SUM(revenue), AVG(cost), MIN(margin),
MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;
Over 8× throughput on a wider aggregation (50K rows, 7 columns). The advantage holds at scale because the columnar executor and result cache both benefit from larger datasets.
Same mechanism as the simple aggregate: first execution builds the result via columnar hash aggregation, subsequent executions serve from the result cache. The 50K-row scan cache is ~200 KB of contiguous data.
SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;
Window functions on 50K rows are expensive in both databases. The modest 1.22× advantage shows that the result cache does not help here (the 50K-row result is too large to cache efficiently) and the sort dominates.
Each execution sorts 50K rows and computes RANK values. The columnar sort is slightly faster due to contiguous sort key arrays, but the O(N log N) cost dominates.
SELECT c.name, o.amount, o.created
FROM customers c JOIN orders o ON c.id = o.customer_id;
Nearly 2× throughput on a join producing ~2K result rows. The result is large enough that the result cache helps but serialization still takes significant time.
The hash join builds a hash table from the 500-row customers table and probes it for each block of the 2K-row orders table. The join cache stores the pre-built hash table across queries.
SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;
78× throughput. The second-largest gap in the entire benchmark suite. The small result set (region × product combinations) fits entirely in the result cache.
First execution: two hash joins + hash aggregation on columnar
blocks (~0.35 ms). All subsequent executions: result cache
hit, single send_all(). PostgreSQL re-executes the
full plan (planner + 2 hash joins + aggregation) every time.
WITH summary AS (
SELECT region, SUM(revenue) AS total
FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;
Nearly 3× throughput on a CTE pipeline. The result cache serves the 5-row result from cached bytes after the first execution.
First execution materializes the CTE, aggregates, sorts, and limits. The 5-row result is cached. Subsequent executions skip all computation.
SELECT * FROM (
SELECT * FROM wide_bench WHERE region = 'North'
ORDER BY revenue DESC LIMIT 100
) sub;
Modest 1.3× advantage. The 100-row result is cached, but the first execution involves sorting ~10K filtered rows, which is expensive in both databases.
The plan chain (scan → filter → sort → limit) runs in the block executor. The limit node stops pulling after 100 rows. The result cache serves subsequent executions.
SELECT * FROM wide_bench ORDER BY revenue DESC;
2.5× throughput on sorting 50K rows. The throughput advantage is larger than the batch advantage (0.87×) because the result cache serves the sorted result from cached wire bytes after the first execution.
First execution: sort 50K rows (~213 ms). Subsequent
executions: result cache hit. The 50K-row serialized result
is ~2 MB of wire bytes, served in a single
send_all().
WITH joined AS (
SELECT r.name AS region, p.name AS product, s.revenue
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region ORDER BY total DESC;
190× throughput—the largest gap in the entire benchmark suite. The 4-row result (one per region) is served from the result cache in 85 microseconds. PostgreSQL re-executes the full plan (planner + 2 hash joins + aggregation + sort) every time, taking 18.4 ms.
The result cache is the key: the 4-row result is ~200 bytes of
serialized wire data. After the first execution, every subsequent
query is a hash lookup + send_all(). The cache is
invalidated only when a write bumps total_generation.
In a read-only throughput test, the cache is never invalidated.