mskql — Benchmark Workload Details

Full table scan

5,000 rows × 3 columns × 200 iterations | 0.67×

What

SELECT * FROM bench_data;

A table with 5,000 rows and 3 columns (id INT, category TEXT, value INT). The query reads every row and sends all columns to the client. Repeated 200 times to amortize connection overhead.

So what

Full scans test raw read throughput: how fast the engine can materialize rows and push them over the wire. This is the baseline for all read workloads. A database that is slow here will be slow everywhere.

Why it’s fast

mskql’s scan cache converts the row store to flat columnar arrays once (O(N), ~0.1 ms for 5K rows). Subsequent scans memcpy 1,024-row slices into col_block arrays. try_plan_send serializes directly from columnar blocks into a 64 KB wire buffer—5 system calls instead of 5,000. No row materialization occurs.

Filtered scan

5,000 rows × 500 iterations | 0.58×

What

SELECT * FROM bench_data WHERE category = 'cat_42';

Same 5,000-row table. The WHERE clause filters to rows matching one of 100 categories (~50 rows per category).

So what

Tests predicate evaluation speed. The engine must scan all rows but only emit matching ones. This measures the cost of the filter operator relative to the scan.

Why it’s fast

The plan executor’s PLAN_FILTER node produces a selection vector (array of matching row indices) instead of copying data. Only matching rows are serialized to the wire. The scan cache hit means the filter operates on contiguous int32_t[] and char*[] arrays in L1 cache.

Indexed point lookup

2,000 iterations | 0.96×

What

SELECT * FROM bench_data WHERE id = 42;

Point lookup by primary key (B-tree index). Returns exactly 1 row. Repeated 2,000 times with varying IDs.

So what

Tests index lookup latency—the most common operation in OLTP workloads. The result is nearly tied (0.98×) because both databases use B-tree indexes and the overhead is dominated by network round-trips, not engine work.

Why it’s fast

mskql’s PLAN_INDEX_SCAN walks the B-tree to find the matching row, then serializes it directly. The result cache does not help here because each query has a different WHERE value. The near-parity shows that mskql’s index implementation is competitive with PostgreSQL’s.

Sort

5,000 rows × 200 iterations | 0.55×

What

SELECT * FROM bench_data ORDER BY value DESC;

Sort all 5,000 rows by a single integer column in descending order.

So what

Tests sort performance on a medium dataset. Sorting is a core operation used by ORDER BY, DISTINCT, window functions, and merge joins.

Why it’s fast

The PLAN_SORT node collects all blocks from its child, builds a flat index array, and sorts it with qsort. The sort key array is bump-allocated from arena.scratch—zero malloc calls. After sorting, blocks are emitted in sorted order directly to the wire.

Multi-column sort

5,000 rows × 200 iterations | 0.53×

What

SELECT * FROM bench_data ORDER BY category, value DESC;

Sort by two columns: category ascending, then value descending within each category.

So what

Multi-column sorts are common in reporting queries. The comparator must check the first column, then fall through to the second on ties. This tests the overhead of compound sort keys.

Why it’s fast

The sort node stores parallel arrays of sort columns and direction flags, all bump-allocated. The comparator loops through columns in order, short-circuiting on the first difference. Columnar storage means each column’s data is contiguous in memory, improving cache locality during the sort.

Distinct

100 categories × 500 iterations | 0.15×

What

SELECT DISTINCT category FROM bench_data;

Extract the 100 unique category values from 5,000 rows.

So what

DISTINCT tests deduplication speed. The 7× speedup (0.14×) is one of the largest gaps because the result set is tiny (100 rows) while the input is large (5,000).

Why it’s fast

The DISTINCT path uses the hash aggregation path with no aggregate functions. The hash table is bump-allocated, and the 100-row result is serialized in a single wire buffer flush. The result cache further accelerates repeated executions.

Single-row inserts

10,000 inserts | 0.35×

What

INSERT INTO bench_data (id, category, value) VALUES (1, 'cat_1', 42);
-- ... repeated 10,000 times with different values

So what

Tests per-statement insert overhead. Each insert is a separate SQL statement parsed, planned, and executed independently. The 3× speedup reflects mskql’s lack of WAL, fsync, and MVCC bookkeeping.

Why it’s fast

Each insert appends a row to the table’s dynamic array and bumps the table generation (invalidating the scan cache). The arena is reset between statements. No durability overhead: no WAL write, no fsync, no checkpoint. The result cache is invalidated by the generation bump.

Bulk update

1,000 of 5,000 rows × 200 iterations | 0.23×

What

UPDATE bench_data SET value = value + 1 WHERE category = 'cat_42';

Update ~50 rows matching a category filter. Repeated 200 times across 20 different categories (1,000 total rows updated per iteration set).

So what

Tests in-place update speed. The 5× speedup (0.20×) is the second-largest gap after expression aggregates. Updates are expensive in PostgreSQL because of MVCC: each update creates a new tuple version and marks the old one dead.

Why it’s fast

Rows are updated in place—no tuple versioning, no dead tuple cleanup. The scan cache is invalidated by the generation bump, but the update itself is a direct cell modification in the row store. No WAL write, no vacuum needed.

Insert + delete

2,000 rows, delete half × 50 iterations | 0.32×

What

-- Insert 2,000 rows, then:
DELETE FROM bench_data WHERE id % 2 = 0;
-- Repeated 50 times (re-inserting each time)

So what

Tests the combined cost of bulk insert + bulk delete. This exercises the full write path including row removal and generation tracking.

Why it’s fast

Deletes remove rows from the dynamic array and bump the table generation. No dead tuple tracking, no vacuum. The 3× speedup is primarily from avoiding WAL and MVCC overhead.

Transactional inserts

100 transactions, 50 inserts each | 0.86×

What

BEGIN;
INSERT INTO bench_data VALUES (1, 'cat_1', 42);
-- ... 50 inserts
COMMIT;

So what

Tests transaction overhead. The near-parity (0.86×) is notable because PostgreSQL’s transactions are backed by WAL with group commit, while mskql uses lazy copy-on-write snapshots with no durability.

Why it’s fast

BEGIN records table names and generation numbers in O(1). The first write to a table triggers table_deep_copy()—O(N) for that table only. COMMIT discards the snapshot (free). ROLLBACK swaps the saved copy back. The O(N) deep copy on first write is why transactions are the closest result to PostgreSQL.

Group + sum

5,000 rows × 500 iterations | 0.12×

What

SELECT category, SUM(value), COUNT(*), AVG(value)
FROM bench_data GROUP BY category;

Group 5,000 rows by 100 categories and compute 3 aggregates per group.

So what

Tests hash aggregation speed. The 8× speedup (0.12×) comes from columnar processing: the aggregator reads contiguous integer arrays instead of extracting values from row tuples.

Why it’s fast

PLAN_HASH_AGG builds a hash table keyed by group columns. For each 1,024-row block, it looks up the group and updates accumulators in a tight loop. The hash table and accumulators are bump-allocated. The 100-row result is serialized in a single wire buffer flush. The result cache makes repeated executions near-instant.

Expression aggregate

SUM(price × qty), 5,000 rows × 500 iterations | 0.10×

What

SELECT category, SUM(price * quantity)
FROM bench_data GROUP BY category;

So what

Tests expression evaluation inside aggregates. The 10× speedup (0.10×) is the largest gap in the batch suite. The expression price * quantity must be evaluated for every row before aggregation.

Why it’s fast

The PLAN_EXPR_PROJECT node evaluates expressions in columnar batches, feeding results into the hash aggregator. Combined with the result cache, repeated executions skip all computation entirely.

Wide aggregate

7 columns, 50,000 rows × 20 iterations | 0.14×

What

SELECT region, product,
       SUM(revenue), AVG(cost), MIN(margin),
       MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;

So what

Tests aggregation on a wider table with more columns and a larger dataset. The 7× speedup shows that the columnar advantage scales with both row count and column count.

Why it’s fast

Each column is stored as a contiguous typed array in the scan cache. The hash aggregator processes each column independently, keeping the working set in L1 cache. A 4-column INT block is 16 KB—well within the 32–48 KB L1 data cache.

Window function (ROW_NUMBER)

5,000 rows × 20 iterations | 0.55×

What

SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY value DESC)
FROM bench_data;

So what

Tests window function execution with partitioning and ordering. Window functions are among the most complex query operations, requiring sorting within each partition.

Why it’s fast

The PLAN_WINDOW node sorts all input blocks by partition key + order key, then walks the sorted data to compute window values. All sort indices and window accumulators are bump-allocated from arena.scratch.

Window function (RANK)

50,000 rows × 5 iterations | 0.50×

What

SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;

So what

Tests window functions on a larger dataset (50K rows). RANK handles ties differently from ROW_NUMBER, requiring comparison with the previous row’s order value.

Why it’s fast

Same window node, but with RANK semantics: equal order values get the same rank, and the next rank skips ahead. The 10× larger dataset shows the columnar advantage holds at scale.

Two-table join

500 + 2,000 rows × 50 iterations | 0.44×

What

SELECT c.name, o.amount, o.created
FROM customers c
JOIN orders o ON c.id = o.customer_id;

So what

Tests hash join performance on a typical OLTP join pattern: a small dimension table (500 customers) joined to a larger fact table (2,000 orders).

Why it’s fast

PLAN_HASH_JOIN builds a hash table from the smaller (inner) table, then probes it for each block of the larger (outer) table. The hash table is bump-allocated. The join cache (table.join_cache) stores the pre-built hash table and is invalidated by the same generation counter as the scan cache.

Three-table join + group

50,000 rows × 5 iterations | 0.32×

What

SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;

So what

Tests a star-schema join pattern: one large fact table joined to two small dimension tables, followed by aggregation. This is the canonical analytical query pattern.

Why it’s fast

Two hash joins are chained, feeding into a hash aggregation node. All three hash tables are bump-allocated. The result (a small number of region × product combinations) is cached by the result cache, making repeated executions near-instant.

Subquery filter

IN subquery × 50 iterations | 0.43×

What

SELECT * FROM orders
WHERE customer_id IN (
    SELECT id FROM customers WHERE region = 'West'
);

So what

Tests IN subquery execution. The subquery is converted to a hash semi-join internally, avoiding the naive nested-loop approach.

Why it’s fast

PLAN_HASH_SEMI_JOIN builds a hash set from the subquery result, then probes it for each row of the outer query. This is O(N + M) instead of O(N × M) for a naive correlated subquery.

Subquery pipeline

filter + sort + limit, 50,000 rows × 20 iterations | 0.19×

What

SELECT * FROM (
    SELECT * FROM wide_bench WHERE region = 'North'
    ORDER BY revenue DESC
    LIMIT 100
) sub;

So what

Tests a multi-stage pipeline: filter → sort → limit, wrapped in a subquery. The 5× speedup (0.20×) shows the benefit of the block-oriented executor on compound operations.

Why it’s fast

The plan is a chain: SeqScan → Filter → Sort → Limit. Each node pulls blocks from its child on demand. The limit node stops pulling after 100 rows, avoiding unnecessary sort work on the remaining data.

CTE filter chain

5,000 rows × 200 iterations | 0.43×

What

WITH filtered AS (
    SELECT * FROM bench_data WHERE value > 50
)
SELECT category, COUNT(*) FROM filtered
GROUP BY category ORDER BY COUNT(*) DESC;

So what

Tests CTE materialization and subsequent aggregation. CTEs are a common pattern for breaking complex queries into readable steps.

Why it’s fast

The CTE is materialized into a temporary table, then the outer query runs against it. The scan cache is built for the temp table on first access. The result cache caches the final wire output.

CTE pipeline

agg → sort → limit, 50,000 rows × 20 iterations | 0.73×

What

WITH summary AS (
    SELECT region, SUM(revenue) AS total
    FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;

So what

Tests a multi-stage analytical pipeline: aggregate in a CTE, then sort and limit the result. The closer ratio (0.75×) reflects that PostgreSQL’s query optimizer handles this pattern well.

Why it’s fast

The CTE materializes the aggregation result, then the outer query sorts and limits it. Both stages use the block executor. The result cache makes repeated executions near-instant.

CTE star-schema

join → agg → sort, 50,000 rows × 5 iterations | 0.26×

What

WITH joined AS (
    SELECT r.name AS region, p.name AS product, s.revenue
    FROM sales s
    JOIN regions r ON s.region_id = r.id
    JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region
ORDER BY total DESC;

So what

The most complex batch workload: two joins, aggregation, and sorting in a single CTE pipeline. This is a realistic analytical query on a star schema.

Why it’s fast

The CTE materializes the join result, then the outer query aggregates and sorts. In the throughput test, this same query runs at 96,007 QPS (192× PostgreSQL) because the 4-row result is served from the result cache in microseconds.

Union

two 2,000-row tables × 50 iterations | 0.54×

What

SELECT id, category, value FROM bench_data_a
UNION
SELECT id, category, value FROM bench_data_b;

So what

Tests set operation performance. UNION requires deduplication across two result sets, combining scan, hash, and serialization costs.

Why it’s fast

PLAN_SET_OP pulls blocks from both children and uses a hash table for deduplication. The hash table is bump-allocated. The deduplicated result is serialized directly to the wire.

Generate series

10,000 rows × 200 iterations | 0.31×

What

SELECT * FROM generate_series(1, 10000);

So what

Tests table-valued function performance. The 3× speedup (0.32×) was achieved by generating integers directly into columnar blocks instead of materializing a temporary table.

Why it’s fast

PLAN_GENERATE_SERIES produces 1,024-row blocks of contiguous int32_t arrays in a tight loop—zero malloc, zero row-store materialization. The blocks are serialized directly to the wire via try_plan_send.

Scalar functions

4 functions, 5,000 rows × 200 iterations | 0.40×

What

SELECT UPPER(name), LENGTH(name), ABS(score),
       ROUND(score::numeric, 2)
FROM bench_data;

So what

Tests expression projection with multiple scalar functions. The 2.5× speedup (0.40×) comes from batched wire send and scan cache hits.

Why it’s fast

PLAN_EXPR_PROJECT evaluates all 4 functions per row within each 1,024-row block, writing results into output col_block arrays. The input columns come from the scan cache (contiguous arrays), and the output is serialized directly to the wire—no row materialization.

Large sort

50,000 rows × 10 iterations | 0.65×

What

SELECT * FROM wide_bench ORDER BY revenue DESC;

So what

Tests sort performance at scale. The 0.65× ratio shows that sorting 50K rows benefits from mskql’s multi-column radix sort on contiguous columnar arrays.

Why it’s fast

The sort node collects all ~49 blocks (50K / 1024), builds a flat index array, and sorts with qsort. The sorted blocks are then serialized to the wire. The remaining gap is from mskql’s batched wire send (fewer system calls) and scan cache (contiguous sort key array).

Indexed point lookup

80,214 vs 23,688 QPS | 3.39× | p50: 0.100 vs 0.252 ms

What

SELECT * FROM tp_data WHERE id = $1;

Point lookup by primary key with a random ID each iteration. 8 threads, each with a persistent connection.

So what

The most common OLTP operation. The 3.37× throughput advantage comes from mskql’s lower per-query overhead: no planner, no MVCC snapshot, no buffer pool lookup.

Why it’s fast

Each query is parsed, the index is walked, and the single matching row is serialized. The arena is reset between queries. At 80,214 QPS, each query completes in ~100 microseconds including network round-trip.

Full table scan

122 vs 120 QPS | 1.02× | p50: 66.0 vs 65.8 ms

What

SELECT * FROM tp_data;

5,000 rows, 8 concurrent threads.

So what

Dead heat. Under concurrent load, full scans are network-bound: serializing 5,000 rows over TCP dominates the total time. The engine speed difference is invisible.

Why it’s fast

Both databases spend most of their time in write() system calls. The mskql batched 64 KB buffer helps in batch mode but under concurrent load the network becomes the bottleneck.

Filtered scan

245 vs 241 QPS | 1.02× | p50: 32.6 vs 33.0 ms

What

SELECT * FROM tp_data WHERE category = 'cat_42';

So what

Near-parity, same as full scan. The filter reduces the result set but the scan cost is similar. Network-bound under concurrent load.

Why it’s fast

The selection vector avoids copying non-matching rows, but the result set (~50 rows) is small enough that serialization time is negligible. The bottleneck is the scan itself.

Single-row insert

64,102 vs 39,512 QPS | 1.62× | p50: 0.114 vs 0.193 ms

What

INSERT INTO tp_data (id, category, value) VALUES ($1, $2, $3);

So what

Tests concurrent write throughput. The 1.61× advantage is smaller than the 2.7× batch advantage because PostgreSQL’s group commit amortizes WAL overhead across concurrent writers.

Why it’s fast

Each insert appends to the row dynamic array and bumps the generation counter. The single-threaded executor serializes all inserts, but each one is fast enough (115μs) to sustain 63K QPS across 8 threads.

Mixed read/write

54,018 vs 58,712 QPS | 0.92× | p50: 0.148 vs 0.128 ms

What

-- 80% reads:
SELECT * FROM tp_data WHERE id = $1;
-- 20% writes:
UPDATE tp_data SET value = $1 WHERE id = $2;

So what

The only workload where PostgreSQL wins. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries, so writers block readers.

Why it’s fast

This is a fundamental architectural limitation: mskql processes one query at a time. Adding MVCC or a concurrent executor would close this gap but would add significant complexity to the ~52.8K-line codebase.

Group + sum

96,007 vs 14,082 QPS | 6.82× | p50: 0.083 vs 0.648 ms

What

SELECT category, SUM(value), COUNT(*), AVG(value)
FROM tp_data GROUP BY category;

So what

Nearly 7× throughput advantage on aggregation. The result cache serves the 100-row result from cached wire bytes after the first execution, skipping all computation.

Why it’s fast

First execution: hash aggregation on columnar blocks (~0.13 ms). Subsequent executions: result cache hit, single send_all() of cached bytes (~0.01 ms). The cache is invalidated only when a write bumps total_generation.

Wide aggregate

96,514 vs 1,048 QPS | 92.1× | p50: 0.083 vs 10.7 ms

What

SELECT region, product,
       SUM(revenue), AVG(cost), MIN(margin),
       MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;

So what

Over 92× throughput on a wider aggregation (50K rows, 7 columns). The result cache serves the aggregated result from cached wire bytes, skipping all computation after the first execution.

Why it’s fast

Same mechanism as the simple aggregate: first execution builds the result via columnar hash aggregation, subsequent executions serve from the result cache. The 50K-row scan cache is ~200 KB of contiguous data.

Window function (RANK)

14 vs 12 QPS | 1.22× | p50: 636.6 vs 652.7 ms

What

SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;

So what

Window functions on 50K rows are expensive in both databases. The modest 1.22× advantage shows that the result cache does not help here (the 50K-row result is too large to cache efficiently) and the sort dominates.

Why it’s fast

Each execution sorts 50K rows and computes RANK values. The columnar sort is slightly faster due to contiguous sort key arrays, but the O(N log N) cost dominates.

Two-table join

569 vs 305 QPS | 1.87× | p50: 14.0 vs 26.1 ms

What

SELECT c.name, o.amount, o.created
FROM customers c JOIN orders o ON c.id = o.customer_id;

So what

Nearly 2× throughput on a join producing ~2K result rows. The result is large enough that the result cache helps but serialization still takes significant time.

Why it’s fast

The hash join builds a hash table from the 500-row customers table and probes it for each block of the 2K-row orders table. The join cache stores the pre-built hash table across queries.

Three-table join + group

36,153 vs 467 QPS | 77.4× | p50: 0.221 vs 22.1 ms

What

SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;

So what

77× throughput. The third-largest gap in the entire benchmark suite. The small result set (region × product combinations) fits entirely in the result cache.

Why it’s fast

First execution: two hash joins + hash aggregation on columnar blocks (~0.35 ms). All subsequent executions: result cache hit, single send_all(). PostgreSQL re-executes the full plan (planner + 2 hash joins + aggregation) every time.

CTE pipeline

5,712 vs 1,938 QPS | 2.95× | p50: 1.410 vs 4.280 ms

What

WITH summary AS (
    SELECT region, SUM(revenue) AS total
    FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;

So what

Nearly 3× throughput on a CTE pipeline. The result cache serves the 5-row result from cached bytes after the first execution.

Why it’s fast

First execution materializes the CTE, aggregates, sorts, and limits. The 5-row result is cached. Subsequent executions skip all computation.

Subquery pipeline

1,189 vs 905 QPS | 1.31× | p50: 6.612 vs 8.680 ms

What

SELECT * FROM (
    SELECT * FROM wide_bench WHERE region = 'North'
    ORDER BY revenue DESC LIMIT 100
) sub;

So what

Modest 1.3× advantage. The 100-row result is cached, but the first execution involves sorting ~10K filtered rows, which is expensive in both databases.

Why it’s fast

The plan chain (scan → filter → sort → limit) runs in the block executor. The limit node stops pulling after 100 rows. The result cache serves subsequent executions.

Large sort

31 vs 12 QPS | 2.58× | p50: 210.5 vs 658.0 ms

What

SELECT * FROM wide_bench ORDER BY revenue DESC;

So what

2.5× throughput on sorting 50K rows. The throughput advantage is larger than the batch advantage (0.87×) because the result cache serves the sorted result from cached wire bytes after the first execution.

Why it’s fast

First execution: sort 50K rows (~213 ms). Subsequent executions: result cache hit. The 50K-row serialized result is ~2 MB of wire bytes, served in a single send_all().

CTE star-schema

96,007 vs 501 QPS | 192× | p50: 0.083 vs 18.2 ms

What

WITH joined AS (
    SELECT r.name AS region, p.name AS product, s.revenue
    FROM sales s
    JOIN regions r ON s.region_id = r.id
    JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region ORDER BY total DESC;

So what

192× throughput—the largest gap in the entire benchmark suite. The 4-row result (one per region) is served from the result cache in 83 microseconds. PostgreSQL re-executes the full plan (planner + 2 hash joins + aggregation + sort) every time, taking 18.2 ms.

Why it’s fast

The result cache is the key: the 4-row result is ~200 bytes of serialized wire data. After the first execution, every subsequent query is a hash lookup + send_all(). The cache is invalidated only when a write bumps total_generation. In a read-only throughput test, the cache is never invalidated.

Benchmark Workload Details

Batch Latency Workloads

Reads & Scans

Full table scan

Filtered scan

Indexed point lookup

Sort

Multi-column sort

Distinct

Writes

Single-row inserts

Bulk update

Insert + delete

Transactional inserts

Aggregation & Analytics

Group + sum

Expression aggregate

Wide aggregate

Window function (ROW_NUMBER)

Window function (RANK)

Joins, Subqueries & Complex Queries

Two-table join

Three-table join + group

Subquery filter

Subquery pipeline

CTE filter chain

CTE pipeline

CTE star-schema

Union

Functions & Generation

Generate series

Scalar functions

Large Dataset (50K rows)

Large sort

Concurrent Throughput Workloads

Simple Reads & Writes

Indexed point lookup

Full table scan

Filtered scan

Single-row insert

Mixed read/write

Aggregation & Analytics

Group + sum

Wide aggregate

Window function (RANK)

Complex Queries

Two-table join

Three-table join + group

CTE pipeline

Subquery pipeline

Large sort

CTE star-schema