Benchmark Workload Details

What each workload tests, why it matters, and how mskql handles it

93 batch workloads and 15 throughput workloads. Each entry below shows the exact SQL, explains what database capability it exercises, and describes how mskql’s architecture produces the measured result. For the summary numbers, see Benchmarks.

Batch Latency Workloads

Each batch workload runs a fixed number of iterations of the same query through a single psql session. The timer measures total wall-clock time including connection overhead. Source: bench/bench_vs_pg.py.

Reads & Scans

Full table scan

5,000 rows × 3 columns × 200 iterations  |  0.67×
SELECT * FROM bench_data;

A table with 5,000 rows and 3 columns (id INT, category TEXT, value INT). The query reads every row and sends all columns to the client. Repeated 200 times to amortize connection overhead.

Full scans test raw read throughput: how fast the engine can materialize rows and push them over the wire. This is the baseline for all read workloads. A database that is slow here will be slow everywhere.

mskql’s scan cache converts the row store to flat columnar arrays once (O(N), ~0.1 ms for 5K rows). Subsequent scans memcpy 1,024-row slices into col_block arrays. try_plan_send serializes directly from columnar blocks into a 64 KB wire buffer—5 system calls instead of 5,000. No row materialization occurs.

Filtered scan

5,000 rows × 500 iterations  |  0.58×
SELECT * FROM bench_data WHERE category = 'cat_42';

Same 5,000-row table. The WHERE clause filters to rows matching one of 100 categories (~50 rows per category).

Tests predicate evaluation speed. The engine must scan all rows but only emit matching ones. This measures the cost of the filter operator relative to the scan.

The plan executor’s PLAN_FILTER node produces a selection vector (array of matching row indices) instead of copying data. Only matching rows are serialized to the wire. The scan cache hit means the filter operates on contiguous int32_t[] and char*[] arrays in L1 cache.

Indexed point lookup

2,000 iterations  |  0.96×
SELECT * FROM bench_data WHERE id = 42;

Point lookup by primary key (B-tree index). Returns exactly 1 row. Repeated 2,000 times with varying IDs.

Tests index lookup latency—the most common operation in OLTP workloads. The result is nearly tied (0.98×) because both databases use B-tree indexes and the overhead is dominated by network round-trips, not engine work.

mskql’s PLAN_INDEX_SCAN walks the B-tree to find the matching row, then serializes it directly. The result cache does not help here because each query has a different WHERE value. The near-parity shows that mskql’s index implementation is competitive with PostgreSQL’s.

Sort

5,000 rows × 200 iterations  |  0.55×
SELECT * FROM bench_data ORDER BY value DESC;

Sort all 5,000 rows by a single integer column in descending order.

Tests sort performance on a medium dataset. Sorting is a core operation used by ORDER BY, DISTINCT, window functions, and merge joins.

The PLAN_SORT node collects all blocks from its child, builds a flat index array, and sorts it with qsort. The sort key array is bump-allocated from arena.scratch—zero malloc calls. After sorting, blocks are emitted in sorted order directly to the wire.

Multi-column sort

5,000 rows × 200 iterations  |  0.53×
SELECT * FROM bench_data ORDER BY category, value DESC;

Sort by two columns: category ascending, then value descending within each category.

Multi-column sorts are common in reporting queries. The comparator must check the first column, then fall through to the second on ties. This tests the overhead of compound sort keys.

The sort node stores parallel arrays of sort columns and direction flags, all bump-allocated. The comparator loops through columns in order, short-circuiting on the first difference. Columnar storage means each column’s data is contiguous in memory, improving cache locality during the sort.

Distinct

100 categories × 500 iterations  |  0.15×
SELECT DISTINCT category FROM bench_data;

Extract the 100 unique category values from 5,000 rows.

DISTINCT tests deduplication speed. The 7× speedup (0.14×) is one of the largest gaps because the result set is tiny (100 rows) while the input is large (5,000).

The DISTINCT path uses the hash aggregation path with no aggregate functions. The hash table is bump-allocated, and the 100-row result is serialized in a single wire buffer flush. The result cache further accelerates repeated executions.

Writes

Single-row inserts

10,000 inserts  |  0.35×
INSERT INTO bench_data (id, category, value) VALUES (1, 'cat_1', 42);
-- ... repeated 10,000 times with different values

Tests per-statement insert overhead. Each insert is a separate SQL statement parsed, planned, and executed independently. The 3× speedup reflects mskql’s lack of WAL, fsync, and MVCC bookkeeping.

Each insert appends a row to the table’s dynamic array and bumps the table generation (invalidating the scan cache). The arena is reset between statements. No durability overhead: no WAL write, no fsync, no checkpoint. The result cache is invalidated by the generation bump.

Bulk update

1,000 of 5,000 rows × 200 iterations  |  0.23×
UPDATE bench_data SET value = value + 1 WHERE category = 'cat_42';

Update ~50 rows matching a category filter. Repeated 200 times across 20 different categories (1,000 total rows updated per iteration set).

Tests in-place update speed. The 5× speedup (0.20×) is the second-largest gap after expression aggregates. Updates are expensive in PostgreSQL because of MVCC: each update creates a new tuple version and marks the old one dead.

Rows are updated in place—no tuple versioning, no dead tuple cleanup. The scan cache is invalidated by the generation bump, but the update itself is a direct cell modification in the row store. No WAL write, no vacuum needed.

Insert + delete

2,000 rows, delete half × 50 iterations  |  0.32×
-- Insert 2,000 rows, then:
DELETE FROM bench_data WHERE id % 2 = 0;
-- Repeated 50 times (re-inserting each time)

Tests the combined cost of bulk insert + bulk delete. This exercises the full write path including row removal and generation tracking.

Deletes remove rows from the dynamic array and bump the table generation. No dead tuple tracking, no vacuum. The 3× speedup is primarily from avoiding WAL and MVCC overhead.

Transactional inserts

100 transactions, 50 inserts each  |  0.86×
BEGIN;
INSERT INTO bench_data VALUES (1, 'cat_1', 42);
-- ... 50 inserts
COMMIT;

Tests transaction overhead. The near-parity (0.86×) is notable because PostgreSQL’s transactions are backed by WAL with group commit, while mskql uses lazy copy-on-write snapshots with no durability.

BEGIN records table names and generation numbers in O(1). The first write to a table triggers table_deep_copy()—O(N) for that table only. COMMIT discards the snapshot (free). ROLLBACK swaps the saved copy back. The O(N) deep copy on first write is why transactions are the closest result to PostgreSQL.

Aggregation & Analytics

Group + sum

5,000 rows × 500 iterations  |  0.12×
SELECT category, SUM(value), COUNT(*), AVG(value)
FROM bench_data GROUP BY category;

Group 5,000 rows by 100 categories and compute 3 aggregates per group.

Tests hash aggregation speed. The 8× speedup (0.12×) comes from columnar processing: the aggregator reads contiguous integer arrays instead of extracting values from row tuples.

PLAN_HASH_AGG builds a hash table keyed by group columns. For each 1,024-row block, it looks up the group and updates accumulators in a tight loop. The hash table and accumulators are bump-allocated. The 100-row result is serialized in a single wire buffer flush. The result cache makes repeated executions near-instant.

Expression aggregate

SUM(price × qty), 5,000 rows × 500 iterations  |  0.10×
SELECT category, SUM(price * quantity)
FROM bench_data GROUP BY category;

Tests expression evaluation inside aggregates. The 10× speedup (0.10×) is the largest gap in the batch suite. The expression price * quantity must be evaluated for every row before aggregation.

The PLAN_EXPR_PROJECT node evaluates expressions in columnar batches, feeding results into the hash aggregator. Combined with the result cache, repeated executions skip all computation entirely.

Wide aggregate

7 columns, 50,000 rows × 20 iterations  |  0.14×
SELECT region, product,
       SUM(revenue), AVG(cost), MIN(margin),
       MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;

Tests aggregation on a wider table with more columns and a larger dataset. The 7× speedup shows that the columnar advantage scales with both row count and column count.

Each column is stored as a contiguous typed array in the scan cache. The hash aggregator processes each column independently, keeping the working set in L1 cache. A 4-column INT block is 16 KB—well within the 32–48 KB L1 data cache.

Window function (ROW_NUMBER)

5,000 rows × 20 iterations  |  0.55×
SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY value DESC)
FROM bench_data;

Tests window function execution with partitioning and ordering. Window functions are among the most complex query operations, requiring sorting within each partition.

The PLAN_WINDOW node sorts all input blocks by partition key + order key, then walks the sorted data to compute window values. All sort indices and window accumulators are bump-allocated from arena.scratch.

Window function (RANK)

50,000 rows × 5 iterations  |  0.50×
SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;

Tests window functions on a larger dataset (50K rows). RANK handles ties differently from ROW_NUMBER, requiring comparison with the previous row’s order value.

Same window node, but with RANK semantics: equal order values get the same rank, and the next rank skips ahead. The 10× larger dataset shows the columnar advantage holds at scale.

Joins, Subqueries & Complex Queries

Two-table join

500 + 2,000 rows × 50 iterations  |  0.44×
SELECT c.name, o.amount, o.created
FROM customers c
JOIN orders o ON c.id = o.customer_id;

Tests hash join performance on a typical OLTP join pattern: a small dimension table (500 customers) joined to a larger fact table (2,000 orders).

PLAN_HASH_JOIN builds a hash table from the smaller (inner) table, then probes it for each block of the larger (outer) table. The hash table is bump-allocated. The join cache (table.join_cache) stores the pre-built hash table and is invalidated by the same generation counter as the scan cache.

Three-table join + group

50,000 rows × 5 iterations  |  0.32×
SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;

Tests a star-schema join pattern: one large fact table joined to two small dimension tables, followed by aggregation. This is the canonical analytical query pattern.

Two hash joins are chained, feeding into a hash aggregation node. All three hash tables are bump-allocated. The result (a small number of region × product combinations) is cached by the result cache, making repeated executions near-instant.

Subquery filter

IN subquery × 50 iterations  |  0.43×
SELECT * FROM orders
WHERE customer_id IN (
    SELECT id FROM customers WHERE region = 'West'
);

Tests IN subquery execution. The subquery is converted to a hash semi-join internally, avoiding the naive nested-loop approach.

PLAN_HASH_SEMI_JOIN builds a hash set from the subquery result, then probes it for each row of the outer query. This is O(N + M) instead of O(N × M) for a naive correlated subquery.

Subquery pipeline

filter + sort + limit, 50,000 rows × 20 iterations  |  0.19×
SELECT * FROM (
    SELECT * FROM wide_bench WHERE region = 'North'
    ORDER BY revenue DESC
    LIMIT 100
) sub;

Tests a multi-stage pipeline: filter → sort → limit, wrapped in a subquery. The 5× speedup (0.20×) shows the benefit of the block-oriented executor on compound operations.

The plan is a chain: SeqScan → Filter → Sort → Limit. Each node pulls blocks from its child on demand. The limit node stops pulling after 100 rows, avoiding unnecessary sort work on the remaining data.

CTE filter chain

5,000 rows × 200 iterations  |  0.43×
WITH filtered AS (
    SELECT * FROM bench_data WHERE value > 50
)
SELECT category, COUNT(*) FROM filtered
GROUP BY category ORDER BY COUNT(*) DESC;

Tests CTE materialization and subsequent aggregation. CTEs are a common pattern for breaking complex queries into readable steps.

The CTE is materialized into a temporary table, then the outer query runs against it. The scan cache is built for the temp table on first access. The result cache caches the final wire output.

CTE pipeline

agg → sort → limit, 50,000 rows × 20 iterations  |  0.73×
WITH summary AS (
    SELECT region, SUM(revenue) AS total
    FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;

Tests a multi-stage analytical pipeline: aggregate in a CTE, then sort and limit the result. The closer ratio (0.75×) reflects that PostgreSQL’s query optimizer handles this pattern well.

The CTE materializes the aggregation result, then the outer query sorts and limits it. Both stages use the block executor. The result cache makes repeated executions near-instant.

CTE star-schema

join → agg → sort, 50,000 rows × 5 iterations  |  0.26×
WITH joined AS (
    SELECT r.name AS region, p.name AS product, s.revenue
    FROM sales s
    JOIN regions r ON s.region_id = r.id
    JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region
ORDER BY total DESC;

The most complex batch workload: two joins, aggregation, and sorting in a single CTE pipeline. This is a realistic analytical query on a star schema.

The CTE materializes the join result, then the outer query aggregates and sorts. In the throughput this same query runs at 94,525 QPS (190× PostgreSQL) because the 4-row result is served from the result cache in microseconds.

Union

two 2,000-row tables × 50 iterations  |  0.54×
SELECT id, category, value FROM bench_data_a
UNION
SELECT id, category, value FROM bench_data_b;

Tests set operation performance. UNION requires deduplication across two result sets, combining scan, hash, and serialization costs.

PLAN_SET_OP pulls blocks from both children and uses a hash table for deduplication. The hash table is bump-allocated. The deduplicated result is serialized directly to the wire.

Functions & Generation

Generate series

10,000 rows × 200 iterations  |  0.31×
SELECT * FROM generate_series(1, 10000);

Tests table-valued function performance. The 3× speedup (0.32×) was achieved by generating integers directly into columnar blocks instead of materializing a temporary table.

PLAN_GENERATE_SERIES produces 1,024-row blocks of contiguous int32_t arrays in a tight loop—zero malloc, zero row-store materialization. The blocks are serialized directly to the wire via try_plan_send.

Scalar functions

4 functions, 5,000 rows × 200 iterations  |  0.40×
SELECT UPPER(name), LENGTH(name), ABS(score),
       ROUND(score::numeric, 2)
FROM bench_data;

Tests expression projection with multiple scalar functions. The 2.5× speedup (0.40×) comes from batched wire send and scan cache hits.

PLAN_EXPR_PROJECT evaluates all 4 functions per row within each 1,024-row block, writing results into output col_block arrays. The input columns come from the scan cache (contiguous arrays), and the output is serialized directly to the wire—no row materialization.

Large Dataset (50K rows)

Large sort

50,000 rows × 10 iterations  |  0.65×
SELECT * FROM wide_bench ORDER BY revenue DESC;

Tests sort performance at scale. The 0.65× ratio shows that sorting 50K rows benefits from mskql’s multi-column radix sort on contiguous columnar arrays.

The sort node collects all ~49 blocks (50K / 1024), builds a flat index array, and sorts with qsort. The sorted blocks are then serialized to the wire. The remaining gap is from mskql’s batched wire send (fewer system calls) and scan cache (contiguous sort key array).

Concurrent Throughput Workloads

Each throughput workload runs 8 threads for 5 seconds, measuring queries per second (QPS) and median latency (p50). Source: bench/bench_throughput.c.

Simple Reads & Writes

Indexed point lookup

79,102 vs 23,461 QPS  |  3.37×  |  p50: 0.101 vs 0.254 ms
SELECT * FROM tp_data WHERE id = $1;

Point lookup by primary key with a random ID each iteration. 8 threads, each with a persistent connection.

The most common OLTP operation. The 3.37× throughput advantage comes from mskql’s lower per-query overhead: no planner, no MVCC snapshot, no buffer pool lookup.

Each query is parsed, the index is walked, and the single matching row is serialized. The arena is reset between queries. At 79,102 QPS, each query completes in ~101 microseconds including network round-trip.

Full table scan

122 vs 120 QPS  |  1.02×  |  p50: 66.0 vs 65.8 ms
SELECT * FROM tp_data;

5,000 rows, 8 concurrent threads.

Dead heat. Under concurrent load, full scans are network-bound: serializing 5,000 rows over TCP dominates the total time. The engine speed difference is invisible.

Both databases spend most of their time in write() system calls. The mskql batched 64 KB buffer helps in batch mode but under concurrent load the network becomes the bottleneck.

Filtered scan

245 vs 241 QPS  |  1.02×  |  p50: 32.6 vs 33.0 ms
SELECT * FROM tp_data WHERE category = 'cat_42';

Near-parity, same as full scan. The filter reduces the result set but the scan cost is similar. Network-bound under concurrent load.

The selection vector avoids copying non-matching rows, but the result set (~50 rows) is small enough that serialization time is negligible. The bottleneck is the scan itself.

Single-row insert

63,253 vs 39,189 QPS  |  1.61×  |  p50: 0.115 vs 0.195 ms
INSERT INTO tp_data (id, category, value) VALUES ($1, $2, $3);

Tests concurrent write throughput. The 1.61× advantage is smaller than the 2.7× batch advantage because PostgreSQL’s group commit amortizes WAL overhead across concurrent writers.

Each insert appends to the row dynamic array and bumps the generation counter. The single-threaded executor serializes all inserts, but each one is fast enough (115μs) to sustain 63K QPS across 8 threads.

Mixed read/write

54,812 vs 58,001 QPS  |  0.94×  |  p50: 0.145 vs 0.127 ms
-- 80% reads:
SELECT * FROM tp_data WHERE id = $1;
-- 20% writes:
UPDATE tp_data SET value = $1 WHERE id = $2;

The only workload where PostgreSQL wins. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries, so writers block readers.

This is a fundamental architectural limitation: mskql processes one query at a time. Adding MVCC or a concurrent executor would close this gap but would add significant complexity to the ~45.1K-line codebase.

Aggregation & Analytics

Group + sum

94,525 vs 13,929 QPS  |  6.79×  |  p50: 0.084 vs 0.652 ms
SELECT category, SUM(value), COUNT(*), AVG(value)
FROM tp_data GROUP BY category;

Nearly 7× throughput advantage on aggregation. The result cache serves the 100-row result from cached wire bytes after the first execution, skipping all computation.

First execution: hash aggregation on columnar blocks (~0.13 ms). Subsequent executions: result cache hit, single send_all() of cached bytes (~0.01 ms). The cache is invalidated only when a write bumps total_generation.

Wide aggregate

8,412 vs 1,036 QPS  |  8.12×  |  p50: 0.952 vs 10.8 ms
SELECT region, product,
       SUM(revenue), AVG(cost), MIN(margin),
       MAX(units), COUNT(*)
FROM wide_bench GROUP BY region, product;

Over 8× throughput on a wider aggregation (50K rows, 7 columns). The advantage holds at scale because the columnar executor and result cache both benefit from larger datasets.

Same mechanism as the simple aggregate: first execution builds the result via columnar hash aggregation, subsequent executions serve from the result cache. The 50K-row scan cache is ~200 KB of contiguous data.

Window function (RANK)

14 vs 12 QPS  |  1.22×  |  p50: 636.6 vs 652.7 ms
SELECT *, RANK() OVER (PARTITION BY region ORDER BY revenue DESC)
FROM wide_bench;

Window functions on 50K rows are expensive in both databases. The modest 1.22× advantage shows that the result cache does not help here (the 50K-row result is too large to cache efficiently) and the sort dominates.

Each execution sorts 50K rows and computes RANK values. The columnar sort is slightly faster due to contiguous sort key arrays, but the O(N log N) cost dominates.

Complex Queries

Two-table join

562 vs 302 QPS  |  1.86×  |  p50: 14.1 vs 26.3 ms
SELECT c.name, o.amount, o.created
FROM customers c JOIN orders o ON c.id = o.customer_id;

Nearly 2× throughput on a join producing ~2K result rows. The result is large enough that the result cache helps but serialization still takes significant time.

The hash join builds a hash table from the 500-row customers table and probes it for each block of the 2K-row orders table. The join cache stores the pre-built hash table across queries.

Three-table join + group

36,171 vs 462 QPS  |  78.3×  |  p50: 0.221 vs 22.3 ms
SELECT r.name, p.name, SUM(s.revenue)
FROM sales s
JOIN regions r ON s.region_id = r.id
JOIN products p ON s.product_id = p.id
GROUP BY r.name, p.name;

78× throughput. The second-largest gap in the entire benchmark suite. The small result set (region × product combinations) fits entirely in the result cache.

First execution: two hash joins + hash aggregation on columnar blocks (~0.35 ms). All subsequent executions: result cache hit, single send_all(). PostgreSQL re-executes the full plan (planner + 2 hash joins + aggregation) every time.

CTE pipeline

5,606 vs 1,920 QPS  |  2.92×  |  p50: 1.430 vs 4.312 ms
WITH summary AS (
    SELECT region, SUM(revenue) AS total
    FROM wide_bench GROUP BY region
)
SELECT * FROM summary ORDER BY total DESC LIMIT 5;

Nearly 3× throughput on a CTE pipeline. The result cache serves the 5-row result from cached bytes after the first execution.

First execution materializes the CTE, aggregates, sorts, and limits. The 5-row result is cached. Subsequent executions skip all computation.

Subquery pipeline

1,170 vs 898 QPS  |  1.30×  |  p50: 6.698 vs 8.775 ms
SELECT * FROM (
    SELECT * FROM wide_bench WHERE region = 'North'
    ORDER BY revenue DESC LIMIT 100
) sub;

Modest 1.3× advantage. The 100-row result is cached, but the first execution involves sorting ~10K filtered rows, which is expensive in both databases.

The plan chain (scan → filter → sort → limit) runs in the block executor. The limit node stops pulling after 100 rows. The result cache serves subsequent executions.

Large sort

30 vs 12 QPS  |  2.54×  |  p50: 212.8 vs 661.0 ms
SELECT * FROM wide_bench ORDER BY revenue DESC;

2.5× throughput on sorting 50K rows. The throughput advantage is larger than the batch advantage (0.87×) because the result cache serves the sorted result from cached wire bytes after the first execution.

First execution: sort 50K rows (~213 ms). Subsequent executions: result cache hit. The 50K-row serialized result is ~2 MB of wire bytes, served in a single send_all().

CTE star-schema

94,525 vs 496 QPS  |  190×  |  p50: 0.085 vs 18.4 ms
WITH joined AS (
    SELECT r.name AS region, p.name AS product, s.revenue
    FROM sales s
    JOIN regions r ON s.region_id = r.id
    JOIN products p ON s.product_id = p.id
)
SELECT region, SUM(revenue) AS total
FROM joined GROUP BY region ORDER BY total DESC;

190× throughput—the largest gap in the entire benchmark suite. The 4-row result (one per region) is served from the result cache in 85 microseconds. PostgreSQL re-executes the full plan (planner + 2 hash joins + aggregation + sort) every time, taking 18.4 ms.

The result cache is the key: the 4-row result is ~200 bytes of serialized wire data. After the first execution, every subsequent query is a hash lookup + send_all(). The cache is invalidated only when a write bumps total_generation. In a read-only throughput test, the cache is never invalidated.