Stress benchmarks: 500K-row workloads

Eight workloads at 500K rows test how well the engine scales beyond L2 cache.

Motivation

The standard benchmarks use 5K–50K rows, which fit comfortably in CPU cache. Real analytical workloads often involve hundreds of thousands or millions of rows. The stress benchmarks push mskql to 500K rows per table to expose scaling bottlenecks in sorting, hashing, joining, and window functions.

Workloads

All eight stress benchmarks run through the PostgreSQL wire protocol against mskql, PostgreSQL, and DuckDB (in-process CLI). Each benchmark runs a single iteration—no caching, no warm-up advantage.

Workload	mskql	PG	Duck	ms/pg	ms/duck
stress_full_agg (500K rows)	360ms	764ms	74ms	0.47×	4.90×
stress_high_card_gb (100K groups)	249ms	1,229ms	245ms	0.20×	1.01×
stress_large_sort (500K rows)	408ms	557ms	87ms	0.73×	4.71×
stress_join_2way (500K + 10K)	350ms	2,284ms	109ms	0.15×	3.20×
stress_join_3way (3 large tables)	478ms	2,941ms	128ms	0.16×	3.73×
stress_filtered_expr (WHERE + computed)	298ms	641ms	134ms	0.46×	2.22×
stress_window (RANK, 500K rows)	327ms	482ms	115ms	0.68×	2.85×
stress_nested_cte (3-level CTE)	495ms	1,057ms	185ms	0.47×	2.67×

Analysis

mskql beats PostgreSQL on all 8 workloads, from 1.4× faster (stress_large_sort) to 6.5× faster (stress_join_2way and stress_join_3way). The join workloads show the largest gap: PostgreSQL’s hash join with 500K outer rows involves significant memory management overhead that mskql avoids with arena allocation.

DuckDB wins all 8 workloads over the wire, from 1.01× (stress_high_card_gb, essentially a tie) to 4.90× (stress_full_agg). This is expected: DuckDB runs in-process with zero wire overhead, while mskql serializes results through TCP. The in-process benchmarks (Section 2) show mskql winning 15 of 16 when both engines run in-process.

The stress_high_card_gb result (249ms vs 245ms, 1.01×) is notable: mskql’s hash aggregation matches DuckDB’s SIMD-vectorized implementation at 100K groups, even with wire protocol overhead.

What these benchmarks stress

Memory allocation: 500K rows push well beyond L2 cache. Arena allocation avoids per-row malloc/free overhead.
Sort scalability: The radix sort and pdqsort combination handles 500K rows in 408ms, 1.4× faster than PostgreSQL’s tuplesort.
Hash table scaling: The high-cardinality GROUP BY creates 100K hash buckets. The Swiss table implementation maintains performance with minimal probe chains.
Join materialization: Two- and three-way joins at 500K rows test the hash join build/probe phases under memory pressure.