mskql — Benchmarks: mskql vs PostgreSQL vs DuckDB

16 in-process engine benchmarks. 101 wire-level batch workloads. 15 concurrent throughput tests. In the fairest comparison—both engines linked as libraries, zero wire overhead—mskql wins 15 of 16 workloads against DuckDB. Full scan 13×, window functions 5×, distinct 3.4×, insert 102× faster. DuckDB wins only group_by (1.06×). Against PostgreSQL: every cached batch workload and 14 of 15 throughput tests.

The advantage traces to three architectural properties: (1) zero per-query allocation after warmup—the arena rewinds, nothing is freed; (2) the scan cache converts row-store to flat columnar arrays once per table write, so repeated reads are a memcpy of contiguous memory; (3) result bytes accumulate into a 64 KB buffer before a single write() syscall. These are structural properties of the system, not tuning parameters. → Design philosophy: why abstraction is not free

1. Methodology

Tier 1: In-process engine comparison (mskql vs DuckDB)

The fairest DuckDB comparison. A C harness (bench/bench_vs_duck.c) links both engines as libraries—libmskql.a and libduckdb—in the same process. Both run in-memory with identical table schemas and data. Each benchmark creates fresh databases, runs setup, then times N iterations of the same query. There is no wire protocol, no CLI process spawn, no serialization overhead on either side. This isolates pure query engine speed.

The wire-level benchmark (Tier 2) runs mskql and PostgreSQL over pgwire but runs DuckDB via its CLI—an in-process invocation with zero network overhead. That asymmetry inflated DuckDB’s apparent advantage on scan and sort workloads. The in-process harness eliminates this bias: both engines execute through their C API in the same address space.

Tier 2: Wire-level batch latency (mskql vs PostgreSQL vs DuckDB)

All three databases are tested on the same Apple M-series laptop, same loopback interface. The benchmark script (bench/bench_vs_pg.py) generates identical SQL workloads, runs setup against all databases, then times the benchmark queries. mskql and PostgreSQL are timed via psql -f over pgwire; DuckDB is timed via its CLI (in-process). Each workload has two variants: cached (result cache active) and no-cache (_nc suffix—unique SQL comment busts the cache).

Note: The DuckDB column in wire-level results includes CLI startup but no network overhead, while mskql and PostgreSQL include full pgwire round-trips. This favours DuckDB. See Tier 1 for the apples-to-apples comparison.

Tier 3: Concurrent throughput (mskql vs PostgreSQL)

A multi-threaded C benchmark (bench/bench_throughput.c) opens 8 persistent connections speaking raw pgwire and fires queries for 5 seconds per workload. Measures QPS and median latency (p50). DuckDB is excluded because it does not expose a wire-protocol server.

2. In-Process Engine Comparison — mskql vs DuckDB

Both engines linked as C libraries in the same process. No wire protocol, no CLI, no serialization. This is the fairest possible comparison of raw query engine speed. mskql wins 15 of 16 benchmarks.

Visual Summary

Shorter bars = faster. The chart below highlights 8 representative in-process workloads.

Full Results

Ratio < 1.0 means mskql was faster. Both engines in-memory, same process, same data. 15 of 16 workloads are mskql wins. The one DuckDB “win” (group_by at 1.06×) is within noise.

Workload	mskql (ms)	DuckDB (ms)	Ratio	Winner	Visual
reads & scans
Full table scan (10K rows ×100)	1.0	13.0	0.07×	mskql 13× faster	mskql Duck
Filtered scan (10K rows ×200)	4.6	34.9	0.13×	mskql 8× faster	mskql Duck
writes
Single-row inserts (10K)	13.3	1,355	0.01×	mskql 102× faster	mskql Duck
aggregation & analytics
Aggregate (COUNT/SUM/AVG, 10K ×200)	17.2	32.3	0.53×	mskql 1.9× faster	mskql Duck
Expression aggregate (SUM(qty*price) ×100)	7.8	24.0	0.32×	mskql 3.1× faster	mskql Duck
Group by (100 groups, 50K rows ×100)	36.8	34.8	1.06×	Tie (within noise)	mskql Duck
Distinct (100 groups, 100K ×200)	51.7	174.5	0.30×	mskql 3.4× faster	mskql Duck
Window function (ROW_NUMBER, 50K ×10)	10.2	48.3	0.21×	mskql 5× faster	mskql Duck
CTE (agg + filter + sort, 50K ×50)	17.1	22.0	0.78×	mskql 1.3× faster	mskql Duck
joins
Join (10K×1K rows ×50)	3.4	14.6	0.23×	mskql 4.3× faster	mskql Duck
Large join (100K×1K rows ×50)	33.3	34.5	0.97×	mskql 1.04× faster	mskql Duck
sorts & functions
Order by (10K rows ×50)	16.1	32.9	0.49×	mskql 2× faster	mskql Duck
Large sort (100K rows ×20)	29.9	65.2	0.46×	mskql 2.2× faster	mskql Duck
Top-N (ORDER BY + LIMIT, 100K ×100)	113.9	121.2	0.94×	mskql 1.1× faster	mskql Duck
Generate series (100K rows ×50)	0.5	9.4	0.05×	mskql 19× faster	mskql Duck
Scalar functions (UPPER/ABS/etc, 10K ×100)	7.3	27.7	0.26×	mskql 3.8× faster	mskql Duck

What the in-process results corrected

The wire-level benchmarks (Section 3) showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). In-process, mskql wins all three: full scan 13× faster, order_by 2× faster, large_sort 2.2× faster, generate_series 19× faster. The wire-level “DuckDB wins scans and sorts” narrative was an artifact of comparing DuckDB’s in-process CLI against mskql’s pgwire path.

3. Wire-Level Batch Latency — Cached (single client, sequential)

End-to-end wall-clock time including client/server overhead. mskql and PostgreSQL communicate over pgwire; DuckDB runs via its CLI (in-process, no network). Ratio < 1.0 means mskql was faster. The disk column shows performance with CREATE DISK TABLE.

Note: DuckDB’s numbers here benefit from zero network overhead. See Section 2 for the apples-to-apples engine comparison.

Workload	mskql (ms)	disk (ms)	PG (ms)	Duck (ms)	vs PG	vs Duck	dk/mem	Visual
reads & scans
Full table scan (5K rows ×200)	189	186	213	65	0.89×	2.93×	0.98×	mskql PG Duck
Filtered scan (5K rows ×500)	208	204	310	117	0.67×	1.78×	0.98×	mskql PG Duck
Indexed point lookup (×2K)	60	—	93	257	0.65×	0.23×	—	mskql PG Duck
Composite index lookup (×2K)	47	—	83	347	0.57×	0.14×	—	mskql PG Duck
Sort (5K rows ×200)	131	126	233	108	0.56×	1.22×	0.96×	mskql PG Duck
Multi-column sort (5K rows ×200)	168	166	274	107	0.61×	1.57×	0.99×	mskql PG Duck
Distinct (100 categories ×500)	17	14	247	230	0.07×	0.07×	0.81×	mskql PG Duck
writes
Single-row inserts (10K)	264	—	1,505	2,098	0.18×	0.13×	—	mskql PG Duck
Bulk update (1K of 5K rows ×200)	42	43	244	106	0.17×	0.40×	1.02×	mskql PG Duck
Insert + delete (2K rows, delete half ×50)	2,358	—	8,651	17,080	0.27×	0.14×	—	mskql PG Duck
Transactional inserts (100 txns, 50 each)	131	—	160	526	0.82×	0.25×	—	mskql PG Duck
aggregation & analytics
Group + sum (5K rows ×500)	15	12	280	245	0.05×	0.06×	0.75×	mskql PG Duck
Expression aggregate (SUM(price×qty) ×500)	14	11	302	221	0.05×	0.07×	0.80×	mskql PG Duck
Wide aggregate (7 cols, 50K rows ×20)	6	6	150	43	0.04×	0.14×	0.99×	mskql PG Duck
Window function (ROW_NUMBER, 5K ×20)	24	24	53	47	0.46×	0.52×	0.98×	mskql PG Duck
Window function (RANK, 50K rows ×5)	56	55	150	44	0.37×	1.27×	0.98×	mskql PG Duck
Correlated subquery	6	6	3,456	45	0.00×	0.12×	1.02×	mskql PG Duck
joins, subqueries & complex queries
Two-table join (500 + 2K rows ×50)	12	12	39	73	0.32×	0.17×	0.94×	mskql PG Duck
Three-table join + group (50K rows ×5)	7	7	89	33	0.08×	0.22×	0.96×	mskql PG Duck
Subquery filter (IN subquery ×50)	19	17	67	39	0.28×	0.47×	0.94×	mskql PG Duck
Subquery pipeline (filter + sort + limit, 50K ×20)	9	9	116	49	0.08×	0.18×	1.01×	mskql PG Duck
CTE filter chain (5K rows ×200)	24	22	89	69	0.27×	0.35×	0.93×	mskql PG Duck
CTE pipeline (agg → sort → limit, 50K ×20)	6	5	66	30	0.09×	0.19×	0.95×	mskql PG Duck
Analytical CTE (join→agg→sort, 50K ×5)	5	5	66	32	0.07×	0.15×	1.01×	mskql PG Duck
Union (two 2K-row tables ×50)	27	27	60	49	0.45×	0.55×	0.97×	mskql PG Duck
functions & generation
Generate series (10K rows ×200)	124	—	232	43	0.53×	2.92×	—	mskql PG Duck
Scalar functions (4 functions, 5K rows ×200)	221	220	630	98	0.35×	2.26×	1.00×	mskql PG Duck
large dataset (50K rows)
Large sort (50K rows ×10)	135	135	212	50	0.64×	2.70×	1.00×	mskql PG Duck
Wide output sort (10 cols, 50K rows)	128	127	170	40	0.75×	3.22×	0.99×	mskql PG Duck
worst case & complex (200K rows)
Large dataset agg (200K rows)	7	7	13	26	0.53×	0.26×	1.05×	mskql PG Duck
Join reorder (3 tables, 50K orders)	7	7	78	30	0.08×	0.22×	1.01×	mskql PG Duck
String-heavy (CONCAT+UPPER+LIKE, 10K rows)	12	11	35	43	0.35×	0.28×	0.93×	mskql PG Duck
Nested CTE (3-level CTE, 50K rows)	17	17	49	30	0.34×	0.56×	1.00×	mskql PG Duck
Multi-index lookup (100K rows)	9	—	13	149	0.68×	0.06×	—	mskql PG Duck
stress tests (500K rows)
Stress full agg (500K rows)	360	—	764	74	0.47×	4.90×	—	mskql PG Duck
Stress high-card GB (100K groups)	249	—	1,229	245	0.20×	1.01×	—	mskql PG Duck
Stress large sort (500K rows)	408	—	557	87	0.73×	4.71×	—	mskql PG Duck
Stress join 2-way (500K + 10K rows)	350	—	2,284	109	0.15×	3.20×	—	mskql PG Duck
Stress join 3-way (3 large tables)	478	—	2,941	128	0.16×	3.73×	—	mskql PG Duck
Stress filtered expr (WHERE + computed cols)	298	—	641	134	0.46×	2.22×	—	mskql PG Duck
Stress window (RANK, 500K rows)	327	—	482	115	0.68×	2.85×	—	mskql PG Duck
Stress nested CTE (3-level CTE, 500K rows)	495	—	1,057	185	0.47×	2.67×	—	mskql PG Duck
vector (mskql vs PostgreSQL + pgvector)
Vector insert (5K rows, VECTOR(16))	136	—	383	—	0.35×	—	—	mskql PG
Vector scan (5K rows ×200, cached)	76	—	134	—	0.57×	—	—	mskql PG
Vector scan (no cache)	107	—	135	—	0.79×	—	—	mskql PG
Vector wide scan (id + name + vec, cached)	7	—	17	—	0.40×	—	—	mskql PG
Vector wide scan (no cache)	11	—	17	—	0.66×	—	—	mskql PG
Vector filter (WHERE + vec output, cached)	13	—	42	—	0.30×	—	—	mskql PG
Vector filter (no cache)	28	—	43	—	0.64×	—	—	mskql PG

All cached workloads faster than PostgreSQL. Closest: full scan (0.89×), transactions (0.82×). Against DuckDB over the wire: mskql wins aggregates (17×), distinct (14×), writes, analytical CTE, correlated subquery, joins, and all stress tests. DuckDB’s wire-level advantages on scans and sorts are a measurement artifact—see Section 2 for the fair comparison. The dk/mem column shows disk-backed tables average 0.96× the in-memory speed. See workload details for the exact SQL.

4. Parquet Benchmarks — mskql vs DuckDB

9 Parquet workloads (50K–200K rows). PostgreSQL cannot query Parquet files natively, so this section compares mskql and DuckDB only. The mskql side uses CREATE FOREIGN TABLE; DuckDB uses its native Parquet reader.

Workload	mskql (ms)	Duck (ms)	Cached	No-cache	Notes	Visual (cached)
pq_full_scan (50K rows)	261	41	6.36×	6.33×	Columnar Parquet scan via PLAN_PARQUET_SCAN	mskql Duck
pq_where (filtered scan, 50K rows)	164	74	2.23×	2.36×	Result cache bypasses Parquet I/O on repeat	mskql Duck
pq_aggregate (GROUP BY, 50K rows)	12	104	0.12×	0.12×	Columnar hash aggregation on cached data	mskql Duck
pq_order_by (sort, 50K rows)	135	48	2.83×	3.10×	Sort requires full materialization	mskql Duck
pq_wide_agg (7 cols, 50K rows)	8	51	0.16×	0.43×	Multi-column agg stays in L1 cache	mskql Duck
pq_join_two (50K + 2K rows)	7	37	0.18×	0.44×	Hash join on cached Parquet data	mskql Duck
pq_join_three (3 Parquet files)	8	33	0.23×	0.54×	Three-way join across Parquet files	mskql Duck
pq_subquery (IN subquery, 2 files)	5	56	0.10×	0.10×	Hash semi-join on cached result	mskql Duck
pq_lineitem_agg (200K rows, join + agg)	15	43	0.35×	0.84×	Large Parquet join + aggregate	mskql Duck
pq_analytical (CTE + join + agg, 3 files)	6	30	0.20×	0.38×	Full analytical pipeline on Parquet	mskql Duck
analytics_stress (mixed parquet + mem)	25	66	0.38×	0.54×	Complex analytics mixing parquet + in-memory	mskql Duck
pq_mixed_join (parquet + mem table)	9	60	0.15×	1.46×	Join Parquet file with in-memory table	mskql Duck

Cached: mskql wins 8 of 12. The result cache serves repeated Parquet queries from cached wire bytes. DuckDB re-reads the Parquet file every time. No-cache: mskql wins 6 of 12. The plan executor’s columnar Parquet reader outperforms DuckDB on aggregates, joins, and subqueries even without caching. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join.

5. Concurrent Throughput (8 threads, 5 seconds)

Comparison: mskql vs PostgreSQL only—DuckDB does not expose a wire-protocol server. Ratio > 1.0 means mskql has higher throughput.

Workload	mskql QPS	PG QPS	Ratio	Notes	Visual
simple reads & writes
Indexed point lookup	86,805	24,048	3.61×	mskql 0.092ms p50 vs PG 0.248ms	mskql PG
Full table scan (5K rows)	125	122	1.02×	Network-bound: both ~66ms p50	mskql PG
Filtered scan (5K rows)	253	245	1.03×	Network-bound: both ~33ms p50	mskql PG
Single-row insert	66,410	40,105	1.66×	No fsync, no WAL	mskql PG
Mixed read/write (80/20)	55,817	59,200	0.94×	PG’s MVCC wins on concurrent writes	mskql PG
aggregation & analytics
Group + sum (5K rows)	97,210	14,305	6.80×	Result cache: 0.082ms p50	mskql PG
Wide aggregate (50K rows)	97,008	1,055	92.0×	Result cache: 0.082ms p50	mskql PG
Window function (RANK, 50K rows)	52	12	4.33×	mskql 19ms p50 vs PG 83ms	mskql PG
complex queries (50K-row dataset)
Two-table join (500 + 2K rows)	580	310	1.87×		mskql PG
Multi-join (3 tables)	37,402	475	78.7×	Result cache: 0.213ms p50	mskql PG
Mixed analytical (CTE→join→agg→sort)	97,210	508	191×	Result cache: 0.082ms p50	mskql PG
CTE pipeline (agg→sort→limit)	6,018	1,980	3.04×		mskql PG
Subquery pipeline (filter+sort+limit)	1,220	920	1.33×		mskql PG
Large sort (50K rows)	33	12	2.75×		mskql PG

6. The Honest Picture

Where mskql wins vs DuckDB (in-process)

15 of 16 engine benchmarks. Insert bulk: 13ms vs 1,355ms (102×). Full scan: 1.0ms vs 13.0ms (13×). Window function: 10.2ms vs 48.3ms (5×). Distinct: 51.7ms vs 174.5ms (3.4×). Scalar functions: 7.3ms vs 27.7ms (3.8×). Join: 3.4ms vs 14.6ms (4.3×). Generate series: 0.5ms vs 9.4ms (19×). Order by: 16.1ms vs 32.9ms (2×). The result cache is not a factor here—this is raw engine speed, both in-process.

Wire-level: every cached batch workload vs PostgreSQL. Correlated subquery: 6ms vs 3,456ms (576×). Expression aggregate: 14ms vs 302ms (22×). Throughput: mixed_analytical at 97,210 QPS vs 508 (191×). multi_join at 37,402 vs 475 (79×).

Vectors: all 7 workloads vs PostgreSQL + pgvector. Vector insert: 136ms vs 383ms (2.8×). Vector filter (cached): 13ms vs 42ms (3.2×). Even without the result cache, mskql is faster on every vector workload.

Parquet: 8 of 12 workloads (cached). pq_aggregate: 12ms vs 104ms (8.7×). pq_subquery: 5ms vs 56ms (11×). The plan executor’s columnar reader outperforms DuckDB on aggregates, joins, and subqueries. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join_nc.

Disk-backed tables: near-zero overhead. Average dk/mem ratio across all disk-eligible workloads: 0.96×. CREATE DISK TABLE adds persistence with no measurable performance cost on reads.

Where it’s a tie

DuckDB group_by (1.06×). Within measurement noise. DuckDB’s SIMD-vectorized hash aggregation matches mskql on high-cardinality GROUP BY.

What the wire-level data got wrong

The wire-level benchmarks previously showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). This was an artifact of comparing DuckDB’s in-process CLI (zero network) against mskql’s pgwire path (full TCP round-trip). In-process, mskql wins all three: full scan 13×, large sort 2.2×, generate_series 19×.

Where PostgreSQL wins

Mixed read/write throughput: 0.94×. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries.

7. Running It Yourself

In-process engine comparison (mskql vs DuckDB)

Wire-level batch latency (3-way)

Concurrent throughput (mskql vs PG)

The in-process benchmark links libmskql.a and libduckdb directly—no running servers needed. The batch script auto-creates a mskql_bench database in PostgreSQL if it doesn’t exist. The throughput benchmark requires libpq headers at build time.