16 in-process engine benchmarks. 101 wire-level batch workloads. 15 concurrent throughput tests. In the fairest comparison—both engines linked as libraries, zero wire overhead—mskql wins 15 of 16 workloads against DuckDB. Full scan 13×, window functions 5×, distinct 3.4×, insert 102× faster. DuckDB wins only group_by (1.06×). Against PostgreSQL: every cached batch workload and 14 of 15 throughput tests.
The advantage traces to three architectural properties: (1) zero per-query
allocation after warmup—the arena rewinds, nothing is freed; (2) the scan cache
converts row-store to flat columnar arrays once per table write, so repeated reads are
a memcpy of contiguous memory; (3) result bytes accumulate into a
64 KB buffer before a single write() syscall. These are structural
properties of the system, not tuning parameters.
→ Design philosophy: why abstraction is not free
verify_correctness.c, which confirms row-for-row identical results between
mskql and DuckDB on all benchmark queries.
Three benchmark tiers, each measuring a different aspect of performance:
The fairest DuckDB comparison. A C harness (bench/bench_vs_duck.c)
links both engines as libraries—libmskql.a and libduckdb—in
the same process. Both run in-memory with identical table schemas and data. Each benchmark
creates fresh databases, runs setup, then times N iterations of the same query. There is
no wire protocol, no CLI process spawn, no serialization overhead on either
side. This isolates pure query engine speed.
The wire-level benchmark (Tier 2) runs mskql and PostgreSQL over pgwire but runs DuckDB via its CLI—an in-process invocation with zero network overhead. That asymmetry inflated DuckDB’s apparent advantage on scan and sort workloads. The in-process harness eliminates this bias: both engines execute through their C API in the same address space.
All three databases are tested on the same Apple M-series laptop, same loopback interface.
The benchmark script (bench/bench_vs_pg.py) generates identical SQL workloads,
runs setup against all databases, then times the benchmark queries. mskql and PostgreSQL are
timed via psql -f over pgwire; DuckDB is timed via its CLI (in-process).
Each workload has two variants: cached (result cache active) and
no-cache (_nc suffix—unique SQL comment busts the cache).
Note: The DuckDB column in wire-level results includes CLI startup but no network overhead, while mskql and PostgreSQL include full pgwire round-trips. This favours DuckDB. See Tier 1 for the apples-to-apples comparison.
A multi-threaded C benchmark (bench/bench_throughput.c) opens 8 persistent
connections speaking raw pgwire and fires queries for 5 seconds per workload.
Measures QPS and median latency (p50). DuckDB is excluded because it does not expose
a wire-protocol server.
Both engines linked as C libraries in the same process. No wire protocol, no CLI, no serialization. This is the fairest possible comparison of raw query engine speed. mskql wins 15 of 16 benchmarks.
Shorter bars = faster. The chart below highlights 8 representative in-process workloads.
Ratio < 1.0 means mskql was faster. Both engines in-memory, same process, same data. 15 of 16 workloads are mskql wins. The one DuckDB “win” (group_by at 1.06×) is within noise.
| Workload | mskql (ms) | DuckDB (ms) | Ratio | Winner | Visual |
|---|---|---|---|---|---|
| reads & scans | |||||
| Full table scan (10K rows ×100) | 1.0 | 13.0 | 0.07× | mskql 13× faster | |
| Filtered scan (10K rows ×200) | 4.6 | 34.9 | 0.13× | mskql 8× faster | |
| writes | |||||
| Single-row inserts (10K) | 13.3 | 1,355 | 0.01× | mskql 102× faster | |
| aggregation & analytics | |||||
| Aggregate (COUNT/SUM/AVG, 10K ×200) | 17.2 | 32.3 | 0.53× | mskql 1.9× faster | |
| Expression aggregate (SUM(qty*price) ×100) | 7.8 | 24.0 | 0.32× | mskql 3.1× faster | |
| Group by (100 groups, 50K rows ×100) | 36.8 | 34.8 | 1.06× | Tie (within noise) | |
| Distinct (100 groups, 100K ×200) | 51.7 | 174.5 | 0.30× | mskql 3.4× faster | |
| Window function (ROW_NUMBER, 50K ×10) | 10.2 | 48.3 | 0.21× | mskql 5× faster | |
| CTE (agg + filter + sort, 50K ×50) | 17.1 | 22.0 | 0.78× | mskql 1.3× faster | |
| joins | |||||
| Join (10K×1K rows ×50) | 3.4 | 14.6 | 0.23× | mskql 4.3× faster | |
| Large join (100K×1K rows ×50) | 33.3 | 34.5 | 0.97× | mskql 1.04× faster | |
| sorts & functions | |||||
| Order by (10K rows ×50) | 16.1 | 32.9 | 0.49× | mskql 2× faster | |
| Large sort (100K rows ×20) | 29.9 | 65.2 | 0.46× | mskql 2.2× faster | |
| Top-N (ORDER BY + LIMIT, 100K ×100) | 113.9 | 121.2 | 0.94× | mskql 1.1× faster | |
| Generate series (100K rows ×50) | 0.5 | 9.4 | 0.05× | mskql 19× faster | |
| Scalar functions (UPPER/ABS/etc, 10K ×100) | 7.3 | 27.7 | 0.26× | mskql 3.8× faster | |
The wire-level benchmarks (Section 3) showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). In-process, mskql wins all three: full scan 13× faster, order_by 2× faster, large_sort 2.2× faster, generate_series 19× faster. The wire-level “DuckDB wins scans and sorts” narrative was an artifact of comparing DuckDB’s in-process CLI against mskql’s pgwire path.
End-to-end wall-clock time including client/server overhead. mskql and PostgreSQL communicate
over pgwire; DuckDB runs via its CLI (in-process, no network). Ratio < 1.0 means mskql
was faster. The disk column shows performance with CREATE DISK TABLE.
Note: DuckDB’s numbers here benefit from zero network overhead. See Section 2 for the apples-to-apples engine comparison.
All cached workloads faster than PostgreSQL. Closest: full scan (0.89×), transactions (0.82×). Against DuckDB over the wire: mskql wins aggregates (17×), distinct (14×), writes, analytical CTE, correlated subquery, joins, and all stress tests. DuckDB’s wire-level advantages on scans and sorts are a measurement artifact—see Section 2 for the fair comparison. The dk/mem column shows disk-backed tables average 0.96× the in-memory speed. See workload details for the exact SQL.
9 Parquet workloads (50K–200K rows). PostgreSQL cannot query Parquet files natively,
so this section compares mskql and DuckDB only. The mskql side uses CREATE FOREIGN TABLE;
DuckDB uses its native Parquet reader.
| Workload | mskql (ms) | Duck (ms) | Cached | No-cache | Notes | Visual (cached) |
|---|---|---|---|---|---|---|
| pq_full_scan (50K rows) | 261 | 41 | 6.36× | 6.33× | Columnar Parquet scan via PLAN_PARQUET_SCAN | |
| pq_where (filtered scan, 50K rows) | 164 | 74 | 2.23× | 2.36× | Result cache bypasses Parquet I/O on repeat | |
| pq_aggregate (GROUP BY, 50K rows) | 12 | 104 | 0.12× | 0.12× | Columnar hash aggregation on cached data | |
| pq_order_by (sort, 50K rows) | 135 | 48 | 2.83× | 3.10× | Sort requires full materialization | |
| pq_wide_agg (7 cols, 50K rows) | 8 | 51 | 0.16× | 0.43× | Multi-column agg stays in L1 cache | |
| pq_join_two (50K + 2K rows) | 7 | 37 | 0.18× | 0.44× | Hash join on cached Parquet data | |
| pq_join_three (3 Parquet files) | 8 | 33 | 0.23× | 0.54× | Three-way join across Parquet files | |
| pq_subquery (IN subquery, 2 files) | 5 | 56 | 0.10× | 0.10× | Hash semi-join on cached result | |
| pq_lineitem_agg (200K rows, join + agg) | 15 | 43 | 0.35× | 0.84× | Large Parquet join + aggregate | |
| pq_analytical (CTE + join + agg, 3 files) | 6 | 30 | 0.20× | 0.38× | Full analytical pipeline on Parquet | |
| analytics_stress (mixed parquet + mem) | 25 | 66 | 0.38× | 0.54× | Complex analytics mixing parquet + in-memory | |
| pq_mixed_join (parquet + mem table) | 9 | 60 | 0.15× | 1.46× | Join Parquet file with in-memory table |
Cached: mskql wins 8 of 12. The result cache serves repeated Parquet queries from cached wire bytes. DuckDB re-reads the Parquet file every time. No-cache: mskql wins 6 of 12. The plan executor’s columnar Parquet reader outperforms DuckDB on aggregates, joins, and subqueries even without caching. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join.
Comparison: mskql vs PostgreSQL only—DuckDB does not expose a wire-protocol server. Ratio > 1.0 means mskql has higher throughput.
| Workload | mskql QPS | PG QPS | Ratio | Notes | Visual |
|---|---|---|---|---|---|
| simple reads & writes | |||||
| Indexed point lookup | 86,805 | 24,048 | 3.61× | mskql 0.092ms p50 vs PG 0.248ms | |
| Full table scan (5K rows) | 125 | 122 | 1.02× | Network-bound: both ~66ms p50 | |
| Filtered scan (5K rows) | 253 | 245 | 1.03× | Network-bound: both ~33ms p50 | |
| Single-row insert | 66,410 | 40,105 | 1.66× | No fsync, no WAL | |
| Mixed read/write (80/20) | 55,817 | 59,200 | 0.94× | PG’s MVCC wins on concurrent writes | |
| aggregation & analytics | |||||
| Group + sum (5K rows) | 97,210 | 14,305 | 6.80× | Result cache: 0.082ms p50 | |
| Wide aggregate (50K rows) | 97,008 | 1,055 | 92.0× | Result cache: 0.082ms p50 | |
| Window function (RANK, 50K rows) | 52 | 12 | 4.33× | mskql 19ms p50 vs PG 83ms | |
| complex queries (50K-row dataset) | |||||
| Two-table join (500 + 2K rows) | 580 | 310 | 1.87× | ||
| Multi-join (3 tables) | 37,402 | 475 | 78.7× | Result cache: 0.213ms p50 | |
| Mixed analytical (CTE→join→agg→sort) | 97,210 | 508 | 191× | Result cache: 0.082ms p50 | |
| CTE pipeline (agg→sort→limit) | 6,018 | 1,980 | 3.04× | ||
| Subquery pipeline (filter+sort+limit) | 1,220 | 920 | 1.33× | ||
| Large sort (50K rows) | 33 | 12 | 2.75× | ||
15 of 16 engine benchmarks. Insert bulk: 13ms vs 1,355ms (102×). Full scan: 1.0ms vs 13.0ms (13×). Window function: 10.2ms vs 48.3ms (5×). Distinct: 51.7ms vs 174.5ms (3.4×). Scalar functions: 7.3ms vs 27.7ms (3.8×). Join: 3.4ms vs 14.6ms (4.3×). Generate series: 0.5ms vs 9.4ms (19×). Order by: 16.1ms vs 32.9ms (2×). The result cache is not a factor here—this is raw engine speed, both in-process.
Wire-level: every cached batch workload vs PostgreSQL. Correlated subquery: 6ms vs 3,456ms (576×). Expression aggregate: 14ms vs 302ms (22×). Throughput: mixed_analytical at 97,210 QPS vs 508 (191×). multi_join at 37,402 vs 475 (79×).
Vectors: all 7 workloads vs PostgreSQL + pgvector. Vector insert: 136ms vs 383ms (2.8×). Vector filter (cached): 13ms vs 42ms (3.2×). Even without the result cache, mskql is faster on every vector workload.
Parquet: 8 of 12 workloads (cached). pq_aggregate: 12ms vs 104ms (8.7×). pq_subquery: 5ms vs 56ms (11×). The plan executor’s columnar reader outperforms DuckDB on aggregates, joins, and subqueries. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join_nc.
Disk-backed tables: near-zero overhead.
Average dk/mem ratio across all disk-eligible workloads: 0.96×.
CREATE DISK TABLE adds persistence with no measurable performance cost on reads.
DuckDB group_by (1.06×). Within measurement noise. DuckDB’s SIMD-vectorized hash aggregation matches mskql on high-cardinality GROUP BY.
The wire-level benchmarks previously showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). This was an artifact of comparing DuckDB’s in-process CLI (zero network) against mskql’s pgwire path (full TCP round-trip). In-process, mskql wins all three: full scan 13×, large sort 2.2×, generate_series 19×.
Mixed read/write throughput: 0.94×. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries.
# Build (requires libduckdb installed)
make bench_vs_duck
# Run all 16 benchmarks
./build/mskql_bench_vs_duck
# Filter to specific benchmarks
./build/mskql_bench_vs_duck --filter join
# Output JSON for CI integration
./build/mskql_bench_vs_duck --json results.json
# Start mskql in one terminal
./build/mskql
# In another terminal, with PostgreSQL and DuckDB installed:
./bench/bench_vs_pg.sh
# Build the throughput benchmark
make -C bench bench_throughput
# Run (requires both mskql and PostgreSQL running)
./build/mskql_bench_tp --duration 5 --threads 8 --pg
The in-process benchmark links libmskql.a and libduckdb directly—no
running servers needed. The batch script auto-creates a mskql_bench database in
PostgreSQL if it doesn’t exist. The throughput benchmark requires libpq
headers at build time.