93 batch workloads. 15 concurrent throughput workloads. Three databases on the same machine. Faster than PostgreSQL on every cached batch workload and 13 of 15 throughput tests. Against DuckDB, wins 61 of 93 workloads—aggregates 8×, expression aggregates 10×, correlated subqueries 166× faster than PG. DuckDB wins full scans (4.3×) and sorts (4.7×). PostgreSQL wins mixed read/write (0.94×).
The advantage on cached analytical workloads traces to three architectural
properties: (1) zero per-query allocation after warmup—the arena rewinds,
nothing is freed; (2) the scan cache converts row-store to flat columnar arrays
once per table write, so repeated reads are a memcpy of contiguous
memory; (3) result bytes accumulate into a 64 KB buffer before a single
write() syscall—on a 5,000-row result, 5 syscalls instead of
5,000. These are structural properties of the system, not tuning parameters.
→ Design philosophy: why abstraction is not free
All three databases are tested on the same Apple M-series laptop, same loopback interface.
The benchmark script (bench/bench_vs_pg.py) generates identical SQL workloads
for each test, runs setup against all databases, then times the benchmark queries.
Each timing measures a single psql -f invocation (mskql, PostgreSQL)
or duckdb CLI invocation (DuckDB) containing all iterations of that workload.
Port assignment: mskql on 5433, PostgreSQL on 5432 with default configuration,
DuckDB in-process via its CLI.
Each workload has two variants: cached (result cache active) and
no-cache (_nc suffix—each query has a unique SQL
comment appended to bust the result cache). This isolates the effect of mskql’s
result cache from raw execution speed.
A multi-threaded C benchmark (bench/bench_throughput.c) opens 8 persistent
connections speaking raw pgwire and fires queries for 5 seconds per workload.
Measures QPS and median latency (p50). DuckDB is not included in throughput tests
because it does not expose a wire-protocol server.
Shorter bars = faster. The chart below highlights 10 representative workloads across all three databases.
Ratio < 1.0 means mskql was faster. All 28 workloads with PostgreSQL data: every one is an mskql win. Against DuckDB: wins on aggregates, joins, CTEs, subqueries; DuckDB wins full scans, sorts, wide output.
All 28 cached workloads faster than PostgreSQL. Closest: index lookups (0.96×), transactions (0.86×). Against DuckDB: mskql wins aggregates, distinct, writes, analytical CTE, correlated subquery, joins. DuckDB wins scans, sorts, window functions, generate_series, scalar functions. See workload details for the exact SQL.
9 Parquet workloads (50K–200K rows). PostgreSQL cannot query Parquet files natively,
so this section compares mskql and DuckDB only. The mskql side uses CREATE FOREIGN TABLE;
DuckDB uses its native Parquet reader.
| Workload | mskql (ms) | Duck (ms) | Cached | No-cache | Notes | Visual (cached) |
|---|---|---|---|---|---|---|
| pq_full_scan (50K rows) | 485 | 31 | 15.6× | 15.6× | Row-by-row Parquet read; no columnar pushdown yet | |
| pq_where (filtered scan, 50K rows) | 20 | 42 | 0.48× | 2.24× | Result cache bypasses Parquet I/O on repeat | |
| pq_aggregate (GROUP BY, 50K rows) | 26 | 73 | 0.36× | 1.46× | Columnar hash aggregation on cached data | |
| pq_order_by (sort, 50K rows) | 365 | 28 | 13.0× | 13.0× | Sort requires full materialization; DuckDB sorts columnar natively | |
| pq_wide_agg (7 cols, 50K rows) | 25 | 40 | 0.63× | 2.00× | Multi-column agg stays in L1 cache | |
| pq_join_two (50K + 2K rows) | 23 | 33 | 0.70× | 3.30× | Hash join on cached Parquet data | |
| pq_join_three (3 Parquet files) | 27 | 36 | 0.75× | 3.56× | Three-way join across Parquet files | |
| pq_subquery (IN subquery, 2 files) | 18 | 41 | 0.44× | 2.22× | Hash semi-join on cached result | |
| pq_analytical (CTE + join + agg, 3 files) | 26 | 39 | 0.67× | 3.08× | Full analytical pipeline on Parquet |
Cached: mskql wins 7 of 9. The result cache serves repeated Parquet
queries from cached wire bytes. DuckDB re-reads the Parquet file every time.
No-cache: DuckDB wins 8 of 9. DuckDB’s native columnar Parquet
reader is faster on cold reads. On the mskql side, Parquet is read row-by-row via
PLAN_PARQUET_SCAN—no columnar pushdown yet.
Comparison: mskql vs PostgreSQL only—DuckDB does not expose a wire-protocol server. Ratio > 1.0 means mskql has higher throughput.
| Workload | mskql QPS | PG QPS | Ratio | Notes | Visual |
|---|---|---|---|---|---|
| simple reads & writes | |||||
| Indexed point lookup | 79,102 | 23,461 | 3.37× | mskql 0.101ms p50 vs PG 0.254ms | |
| Full table scan (5K rows) | 122 | 120 | 1.02× | Network-bound: both ~66ms p50 | |
| Filtered scan (5K rows) | 245 | 241 | 1.02× | Network-bound: both ~33ms p50 | |
| Single-row insert | 63,253 | 39,189 | 1.61× | No fsync, no WAL | |
| Mixed read/write (80/20) | 54,812 | 58,001 | 0.94× | PG’s MVCC wins on concurrent writes | |
| aggregation & analytics | |||||
| Group + sum (5K rows) | 94,525 | 13,929 | 6.79× | Result cache: 0.084ms p50 | |
| Wide aggregate (50K rows) | 8,412 | 1,036 | 8.12× | Result cache: 0.952ms p50 | |
| Window function (RANK, 50K rows) | 14 | 12 | 1.22× | Both slow: 50K-row window is CPU-bound | |
| complex queries (50K-row dataset) | |||||
| Two-table join (500 + 2K rows) | 562 | 302 | 1.86× | ||
| Multi-join (3 tables) | 36,171 | 462 | 78.3× | Result cache: 0.221ms p50 | |
| Mixed analytical (CTE→join→agg→sort) | 94,525 | 496 | 190× | Result cache: 0.085ms p50 | |
| CTE pipeline (agg→sort→limit) | 5,606 | 1,920 | 2.92× | ||
| Subquery pipeline (filter+sort+limit) | 1,170 | 898 | 1.30× | ||
| Large sort (50K rows) | 30 | 12 | 2.54× | ||
vs PostgreSQL: every cached batch workload. Correlated subquery: 22ms vs 3,665ms (166×). Expression aggregate: 33ms vs 323ms (10×). Throughput: mixed_analytical at 94,525 QPS vs 496 (190×). multi_join at 36,171 vs 462 (78×). The result cache and zero-allocation executor dominate on repeated analytical queries.
vs DuckDB: aggregates, writes, correlated subqueries, analytical CTEs, joins. Expression aggregate: 33ms vs 249ms (8×). Aggregate: 35ms vs 199ms (6×). Insert+delete: 2,654ms vs 15,972ms (6×). Bulk update: 55ms vs 85ms (1.5×). Analytical CTE: 20ms vs 32ms (1.6×). Correlated subquery: 22ms vs 38ms (1.7×).
Vectors: all 7 workloads vs PostgreSQL + pgvector.
Vector insert: 138ms vs 418ms (3×). Vector filter (cached): 13ms vs 45ms (3.5×).
Even without the result cache, mskql is faster on every vector workload—the columnar
float[] storage and zero-copy scan path outperform pgvector’s row-based storage.
Parquet (cached): 7 of 9 workloads. pq_aggregate: 26ms vs 73ms (2.8×). pq_subquery: 18ms vs 41ms (2.3×). The result cache makes repeated Parquet queries faster than DuckDB’s native reader.
Full scans: 4.3× faster. DuckDB’s columnar storage reads contiguous column data with SIMD vectorization. The mskql row store converts to columnar via the scan cache—fast, but not as fast as native columnar.
Sorts: 4.7× faster. DuckDB’s sort is SIMD-optimized for columnar data. The mskql plan executor sorts via multi-column radix sort on column blocks.
Window functions: 1.9–2.6× faster. DuckDB’s window function implementation is heavily optimized for columnar processing.
Parquet (cold): 8 of 9 workloads. DuckDB reads Parquet natively
with columnar pushdown. The mskql Parquet reader scans row-by-row via PLAN_PARQUET_SCAN.
Mixed read/write throughput: 0.94×. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries.
# Start mskql in one terminal
./build/mskql
# In another terminal, with PostgreSQL and DuckDB installed:
./bench/bench_vs_pg.sh
# Build the throughput benchmark
make -C bench bench_throughput
# Run (requires both mskql and PostgreSQL running)
./build/mskql_bench_tp --duration 5 --threads 8 --pg
The batch script auto-creates a mskql_bench database in PostgreSQL if it
doesn’t exist. DuckDB runs in-process via its CLI. The throughput benchmark
requires libpq headers at build time.