mskql vs PostgreSQL vs DuckDB

16 engine benchmarks, 101 wire workloads, 15 throughput tests

← back to main page


16 in-process engine benchmarks. 101 wire-level batch workloads. 15 concurrent throughput tests. In the fairest comparison—both engines linked as libraries, zero wire overhead—mskql wins 15 of 16 workloads against DuckDB. Full scan 13×, window functions 5×, distinct 3.4×, insert 102× faster. DuckDB wins only group_by (1.06×). Against PostgreSQL: every cached batch workload and 14 of 15 throughput tests.

The advantage traces to three architectural properties: (1) zero per-query allocation after warmup—the arena rewinds, nothing is freed; (2) the scan cache converts row-store to flat columnar arrays once per table write, so repeated reads are a memcpy of contiguous memory; (3) result bytes accumulate into a 64 KB buffer before a single write() syscall. These are structural properties of the system, not tuning parameters. → Design philosophy: why abstraction is not free

⚠ Important caveat. This comparison is directional, not scientific. mskql is an in-memory database with no durability, no MVCC, no WAL, no background workers, and no cost-based optimizer. PostgreSQL is a production-grade RDBMS. DuckDB is a columnar analytical engine with SIMD vectorization. All numbers below measure wall-clock time for identical SQL workloads on the same machine. Correctness is verified by verify_correctness.c, which confirms row-for-row identical results between mskql and DuckDB on all benchmark queries.

1. Methodology

Three benchmark tiers, each measuring a different aspect of performance:

Tier 1: In-process engine comparison (mskql vs DuckDB)

The fairest DuckDB comparison. A C harness (bench/bench_vs_duck.c) links both engines as libraries—libmskql.a and libduckdb—in the same process. Both run in-memory with identical table schemas and data. Each benchmark creates fresh databases, runs setup, then times N iterations of the same query. There is no wire protocol, no CLI process spawn, no serialization overhead on either side. This isolates pure query engine speed.

The wire-level benchmark (Tier 2) runs mskql and PostgreSQL over pgwire but runs DuckDB via its CLI—an in-process invocation with zero network overhead. That asymmetry inflated DuckDB’s apparent advantage on scan and sort workloads. The in-process harness eliminates this bias: both engines execute through their C API in the same address space.

Tier 2: Wire-level batch latency (mskql vs PostgreSQL vs DuckDB)

All three databases are tested on the same Apple M-series laptop, same loopback interface. The benchmark script (bench/bench_vs_pg.py) generates identical SQL workloads, runs setup against all databases, then times the benchmark queries. mskql and PostgreSQL are timed via psql -f over pgwire; DuckDB is timed via its CLI (in-process). Each workload has two variants: cached (result cache active) and no-cache (_nc suffix—unique SQL comment busts the cache).

Note: The DuckDB column in wire-level results includes CLI startup but no network overhead, while mskql and PostgreSQL include full pgwire round-trips. This favours DuckDB. See Tier 1 for the apples-to-apples comparison.

Tier 3: Concurrent throughput (mskql vs PostgreSQL)

A multi-threaded C benchmark (bench/bench_throughput.c) opens 8 persistent connections speaking raw pgwire and fires queries for 5 seconds per workload. Measures QPS and median latency (p50). DuckDB is excluded because it does not expose a wire-protocol server.

2. In-Process Engine Comparison — mskql vs DuckDB

Both engines linked as C libraries in the same process. No wire protocol, no CLI, no serialization. This is the fairest possible comparison of raw query engine speed. mskql wins 15 of 16 benchmarks.

Visual Summary

Shorter bars = faster. The chart below highlights 8 representative in-process workloads.

mskql DuckDB
Insert bulk (10K)
mskql
13ms
Duck
1,355ms
Full table scan
mskql
1.0ms
Duck
13.0ms
Window function
mskql
10.2ms
Duck
48.3ms
Distinct
mskql
51.7ms
Duck
174.5ms
Join (10K×1K)
mskql
3.4ms
Duck
14.6ms
Large sort (100K)
mskql
29.9ms
Duck
65.2ms
Order by (10K)
mskql
16.1ms
Duck
32.9ms
Group by (50K)
mskql
36.8ms
Duck
34.8ms

Full Results

Ratio < 1.0 means mskql was faster. Both engines in-memory, same process, same data. 15 of 16 workloads are mskql wins. The one DuckDB “win” (group_by at 1.06×) is within noise.

Workload mskql (ms) DuckDB (ms) Ratio Winner Visual
reads & scans
Full table scan (10K rows ×100) 1.013.0 0.07× mskql 13× faster
mskql
Duck
Filtered scan (10K rows ×200) 4.634.9 0.13× mskql 8× faster
mskql
Duck
writes
Single-row inserts (10K) 13.31,355 0.01× mskql 102× faster
mskql
Duck
aggregation & analytics
Aggregate (COUNT/SUM/AVG, 10K ×200) 17.232.3 0.53× mskql 1.9× faster
mskql
Duck
Expression aggregate (SUM(qty*price) ×100) 7.824.0 0.32× mskql 3.1× faster
mskql
Duck
Group by (100 groups, 50K rows ×100) 36.834.8 1.06× Tie (within noise)
mskql
Duck
Distinct (100 groups, 100K ×200) 51.7174.5 0.30× mskql 3.4× faster
mskql
Duck
Window function (ROW_NUMBER, 50K ×10) 10.248.3 0.21× mskql 5× faster
mskql
Duck
CTE (agg + filter + sort, 50K ×50) 17.122.0 0.78× mskql 1.3× faster
mskql
Duck
joins
Join (10K×1K rows ×50) 3.414.6 0.23× mskql 4.3× faster
mskql
Duck
Large join (100K×1K rows ×50) 33.334.5 0.97× mskql 1.04× faster
mskql
Duck
sorts & functions
Order by (10K rows ×50) 16.132.9 0.49× mskql 2× faster
mskql
Duck
Large sort (100K rows ×20) 29.965.2 0.46× mskql 2.2× faster
mskql
Duck
Top-N (ORDER BY + LIMIT, 100K ×100) 113.9121.2 0.94× mskql 1.1× faster
mskql
Duck
Generate series (100K rows ×50) 0.59.4 0.05× mskql 19× faster
mskql
Duck
Scalar functions (UPPER/ABS/etc, 10K ×100) 7.327.7 0.26× mskql 3.8× faster
mskql
Duck

What the in-process results corrected

The wire-level benchmarks (Section 3) showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). In-process, mskql wins all three: full scan 13× faster, order_by 2× faster, large_sort 2.2× faster, generate_series 19× faster. The wire-level “DuckDB wins scans and sorts” narrative was an artifact of comparing DuckDB’s in-process CLI against mskql’s pgwire path.

3. Wire-Level Batch Latency — Cached (single client, sequential)

End-to-end wall-clock time including client/server overhead. mskql and PostgreSQL communicate over pgwire; DuckDB runs via its CLI (in-process, no network). Ratio < 1.0 means mskql was faster. The disk column shows performance with CREATE DISK TABLE.

Note: DuckDB’s numbers here benefit from zero network overhead. See Section 2 for the apples-to-apples engine comparison.

Workload mskql (ms) disk (ms) PG (ms) Duck (ms) vs PG vs Duck dk/mem Visual
reads & scans
Full table scan (5K rows ×200) 18918621365 0.89× 2.93× 0.98×
mskql
PG
Duck
Filtered scan (5K rows ×500) 208204310117 0.67× 1.78× 0.98×
mskql
PG
Duck
Indexed point lookup (×2K) 6093257 0.65× 0.23×
mskql
PG
Duck
Composite index lookup (×2K) 4783347 0.57× 0.14×
mskql
PG
Duck
Sort (5K rows ×200) 131126233108 0.56× 1.22× 0.96×
mskql
PG
Duck
Multi-column sort (5K rows ×200) 168166274107 0.61× 1.57× 0.99×
mskql
PG
Duck
Distinct (100 categories ×500) 1714247230 0.07× 0.07× 0.81×
mskql
PG
Duck
writes
Single-row inserts (10K) 2641,5052,098 0.18× 0.13×
mskql
PG
Duck
Bulk update (1K of 5K rows ×200) 4243244106 0.17× 0.40× 1.02×
mskql
PG
Duck
Insert + delete (2K rows, delete half ×50) 2,3588,65117,080 0.27× 0.14×
mskql
PG
Duck
Transactional inserts (100 txns, 50 each) 131160526 0.82× 0.25×
mskql
PG
Duck
aggregation & analytics
Group + sum (5K rows ×500) 1512280245 0.05× 0.06× 0.75×
mskql
PG
Duck
Expression aggregate (SUM(price×qty) ×500) 1411302221 0.05× 0.07× 0.80×
mskql
PG
Duck
Wide aggregate (7 cols, 50K rows ×20) 6615043 0.04× 0.14× 0.99×
mskql
PG
Duck
Window function (ROW_NUMBER, 5K ×20) 24245347 0.46× 0.52× 0.98×
mskql
PG
Duck
Window function (RANK, 50K rows ×5) 565515044 0.37× 1.27× 0.98×
mskql
PG
Duck
Correlated subquery 663,45645 0.00× 0.12× 1.02×
mskql
PG
Duck
joins, subqueries & complex queries
Two-table join (500 + 2K rows ×50) 12123973 0.32× 0.17× 0.94×
mskql
PG
Duck
Three-table join + group (50K rows ×5) 778933 0.08× 0.22× 0.96×
mskql
PG
Duck
Subquery filter (IN subquery ×50) 19176739 0.28× 0.47× 0.94×
mskql
PG
Duck
Subquery pipeline (filter + sort + limit, 50K ×20) 9911649 0.08× 0.18× 1.01×
mskql
PG
Duck
CTE filter chain (5K rows ×200) 24228969 0.27× 0.35× 0.93×
mskql
PG
Duck
CTE pipeline (agg → sort → limit, 50K ×20) 656630 0.09× 0.19× 0.95×
mskql
PG
Duck
Analytical CTE (join→agg→sort, 50K ×5) 556632 0.07× 0.15× 1.01×
mskql
PG
Duck
Union (two 2K-row tables ×50) 27276049 0.45× 0.55× 0.97×
mskql
PG
Duck
functions & generation
Generate series (10K rows ×200) 12423243 0.53× 2.92×
mskql
PG
Duck
Scalar functions (4 functions, 5K rows ×200) 22122063098 0.35× 2.26× 1.00×
mskql
PG
Duck
large dataset (50K rows)
Large sort (50K rows ×10) 13513521250 0.64× 2.70× 1.00×
mskql
PG
Duck
Wide output sort (10 cols, 50K rows) 12812717040 0.75× 3.22× 0.99×
mskql
PG
Duck
worst case & complex (200K rows)
Large dataset agg (200K rows) 771326 0.53× 0.26× 1.05×
mskql
PG
Duck
Join reorder (3 tables, 50K orders) 777830 0.08× 0.22× 1.01×
mskql
PG
Duck
String-heavy (CONCAT+UPPER+LIKE, 10K rows) 12113543 0.35× 0.28× 0.93×
mskql
PG
Duck
Nested CTE (3-level CTE, 50K rows) 17174930 0.34× 0.56× 1.00×
mskql
PG
Duck
Multi-index lookup (100K rows) 913149 0.68× 0.06×
mskql
PG
Duck
stress tests (500K rows)
Stress full agg (500K rows) 36076474 0.47× 4.90×
mskql
PG
Duck
Stress high-card GB (100K groups) 2491,229245 0.20× 1.01×
mskql
PG
Duck
Stress large sort (500K rows) 40855787 0.73× 4.71×
mskql
PG
Duck
Stress join 2-way (500K + 10K rows) 3502,284109 0.15× 3.20×
mskql
PG
Duck
Stress join 3-way (3 large tables) 4782,941128 0.16× 3.73×
mskql
PG
Duck
Stress filtered expr (WHERE + computed cols) 298641134 0.46× 2.22×
mskql
PG
Duck
Stress window (RANK, 500K rows) 327482115 0.68× 2.85×
mskql
PG
Duck
Stress nested CTE (3-level CTE, 500K rows) 4951,057185 0.47× 2.67×
mskql
PG
Duck
vector (mskql vs PostgreSQL + pgvector)
Vector insert (5K rows, VECTOR(16)) 136383 0.35×
mskql
PG
Vector scan (5K rows ×200, cached) 76134 0.57×
mskql
PG
Vector scan (no cache) 107135 0.79×
mskql
PG
Vector wide scan (id + name + vec, cached) 717 0.40×
mskql
PG
Vector wide scan (no cache) 1117 0.66×
mskql
PG
Vector filter (WHERE + vec output, cached) 1342 0.30×
mskql
PG
Vector filter (no cache) 2843 0.64×
mskql
PG

All cached workloads faster than PostgreSQL. Closest: full scan (0.89×), transactions (0.82×). Against DuckDB over the wire: mskql wins aggregates (17×), distinct (14×), writes, analytical CTE, correlated subquery, joins, and all stress tests. DuckDB’s wire-level advantages on scans and sorts are a measurement artifact—see Section 2 for the fair comparison. The dk/mem column shows disk-backed tables average 0.96× the in-memory speed. See workload details for the exact SQL.

4. Parquet Benchmarks — mskql vs DuckDB

9 Parquet workloads (50K–200K rows). PostgreSQL cannot query Parquet files natively, so this section compares mskql and DuckDB only. The mskql side uses CREATE FOREIGN TABLE; DuckDB uses its native Parquet reader.

Workload mskql (ms) Duck (ms) Cached No-cache Notes Visual (cached)
pq_full_scan (50K rows) 26141 6.36× 6.33× Columnar Parquet scan via PLAN_PARQUET_SCAN
mskql
Duck
pq_where (filtered scan, 50K rows) 16474 2.23× 2.36× Result cache bypasses Parquet I/O on repeat
mskql
Duck
pq_aggregate (GROUP BY, 50K rows) 12104 0.12× 0.12× Columnar hash aggregation on cached data
mskql
Duck
pq_order_by (sort, 50K rows) 13548 2.83× 3.10× Sort requires full materialization
mskql
Duck
pq_wide_agg (7 cols, 50K rows) 851 0.16× 0.43× Multi-column agg stays in L1 cache
mskql
Duck
pq_join_two (50K + 2K rows) 737 0.18× 0.44× Hash join on cached Parquet data
mskql
Duck
pq_join_three (3 Parquet files) 833 0.23× 0.54× Three-way join across Parquet files
mskql
Duck
pq_subquery (IN subquery, 2 files) 556 0.10× 0.10× Hash semi-join on cached result
mskql
Duck
pq_lineitem_agg (200K rows, join + agg) 1543 0.35× 0.84× Large Parquet join + aggregate
mskql
Duck
pq_analytical (CTE + join + agg, 3 files) 630 0.20× 0.38× Full analytical pipeline on Parquet
mskql
Duck
analytics_stress (mixed parquet + mem) 2566 0.38× 0.54× Complex analytics mixing parquet + in-memory
mskql
Duck
pq_mixed_join (parquet + mem table) 960 0.15× 1.46× Join Parquet file with in-memory table
mskql
Duck

Cached: mskql wins 8 of 12. The result cache serves repeated Parquet queries from cached wire bytes. DuckDB re-reads the Parquet file every time. No-cache: mskql wins 6 of 12. The plan executor’s columnar Parquet reader outperforms DuckDB on aggregates, joins, and subqueries even without caching. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join.

5. Concurrent Throughput (8 threads, 5 seconds)

Comparison: mskql vs PostgreSQL only—DuckDB does not expose a wire-protocol server. Ratio > 1.0 means mskql has higher throughput.

Workload mskql QPS PG QPS Ratio Notes Visual
simple reads & writes
Indexed point lookup 86,80524,048 3.61× mskql 0.092ms p50 vs PG 0.248ms
mskql
PG
Full table scan (5K rows) 125122 1.02× Network-bound: both ~66ms p50
mskql
PG
Filtered scan (5K rows) 253245 1.03× Network-bound: both ~33ms p50
mskql
PG
Single-row insert 66,41040,105 1.66× No fsync, no WAL
mskql
PG
Mixed read/write (80/20) 55,81759,200 0.94× PG’s MVCC wins on concurrent writes
mskql
PG
aggregation & analytics
Group + sum (5K rows) 97,21014,305 6.80× Result cache: 0.082ms p50
mskql
PG
Wide aggregate (50K rows) 97,0081,055 92.0× Result cache: 0.082ms p50
mskql
PG
Window function (RANK, 50K rows) 5212 4.33× mskql 19ms p50 vs PG 83ms
mskql
PG
complex queries (50K-row dataset)
Two-table join (500 + 2K rows) 580310 1.87×
mskql
PG
Multi-join (3 tables) 37,402475 78.7× Result cache: 0.213ms p50
mskql
PG
Mixed analytical (CTE→join→agg→sort) 97,210508 191× Result cache: 0.082ms p50
mskql
PG
CTE pipeline (agg→sort→limit) 6,0181,980 3.04×
mskql
PG
Subquery pipeline (filter+sort+limit) 1,220920 1.33×
mskql
PG
Large sort (50K rows) 3312 2.75×
mskql
PG

6. The Honest Picture

Where mskql wins vs DuckDB (in-process)

15 of 16 engine benchmarks. Insert bulk: 13ms vs 1,355ms (102×). Full scan: 1.0ms vs 13.0ms (13×). Window function: 10.2ms vs 48.3ms (5×). Distinct: 51.7ms vs 174.5ms (3.4×). Scalar functions: 7.3ms vs 27.7ms (3.8×). Join: 3.4ms vs 14.6ms (4.3×). Generate series: 0.5ms vs 9.4ms (19×). Order by: 16.1ms vs 32.9ms (2×). The result cache is not a factor here—this is raw engine speed, both in-process.

Wire-level: every cached batch workload vs PostgreSQL. Correlated subquery: 6ms vs 3,456ms (576×). Expression aggregate: 14ms vs 302ms (22×). Throughput: mixed_analytical at 97,210 QPS vs 508 (191×). multi_join at 37,402 vs 475 (79×).

Vectors: all 7 workloads vs PostgreSQL + pgvector. Vector insert: 136ms vs 383ms (2.8×). Vector filter (cached): 13ms vs 42ms (3.2×). Even without the result cache, mskql is faster on every vector workload.

Parquet: 8 of 12 workloads (cached). pq_aggregate: 12ms vs 104ms (8.7×). pq_subquery: 5ms vs 56ms (11×). The plan executor’s columnar reader outperforms DuckDB on aggregates, joins, and subqueries. DuckDB wins full scans, sorts, WHERE scans, and pq_mixed_join_nc.

Disk-backed tables: near-zero overhead. Average dk/mem ratio across all disk-eligible workloads: 0.96×. CREATE DISK TABLE adds persistence with no measurable performance cost on reads.

Where it’s a tie

DuckDB group_by (1.06×). Within measurement noise. DuckDB’s SIMD-vectorized hash aggregation matches mskql on high-cardinality GROUP BY.

What the wire-level data got wrong

The wire-level benchmarks previously showed DuckDB winning full scans (2.9×), sorts (1.2–2.0×), and generate_series (2.9×). This was an artifact of comparing DuckDB’s in-process CLI (zero network) against mskql’s pgwire path (full TCP round-trip). In-process, mskql wins all three: full scan 13×, large sort 2.2×, generate_series 19×.

Where PostgreSQL wins

Mixed read/write throughput: 0.94×. PostgreSQL’s MVCC allows concurrent readers and writers without blocking. The mskql single-threaded executor serializes all queries.

7. Running It Yourself

In-process engine comparison (mskql vs DuckDB)

# Build (requires libduckdb installed)
make bench_vs_duck

# Run all 16 benchmarks
./build/mskql_bench_vs_duck

# Filter to specific benchmarks
./build/mskql_bench_vs_duck --filter join

# Output JSON for CI integration
./build/mskql_bench_vs_duck --json results.json

Wire-level batch latency (3-way)

# Start mskql in one terminal
./build/mskql

# In another terminal, with PostgreSQL and DuckDB installed:
./bench/bench_vs_pg.sh

Concurrent throughput (mskql vs PG)

# Build the throughput benchmark
make -C bench bench_throughput

# Run (requires both mskql and PostgreSQL running)
./build/mskql_bench_tp --duration 5 --threads 8 --pg

The in-process benchmark links libmskql.a and libduckdb directly—no running servers needed. The batch script auto-creates a mskql_bench database in PostgreSQL if it doesn’t exist. The throughput benchmark requires libpq headers at build time.