The benchmark is bench_wc_correlated_subquery: 2,000 master rows,
10,000 detail rows, 10 iterations. The query computes a scalar subquery for
each master row. Measured as wall-clock latency through the PostgreSQL wire
protocol.
CREATE TABLE csq_master (id INT PRIMARY KEY, val INT);
CREATE TABLE csq_detail (id INT, ref_id INT, score INT);
-- 2,000 master rows, 10,000 detail rows
SELECT csq_master.id, csq_master.val,
(SELECT MAX(csq_detail.score)
FROM csq_detail
WHERE csq_detail.ref_id = csq_master.id) AS max_score
FROM csq_master
WHERE csq_master.val > 3000;
PostgreSQL takes 3,636ms for this workload. The naive execution strategy re-runs the inner SELECT once per qualifying outer row. With ~1,000 qualifying master rows and 10,000 detail rows, that is ~10,000,000 row touches per iteration, or 100,000,000 across 10 iterations.
A correlated scalar subquery in the SELECT list creates an implicit nested
loop: for each outer row, evaluate the inner query with the correlation
predicate (csq_detail.ref_id = csq_master.id). This is
O(N×M) where N is the number of qualifying outer rows and M is the
size of the inner table. PostgreSQL’s planner can sometimes decorrelate
these, but the benchmark query structure hits a case where it does not.
The plan executor decorrelates the subquery at plan build time:
ref_id = csq_master.id).SELECT ref_id, MAX(score) FROM csq_detail GROUP BY ref_id.PLAN_SEMI_JOIN (LEFT
HASH_JOIN) between the outer scan and the materialized aggregate.
The hash join builds a hash table on the materialized aggregate (one entry
per distinct ref_id), then probes once per outer row. Total
cost: O(M) for the aggregate + O(N) for the probe = O(N+M) instead of
O(N×M).
Same query, same machine:
| mskql | PostgreSQL | DuckDB | |
|---|---|---|---|
| correlated_subquery (10 iter) | 4ms | 3,636ms | 39ms |
| vs PostgreSQL ratio | 0.001× (mskql 826× faster) | ||
| vs DuckDB ratio | 0.10× (mskql 10× faster) | ||
The 826× improvement over PostgreSQL is the largest ratio on the entire benchmark suite. DuckDB also decorrelates this pattern but takes 39ms; the additional 10× advantage comes from the result cache serving the identical query from cached wire bytes on repeat iterations.