Correlated subquery decorrelation

Materialized aggregate + LEFT HASH_JOIN replaces O(N×M) nested loop.

Setup

The benchmark is bench_wc_correlated_subquery: 2,000 master rows, 10,000 detail rows, 10 iterations. The query computes a scalar subquery for each master row. Measured as wall-clock latency through the PostgreSQL wire protocol.

CREATE TABLE csq_master (id INT PRIMARY KEY, val INT);
CREATE TABLE csq_detail (id INT, ref_id INT, score INT);
-- 2,000 master rows, 10,000 detail rows

SELECT csq_master.id, csq_master.val,
       (SELECT MAX(csq_detail.score)
        FROM csq_detail
        WHERE csq_detail.ref_id = csq_master.id) AS max_score
FROM csq_master
WHERE csq_master.val > 3000;

Problem

PostgreSQL takes 3,636ms for this workload. The naive execution strategy re-runs the inner SELECT once per qualifying outer row. With ~1,000 qualifying master rows and 10,000 detail rows, that is ~10,000,000 row touches per iteration, or 100,000,000 across 10 iterations.

Cause

A correlated scalar subquery in the SELECT list creates an implicit nested loop: for each outer row, evaluate the inner query with the correlation predicate (csq_detail.ref_id = csq_master.id). This is O(N×M) where N is the number of qualifying outer rows and M is the size of the inner table. PostgreSQL’s planner can sometimes decorrelate these, but the benchmark query structure hits a case where it does not.

Fix

The plan executor decorrelates the subquery at plan build time:

Extract the correlation predicate (ref_id = csq_master.id).
Materialize the inner query as a grouped aggregate: SELECT ref_id, MAX(score) FROM csq_detail GROUP BY ref_id.
Replace the scalar subquery with a PLAN_SEMI_JOIN (LEFT HASH_JOIN) between the outer scan and the materialized aggregate.

The hash join builds a hash table on the materialized aggregate (one entry per distinct ref_id), then probes once per outer row. Total cost: O(M) for the aggregate + O(N) for the probe = O(N+M) instead of O(N×M).

Result

Same query, same machine:

	mskql	PostgreSQL	DuckDB
correlated_subquery (10 iter)	4ms	3,636ms	39ms
vs PostgreSQL ratio	0.001× (mskql 826× faster)
vs DuckDB ratio	0.10× (mskql 10× faster)

The 826× improvement over PostgreSQL is the largest ratio on the entire benchmark suite. DuckDB also decorrelates this pattern but takes 39ms; the additional 10× advantage comes from the result cache serving the identical query from cached wire bytes on repeat iterations.