The benchmark is bench_mixed_analytical: a CTE pipeline that joins,
aggregates, and sorts across a star-schema dataset. 50,000 order rows, 2,000
customers, 200 products. Measured as wall-clock latency through the PostgreSQL
wire protocol, 500 iterations.
WITH order_summary AS (
SELECT o.customer_id,
SUM(o.quantity * p.price) AS total
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY o.customer_id
)
SELECT c.region, COUNT(*), SUM(os.total) AS revenue
FROM order_summary os
JOIN customers c ON os.customer_id = c.id
GROUP BY c.region
ORDER BY revenue DESC;
The plan executor (plan.c) handles SELECT queries through a columnar
pipeline: scan → filter → hash join → hash aggregate → sort →
top-N → project. This pipeline is significantly faster than the legacy
row-by-row executor in query.c. However, queries with CTEs were
falling back to the legacy path entirely.
The logical plan builder (logical_build) emitted
L_SCAN(cte_name) when it encountered a CTE reference in the FROM
clause. The plan executor then called db_find_table_sv() to resolve
the scan source—but CTE names are not real tables, so the lookup returned
NULL. This caused plan_build_select() to return failure, and the
entire query fell back to the legacy row-by-row executor.
The result: a query that should have run through the fast columnar pipeline (hash join + hash aggregate + sort on column blocks) instead ran through the slow path (per-row expression evaluation, row-based hash table, row-based sort).
In plan_build_select() (plan.c), before calling
logical_build, the function now iterates over the query’s CTE
list. For each non-recursive CTE:
db_exec(), which may itself use
the plan executor for the CTE’s inner query.db_find_table_sv() resolves it.
After the outer query completes, the temp tables are dropped. This approach
means logical_build sees real tables for every CTE reference,
and the full columnar pipeline handles the outer query’s joins, aggregates,
and sorts.
Same query, same machine:
| mskql | PostgreSQL | DuckDB | |
|---|---|---|---|
| mixed_analytical (500 iter) | 7ms | 65ms | 28ms |
| vs PostgreSQL ratio | 0.11× (mskql 9× faster) | ||
| vs DuckDB ratio | 0.25× (mskql 4× faster) | ||
The CTE materialization fix moved the entire analytical pipeline into the columnar executor. Combined with the result cache (which caches the wire bytes for repeated identical queries), this workload sustains 97,210 QPS under concurrent throughput testing.