Logical IR: three-stage query pipeline

A 7-node canonical representation between the parser and the physical planner.

Setup

The benchmark is bench_analytical_cte: a CTE pipeline with joins, aggregation, and sorting. The query touches three tables through a multi-stage plan. Measured as wall-clock latency through the PostgreSQL wire protocol.

WITH order_summary AS (
    SELECT o.customer_id,
           SUM(o.quantity * p.price) AS total
    FROM orders o
    JOIN products p ON o.product_id = p.id
    GROUP BY o.customer_id
)
SELECT c.region, COUNT(*), SUM(os.total) AS revenue
FROM order_summary os
JOIN customers c ON os.customer_id = c.id
GROUP BY c.region
ORDER BY revenue DESC;

Problem

The parser emits a struct query_select with ~40 boolean flags: has_join, has_group_by, has_order_by, has_distinct, has_having, has_window, has_set_op, and so on. The physical plan builder (plan.c) had to pattern-match this flat flag soup directly, leading to fragile if/else chains that grew with every new feature. Adding projection pushdown or predicate pushdown required touching every builder path.

Cause

There was no intermediate representation between parsing and physical planning. The parser produced a flat record of everything the query mentioned. The planner consumed it directly. This meant every optimization had to understand the raw parser output, and rewrite rules (merge two filters, push a filter past a join) had nowhere to operate.

Fix

Added logical.c (~660 lines) and logical.h (~180 lines) implementing a three-stage pipeline:

Stage 1: logical_build() — desugars the parser AST into a minimal tree of 7 node types:

L_SCAN — table scan, subquery, CTE, generate_series, or set operation
L_FILTER — predicate (WHERE or HAVING)
L_PROJECT — column selection and expressions
L_JOIN — left + right children with explicit ON condition
L_AGGREGATE — GROUP BY keys + aggregate expressions
L_SORT — ORDER BY
L_LIMIT — LIMIT + OFFSET

Sugar is fully lowered: HAVING becomes L_FILTER wrapping L_AGGREGATE. DISTINCT becomes L_AGGREGATE with all projected columns as group keys. DISTINCT ON desugars to L_SORT + L_AGGREGATE(first_row_only=1).

Stage 2: logical_normalize() — applies semantics-preserving tree rewrites in a single bottom-up pass:

Filter merge — L_FILTER(L_FILTER(x, p1), p2) → L_FILTER(x, p1 AND p2). Combines adjacent filters into one, reducing plan tree depth.
Predicate pushdown — L_FILTER(L_JOIN(a, b), pred_on_a) → L_JOIN(L_FILTER(a, pred), b). Moves single-column filters inside joins so the physical planner sees L_FILTER(L_SCAN) on the left child, enabling index scan detection.
Dead project elimination — L_PROJECT(L_PROJECT(x, c1), c2) → L_PROJECT(x, c2). Removes redundant projections.

Stage 3: logical_to_physical() (in plan.c) — pattern-matches the normalized logical tree to physical plan nodes (PLAN_SEQ_SCAN, PLAN_HASH_JOIN, PLAN_HASH_AGG, PLAN_TOP_N, etc.). This is where projection pushdown narrows scan width per join table based on which columns are actually needed.

Result

The logical IR enabled two optimizations that would have been impractical to implement against the raw parser output:

Predicate pushdown — filters that reference only the left side of a join are pushed below the join node. The physical planner then applies them during the scan, reducing the number of rows entering the join.
Projection pushdown — the physical planner analyzes which columns are referenced in join keys, GROUP BY, aggregates, and filters, then narrows each table scan to only the needed columns. For wide tables, this reduces memory traffic and intermediate tuple width.

The IR also made EXPLAIN (LOGICAL) possible: a human-readable dump of the canonical tree before physical planning, useful for debugging plan decisions.