Logical IR: three-stage query pipeline

A 7-node canonical representation between the parser and the physical planner.

Setup

The benchmark is bench_analytical_cte: a CTE pipeline with joins, aggregation, and sorting. The query touches three tables through a multi-stage plan. Measured as wall-clock latency through the PostgreSQL wire protocol.

WITH order_summary AS (
    SELECT o.customer_id,
           SUM(o.quantity * p.price) AS total
    FROM orders o
    JOIN products p ON o.product_id = p.id
    GROUP BY o.customer_id
)
SELECT c.region, COUNT(*), SUM(os.total) AS revenue
FROM order_summary os
JOIN customers c ON os.customer_id = c.id
GROUP BY c.region
ORDER BY revenue DESC;

Problem

The parser emits a struct query_select with ~40 boolean flags: has_join, has_group_by, has_order_by, has_distinct, has_having, has_window, has_set_op, and so on. The physical plan builder (plan.c) had to pattern-match this flat flag soup directly, leading to fragile if/else chains that grew with every new feature. Adding projection pushdown or predicate pushdown required touching every builder path.

Cause

There was no intermediate representation between parsing and physical planning. The parser produced a flat record of everything the query mentioned. The planner consumed it directly. This meant every optimization had to understand the raw parser output, and rewrite rules (merge two filters, push a filter past a join) had nowhere to operate.

Fix

Added logical.c (~660 lines) and logical.h (~180 lines) implementing a three-stage pipeline:

Stage 1: logical_build() — desugars the parser AST into a minimal tree of 7 node types:

Sugar is fully lowered: HAVING becomes L_FILTER wrapping L_AGGREGATE. DISTINCT becomes L_AGGREGATE with all projected columns as group keys. DISTINCT ON desugars to L_SORT + L_AGGREGATE(first_row_only=1).

Stage 2: logical_normalize() — applies semantics-preserving tree rewrites in a single bottom-up pass:

Stage 3: logical_to_physical() (in plan.c) — pattern-matches the normalized logical tree to physical plan nodes (PLAN_SEQ_SCAN, PLAN_HASH_JOIN, PLAN_HASH_AGG, PLAN_TOP_N, etc.). This is where projection pushdown narrows scan width per join table based on which columns are actually needed.

Result

The logical IR enabled two optimizations that would have been impractical to implement against the raw parser output:

The IR also made EXPLAIN (LOGICAL) possible: a human-readable dump of the canonical tree before physical planning, useful for debugging plan decisions.