Projection pushdown: only read what the query needs

The plan executor skips columns that aren’t referenced, reducing memory traffic and improving cache locality.

Setup

Wide tables with many columns are common in analytics. A query like SELECT name, score FROM events on a 10-column table should not read the other 8 columns. The benchmark is wc_wide_output: 50,000 rows with 10 columns, sorted by one column.

SELECT id, name, score
FROM wide_table
ORDER BY score DESC;

Problem

Without projection pushdown, PLAN_SEQ_SCAN reads all columns from the table into the pipeline, even if only a subset is needed. For wide tables, this wastes memory bandwidth and pollutes the CPU cache with unused data. The sort stage then shuffles full-width rows through the comparator.

How it works

The plan builder analyzes the SELECT list, WHERE clause, ORDER BY, and GROUP BY to determine which columns are actually referenced. It constructs a column mask and passes it to the scan node. The scan node only materializes the needed columns into col_block arrays.

  1. Column resolution: Walk the parsed query and collect every column reference (select exprs, filter operands, sort keys, group keys).
  2. Mask construction: Build a bitmap of needed column indices. If the query uses SELECT *, all columns are marked.
  3. Scan pruning: PLAN_SEQ_SCAN skips columns not in the mask. For columnar storage this means entire column arrays are never touched.
  4. Downstream nodes (sort, aggregate, join) operate on the narrower tuple, improving cache utilization.

Impact on Parquet

Projection pushdown is especially effective for Parquet foreign tables. The Parquet reader can skip entire column chunks at the I/O level, avoiding both decompression and deserialization of unused columns. A query selecting 2 of 10 columns reads ~80% less data from disk.

Result

The wide output sort benchmark (wc_wide_output, 10 columns, 50K rows):

mskqlPostgreSQLDuckDB
wc_wide_output (cached)128ms170ms40ms
vs PostgreSQL ratio0.75× (mskql 1.3× faster)

Projection pushdown is one of the reasons mskql beats PostgreSQL on every analytical workload despite having no query planner or cost model. The plan builder applies it unconditionally—there is no statistics-based decision about whether to prune columns.