Vectorized expression aggregates (VEC_PROJECT)

Direct column-array arithmetic replaces per-row expression evaluation.

Setup

The benchmark is bench_expression_agg: 5,000 rows, 500 iterations of an aggregate with an arithmetic expression in the SELECT list. Measured as wall-clock latency through the PostgreSQL wire protocol.

CREATE TABLE sales (
    id INT, region TEXT, product TEXT,
    quantity INT, amount NUMERIC(10,2),
    price NUMERIC(10,2)
);
-- 5,000 rows inserted

SELECT category, SUM(price * quantity)
FROM sales
GROUP BY category;
-- repeated 500 times

Problem

The expression price * quantity inside SUM() must be evaluated for every row before aggregation. The legacy executor evaluates this per-row: for each of 5,000 rows, call eval_expr(), which dispatches on expression type, reads two cells, performs the multiplication, and writes the result to a temporary cell. This is 5,000 function calls with type dispatch per iteration, 2,500,000 across 500 iterations.

Cause

The per-row evaluator eval_expr() handles arbitrary expression trees: nested function calls, CASE expressions, type coercions, NULL propagation. For the common case of a simple binary operation on two columns, this generality is overhead. The function call, type switch, cell read, cell write, and NULL check dominate the actual arithmetic.

Fix

The plan builder recognizes simple arithmetic expressions and emits VEC_PROJECT / VEC_FUNC_* nodes in the plan tree (plan.c). These operate directly on col_block arrays:

VEC_FUNC_MUL_I32, VEC_FUNC_MUL_F64 — multiply two contiguous column arrays in a tight loop with no per-element dispatch.
VEC_FUNC_ADD_*, VEC_FUNC_SUB_*, VEC_FUNC_DIV_* — same pattern for other arithmetic operations.
VEC_FUNC_CEIL, VEC_FUNC_FLOOR, VEC_FUNC_SQRT, VEC_FUNC_SIGN_*, VEC_FUNC_MOD_*, VEC_FUNC_POWER — unary and binary math functions vectorized.
VEC_FUNC_CAST_F64_TO_I32, VEC_FUNC_COALESCE_LIT — type casts and COALESCE with literal defaults.

The vectorized path reads contiguous int32_t[] or double[] arrays from the column block, applies the operation in a single loop, and writes the result to a new column block. NULL handling is done via a parallel uint8_t[] null bitmap, checked once per block rather than per element.

Result

Same query, same machine:

	mskql	PostgreSQL	DuckDB
expression_agg (500 iter, 5K rows)	15ms	299ms	220ms
vs PostgreSQL ratio	0.05× (mskql 20× faster)
vs DuckDB ratio	0.07× (mskql 15× faster)

The VEC_PROJECT path is used automatically by the plan builder when it recognizes eligible expressions. Queries with complex expression trees (nested CASE, multi-level function composition) fall back to the per-row evaluator. The plan builder makes this decision at build time with zero runtime overhead for either path.