Swiss Table hash join

Ctrl-byte overlay for cache-friendly hash join build and probe.

Setup

The benchmark is node_hash_join_large: a 100K-row outer table joined against a 50K-row build side. At this scale, the hash table exceeds L1 and pressures L2/L3 cache. Measured as wall-clock latency of the internal C benchmark harness (bench.c).

SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- 100K outer rows, 50K build-side rows

Problem

The hash join uses a classic open-addressing hash table with linear probing. Each probe follows a chain of bucket entries. When the build side is large enough to spill out of L1 cache, every probe step is a potential cache miss. The CPU stalls waiting for memory, and the join becomes memory-bound rather than compute-bound.

Cause

The classic hash table stores full entries (hash + row index) at each bucket. A probe that lands on a non-matching bucket must read the full entry to check the hash, then follow the chain pointer. Each step touches a different cache line. For 50K entries, the hash table spans ~400KB—well beyond L1 (64KB) and into L2 territory.

Fix

Added a Swiss Table control-byte overlay to struct block_hash_table (block.h, plan.c). The design:

Other hash table consumers (hash aggregation, semi-join, distinct, set operations) are unchanged—they still use classic buckets. The Swiss Table overlay is specific to hash join, where the build side is largest.

Result

A/B comparison, -O3 -march=native -flto:

BenchmarkBeforeAfterChange
node_hash_join_large (100K×50K, L2/L3)~8–10% faster total, 7.5% faster p50
node_hash_join (100K×1K, L1-resident)~2% faster
Small joins (500–2000 rows)0–5%, within noise

The benefit scales with build-side size. When the hash table fits in L1, the ctrl-byte check is redundant—the full entry is already in cache. When it spills to L2/L3, the compact ctrl array provides a fast-miss filter that avoids most cache-line fetches. No regressions on any benchmark.