Swiss Table hash join

Ctrl-byte overlay for cache-friendly hash join build and probe.

Setup

The benchmark is node_hash_join_large: a 100K-row outer table joined against a 50K-row build side. At this scale, the hash table exceeds L1 and pressures L2/L3 cache. Measured as wall-clock latency of the internal C benchmark harness (bench.c).

SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- 100K outer rows, 50K build-side rows

Problem

The hash join uses a classic open-addressing hash table with linear probing. Each probe follows a chain of bucket entries. When the build side is large enough to spill out of L1 cache, every probe step is a potential cache miss. The CPU stalls waiting for memory, and the join becomes memory-bound rather than compute-bound.

Cause

The classic hash table stores full entries (hash + row index) at each bucket. A probe that lands on a non-matching bucket must read the full entry to check the hash, then follow the chain pointer. Each step touches a different cache line. For 50K entries, the hash table spans ~400KB—well beyond L1 (64KB) and into L2 territory.

Fix

Added a Swiss Table control-byte overlay to struct block_hash_table (block.h, plan.c). The design:

ctrl[] — a dense uint8_t array, one byte per slot. Each byte stores a 7-bit hash tag (bits 25–31 of the full hash) or 0x80 (empty). The entire ctrl array for 50K entries fits in ~50KB—L1-resident.
slot_entry[] — maps each slot to the bucket index in the existing entries[] array.
Probe path — swiss_ht_probe() scans a 16-byte group of ctrl bytes starting at the hash-derived position. The compiler auto-vectorizes this to a single NEON/SSE comparison. If no ctrl byte matches the 7-bit tag, the probe immediately moves to the next group—no entry reads needed. Only on a tag match does it dereference the entry for a full key comparison.
Duplicate keys — chained via the existing nexts[] array. The Swiss Table finds the first match; subsequent duplicates follow the chain as before.

Other hash table consumers (hash aggregation, semi-join, distinct, set operations) are unchanged—they still use classic buckets. The Swiss Table overlay is specific to hash join, where the build side is largest.

Result

A/B comparison, -O3 -march=native -flto:

Benchmark	Before	After	Change
node_hash_join_large (100K×50K, L2/L3)	—	—	~8–10% faster total, 7.5% faster p50
node_hash_join (100K×1K, L1-resident)	—	—	~2% faster
Small joins (500–2000 rows)	—	—	0–5%, within noise

The benefit scales with build-side size. When the hash table fits in L1, the ctrl-byte check is redundant—the full entry is already in cache. When it spills to L2/L3, the compact ctrl array provides a fast-miss filter that avoids most cache-line fetches. No regressions on any benchmark.