The benchmark is node_hash_join_large: a 100K-row outer table
joined against a 50K-row build side. At this scale, the hash table exceeds
L1 and pressures L2/L3 cache. Measured as wall-clock latency of the
internal C benchmark harness (bench.c).
SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- 100K outer rows, 50K build-side rows
The hash join uses a classic open-addressing hash table with linear probing. Each probe follows a chain of bucket entries. When the build side is large enough to spill out of L1 cache, every probe step is a potential cache miss. The CPU stalls waiting for memory, and the join becomes memory-bound rather than compute-bound.
The classic hash table stores full entries (hash + row index) at each bucket. A probe that lands on a non-matching bucket must read the full entry to check the hash, then follow the chain pointer. Each step touches a different cache line. For 50K entries, the hash table spans ~400KB—well beyond L1 (64KB) and into L2 territory.
Added a Swiss Table control-byte overlay to
struct block_hash_table (block.h,
plan.c). The design:
ctrl[] — a dense uint8_t
array, one byte per slot. Each byte stores a 7-bit hash tag
(bits 25–31 of the full hash) or 0x80 (empty).
The entire ctrl array for 50K entries fits in ~50KB—L1-resident.slot_entry[] — maps each slot to
the bucket index in the existing entries[] array.swiss_ht_probe()
scans a 16-byte group of ctrl bytes starting at the hash-derived
position. The compiler auto-vectorizes this to a single NEON/SSE
comparison. If no ctrl byte matches the 7-bit tag, the probe
immediately moves to the next group—no entry reads needed.
Only on a tag match does it dereference the entry for a full key
comparison.nexts[] array. The Swiss Table finds the first match;
subsequent duplicates follow the chain as before.Other hash table consumers (hash aggregation, semi-join, distinct, set operations) are unchanged—they still use classic buckets. The Swiss Table overlay is specific to hash join, where the build side is largest.
A/B comparison, -O3 -march=native -flto:
| Benchmark | Before | After | Change |
|---|---|---|---|
| node_hash_join_large (100K×50K, L2/L3) | — | — | ~8–10% faster total, 7.5% faster p50 |
| node_hash_join (100K×1K, L1-resident) | — | — | ~2% faster |
| Small joins (500–2000 rows) | — | — | 0–5%, within noise |
The benefit scales with build-side size. When the hash table fits in L1, the ctrl-byte check is redundant—the full entry is already in cache. When it spills to L2/L3, the compact ctrl array provides a fast-miss filter that avoids most cache-line fetches. No regressions on any benchmark.