Notes

Implementation details, measured.

Each note describes one optimization: the query used to measure it, what the code was doing before, what changed, and the before/after numbers. All measurements use the same benchmark harness (bench-vs-pg) on the same machine.

Stress benchmarks: 500K-row workloads
Eight workloads at 500K rows. mskql beats PostgreSQL on all 8. DuckDB wins over the wire but ties on high-cardinality GROUP BY.
Projection pushdown
Only read columns the query references. Especially effective for Parquet foreign tables where entire column chunks are skipped.
Logical IR: three-stage query pipeline
7-node canonical representation between parser and physical planner. Predicate pushdown, filter merge, projection pushdown, dead project elimination.
Swiss Table hash join
Ctrl-byte overlay for cache-friendly hash join build and probe. ~8–10% faster on L2/L3 pressure workloads (100K×50K join).
Hand-rolled Parquet reader
Replaced third-party Carquet library with 884 lines of C. Direct Thrift compact protocol decoder with zstd/zlib decompression.
HNSW vector index
Approximate nearest-neighbor search via hierarchical navigable small-world graphs. L2, cosine, and inner-product distance metrics in 608 lines of C.
Eliminating the legacy row-store
One storage format, one code path: flat columnar arrays everywhere. Removed dual row/column sync overhead.
Result cache and join cache
Wire-byte result cache + hash-join build-side cache. mixed_analytical throughput: 97,210 QPS vs PostgreSQL 508 QPS (191×).
Disk-backed tables: near-zero overhead persistence
CREATE DISK TABLE adds persistence at 0.96× the in-memory speed.
Vectorized expression aggregates (VEC_PROJECT)
Direct column-array arithmetic replaces per-row expression evaluation. expression_agg: 14ms vs PostgreSQL 302ms (22×).
Top-N sort: fused SORT + LIMIT via binary heap
Binary max-heap of size K avoids full sort. subquery_complex with LIMIT 500 on 50K rows.
Correlated subquery decorrelation
Materialized aggregate + LEFT HASH_JOIN replaces O(N×M) nested loop. 6ms vs PostgreSQL 3,456ms (576×).
CTE materialization: from legacy fallback to plan executor
Pre-materialize non-recursive CTEs as temp tables before plan build. mixed_analytical CTE pipeline fully in the columnar executor.
Radix sort for integer ORDER BY
8-bit radix sort replaces comparison-based sort for integer keys. large_sort: 135ms vs PostgreSQL 212ms.
Zero steady-state allocation: the arena model
Bump-slab arena with pointer rewind replaces malloc/free per query. aggregate: 15ms vs PostgreSQL 280ms (19×).
Closing the ORDER BY gap: 4.65× slower to 1.18×
Direct columnar→wire serialization + flat sort comparator. order_by: 1,676ms → 429ms.