Notes

Implementation details, measured.

Each note describes one optimization: the query used to measure it, what the code was doing before, what changed, and the before/after numbers. All measurements use the same benchmark harness (bench-vs-pg) on the same machine.

  1. Stress benchmarks: 500K-row workloads
    Eight workloads at 500K rows. mskql beats PostgreSQL on all 8. DuckDB wins over the wire but ties on high-cardinality GROUP BY.
  2. Projection pushdown
    Only read columns the query references. Especially effective for Parquet foreign tables where entire column chunks are skipped.
  3. Logical IR: three-stage query pipeline
    7-node canonical representation between parser and physical planner. Predicate pushdown, filter merge, projection pushdown, dead project elimination.
  4. Swiss Table hash join
    Ctrl-byte overlay for cache-friendly hash join build and probe. ~8–10% faster on L2/L3 pressure workloads (100K×50K join).
  5. Hand-rolled Parquet reader
    Replaced third-party Carquet library with 884 lines of C. Direct Thrift compact protocol decoder with zstd/zlib decompression.
  6. HNSW vector index
    Approximate nearest-neighbor search via hierarchical navigable small-world graphs. L2, cosine, and inner-product distance metrics in 608 lines of C.
  7. Eliminating the legacy row-store
    One storage format, one code path: flat columnar arrays everywhere. Removed dual row/column sync overhead.
  8. Result cache and join cache
    Wire-byte result cache + hash-join build-side cache. mixed_analytical throughput: 97,210 QPS vs PostgreSQL 508 QPS (191×).
  9. Disk-backed tables: near-zero overhead persistence
    CREATE DISK TABLE adds persistence at 0.96× the in-memory speed.
  10. Vectorized expression aggregates (VEC_PROJECT)
    Direct column-array arithmetic replaces per-row expression evaluation. expression_agg: 14ms vs PostgreSQL 302ms (22×).
  11. Top-N sort: fused SORT + LIMIT via binary heap
    Binary max-heap of size K avoids full sort. subquery_complex with LIMIT 500 on 50K rows.
  12. Correlated subquery decorrelation
    Materialized aggregate + LEFT HASH_JOIN replaces O(N×M) nested loop. 6ms vs PostgreSQL 3,456ms (576×).
  13. CTE materialization: from legacy fallback to plan executor
    Pre-materialize non-recursive CTEs as temp tables before plan build. mixed_analytical CTE pipeline fully in the columnar executor.
  14. Radix sort for integer ORDER BY
    8-bit radix sort replaces comparison-based sort for integer keys. large_sort: 135ms vs PostgreSQL 212ms.
  15. Zero steady-state allocation: the arena model
    Bump-slab arena with pointer rewind replaces malloc/free per query. aggregate: 15ms vs PostgreSQL 280ms (19×).
  16. Closing the ORDER BY gap: 4.65× slower to 1.18×
    Direct columnar→wire serialization + flat sort comparator. order_by: 1,676ms → 429ms.