← mskql
Notes
Implementation details, measured.
Each note describes one optimization: the query used to measure it, what
the code was doing before, what changed, and the before/after numbers.
All measurements use the same benchmark harness
(bench-vs-pg) on the same machine.
-
Stress benchmarks: 500K-row workloads
Eight workloads at 500K rows. mskql beats PostgreSQL on all 8.
DuckDB wins over the wire but ties on high-cardinality GROUP BY.
-
Projection pushdown
Only read columns the query references. Especially effective for
Parquet foreign tables where entire column chunks are skipped.
-
Logical IR: three-stage query pipeline
7-node canonical representation between parser and physical planner.
Predicate pushdown, filter merge, projection pushdown, dead project elimination.
-
Swiss Table hash join
Ctrl-byte overlay for cache-friendly hash join build and probe.
~8–10% faster on L2/L3 pressure workloads (100K×50K join).
-
Hand-rolled Parquet reader
Replaced third-party Carquet library with 884 lines of C.
Direct Thrift compact protocol decoder with zstd/zlib decompression.
-
HNSW vector index
Approximate nearest-neighbor search via hierarchical navigable small-world graphs.
L2, cosine, and inner-product distance metrics in 608 lines of C.
-
Eliminating the legacy row-store
One storage format, one code path: flat columnar arrays everywhere.
Removed dual row/column sync overhead.
-
Result cache and join cache
Wire-byte result cache + hash-join build-side cache.
mixed_analytical throughput: 97,210 QPS vs PostgreSQL 508 QPS (191×).
-
Disk-backed tables: near-zero overhead persistence
CREATE DISK TABLE adds persistence at 0.96× the in-memory speed.
-
Vectorized expression aggregates (VEC_PROJECT)
Direct column-array arithmetic replaces per-row expression evaluation.
expression_agg: 14ms vs PostgreSQL 302ms (22×).
-
Top-N sort: fused SORT + LIMIT via binary heap
Binary max-heap of size K avoids full sort.
subquery_complex with LIMIT 500 on 50K rows.
-
Correlated subquery decorrelation
Materialized aggregate + LEFT HASH_JOIN replaces O(N×M) nested loop.
6ms vs PostgreSQL 3,456ms (576×).
-
CTE materialization: from legacy fallback to plan executor
Pre-materialize non-recursive CTEs as temp tables before plan build.
mixed_analytical CTE pipeline fully in the columnar executor.
-
Radix sort for integer ORDER BY
8-bit radix sort replaces comparison-based sort for integer keys.
large_sort: 135ms vs PostgreSQL 212ms.
-
Zero steady-state allocation: the arena model
Bump-slab arena with pointer rewind replaces malloc/free per query.
aggregate: 15ms vs PostgreSQL 280ms (19×).
-
Closing the ORDER BY gap: 4.65× slower to 1.18×
Direct columnar→wire serialization + flat sort comparator.
order_by: 1,676ms → 429ms.