The benchmark is bench_order_by: 5,000 rows, 200 iterations of
SELECT * FROM t ORDER BY val DESC over a table with integer and text
columns. Measured as wall-clock latency through the PostgreSQL wire protocol
(psycopg2, localhost, same machine for all engines).
CREATE TABLE t (id INT, val INT, label TEXT);
-- 5,000 rows inserted
SELECT * FROM t ORDER BY val DESC;
-- repeated 200 times, total wall-clock measured
Before the fix, mskql took 1,676ms for this workload. PostgreSQL took 356ms. That made mskql 4.65× slower on ORDER BY—the worst ratio on the entire benchmark suite. For an in-memory database, returning sorted results slower than a disk-based engine is a fundamental problem.
Profiling showed two bottlenecks:
1. One write() syscall per result row.
The wire serializer called send_data_rows() which issued one
write() per DataRow message. For 5,000 rows × 200 iterations,
that is 1,000,000 syscalls. Each syscall costs ~1µs of kernel overhead,
adding ~1 second of pure syscall latency to the workload.
2. Per-row block/row index decoding in the sort comparator.
The sort comparator sort_index_cmp() encoded block index and row
index into a single integer, then decoded them on every comparison. For 5,000
rows, qsort makes ~60,000 comparisons, each doing two decode operations plus
pointer chasing through all_cols[block * ncols + col].
Batched wire writes (pgwire.c:
try_plan_send()). For SELECT queries that the plan executor can
handle, this function bypasses block_to_rows() and
send_data_rows() entirely. It pulls column blocks via
plan_next_block() and serializes directly from col_block
arrays into pgwire DataRow messages. Messages accumulate in a 64 KB buffer
and flush with a single write() call.
Flat sort comparator (plan.c:
sort_flat_cmp()). The new comparator uses flat contiguous arrays
(flat_keys[], flat_nulls[], key_types[])
stored in block_sort_ctx. Sorted indices are simple 0..total-1
flat indices with direct array lookups—no block/row encode/decode, no
pointer chasing.
Same query, same machine, same benchmark harness:
| Before | After | Speedup | |
|---|---|---|---|
| order_by (wall-clock, 200 iter) | 1,676ms | 429ms | 3.9× |
| vs PostgreSQL ratio | 4.65× slower | 1.18× slower | — |
| Internal C bench (no wire) | 1,752ms | 142ms | 12.3× |
The wire batching alone accounted for the majority of the wall-clock improvement. The flat comparator reduced the internal sort time by 12.3×, but that was masked by the syscall overhead in the "before" measurement.