Closing the ORDER BY gap: 4.65× slower to 1.18×

Direct columnar→wire serialization and flat sort comparator.

Setup

The benchmark is bench_order_by: 5,000 rows, 200 iterations of SELECT * FROM t ORDER BY val DESC over a table with integer and text columns. Measured as wall-clock latency through the PostgreSQL wire protocol (psycopg2, localhost, same machine for all engines).

CREATE TABLE t (id INT, val INT, label TEXT);
-- 5,000 rows inserted

SELECT * FROM t ORDER BY val DESC;
-- repeated 200 times, total wall-clock measured

Problem

Before the fix, mskql took 1,676ms for this workload. PostgreSQL took 356ms. That made mskql 4.65× slower on ORDER BY—the worst ratio on the entire benchmark suite. For an in-memory database, returning sorted results slower than a disk-based engine is a fundamental problem.

Cause

Profiling showed two bottlenecks:

1. One write() syscall per result row. The wire serializer called send_data_rows() which issued one write() per DataRow message. For 5,000 rows × 200 iterations, that is 1,000,000 syscalls. Each syscall costs ~1µs of kernel overhead, adding ~1 second of pure syscall latency to the workload.

2. Per-row block/row index decoding in the sort comparator. The sort comparator sort_index_cmp() encoded block index and row index into a single integer, then decoded them on every comparison. For 5,000 rows, qsort makes ~60,000 comparisons, each doing two decode operations plus pointer chasing through all_cols[block * ncols + col].

Fix

Batched wire writes (pgwire.c: try_plan_send()). For SELECT queries that the plan executor can handle, this function bypasses block_to_rows() and send_data_rows() entirely. It pulls column blocks via plan_next_block() and serializes directly from col_block arrays into pgwire DataRow messages. Messages accumulate in a 64 KB buffer and flush with a single write() call.

Flat sort comparator (plan.c: sort_flat_cmp()). The new comparator uses flat contiguous arrays (flat_keys[], flat_nulls[], key_types[]) stored in block_sort_ctx. Sorted indices are simple 0..total-1 flat indices with direct array lookups—no block/row encode/decode, no pointer chasing.

Result

Same query, same machine, same benchmark harness:

BeforeAfterSpeedup
order_by (wall-clock, 200 iter)1,676ms429ms3.9×
vs PostgreSQL ratio4.65× slower1.18× slower
Internal C bench (no wire)1,752ms142ms12.3×

The wire batching alone accounted for the majority of the wall-clock improvement. The flat comparator reduced the internal sort time by 12.3×, but that was masked by the syscall overhead in the "before" measurement.