12 Parquet benchmarks: full scan, filtered scan, aggregation, joins, subqueries, and analytical queries on files ranging from 50K to 200K rows. Measured as wall-clock latency through the PostgreSQL wire protocol.
CREATE FOREIGN TABLE events
OPTIONS (filename '/data/events.parquet');
SELECT event_type, COUNT(*), SUM(amount)
FROM events
GROUP BY event_type;
The original Parquet support used Carquet, a third-party C library built via CMake. It worked, but introduced a heavy build dependency: a full CMake clone-and-build step, a large dependency tree, and a schema traversal API that did not map cleanly to mskql’s flat columnar storage. The build was slow, the code was opaque, and bugs in the Thrift decoding layer were impossible to diagnose without understanding the library’s internals.
Parquet files use Apache Thrift compact protocol for metadata serialization. The format is simple—zigzag varints, field deltas, nested structs—but the field IDs must match the official Parquet Thrift IDL exactly. Any off-by-one in field mapping silently corrupts the metadata parse, producing wrong column counts, wrong offsets, or garbage compression codecs.
Replaced Carquet with pq_reader.c (884 lines): a minimal
read-only Parquet decoder that handles the full read path from file open
to materialized columns.
The build dependency shrank from a full CMake project to two linker flags:
-lzstd -lz.
The initial implementation had 5 bugs in Thrift field ID mapping that prevented it from opening any Parquet file. These were invisible in testing because the test runner swallowed setup errors (psql returns exit 0 on SQL errors). After fixing the test harness to detect setup failures:
0x00
(Thrift struct stop) was treated as field delta=0, consuming extra
bytes and corrupting the stream.converted_type
instead of num_children).num_rows but the spec says field 3 (field 2 is
total_byte_size).Zero external C dependencies beyond zstd and zlib. The Parquet reader compiles in under a second. All 12 Parquet benchmarks pass with the same or better performance. The code is fully auditable—every field ID, every encoding, every decompression call is visible in a single 884-line file.