Parquet Foreign Tables

Tutorial 11 — Map Parquet files to SQL tables, query with standard SQL

This tutorial requires the native mskql server. Parquet support uses filesystem access and is not available in the WebAssembly playground. Start the server with make && ./build/mskql and connect with psql -h 127.0.0.1 -p 5433 -U test -d mskql.

1. What is a foreign table?

A foreign table maps an external Parquet file to a SQL table name. No data is copied into memory—mskql reads the Parquet file on demand. Once mapped, you query it with the same SQL you use for regular tables: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, subqueries, CTEs.

2. Create a foreign table

CREATE FOREIGN TABLE events (
    id          INT,
    user_id     INT,
    event_type  TEXT,
    amount      INT,
    score       INT,
    created_at  TEXT
) OPTIONS (filename '/path/to/events.parquet');

The column definitions must match the Parquet file’s schema. Supported Parquet types: INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (→ TEXT), INT96 (→ TIMESTAMP).

3. Query Parquet data

Filtered scan

SELECT * FROM events
WHERE amount > 500 AND score > 5000;

Aggregation

SELECT event_type, COUNT(*), SUM(amount), AVG(score)
FROM events
GROUP BY event_type;

Sort

SELECT * FROM events
ORDER BY score DESC;

4. Join Parquet files

Map multiple Parquet files and join them with standard SQL.

CREATE FOREIGN TABLE orders (
    id          INT,
    customer_id INT,
    product_id  INT,
    quantity    INT,
    amount      INT
) OPTIONS (filename '/path/to/orders.parquet');

CREATE FOREIGN TABLE customers (
    id     INT,
    name   TEXT,
    region TEXT,
    tier   TEXT
) OPTIONS (filename '/path/to/customers.parquet');

SELECT customers.region, COUNT(*), SUM(orders.amount)
FROM orders
JOIN customers ON orders.customer_id = customers.id
GROUP BY customers.region
ORDER BY customers.region;

5. Mix Parquet and in-memory tables

Foreign tables and regular tables work together in the same query.

CREATE TABLE local_regions (
    id          INT,
    region_name TEXT,
    tax_rate    INT
);
INSERT INTO local_regions VALUES (0, 'north', 10);
INSERT INTO local_regions VALUES (1, 'south', 8);
INSERT INTO local_regions VALUES (2, 'east', 12);
INSERT INTO local_regions VALUES (3, 'west', 9);

CREATE FOREIGN TABLE sales (
    id        INT,
    region_id INT,
    amount    INT
) OPTIONS (filename '/path/to/sales.parquet');

SELECT lr.region_name, SUM(sales.amount),
       SUM(sales.amount * lr.tax_rate / 100) AS tax
FROM sales
JOIN local_regions lr ON sales.region_id = lr.id
GROUP BY lr.region_name
ORDER BY lr.region_name;

6. Subqueries across Parquet files

SELECT * FROM events
WHERE user_id IN (
    SELECT id FROM customers WHERE tier = 'premium'
)
AND amount > 500
ORDER BY score DESC
LIMIT 100;

7. Analytical CTEs

CREATE FOREIGN TABLE products (
    id       INT,
    name     TEXT,
    category TEXT,
    price    INT
) OPTIONS (filename '/path/to/products.parquet');

WITH order_totals AS (
    SELECT orders.customer_id,
           SUM(orders.quantity * products.price) AS total
    FROM orders
    JOIN products ON orders.product_id = products.id
    GROUP BY orders.customer_id
)
SELECT customers.region, COUNT(*), SUM(ot.total)
FROM order_totals ot
JOIN customers ON ot.customer_id = customers.id
GROUP BY customers.region
ORDER BY customers.region;

8. Performance

9 Parquet workloads benchmarked against DuckDB on 50K–200K row files. On cached queries, mskql is faster than DuckDB on 7 of 9. DuckDB faster on 8 of 9 cold (no-cache) queries.

Workloadmskql (ms)DuckDB (ms)Cached ratio
pq_aggregate (GROUP BY)28760.37×
pq_subquery (IN subquery)21440.49×
pq_where (filtered scan)22440.50×
pq_wide_agg (7 columns)27420.65×
pq_analytical (CTE+join+agg)29410.71×
pq_join_three (3 files)30380.79×
pq_join_two (2 files)28350.80×
pq_full_scan (50K rows)4493214.0×
pq_order_by (sort)3712912.8×

The result cache serves repeated queries from cached wire bytes—no Parquet re-read. DuckDB re-reads the file every time. On cold reads, DuckDB’s native columnar Parquet reader is faster.

→ Full Parquet benchmark results

9. Cleanup

DROP FOREIGN TABLE events;
DROP FOREIGN TABLE orders;
DROP FOREIGN TABLE customers;
DROP FOREIGN TABLE products;
DROP FOREIGN TABLE sales;
DROP TABLE local_regions;

Explore further

All tutorials  ·  SQL reference  ·  Grammar  ·  Benchmarks  ·  Architecture