LucaCappelletti94 · LucaCappelletti94 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
diff --git a/.cargo/config.toml b/.cargo/config.toml
@@ -1,4 +1,4 @@
-# `cargo regen` rebuilds every input to web/assets/bench.json with one command:
+# `cargo regen` rebuilds every input to web/assets/bench.json.zst with one command:
 # the timing benches, the two memory benches (a separate process each, since
 # they install a counting global allocator), and finally the export.
 [alias]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,22 @@
 # Changelog
 
+## June 2026: real engines, batch axis, and the time machine
+
+- Validity is now graded against the real database engines (PostgreSQL, SQLite, MySQL, ClickHouse, DuckDB, SQL Server), run once locally in Docker via testcontainers by the `oracle` crate, with the labels committed under `oracle/labels` so grading and CI need no Docker. Library oracles are gone.
+- Fixed the SQLite oracle mislabeling grammar errors as valid (it only recognized a few syntax-error phrasings, so rejections like "ORDER BY clause should come after INTERSECT not before" slipped through, reported in gwenn/lemon-rs#102). The classifier now treats any prepare error as invalid unless it is a missing-object error.
+- Added turso_parser (the SQLite parser from Turso) as a tenth library, and per-statement rejection reasons on the failing-statement lists.
+- The SQLite corpus now includes the SQLite project's own official test suite (29,344 statements, total corpus 340,938), which finally spreads the parsers on real SQLite grammar instead of leaving everyone near 100 percent.
+- A batch (whole-script) axis times and measures memory for each parser's whole accepted set parsed as one script, normalized per statement and shown next to the single-statement means, with a completeness guard so a parser that bails out partway never reports a misleading number.
+- A time machine benchmarks historical releases of every pure-Rust parser (59 versions across 8 families, including every sqlparser-rs minor since 0.30): each parser page gains a version picker plus date-axis trends for parse time and memory (median with interquartile bars) and for accept/recall and false positives. The FFI pg_query is excluded (two libpg_query builds collide at link), as is qusql-parse 0.1.0 (pathological parse time on parts of the corpus).
+- The committed snapshots are now zstd-compressed and decompressed in the browser (`bench.json.zst` is about 26x smaller than the old raw JSON), keeping the site free of runtime fetches.
+- One-command regeneration: `cargo regen` runs the timing benches, the memory benches, the time-machine passes, and the export in order.
+
 ## May 2026 refresh
 
 - All benchmarked crates were updated to their latest versions (sqlparser 0.62, polyglot-sql 0.4.1, qusql-parse 0.8, databend-common-ast 0.2.5, sqlglot-rust 0.9.37, pg_query and orql to latest commits).
 - Removed pg_parse and the pg_query_parser/pg_parse_parser Cargo features. pg_query.rs (libpg_query) is now an unconditional dependency and the sole PostgreSQL reference.
-- Two parsers were added: **sqlglot-rust** (standalone 30-dialect parser) and **sqlite3-parser / lemon-rs** (SQLite's real Lemon grammar).
-- The benchmark went from PostgreSQL-only to **multi-dialect**: every parser is now run in the dialect that matches the corpus it is being tested against.
+- Two parsers were added: sqlglot-rust (standalone 30-dialect parser) and sqlite3-parser / lemon-rs (SQLite's real Lemon grammar).
+- The benchmark went from PostgreSQL-only to multi-dialect: every parser is now run in the dialect that matches the corpus it is being tested against.
 - The corpus was expanded from a few thousand PostgreSQL statements to 311,594 statements over 13 dialects, now shipped pre-built and compressed as `datasets.tar.zst`.
 - A data-quality pass removed mislabeled or non-SQL content: BiomedSQL (natural-language answers) and a metadata-contaminated Trino file were dropped, Stack Exchange Data Explorer queries were relabeled from SQLite to T-SQL, Oracle SQL\*Plus directives were stripped, and the SQL Server samples were dropped because their `GO` batch separators defeated statement segmentation.
 - The five separate tools were consolidated into a single `sqlbench` binary (`correctness`, `correctness --per-file`, `plot`), and the grading core was extracted into testable library modules.

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -14,7 +14,7 @@ No unsafe code is allowed (`unsafe_code = "forbid"`). Clippy runs with pedantic
 
 ## Results website
 
-The site under `web/` is a Dioxus -> WASM app that renders a committed snapshot, `web/assets/bench.json`, produced by `sqlbench export`. CI (`.github/workflows/pages.yml`) only builds and deploys the committed crates, so regenerate the snapshot manually after changing the corpus or parsers:
+The site under `web/` is a Dioxus -> WASM app that renders a committed snapshot, `web/assets/bench.json.zst`, produced by `sqlbench export`. CI (`.github/workflows/pages.yml`) only builds and deploys the committed crates, so regenerate the snapshot manually after changing the corpus or parsers:
 
 ```bash
 cargo regen          # one command: timing benches + memory benches + export (long)
@@ -27,13 +27,25 @@ cd web && dx serve   # preview at http://127.0.0.1:8080/sql_ast_benchmark/
 cargo bench                              # write target/bench_dist/ + target/batch_dist/ timings
 cargo run --release -p membench          # write target/mem_dist/ per-statement memory
 cargo run --release -p membench -- batch # write target/batch_mem_dist/ whole-script memory
-cargo run --bin sqlbench -- export       # read all of the above, write web/assets/bench.json
+cargo run --bin sqlbench -- export       # read all of the above, write web/assets/bench.json.zst
 ```
 
 `export` reads whatever timing, memory, and batch summaries are present under `target/` and warns (rather than fails) for any that are missing, so the memory and batch columns stay empty until their producers have been run.
 
 The charts are rendered in the browser from the JSON by the shared `viz` crate (plotters, SVG backend), so no chart images are committed.
 
+## Time machine (per-version history)
+
+The `timemachine` crate benchmarks several historical versions of each pure-Rust parser and writes `web/assets/history.json.zst` (committed, embedded and decompressed in wasm with `ruzstd`, so the site still does no runtime fetch). It hosts many versions of one crate at once with `package`-rename aliases, which works because different `0.x` minors are semver-incompatible. The FFI parsers (`pg_query`) are excluded: two libpg_query builds export the same C symbols and collide at link.
+
+Every version implements the `sql_ast_benchmark::Parser` trait (the same trait `BenchParser` uses), so the main crate's grading, timing, and memory code drive the whole history unchanged. Adding a version is three lines:
+
+1. a `package`-rename alias in `timemachine/Cargo.toml`, e.g. `sqlparser_v0_58 = { package = "sqlparser", version = "=0.58.0" }`
+2. one macro invocation in `timemachine/src/families/<family>.rs`, e.g. `sqlparser_version!(SqlparserV0_58, sqlparser_v0_58, "0.58.0", "2025-01-01")` (an API break gets its own hand-written `impl Parser` instead)
+3. one entry in `timemachine/src/registry.rs`
+
+A new family is a new `families/<name>.rs` with its own adapter (each library has a different parse API) plus its aliases and registry entries.
+
 ## Coverage
 
 ```bash

diff --git a/Cargo.toml b/Cargo.toml
@@ -1,5 +1,5 @@
 [workspace]
-members = [".", "viz", "web", "membench", "oracle"]
+members = [".", "viz", "web", "membench", "oracle", "timemachine"]
 default-members = ["."]
 resolver = "2"
 

diff --git a/README.md b/README.md
@@ -6,15 +6,15 @@
 [![Rust](https://img.shields.io/badge/rust-2021_edition-orange.svg)](https://www.rust-lang.org)
 [![Explorer](https://img.shields.io/website?url=https%3A%2F%2Fsql-ast-benchmark.luca.phd&label=explorer&up_message=online&down_message=offline)](https://sql-ast-benchmark.luca.phd)
 
-Benchmarking Rust SQL parsers on a real-world corpus of 311,594 statements across 13 SQL dialects. Each parser runs in its best-matching dialect, and correctness is graded against a real reference parser where one exists.
+Benchmarking Rust SQL parsers on a real-world corpus of 340,938 statements across 13 SQL dialects. Each parser runs in its best-matching dialect, and correctness is graded against a real reference parser where one exists.
 
 ## Abstract
 
 Choosing a SQL parser for a Rust project means weighing dialect coverage, correctness, and speed, yet those trade-offs are seldom measured on realistic input. We benchmarked the actively maintained Rust SQL parsers on a large, multi-dialect corpus of real-world statements so the choice can rest on evidence rather than on each library's own claims.
 
-We evaluated nine parser libraries: [sqlparser-rs](https://github.com/sqlparser-rs/sqlparser-rs) (Apache DataFusion), [pg_query.rs](https://github.com/pganalyze/pg_query.rs) and its faster summary mode (Rust bindings to [libpg_query](https://github.com/pganalyze/libpg_query), PostgreSQL's own parser), [databend-common-ast](https://crates.io/crates/databend-common-ast), [polyglot-sql](https://github.com/tobilg/polyglot), [sqlglot-rust](https://crates.io/crates/sqlglot-rust), [qusql-parse](https://crates.io/crates/qusql-parse), [sqlite3-parser](https://crates.io/crates/sqlite3-parser) (lemon-rs), and [turso_parser](https://crates.io/crates/turso_parser) (the SQLite parser from Turso), plus [orql](https://codeberg.org/xitep/orql) on Oracle. We ran them against a corpus of 311,594 statements spanning 13 dialects, drawn from each engine's own regression suites and official samples and committed compressed so every run is reproducible.
+We evaluated nine parser libraries: [sqlparser-rs](https://github.com/sqlparser-rs/sqlparser-rs) (Apache DataFusion), [pg_query.rs](https://github.com/pganalyze/pg_query.rs) and its faster summary mode (Rust bindings to [libpg_query](https://github.com/pganalyze/libpg_query), PostgreSQL's own parser), [databend-common-ast](https://crates.io/crates/databend-common-ast), [polyglot-sql](https://github.com/tobilg/polyglot), [sqlglot-rust](https://crates.io/crates/sqlglot-rust), [qusql-parse](https://crates.io/crates/qusql-parse), [sqlite3-parser](https://crates.io/crates/sqlite3-parser) (lemon-rs), and [turso_parser](https://crates.io/crates/turso_parser) (the SQLite parser from Turso), plus [orql](https://codeberg.org/xitep/orql) on Oracle. We ran them against a corpus of 340,938 statements spanning 13 dialects, drawn from each engine's own regression suites and official samples and committed compressed so every run is reproducible.
 
-We exercised each parser in the dialect that matches the corpus under test. Where a dialect has a runnable engine, we labelled each statement valid or invalid with the real database engine itself, run in Docker via [testcontainers](https://github.com/testcontainers/testcontainers-rs): a statement counts as valid unless the engine reports a syntax error, so a missing table or column still counts as parsed. Against that ground truth we scored the parsers on recall (valid statements accepted), false positives (invalid statements wrongly accepted), display round-trip stability, and canonical-form fidelity. The other dialects have no runnable engine, so their statements count as provenance-valid and the metric is simply the acceptance rate. Across all dialects, we captured speed as a per-statement parse-time distribution over every accepted statement.
+We exercised each parser in the dialect that matches the corpus under test. Where a dialect has a runnable engine, we labelled each statement valid or invalid with the real database engine itself, run in Docker via [testcontainers](https://github.com/testcontainers/testcontainers-rs): a statement counts as valid unless the engine reports a syntax error, so a missing table or column still counts as parsed. Against that ground truth we scored the parsers on recall (valid statements accepted), false positives (invalid statements wrongly accepted), display round-trip stability, and canonical-form fidelity. The other dialects have no runnable engine, so their statements count as provenance-valid and the metric is simply the acceptance rate. Across all dialects, we captured speed as a per-statement parse-time distribution over every accepted statement, and memory as the peak and retained bytes per statement under a counting allocator. A batch axis additionally parses each parser's whole accepted set as a single script, showing what bulk parsing amortizes, and a time machine benchmarks the historical releases of every pure-Rust parser (59 versions in total, including every sqlparser-rs minor since January 2023), so each parser page also charts how coverage, speed, and memory evolved across releases.
 
 On their home dialect the reference bindings are exact by construction, so the more telling comparison is among the pure-Rust parsers. There, [sqlparser-rs](https://github.com/sqlparser-rs/sqlparser-rs) is the most broadly capable, the permissive parsers such as [polyglot-sql](https://github.com/tobilg/polyglot) accept the most statements but pay for it with a high false-positive rate, and the stricter parsers reject more in exchange for precision. Speed spans more than an order of magnitude, from well under a microsecond per statement for the fastest parsers to the low single-digit microseconds for most, with [polyglot-sql](https://github.com/tobilg/polyglot) a clear outlier at roughly fifteen. No parser leads on every axis, so the right choice comes down to what a given project values most: broad coverage, few false positives, or raw speed.
 
@@ -36,31 +36,35 @@ Per-parser repository metadata (stars, contributors, fuzzing, test and benchmark
 
 ## Corpus
 
-311,594 statements across 34 files and 13 dialects, committed compressed as `datasets.tar.zst` (5.3 MB) and unpacked to `datasets/{dialect}/{name}.txt`, one statement per line. The commands below extract it automatically on first use. All sources are openly licensed (Apache-2.0, MIT, BSD, public domain or CC-BY), drawn from each engine's own regression suites and official samples. Natural-language-with-embedded-SQL datasets are intentionally excluded.
+340,938 statements across 32 files and 13 dialects, committed compressed as `datasets.tar.zst` (5.6 MB) and unpacked to `datasets/{dialect}/{name}.txt`, one statement per line. The commands below extract it automatically on first use. All sources are openly licensed (Apache-2.0, MIT, BSD, public domain or CC-BY), drawn from each engine's own regression suites and official samples. The SQLite corpus includes the SQLite project's own official test suite (public domain), which exercises SQLite-specific grammar such as PRAGMAs, virtual tables, recursive CTEs, and upsert. Natural-language-with-embedded-SQL datasets are intentionally excluded.
 
 Correctness is defined per dialect. Dialects with a runnable engine are graded against that real database engine, run in Docker via testcontainers by the `oracle` crate: a statement is valid unless the engine reports a syntax error (a missing table or column still counts as parsed). The validity labels are computed once and committed under `oracle/labels`, so grading and CI need no Docker. That reference splits the corpus into valid and invalid and scores recall, false positives, round-trip, and fidelity. Dialects with no runnable engine (cloud services, heavy JVM engines) have no reference, so their statements count as provenance-valid (sourced from each engine's own suites) and the metric is acceptance rate. Speed is a per-statement parse-time distribution over every accepted statement, timed with an adaptive iteration count on a no-`catch_unwind` path. Memory is measured separately with a counting allocator, as peak live bytes and retained (AST) bytes per statement. A companion batch axis parses each parser's whole accepted set as one script and normalizes the time and memory by the statement count, showing what bulk parsing amortizes against parsing one statement at a time. A batch that does not parse the whole set (a parser that bails out partway) is dropped rather than reported, and parsers without a multi-statement entry point (databend-common-ast) sit out the batch axis.
 
 ## Running
 
-The corpus auto-extracts on first use. To rebuild the whole explorer snapshot (`web/assets/bench.json`) with one command:
+The corpus auto-extracts on first use. To rebuild the whole explorer snapshot (`web/assets/bench.json.zst`) with one command:
 
 ```bash
-cargo regen   # timing benches + memory benches + export, in order
+cargo regen   # timing benches + memory benches + time-machine + export, in order
 ```
 
 That is an alias (see `.cargo/config.toml`) for `cargo run --release --bin sqlbench -- regen`. The memory measurement installs a counting global allocator, so it has to run in its own process, separate from the timing bench (which must stay on the default allocator for fair numbers). The `regen` command orchestrates that sequence so you do not have to. The individual steps, if you want to run one on its own:
 
 ```bash
-cargo run --release --bin sqlbench correctness --per-file    # per-file acceptance, every dialect
-cargo run --release --bin sqlbench correctness               # reference + provenance correctness
-cargo bench                                                  # parse time (per-statement and batch), every dialect
-cargo run --release -p membench                              # per-statement memory (peak + retained bytes)
-cargo run --release -p membench -- batch                     # whole-script (batch) memory, per statement
-cargo run --release --bin sqlbench export                    # regenerate web/assets/bench.json for the explorer
+cargo run --release --bin sqlbench correctness --per-file       # per-file acceptance, every dialect
+cargo run --release --bin sqlbench correctness                  # reference + provenance correctness
+cargo bench                                                     # parse time (per-statement and batch), every dialect
+cargo run --release -p membench                                 # per-statement memory (peak + retained bytes)
+cargo run --release -p membench -- batch                        # whole-script (batch) memory, per statement
+cargo run --release -p timemachine --bin timemachine-mem -- --full   # per-version memory (writes a sidecar)
+cargo run --release -p timemachine --bin timemachine -- --full       # per-version time + correctness, writes history
+cargo run --release --bin sqlbench export                       # regenerate web/assets/bench.json.zst for the explorer
 ```
 
 `cargo bench` runs both the per-statement (`parsing`) and whole-script (`batch_parsing`) timing benches. Add `--bench batch_parsing` to run only the batch one. `export` reads whatever the benches left under `target/`, warning rather than failing for any missing source, so the memory and batch columns stay empty until their producers have run.
 
+The `timemachine` crate benchmarks several historical versions of each pure-Rust parser at once (via `package`-rename aliases in `timemachine/Cargo.toml`) and writes a compressed `web/assets/history.json.zst` that the explorer embeds and decompresses in the browser. Each parser page then shows how that library's time, memory, and correctness changed across releases, with a version picker. Cargo can only host semver-incompatible versions side by side, so the milestones are the latest patch of every `0.x` minor the shared adapter compiles against: `sqlparser-rs` gets 33 points (every minor from 0.30, January 2023, through 0.62), `sqlite3-parser` eight, `qusql-parse` seven (its 0.1.0 release parses pathologically slowly on parts of the corpus and is excluded), `polyglot-sql` four, `databend-common-ast` three, `sqlglot-rust` two, while `turso_parser` and `orql` have a single published release and so show one point. The FFI parsers (`pg_query`) are excluded because two builds of libpg_query collide at link. Without `--full` the runners use a small per-dialect sample, which is a fast pipeline check rather than publishable numbers.
+
 Validity labels for the reference dialects are produced by the `oracle` crate (real engines in Docker via testcontainers) and committed under `oracle/labels`, so `correctness` and `export` need no Docker. Regenerate them with `cargo run --release -p oracle`.
 
 ### Requirements