Add the time machine: per-version benchmark history for each parser#18
Merged
Conversation
Benchmarks several historical versions of each pure-Rust parser and shows, on every parser page, how its parse time, memory, and correctness changed across releases, with a version picker and trend charts. Architecture: - A new Parser trait in the main crate abstracts the measurable surface (parse, batch, memory, reprint, grading hooks). BenchParser implements it as a zero-behavior-change delegation shim, and the grading engine in report.rs plus the benches and membench now drive over dyn Parser, keyed by a ParserId (family + version). - A new timemachine crate hosts many semver-incompatible versions of a crate at once via package-rename aliases (different 0.x minors coexist). Each version is one macro invocation implementing Parser against its renamed crate, so adding a version is three lines (alias, macro call, registry entry). The FFI parsers (pg_query) are excluded because two libpg_query builds collide at link. - 23 historical builds across 8 families: sqlparser-rs 5, qusql-parse 5, sqlite3-parser 4, polyglot-sql 3, sqlglot-rust 2, databend-common-ast 2, turso_parser 1, orql 1. - Two runner binaries (timing plus grading, and memory with the counting allocator) reuse the main crate's grading and summary helpers and write one combined per-family history. Each version-and-dialect pair is panic-isolated so a single old-parser panic skips that pair instead of aborting the run. Data delivery and size: - Both the main snapshot and the history are stored zstd-compressed (bench.json.zst 127 KB, down ~26x from a 3.3 MB raw bench.json, and history.json.zst 327 KB) and embedded in the wasm viewer, which decompresses them with ruzstd. The site still does no runtime fetch, staying immune to GitHub Pages base-path issues. Visualization: - The trend uses median with an interquartile (p25 to p75) band on a release-date x-axis, because parse-time and memory distributions are heavily right-skewed and the mean is outlier-dominated. Also adds std to the perf and memory schema, shared stats helpers (std_dev, dist_from, perf_from), the cargo regen wiring for the time-machine passes, and updated README and CONTRIBUTING with the add-a-version recipe. Includes the regenerated real benchmark data.
The SQLite reference oracle classified an EXPLAIN error as invalid only when the message contained "syntax error", "incomplete input", or "unrecognized token". SQLite reports many parse and grammar rejections with other wording (for example "ORDER BY clause should come after INTERSECT not before", reported in gwenn/lemon-rs#102, and "RIGHT and FULL OUTER JOINs are not currently supported"), so those statements were mislabeled valid. That put them on the failing lists of parsers that correctly reject them, unfairly penalizing strict parsers. Invert the rule to match the documented intent: EXPLAIN resolves names, so an error means the statement is invalid unless it is a missing-object or binding error (no such table/column/function, ambiguous column), which means it parsed and only references objects we did not create. Add unit tests for the reported case and the missing-object cases. Re-labeling the SQLite corpus against real SQLite flipped exactly one statement valid to invalid (the rest already produced standard syntax-error messages the old code caught). After re-export, sqlite3-parser has zero failing statements (recall 100, false positive 0) and the statement is removed from the failing lists. Regenerated oracle/labels/sqlite.tsv.zst and web/assets/bench.json.zst, and removed two now-orphaned failure downloads.
Our SQLite corpus was only Spider and sql-create-context queries (12,119 statements), almost all plain SELECT/CREATE, so it barely exercised SQLite-specific grammar and left every parser near 100 percent recall. This adds the SQLite project's own official test suite (public domain), extracted from the codeschool/sqlite-parser official-suite, with a SQLite-aware splitter that respects string and identifier quoting and comments, normalizes each statement to one line, and dedupes within the suite and against the existing corpus: 29,344 new statements. The SQLite corpus grows to 41,463 and the total corpus from 311,594 to 340,938. The repacked datasets.tar.zst carries them (datasets/ is gitignored). The new corpus surfaced a second gap in the SQLite oracle classifier: "unknown database" (a db-qualified reference such as CREATE TABLE db2.t(x) without ATTACH) is a missing-object error, so the statement parsed and should count as valid, not as a parse error. Added it to the missing-object set, with a unit test. Re-labeled against real SQLite: 79 invalid, all genuinely invalid SQL (OFFSET without LIMIT, AS inside an aggregate, reserved words used as identifiers, bracket-identifier concatenation, en-dashes). The suite itself contributes only valid statements. After a full regen, SQLite now meaningfully differentiates the parsers on real coverage instead of a trivial corpus: sqlite3-parser and turso lead at 99.1 percent recall (sqlite3-parser still rejects 363 of SQLite's own official statements, the lemon-rs divergence), then polyglot-sql 97.5, sqlparser-rs 97.3, sqlglot-rust 92.7, qusql-parse 80.5. Regenerated bench.json.zst and history.json.zst, refreshed the failure downloads, and updated the README counts and provenance.
The per-version correctness was already in the history and shown in the selected-version table, but the trend charts only covered time and memory. Add a percentage trend chart (linear axis hugging the data, points at each release date, one line per dialect) and render two of them on the parser page: the share of expected statements accepted (recall on reference dialects, acceptance rate elsewhere) and, where a reference engine exists, the share of invalid statements wrongly accepted. The false-positive chart is omitted for parsers with no reference dialect. Uses the data already in history.json.zst, so no regeneration is needed. The sqlglot-rust page now visibly shows its 0.9.37 to 0.10.0 recall jump across every dialect.
The milestones were a handful of hand-picked releases per family, which hides how often some libraries ship meaningful changes as 0.x minors. Expand to the latest patch of every minor the shared adapters compile against, found by probing: sqlparser-rs grows from 5 to 33 points (every minor from 0.30, January 2023, through 0.62), sqlite3-parser from 4 to 8 (0.9 to 0.16, its fallible-iterator 0.2 era starts below), qusql-parse from 5 to 7 (0.2 to 0.8), polyglot-sql from 3 to 4 (adding 0.2 and bumping 0.1 to its latest patch), databend-common-ast from 2 to 3 (adding 0.0.3). In total 59 versions across 8 families, up from 23. qusql-parse 0.1.0 is excluded: at full-corpus scale its parser effectively never returns on parts of the MySQL corpus (pathological parse time, not a panic, so the per-pair isolation cannot catch it). The exclusion is documented in the manifest, the family module, and the README. Bumping polyglot 0.1 to 0.1.15 incidentally fixed the panic that previously skipped its MySQL pair, so the regenerated history has no skipped pairs at all. The memory runner's per-family sidecars now double as checkpoints: a family whose sidecar exists is skipped, so an interrupted run resumes where it stopped (delete target/timemachine/ for a fresh measurement). That turned recovering from the qusql hang into minutes instead of redoing everything. Includes the regenerated web/assets/history.json.zst (990 KB compressed). The denser data already shows real stories: sqlparser-rs PostgreSQL median parse time drifts from 3.6 to 6.4 microseconds across 2023 to 2025 and recovers to 4.4 around 0.60, and qusql-parse recall jumps from 0.1 percent at 0.2.1 to 73 percent at 0.8.0.
Extend the abstract (README and web identically) with the memory methodology, the batch axis, and the time machine. Fix the stale corpus count in the web abstract (311,594 to 340,938). Add a June 2026 changelog section covering real-engine oracles, the SQLite oracle fix, turso_parser, the official SQLite suite, the batch axis, the time machine, compressed snapshots, and cargo regen.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Benchmarks 59 historical versions across the 8 pure-Rust parser families via Cargo package-rename aliases (every semver-incompatible 0.x minor, sqlparser-rs alone gets 33). Each parser page gains a version picker and release-date trend charts for parse time, memory (median with interquartile bars), accept/recall, and false positives. The head version is measured in the same run as the old ones. pg_query is excluded (libpg_query builds collide at link), as is qusql-parse 0.1.0 (pathologically slow).
Also: fixes the SQLite oracle labeling grammar errors as valid (gwenn/lemon-rs#102), ingests the SQLite official test suite (corpus now 340,938 statements), compresses the embedded snapshots with zstd, and adds cargo regen to rebuild everything in one command.