Add grammar checks to developer guide CI#5002
Merged
Merged
Conversation
Catches the class of bug from #5000 (paragraph rendered as "many ways to animate..." with no leading subject) which the existing Vale config could not detect — Vale is a style/regex linter and its asciidoc tokenizer fragments paragraphs at inline markup, so anchor regexes match on every text run between inline elements rather than on real paragraph starts. Two new checks run in the existing developer-guide-docs workflow: - Hard gate (build-failing): a Ruby script using the asciidoctor parser walks every paragraph block and flags ones whose first prose word starts lowercase. Skips paragraphs that begin with code/kbd/link/image elements, code-like identifiers (com.foo.Bar, iosScrollMotionBool), and single-word transitional connectors between code blocks. A baseline file locks in 107 pre-existing findings; only new violations fail CI. - Soft gate (advisory): a Python wrapper around LanguageTool (via language-tool-python) strips the rendered HTML to plain text, runs the grammar pass, and uploads the JSON report as a CI artifact. Never fails the build — LanguageTool's false-positive rate on technical prose is too high to enforce, but its findings are useful for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Developer Guide build artifacts are available for download from this workflow run:
Developer Guide quality checks:
|
Contributor
Cloudflare Preview
|
Three follow-up fixes after the first CI run on this branch: - Vale's write-good.ThereIs rule was blocking "There are many ways to animate..." — the documented fix to PR #5000 that motivated this whole effort. The rule's premise (don't lead with existential "there are") is wrong for technical reference, where it's the natural way to introduce a count or set. Disabled in .vale.ini with justification, matching the precedent for other write-good rules already turned off. - The LanguageTool step crashed with "Connection reset by peer" because the local LT server was fed the full 3 MB of stripped guide text in a single request. The script now splits on paragraph boundaries into ~40 KB chunks and aggregates results, and always writes a report file even when LT fails to start, so the summarizer/quality-gate downstream read a valid file instead of falling back to 0. - The PR comment didn't surface the new paragraph-capitalization or LanguageTool checks. Added summarize_reports.py subcommands for both, wired matching steps and env vars, and extended the github-script block to render their summaries with artifact links. The paragraph- capitalization report now also serializes total/new/baseline counts so the summary can say "1 new (107 baseline ignored)" instead of just "exit code 1". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback was twofold: (1) the baseline mechanism was hiding 107 real
prose bugs rather than enforcing the rule, so fix them and remove the
baseline; (2) the LanguageTool advisory was reporting 0 matches in CI,
which meant something was broken — locally LT reports 7614 matches.
Changes:
- Capitalized the first prose word of every flagged paragraph across 23
developer guide files. The shape of the fixes varied: most were a
single-letter capitalization ("if you" → "If you", "that's" → "That's"),
a few needed "There are" prepended to a missing-subject sentence
("two ways..." → "There are two ways..."), and a few opaque
"that's: ..." pseudo-list intros became "In other words: ..." so the
text reads as a sentence.
- Removed the baseline mechanism from check_paragraph_capitalization.rb
entirely. No --baseline flag, no --update-baseline flag, no baseline
JSON file. The check is now a strict gate: any paragraph that starts
with a lowercase prose word fails CI.
- Tightened the script's "skip if leading element is code/kbd/link/img"
heuristic to also accept formatting wrappers (strong/em/b/i/mark/u/
sub/sup) around the identifier. css.asciidoc:443's
`**`repeating-linear-gradient` / `repeating-radial-gradient`**` glossary
entry renders as `<strong><code>...</code> / <code>...</code></strong>`
which the old regex missed.
- Fixed the LanguageTool advisory check. The CI step was crashing with
AttributeError: 'Match' object has no attribute 'rule_id'. The pinned
language-tool-python==2.9.4 uses camelCase accessors (ruleId,
errorLength) while newer releases use snake_case (rule_id,
error_length). Added a small _attr() helper that tries both names and
serialization is now wrapped in a try/except inside a try/finally so
the JSON report is written even when LT raises — the original code
failed silently because the JSON dump only ran on the happy path and
`LT_COUNT="$(python3 ... || echo 0)"` papered over the missing file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three rewrites in the previous commit tripped existing Vale rules:
- "It is defined in..." → write as a contraction ("It's defined in..."),
matching Microsoft.Contractions.
- "Let's fix the example above..." → "The example above can be extended...";
Microsoft.We bans first-person plural.
- "So far you've relied on..." → "Up to this point you've relied on..."; the
write-good.So rule bans sentence-initial "So ".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shai-almog
added a commit
that referenced
this pull request
May 22, 2026
Brings in PR #5002's scripts/developer-guide/check_paragraph_capitalization.rb so GitHub's auto-CodeQL Ruby scan has something to process. Without this file the dynamic 'Analyze (ruby)' job errors with 'CodeQL could not process any code written in Ruby' even though my PR adds no Ruby content.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Catches the class of bug from #5000 (paragraph rendered as "many ways to animate..." with no leading subject) which the existing Vale config could not detect. Vale is a style/regex linter; its asciidoc tokenizer fragments paragraphs at inline markup (
#kbd#, code spans, links), so anchor regexes match on every text run between inline elements rather than on real paragraph starts. Confirmed locally — running the broken sentence through Vale 3.14 with this repo's.vale.iniproduces zero alerts.Two new checks run inside the existing
developer-guide-docsworkflow.Hard gate (build-failing)
scripts/developer-guide/check_paragraph_capitalization.rbwalks every paragraph block via asciidoctor's parser and flags ones whose first prose word starts lowercase.<code>/<kbd>/<a>/<img>elements, code-like identifiers (com.foo.Bar,iosScrollMotionBool), and single-word transitional connectors between code blocks (becomes,to,and).docs/developer-guide/paragraph-capitalization-baseline.jsonlocks in 107 pre-existing findings so this PR doesn't snowball into a 200-paragraph rewrite. Only new violations fail CI. Maintainers can fix baseline entries over time and regenerate with--update-baseline.Soft gate (advisory)
scripts/developer-guide/run_languagetool.pystrips the rendered HTML to plain text and runs LanguageTool vialanguage-tool-python.developer-guide-languagetoolartifact. Never fails the build — LanguageTool's false-positive rate on technical prose is too high to enforce, but its findings are useful spot-check signal in PR review.UPPERCASE_SENTENCE_STARTmatches, confirming it catches the same class of bug as the hard gate.Test plan
python3 -c "import yaml; yaml.safe_load(...)"confirms the workflow YAML parses.🤖 Generated with Claude Code