Skip to content

Add grammar checks to developer guide CI#5002

Merged
shai-almog merged 4 commits into
masterfrom
docs-grammar-checks
May 22, 2026
Merged

Add grammar checks to developer guide CI#5002
shai-almog merged 4 commits into
masterfrom
docs-grammar-checks

Conversation

@shai-almog
Copy link
Copy Markdown
Collaborator

Summary

Catches the class of bug from #5000 (paragraph rendered as "many ways to animate..." with no leading subject) which the existing Vale config could not detect. Vale is a style/regex linter; its asciidoc tokenizer fragments paragraphs at inline markup (#kbd#, code spans, links), so anchor regexes match on every text run between inline elements rather than on real paragraph starts. Confirmed locally — running the broken sentence through Vale 3.14 with this repo's .vale.ini produces zero alerts.

Two new checks run inside the existing developer-guide-docs workflow.

Hard gate (build-failing)

  • scripts/developer-guide/check_paragraph_capitalization.rb walks every paragraph block via asciidoctor's parser and flags ones whose first prose word starts lowercase.
  • Skips paragraphs that begin with <code>/<kbd>/<a>/<img> elements, code-like identifiers (com.foo.Bar, iosScrollMotionBool), and single-word transitional connectors between code blocks (becomes, to, and).
  • docs/developer-guide/paragraph-capitalization-baseline.json locks in 107 pre-existing findings so this PR doesn't snowball into a 200-paragraph rewrite. Only new violations fail CI. Maintainers can fix baseline entries over time and regenerate with --update-baseline.

Soft gate (advisory)

  • scripts/developer-guide/run_languagetool.py strips the rendered HTML to plain text and runs LanguageTool via language-tool-python.
  • JSON report is uploaded as the developer-guide-languagetool artifact. Never fails the build — LanguageTool's false-positive rate on technical prose is too high to enforce, but its findings are useful spot-check signal in PR review.
  • On the current guide it flagged 514 UPPERCASE_SENTENCE_START matches, confirming it catches the same class of bug as the hard gate.

Test plan

  • Re-introduce the PR Update Animations.asciidoc #5000 bug locally → script exits 1, names the file/line/offending word.
  • Restore the fix → script exits 0 with "0 new finding(s)".
  • python3 -c "import yaml; yaml.safe_load(...)" confirms the workflow YAML parses.
  • CI passes on this PR (will be visible once the workflow runs).

🤖 Generated with Claude Code

Catches the class of bug from #5000 (paragraph rendered as "many ways to
animate..." with no leading subject) which the existing Vale config could
not detect — Vale is a style/regex linter and its asciidoc tokenizer
fragments paragraphs at inline markup, so anchor regexes match on every
text run between inline elements rather than on real paragraph starts.

Two new checks run in the existing developer-guide-docs workflow:

- Hard gate (build-failing): a Ruby script using the asciidoctor parser
  walks every paragraph block and flags ones whose first prose word
  starts lowercase. Skips paragraphs that begin with code/kbd/link/image
  elements, code-like identifiers (com.foo.Bar, iosScrollMotionBool),
  and single-word transitional connectors between code blocks. A baseline
  file locks in 107 pre-existing findings; only new violations fail CI.

- Soft gate (advisory): a Python wrapper around LanguageTool (via
  language-tool-python) strips the rendered HTML to plain text, runs the
  grammar pass, and uploads the JSON report as a CI artifact. Never
  fails the build — LanguageTool's false-positive rate on technical
  prose is too high to enforce, but its findings are useful for review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

Developer Guide build artifacts are available for download from this workflow run:

Developer Guide quality checks:

  • AsciiDoc linter: No issues found (report)
  • Vale: No alerts found (report)
  • Paragraph capitalization: No paragraph capitalization issues (report)
  • LanguageTool (advisory): 7608 advisory match(es) — top: CONSECUTIVE_SPACES (2376), MORFOLOGIK_RULE_EN_US (2297), COMMA_PARENTHESIS_WHITESPACE (835) (report)
  • Image references: No unused images detected (report)

@github-actions
Copy link
Copy Markdown
Contributor

Cloudflare Preview

shai-almog and others added 3 commits May 22, 2026 07:06
Three follow-up fixes after the first CI run on this branch:

- Vale's write-good.ThereIs rule was blocking "There are many ways to
  animate..." — the documented fix to PR #5000 that motivated this whole
  effort. The rule's premise (don't lead with existential "there are")
  is wrong for technical reference, where it's the natural way to
  introduce a count or set. Disabled in .vale.ini with justification,
  matching the precedent for other write-good rules already turned off.

- The LanguageTool step crashed with "Connection reset by peer" because
  the local LT server was fed the full 3 MB of stripped guide text in a
  single request. The script now splits on paragraph boundaries into
  ~40 KB chunks and aggregates results, and always writes a report file
  even when LT fails to start, so the summarizer/quality-gate downstream
  read a valid file instead of falling back to 0.

- The PR comment didn't surface the new paragraph-capitalization or
  LanguageTool checks. Added summarize_reports.py subcommands for both,
  wired matching steps and env vars, and extended the github-script
  block to render their summaries with artifact links. The paragraph-
  capitalization report now also serializes total/new/baseline counts so
  the summary can say "1 new (107 baseline ignored)" instead of just
  "exit code 1".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback was twofold: (1) the baseline mechanism was hiding 107 real
prose bugs rather than enforcing the rule, so fix them and remove the
baseline; (2) the LanguageTool advisory was reporting 0 matches in CI,
which meant something was broken — locally LT reports 7614 matches.

Changes:

- Capitalized the first prose word of every flagged paragraph across 23
  developer guide files. The shape of the fixes varied: most were a
  single-letter capitalization ("if you" → "If you", "that's" → "That's"),
  a few needed "There are" prepended to a missing-subject sentence
  ("two ways..." → "There are two ways..."), and a few opaque
  "that's: ..." pseudo-list intros became "In other words: ..." so the
  text reads as a sentence.

- Removed the baseline mechanism from check_paragraph_capitalization.rb
  entirely. No --baseline flag, no --update-baseline flag, no baseline
  JSON file. The check is now a strict gate: any paragraph that starts
  with a lowercase prose word fails CI.

- Tightened the script's "skip if leading element is code/kbd/link/img"
  heuristic to also accept formatting wrappers (strong/em/b/i/mark/u/
  sub/sup) around the identifier. css.asciidoc:443's
  `**`repeating-linear-gradient` / `repeating-radial-gradient`**` glossary
  entry renders as `<strong><code>...</code> / <code>...</code></strong>`
  which the old regex missed.

- Fixed the LanguageTool advisory check. The CI step was crashing with
  AttributeError: 'Match' object has no attribute 'rule_id'. The pinned
  language-tool-python==2.9.4 uses camelCase accessors (ruleId,
  errorLength) while newer releases use snake_case (rule_id,
  error_length). Added a small _attr() helper that tries both names and
  serialization is now wrapped in a try/except inside a try/finally so
  the JSON report is written even when LT raises — the original code
  failed silently because the JSON dump only ran on the happy path and
  `LT_COUNT="$(python3 ... || echo 0)"` papered over the missing file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three rewrites in the previous commit tripped existing Vale rules:

- "It is defined in..." → write as a contraction ("It's defined in..."),
  matching Microsoft.Contractions.
- "Let's fix the example above..." → "The example above can be extended...";
  Microsoft.We bans first-person plural.
- "So far you've relied on..." → "Up to this point you've relied on..."; the
  write-good.So rule bans sentence-initial "So ".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shai-almog shai-almog merged commit 8ad4cc0 into master May 22, 2026
8 checks passed
shai-almog added a commit that referenced this pull request May 22, 2026
Brings in PR #5002's scripts/developer-guide/check_paragraph_capitalization.rb
so GitHub's auto-CodeQL Ruby scan has something to process. Without this
file the dynamic 'Analyze (ruby)' job errors with 'CodeQL could not
process any code written in Ruby' even though my PR adds no Ruby content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant