Skip to content

feat(page-cluster): add frequency-based template/content token split#901

Merged
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-frequency-split
Jul 3, 2026
Merged

feat(page-cluster): add frequency-based template/content token split#901
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-frequency-split

Conversation

@YusukeHirao

@YusukeHirao YusukeHirao commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

  • Add computeDocumentFrequency() and splitTokensByFrequency() to @d-zero/page-cluster: a preprocessing layer that separates a page's shared site chrome (header/nav/footer) from page-specific content by document frequency, before either half is compared with jaccardSimilarity() (PR feat(page-cluster): add Jaccard similarity and array edit distance primitives #900, still open).
  • A single flat Jaccard over a page's full token set has two failure modes: common chrome dilutes genuine content differences at loose similarity thresholds, and page-specific content variation (e.g. a freeform CMS block-editor page) swamps a real layout match. Splitting first and comparing template/content axes separately fixes both — no tag-name or class-name semantics involved, purely document-frequency statistics.
  • Validated against two real crawls (a small single-layout corporate site, and a much larger site that turned out to be a federation of independent sub-sections with no single dominant layout): the split works cleanly on a homogeneous corpus, and requires scoping to one section first on a heterogeneous multi-section site.
  • Scope intentionally excludes the classifier core (MinHash/LSH, medoid clustering) and auto-discovery of homogeneous page groups for large multi-section sites — both still need separate design decisions.

Test plan

  • yarn build (28 projects)
  • yarn lint
  • yarn test (1130 tests)
  • /code-review xhigh — 9 findings (all in the frequency-cutoff comparison: unvalidated threshold, floating-point boundary rounding, pageCount/documentFrequency desync risk), all fixed
  • /qa-engineer — no additional findings

Note

PR #900 (jaccardSimilarity/arrayEditDistance) is still open/unmerged; this branch has no compile-time dependency on it, but the two PRs are conceptually part of the same classifier-core preprocessing layer.

🤖 Generated with Claude Code

@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner July 3, 2026 09:10
Add computeDocumentFrequency() and splitTokensByFrequency(), a preprocessing
layer that separates a page's shared site chrome (header/nav/footer) from
its page-specific content by document frequency, before either half is
compared with jaccardSimilarity(). A single flat Jaccard over a page's full
token set has two failure modes: common chrome dilutes genuine content
differences at loose similarity thresholds, and page-specific content
variation (e.g. a freeform CMS block-editor page, where the exact block mix
differs per page) swamps a real layout match. Splitting first and comparing
each axis separately fixes both.

Validated against two real crawls: a small single-layout corporate site (a
few hundred pages) showed a clean bimodal frequency split stable across a
wide threshold range; a much larger site that turned out to be a federation
of independent sub-sections (no single section covering even half the
pages) showed the split requires a homogeneous input, and recovers cleanly
once scoped to one section.

code-review (xhigh) surfaced 9 findings, all in the frequency-cutoff
comparison: unvalidated threshold allowing degenerate cutoffs (0, NaN, or a
percentage instead of a fraction), floating-point rounding at the
documented inclusive boundary, and pageCount being passable out of sync
with the documentFrequency it was computed from. Fixed by validating
threshold eagerly, applying an epsilon tolerance to the boundary
comparison, and bundling pageCount with documentFrequency into one
DocumentFrequency result so they cannot be passed independently.
@YusukeHirao YusukeHirao force-pushed the feat/page-cluster-frequency-split branch from c9323a9 to 68acfe8 Compare July 3, 2026 09:18
@YusukeHirao YusukeHirao merged commit b233683 into dev Jul 3, 2026
6 checks passed
@YusukeHirao YusukeHirao deleted the feat/page-cluster-frequency-split branch July 3, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant