Skip to content

feat(page-cluster): add Jaccard similarity and array edit distance primitives#900

Merged
YusukeHirao merged 1 commit into
devfrom
feat/page-cluster-distance-primitives
Jul 3, 2026
Merged

feat(page-cluster): add Jaccard similarity and array edit distance primitives#900
YusukeHirao merged 1 commit into
devfrom
feat/page-cluster-distance-primitives

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • Add jaccardSimilarity() (set-based structural overlap) and arrayEditDistance() (order-sensitive element-wise Levenshtein distance) to @d-zero/page-cluster, as pure, parameter-free building blocks for the future MinHash/LSH + medoid clustering classifier.
  • Scope intentionally excludes the classifier itself (MinHash/LSH signature generation, banding, medoid hierarchical clustering): those need additional implementation-parameter decisions (hash function count, band/row config, merge-distance threshold, linkage method) not yet settled.
  • Validated the direction against a real 302-page crawl archive of www.d-zero.co.jp: Jaccard similarity over tokenize() output already groups pages into sensible template families (service pages, work/report/news detail pages, tag listings) at a 0.8–0.9 threshold.

Test plan

  • yarn build (28 projects)
  • yarn lint
  • yarn test (1122 tests)
  • /code-review xhigh — 4 findings, all fixed (orphaned JSDoc, dedup-collapsed test literal, ReadonlySet/naming hardening)
  • /qa-engineer — added a differing-set-size symmetry test to close a coverage gap

🤖 Generated with Claude Code

…imitives

Add the two structural-distance building blocks a future MinHash/LSH +
medoid clustering classifier will need: jaccardSimilarity() for set-based
overlap (candidate scoring, medoid merge distance) and arrayEditDistance()
for order-sensitive comparisons on tokenize() output (merge-distance
refinement, quality checks). Both are parameter-free and independently
testable ahead of the classifier itself, whose MinHash/LSH and clustering
parameters still need a separate design pass.

Validated against a real 302-page crawl of www.d-zero.co.jp: Jaccard
similarity on tokenize() output already groups pages into their actual
template families (service pages, work/report/news detail pages, tag
listings) at a 0.8-0.9 threshold, without any hashing or clustering
machinery.
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner July 3, 2026 07:29
@YusukeHirao YusukeHirao merged commit 38f1257 into dev Jul 3, 2026
6 checks passed
@YusukeHirao YusukeHirao deleted the feat/page-cluster-distance-primitives branch July 3, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant