Skip to content

feat(page-cluster): add URL-path and stylesheet blocking-key derivation#902

Merged
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-blocking-keys
Jul 3, 2026
Merged

feat(page-cluster): add URL-path and stylesheet blocking-key derivation#902
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-blocking-keys

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • Add derivePathGroupKey and deriveStylesheetGroupKey: two independent "blocking key" functions (record-linkage sense — see Fellegi-Sunter, canopy clustering, DNF blocking scheme) that turn a page's URL path segments and its loaded-stylesheet URL list into coarse partition keys.
  • These are meant to feed a future classifier's homogeneous-group discovery step, run before the expensive structural comparison tokenize()/jaccardSimilarity()/arrayEditDistance() already provide (PR feat(page-cluster): add Jaccard similarity and array edit distance primitives #900, feat(page-cluster): add frequency-based template/content token split #901). Combining the two keys into an actual grouping decision is intentionally out of scope here — literature on blocking recommends keeping independent blocking predicates separate (OR-combined) rather than merging them into one composite key.
  • Export HASH_LENGTH from hash-content.ts so deriveStylesheetGroupKey reuses the same hash truncation length instead of introducing a second one.

Test plan

  • yarn build
  • yarn lint
  • yarn test (1164 tests passing)
  • New unit tests for both functions, including edge cases found during /code-review xhigh (trailing-slash path segments, delimiter-collision in stylesheet hashing, bare-string type-safety hole)

Add derivePathGroupKey and deriveStylesheetGroupKey, two independent
"blocking key" functions (record-linkage sense) that turn a page's URL
path segments and loaded-stylesheet list into coarse partition keys.
These feed a future classifier's homogeneous-group discovery step, ahead
of the expensive structural comparison tokenize()/jaccardSimilarity()
already provide.

Export HASH_LENGTH from hash-content.ts so deriveStylesheetGroupKey
reuses the same hash truncation length instead of picking its own.

Add "hrefs" to the cspell word list, used across the new files' JSDoc
and tests.
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner July 3, 2026 12:06
@YusukeHirao YusukeHirao merged commit c3e06e2 into dev Jul 3, 2026
6 checks passed
@YusukeHirao YusukeHirao deleted the feat/page-cluster-blocking-keys branch July 3, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant