Skip to content

fix: resolve backlinks through redirects (http/https merge, #71)#78

Merged
YusukeHirao merged 4 commits into
devfrom
fix/http-https-backlink-resolution
Jun 14, 2026
Merged

fix: resolve backlinks through redirects (http/https merge, #71)#78
YusukeHirao merged 4 commits into
devfrom
fix/http-https-backlink-resolution

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

概要

被リンク(incoming links / referrers)を 読み取り時に redirect 越しで解決し、http://xhttps://x のように redirect 元/宛先に分裂していた被リンクを canonical ページに合算する(#71)。正規化は行わず redirect 辺を保持したまま read 層で集約する。

これまで report 経路(getPagesWithRels)だけが redirect 越しに解決しており、viewer/mcp/cli 経路(getReferrersOfPage / getPageDetail / listPageLinks)は解決しておらず被リンクが分裂していた。本 PR で 4 経路すべてを同一セマンティクスに揃える。

変更点

  • crawler getReferrersOfPage: COALESCE(target.redirectDestId, target.id) の単一ホップで最終宛先に解決。through/throughId(アンカーが実際に指した URL = redirect 元)も返し、Page.getReferrers/getRequests フォールバックが完全な Referrer 形状を返すよう整形(report の [REDIRECTED FROM] 注記が非プリロード経路でも機能)。
  • query getPageDetail.inboundLinks / listPageLinks.referrerCount: 同じ単一ホップ解決で redirect 越しに集約。
  • 意図的な非対称性: outbound(発リンク)は raw な指し先を保持(「リダイレクトする URL にリンクしている」監査シグナル)。inbound のみ canonical に集約。

テスト

  • crawler database.spec / query get-page-detail.spec / list-page-links.spec: http/https の実 URL を使った DB レベルの合算検証(分裂しないこと、through/throughId、redirect 元側の被リンクが空になること)。
  • crawler page.spec: プリロード無しフォールバックの through/throughId マッピング(ghost code 解消)。
  • E2E(redirect.e2e): /redirect/start(301→302→dest)を指すページが最終 /redirect/dest の被リンクとして合算されることを crawl→archive→getReferrers で end-to-end 検証(機構は scheme 非依存)。

ドキュメント

ARCHITECTURE.md に「被リンク/参照の redirect 透過解決(#71)」節を追加(単一ホップ解決の根拠、読み取り経路間の一貫性、inbound/outbound 非対称の設計意図)。

レビュー

/code-review xhigh/qa-engineer/product-manager を実施し、全 finding を反映(through/throughId 欠落の修正、フォールバックの ghost code テスト追加、E2E 追加、ドキュメント節追加)。

Closes #71

🤖 Generated with Claude Code

YusukeHirao and others added 4 commits June 14, 2026 09:20
…throughId

getReferrersOfPage now counts an anchor pointing at a redirect source (e.g.
http://x that 301s to https://x) as a referrer of the redirect's final
destination, mirroring getPagesWithRels. redirectDestId is pre-flattened to the
final dest, so COALESCE(target.redirectDestId, target.id) is a single hop.

Also select target.url/id as through/throughId so the Page.getReferrers /
getRequests fallbacks return the full Referrer shape (report's "[REDIRECTED
FROM]" note works on this non-preloaded path too).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
getPageDetail.inboundLinks and listPageLinks.referrerCount now resolve incoming
links through redirects, so links to a redirect source merge onto the canonical
destination (#71) instead of splitting across the http/https pair. Same
single-hop COALESCE semantics as crawler's redirectTable().

Outbound links intentionally stay raw (audit signal that a page links to a
redirecting URL); documented inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A page linking /redirect/start (which 301/302s to /redirect/dest) now shows up
as a referrer of the final /redirect/dest, with through pointing at the redirect
source. Without redirect-resolved referrers the destination's backlinks are
empty. Uses the existing http->http chain (the mechanism is scheme-agnostic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an ARCHITECTURE.md section explaining that incoming links resolve through
redirects (single-hop COALESCE on pre-flattened redirectDestId), the read-path
consistency across getPagesWithRels / getReferrersOfPage / getPageDetail /
listPageLinks, and the intentional inbound/outbound asymmetry (outbound stays
raw for audit visibility — do not "unify" it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@YusukeHirao YusukeHirao merged commit 795a858 into dev Jun 14, 2026
5 checks passed
@YusukeHirao YusukeHirao deleted the fix/http-https-backlink-resolution branch June 14, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(crawler): http と https の同一 URL が別ページとして二重登録される

1 participant