fix(crawler): http と https の同一 URL が別ページとして二重登録される

## 概要

scheme だけが異なる同一リソース（`http://…` と `https://…`）が `pages` テーブルに**別レコードとして二重登録**され、ページ・被リンク・リソースが分裂する。

## 根本原因

`packages/@nitpicker/crawler/src/archive/database.ts` の `#getIdByUrl` / `#insertPage` は URL 文字列を**そのまま挿入**し、scheme 正規化を行わない（`database.ts:958-985`）。`pages.url` は unique 制約だが、`http://example.com/a` と `https://example.com/a` は別キーになる。

crawler 側の `protocolAgnosticKey`（`packages/@nitpicker/crawler/src/crawler/protocol-agnostic-key.ts:9-11`）は `LinkList` の dedup でのみ使われ、**DB 挿入には伝播していない**。

## 実害 / エビデンス

実運用のクロールアーカイブで、同一 PDF が

- `http://…/foo.pdf`（id A）
- `https://…/foo.pdf`（id B）

の 2 つの内部ページ（`isTarget=1`）に分裂して登録されていた。被リンク元も分裂し（一方を指すページと他方を指すページに分かれる）、viewer 上で「同じファイルなのに 2 ページ・被リンクが割れている」状態になる。

## 検討点

完全な scheme 正規化は副作用がある（実際に http と https で別物を出すサイトが存在しうる）。scope / canonical の扱いとあわせて方針を決める必要がある。

候補：

- 最低限、**同一 scope 内の scheme 差のみ** dedup する（`(hostname, port, path)` scope 判定と整合）。
- `#getIdByUrl` で protocol-agnostic キーによる既存行検索を行い、ヒット時はそのページに集約する。

## 優先度メモ

①（anchors/images 重複）より影響は限定的。design 判断を伴うため、enhancement 寄りの bug。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(crawler): http と https の同一 URL が別ページとして二重登録される #71

概要

根本原因

実害 / エビデンス

検討点

優先度メモ

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix(crawler): http と https の同一 URL が別ページとして二重登録される #71

Description

概要

根本原因

実害 / エビデンス

検討点

優先度メモ

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions