From fa61cdad4211d14a168d1fe603f4714e63a69139 Mon Sep 17 00:00:00 2001 From: Yusuke Hirao Date: Fri, 3 Jul 2026 22:57:25 +0900 Subject: [PATCH 1/3] docs(repo): document viewer_anchor_facts read model design (issue #114) Records the read/write/storage-optimized design for broken/external link listing, superseding the destination-summary-only approach from PR #157. --- ARCHITECTURE.md | 25 +++++++++++++++++-------- CLAUDE.md | 2 +- README.md | 2 +- 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 12edd194..538d88ed 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -178,9 +178,10 @@ crawler/src/ - **`getSummary`**: サイト全体の統計(内部/外部のページ数とコンテンツ数、ステータス分布、Content-Type 分布、メタデータ充足率) - **`getPageDetail`**: 単一ページの詳細情報(メタデータ、アウトバウンド/インバウンドリンク、リダイレクト元) - **`getPageHtml`**: HTML スナップショット取得(truncation サポート) -- **`listLinks`**: リンク分析(`type: 'broken' | 'external'`、anchor 単位 = 1 行 1 `` タグ、重複排除なし)。dest は `pages.redirectDestId` 経由で canonical destination まで解決した上で broken/external 判定(`includeRedirectSources: true` で解決を無効化し literal を見る)。関数自体は変更していないため CLI/MCP は従来通り `type: 'external'` で anchor 単位の生データを取得できるが、**viewer の `/api/links?type=external` だけは `listExternalLinks` に切り替え済み**(後述) — 「外部リンク」ビューは宛先ごとに集約した一覧を必要とするため -- **`listExternalLinks`**: viewer の「外部リンク」ビュー用の legacy 経路(read model が無い/古いアーカイブのフォールバック)。外部リンク先を canonical destination(`listLinks` と同じ `COALESCE(canonical.*, dest.*)` 解決パターン)ごとに `GROUP BY` で重複排除し、`referrerCount`(`COUNT(DISTINCT source.id)` — 同一ページからの複数アンカーは 1 件として数える)を付与した一覧。ページネーションの `total` は distinct 宛先数(anchor 数ではない)を GROUP BY サブクエリでラップして算出 — `paginateQuery` ヘルパーは素朴な `count(idColumn)` のため GROUP BY 済みクエリと非互換で使えない。宛先の詳細(参照元ページ一覧)は新規ビューを作らず既存の `getPageDetail`(`isExternal`/`scraped` 制約なし)の `inboundLinks` をそのまま再利用する。**`viewer_external_links` read model が current な場合は `listViewerExternalLinks` に切り替わる**(後述の「設計注意(外部リンク read model)」参照)— この関数自体はそのフォールバックとして無変更のまま残る -- **`listViewerExternalLinks`**: `viewer_external_links` read model 専用の fast path。`listExternalLinks` と同じオプション/レスポンス形だが、集計(JOIN + GROUP BY + COUNT DISTINCT)は read model ビルド時に1回だけ実行済みなので、実行時は単純な indexed SELECT + `paginateQuery`(GROUP BY 不要になったため素朴な helper がそのまま使える) +- **`listLinks`**: リンク分析(`type: 'broken' | 'external'`、anchor 単位 = 1 行 1 `` タグ、重複排除なし)。dest は `pages.redirectDestId` 経由で canonical destination まで解決した上で broken/external 判定(`includeRedirectSources: true` で解決を無効化し literal を見る)。関数自体は変更していないため CLI/MCP は従来通り `type: 'broken' | 'external'` で anchor 単位の生データを取得できる。viewer 側は `type: 'external'` は `listExternalLinks`/`listViewerExternalLinks`、`type: 'broken'` は `listViewerBrokenLinks` が current な read model を持つ場合に切り替わり(後述)、この関数は両方の legacy フォールバックとしてのみ残る +- **`listExternalLinks`**: viewer の「外部リンク」ビュー用の legacy 経路(read model が無い/古いアーカイブのフォールバック)。外部リンク先を canonical destination(`listLinks` と同じ `COALESCE(canonical.*, dest.*)` 解決パターン)ごとに `GROUP BY` で重複排除し、`referrerCount`(`COUNT(DISTINCT source.id)` — 同一ページからの複数アンカーは 1 件として数える)を付与した一覧。ページネーションの `total` は distinct 宛先数(anchor 数ではない)を GROUP BY サブクエリでラップして算出 — `paginateQuery` ヘルパーは素朴な `count(idColumn)` のため GROUP BY 済みクエリと非互換で使えない。宛先の詳細(参照元ページ一覧)は新規ビューを作らず既存の `getPageDetail`(`isExternal`/`scraped` 制約なし)の `inboundLinks` をそのまま再利用する。**`viewer_external_links` read model が current な場合は `listViewerExternalLinks` に切り替わる**(後述の「設計注意(viewer_anchor_facts read model、issue #114)」参照)— この関数自体はそのフォールバックとして無変更のまま残る +- **`listViewerExternalLinks`**: `viewer_external_links` read model 専用の fast path。`listExternalLinks` と同じオプション/レスポンス形。集計は read model ビルド時、`viewer_anchor_facts` を組み立てるのと同じ `anchors` スキャン1回から**メモリ上で**導出済み(`deriveExternalLinkSummaryRows`、issue #114 で `computeExternalLinkRows` の独自スキャンを置き換え)なので、実行時は単純な indexed SELECT + `paginateQuery`(GROUP BY 不要になったため素朴な helper がそのまま使える) +- **`listViewerBrokenLinks`**: `viewer_anchor_facts` read model 専用の fast path(issue #114)。`viewer_pages` と同じ4系統(初回/forward keyset/backward keyset/offset直読み)のcursorページネーションを実装し、`/api/links?type=broken` の `nextCursor`/`prevCursor` 契約を担う。`urlPattern`(source/dest 2列に跨る LIKE)と `includeRedirectSources`(read modelは正規化済みdestinationしか持たない)を指定された場合は fast path が使えないため `listLinks` にフォールバックする — `/api/pages` の `urlPattern`/`directory` 除外と同じ考え方 - **`listIsolatedPages`** / **`listIsolatedClusters`** / **`getIsolatedCluster`**: inventory subgraph の **完全孤立** (singleton) / **孤立集合** (connected component, size ≥ 2)。crawled-wins downgrade の不変量により crawled 行は定義上 isolated 判定から除外される。cluster の edge は redirect 解決済み anchor を無向で見た weakly connected component(共通ヘルパー `compute-isolated-clusters.ts` が `resolve-redirect-chain` + union-find で計算) - **`listResources`**: サブリソース一覧(CSS, JS, 画像、フォント) - **`listImages`**: 画像一覧(alt 欠損、寸法欠損、オーバーサイズ検出) @@ -255,7 +256,7 @@ nitpicker viewer → SIGINT/SIGTERM: manager.closeAll() → server.close() → resolve(CLI が exit) ``` -**REST API(アーカイブは起動時固定なので archiveId 不要):** `GET /api/summary`, `/api/pages`(`hasCSP`/`hasXFrameOptions`/`hasXContentTypeOptions`/`hasHSTS` の 4 列を含む。旧 `/api/headers`・「Headers」ビューは「ページ」ビューへ統合済み、CLI/MCP 向けの `checkHeaders` 自体は残存), `/api/pages/detail?url=`, `/api/pages/html?url=`, `/api/links?type=`(`broken` は `listLinks` 経由で anchor 単位のまま、canonical destination が HTTP 404 のみ。403/5xx/未取得(NULL) は broken 扱いしない。`external` は canonical destination ごとに重複排除され `referrerCount` を返す — read model が current なら `listViewerExternalLinks`、そうでなければ `listExternalLinks` にフォールバック(`/api/pages` と同じ二層構成)。宛先の参照元一覧は新規エンドポイントを作らず既存の `/api/pages/detail` の inboundLinks を再利用する), `/api/resources`, `/api/resources/referrers?resourceUrl=`, `/api/images`, `/api/violations`, `/api/duplicates`, `/api/mismatches`, `/api/graph`(内部ページのリンクグラフ、`getLinkGraph`), `/api/directory-tree`(全 root の初期 3 depth ツリー、`getDirectoryTree`), `/api/directory-tree/children?nodeId=`(1 ノード直下の子ディレクトリ、`listDirectoryChildren`), `/api/directory-tree/pages?nodeId=&cursor=&limit=`(1 ディレクトリ直下ページの cursor 一覧、`listDirectoryPages`), `/api/info`(開いているアーカイブの絶対パス、フッター表示用)。クエリパラメータ → query options 変換は `query-params/to-number.ts` / `to-boolean.ts`、エラーは `sanitize-error-message.ts` で絶対パスを伏せて JSON 返却(mcp-server と同方針)。旧 `/api/page-links`(`listPageLinks`)は「ページリンク」ビューの廃止に伴い削除 — per-page の status/referrers/redirect-from は Page Detail ビュー(`/api/pages/detail`)の inbound/outbound/redirectFrom で個別ページ単位に確認する。`getPageDetail` は `isSkipped`/`skipReason`(robots.txt / `excludeUrls` による除外理由)も返すようになり、URL 既知の場合は除外理由を引き続き確認できる。**受容したギャップ**: `listPages` / `listPagesByTag` / `listPagesByJsonLdType` はすべて `scraped = 1` 前提のため、「除外されて一度も取得されていない URL 一覧」を一括列挙する手段は無くなった(旧 `listPageLinks` だけが `scraped` 制約なしだった)。URL が分かっていれば `getPageDetail` で確認できるが、一括把握が必要な場合は `nitpicker query error-kinds` や archive の `pages` テーブルを直接クエリすること。 +**REST API(アーカイブは起動時固定なので archiveId 不要):** `GET /api/summary`, `/api/pages`(`hasCSP`/`hasXFrameOptions`/`hasXContentTypeOptions`/`hasHSTS` の 4 列を含む。旧 `/api/headers`・「Headers」ビューは「ページ」ビューへ統合済み、CLI/MCP 向けの `checkHeaders` 自体は残存), `/api/pages/detail?url=`, `/api/pages/html?url=`, `/api/links?type=`(`broken` は canonical destination が HTTP 404 のみ(403/5xx/未取得(NULL) は broken 扱いしない)を `nextCursor`/`prevCursor` 付きで返す — read model が current かつ `urlPattern`/`includeRedirectSources` 未指定なら `listViewerBrokenLinks`(`viewer_anchor_facts` fast path、keyset cursor)、そうでなければ `listLinks`(legacy、anchor 単位、offset を文字列化した疑似cursor)にフォールバック。`external` は canonical destination ごとに重複排除され `referrerCount` を返す — read model が current なら `listViewerExternalLinks`、そうでなければ `listExternalLinks` にフォールバック(同じ二層構成だが除外条件なし)。宛先の参照元一覧は新規エンドポイントを作らず既存の `/api/pages/detail` の inboundLinks を再利用する), `/api/resources`, `/api/resources/referrers?resourceUrl=`, `/api/images`, `/api/violations`, `/api/duplicates`, `/api/mismatches`, `/api/graph`(内部ページのリンクグラフ、`getLinkGraph`), `/api/directory-tree`(全 root の初期 3 depth ツリー、`getDirectoryTree`), `/api/directory-tree/children?nodeId=`(1 ノード直下の子ディレクトリ、`listDirectoryChildren`), `/api/directory-tree/pages?nodeId=&cursor=&limit=`(1 ディレクトリ直下ページの cursor 一覧、`listDirectoryPages`), `/api/info`(開いているアーカイブの絶対パス、フッター表示用)。クエリパラメータ → query options 変換は `query-params/to-number.ts` / `to-boolean.ts`、エラーは `sanitize-error-message.ts` で絶対パスを伏せて JSON 返却(mcp-server と同方針)。旧 `/api/page-links`(`listPageLinks`)は「ページリンク」ビューの廃止に伴い削除 — per-page の status/referrers/redirect-from は Page Detail ビュー(`/api/pages/detail`)の inbound/outbound/redirectFrom で個別ページ単位に確認する。`getPageDetail` は `isSkipped`/`skipReason`(robots.txt / `excludeUrls` による除外理由)も返すようになり、URL 既知の場合は除外理由を引き続き確認できる。**受容したギャップ**: `listPages` / `listPagesByTag` / `listPagesByJsonLdType` はすべて `scraped = 1` 前提のため、「除外されて一度も取得されていない URL 一覧」を一括列挙する手段は無くなった(旧 `listPageLinks` だけが `scraped` 制約なしだった)。URL が分かっていれば `getPageDetail` で確認できるが、一括把握が必要な場合は `nitpicker query error-kinds` や archive の `pages` テーブルを直接クエリすること。 **バイナリ:** なし(CLI の `viewer` サブコマンド経由で起動) @@ -341,13 +342,21 @@ nitpicker viewer > > **`getDirectoryTree` の ORDER BY は `path_sort_key` 単独、`root_key` を含めない**: 全 root を 1 クエリで返す設計上、`root_key` の等価フィルタが存在しないため、`vdn_root_depth_path (root_key, depth, path_sort_key, node_id)` のような `root_key` 先頭 index は `depth <= 3` という range 条件との組み合わせで一切活用できず、`EXPLAIN QUERY PLAN` で実測すると `USE TEMP B-TREE FOR LAST TERM OF ORDER BY` が付く(PR #96 の `idx_pages_listfilter` column 順ミスと同型の教訓)。`path_sort_key` を先頭に置いた `vdn_path_depth (path_sort_key, depth, node_id)` に張り替え、`ORDER BY path_sort_key` のみに変更することで `SCAN ... USING INDEX vdn_path_depth`(sort 無し、`depth` は残差フィルタ)に収まることを確認済み。root_key を ORDER BY から外しても、grouping は JS 側で `Map` に振り分けるだけなので各 root 内の相対順序(`path_sort_key` 昇順)は保たれる。**検索キーワード**: 「directory-tree」「ディレクトリツリー」「has_children」「vdn_path_depth」「USE TEMP B-TREE」。 -> **設計注意(外部リンク read model):** `listExternalLinks`(PR #153)は `anchors JOIN pages(source) JOIN pages(dest) LEFT JOIN pages(canonical)` を `COALESCE` 計算列で `GROUP BY` し `COUNT(DISTINCT source.id)` を求める形で、リクエストごとにこの JOIN+集計を(`total` 用サブクエリと data 用の)2 回実行していた。SQLite は `COUNT(DISTINCT ...)` で既存 index を使わず別の b-tree を都度構築することが知られており(SQLite forum 実測: `count(distinct id)` 単体 6.4 秒、他の集約と同一クエリに混ぜると 55.2 秒まで悪化する例が報告されている)、`GROUP BY` も式インデックス(`CREATE INDEX` の式と `WHERE`/`GROUP BY` の式が構文的に完全一致しないと使われない)では確実に解決できない。回避策として同フォーラムが推奨するのは集計をあらかじめ一時テーブルに書き出す方式で、これは本リポジトリの `viewer_pages`/`viewer_directory_nodes`(issue #106〜#112)と同じ「read model を作って計測してから最適化する」方針そのものである。 +> **設計注意(viewer_anchor_facts read model、issue #114):** `listExternalLinks`(PR #153)は `anchors JOIN pages(source) JOIN pages(dest) LEFT JOIN pages(canonical)` を `COALESCE` 計算列で `GROUP BY` し `COUNT(DISTINCT source.id)` を求める形で、リクエストごとにこの JOIN+集計を(`total` 用サブクエリと data 用の)2 回実行していた。SQLite は `COUNT(DISTINCT ...)` で既存 index を使わず別の b-tree を都度構築することが知られており(SQLite forum 実測: `count(distinct id)` 単体 6.4 秒、他の集約と同一クエリに混ぜると 55.2 秒まで悪化する例が報告されている)、`GROUP BY` も式インデックス(`CREATE INDEX` の式と `WHERE`/`GROUP BY` の式が構文的に完全一致しないと使われない)では確実に解決できない。加えて `/api/links?type=broken`(`listLinks`)は fast path を持たず 13-16 秒級の anchor スキャンのまま、ページネーションも offset ベースで `#103` 自身の "Do not introduce large OFFSET based pagination for virtualized lists" に反していた。issue #114 は broken/external 両方を `viewer_anchor_facts` に載せる設計を提示していたが、実装時に以下3点を **read/write/storage のいずれも妥協しない** 基準で再検討した。 > -> `viewer_external_links`(`dest_page_id` PK / `dest_url` / `status` / `referrer_count`)は `buildViewerReadModel` の同じトランザクション内で `computeExternalLinkRows`(`viewer-read-model/compute-external-link-rows.ts`)が構築する。集計ロジック(`COALESCE` 解決・`COUNT(DISTINCT source.id)`)は `listExternalLinks` から一切変更せずそのまま移植 — `referrerCount` は `getPageDetail.inboundLinks`(#71)と同じ数え方(重複アンカーは 1 referrer)を保つ契約があるため。`viewer_pages`/directory tree と違い、`sourceRows`(`pages` のみ)を再利用できず `anchors` への専用クエリが必要(リンク情報は `anchors` にしかない)。 +> 1. **`url_refs`/`content_items`(issue #139 のref-table方式)は使わない**: `#114` が参照するドキュメント上のスキーマは正規化されたURL辞書テーブルを前提にするが、`#139` はまだ着手されておらず、`#103` 自身の実行順序も `#139` を `#114` より後(16番目 vs 7番目)に置いている。今すぐref-table化するのは前提条件が揃っていない。代わりに `viewer_pages.url_sort_key` と同じ発想で `source_url_sort_key`/`dest_url_sort_key` を build 時にコピーしたテキスト列とする——indexed `ORDER BY` にはjoin前のsort keyが必須で、これは避けられないコスト。ただしフルURLを複数列・複数箇所に複製するのではなく「表示に使う実際の値そのもの」を1列に持たせるだけに絞り、`viewer_pages`のように`url`と`url_sort_key`を別々に複製することもしない。実測: 5万ページ・40万anchor規模で追加DBサイズ152 MiB(後述のベンチマーク)——`#114`が警告する1300万行規模での「+5GB」は現実のベンチマーク規模(40万行オーダー、CLAUDE.md/ARCHITECTURE.mdの既存ベンチマーク全てがこの規模)とは2桁以上異なり、今この規模で正規化コストを払う判断はしない。将来的に本当に1300万行規模に達したら `#139` のref-table化を検討する、というスコープの切り方。 +> 2. **`viewer_external_links` はテーブルとして分離したまま維持し、`viewer_anchor_facts` から1回のスキャンで導出する**: `viewer_anchor_facts` は `(source_page_id, dest_page_id)` ペア単位でdedup済みのedgeテーブルなので、宛先ごとのreferrerCountは「そのdest_page_idを持つedge行の数」を数えるだけで求まる(edge単位で既にdedup済みなので `COUNT(DISTINCT source)` と数学的に同値)——`GROUP BY`のランタイム再導入は不要。`compute-anchor-fact-rows.ts` が `anchors` を1回だけスキャンし、その結果(メモリ上の配列)から `derive-external-link-summary-rows.ts`(純粋関数、DBアクセス無し)が `viewer_external_links` 行を導出する。旧 `compute-external-link-rows.ts`(独自の2回目の `anchors` スキャン)は廃止。テーブルを統合して edge 単位一本化すると External Links ビューの「宛先ごとの参照元数」というUXが失われる(PR #153 のUX決定を破壊する)ため、2つの独立したテーブルとして残す判断をした。 +> 3. **Broken Links は edge dedup(`count` 列)を採用**: 同一 `(source_page_id, dest_page_id)` ペアの重複アンカー(同じリンクがヘッダー/フッターに複数回出現する等)は1行に集約し `count` で観測数を持つ。read(走査行数減)・write(build時に1回集約するだけ)・storage(重複edgeの行数削減)のいずれでも1anchor=1行より優れる——`listLinks`(legacy)とは総件数が変わり得るが、これは `/api/pages` の plain sort vs natural sort と同種の、根拠のある fast path/legacy 分岐として受容する。 > -> **keyset cursor ではなく `paginateQuery`(offset ベース)を使う**: `viewer_pages` が `status_sort_key`/`status_desc_key`/`NULL_STATUS_SENTINEL` という仕掛けを持つのは keyset cursor 特有の要件(SQL の 3 値論理で `NULL` 比較が壊れる、`DESC` を常に `ASC` 方向スキャンにする必要がある)で、`/api/links?type=external` の REST 契約はそもそも offset ベースのまま変更していないため、この複雑さは不要。`viewer_external_links` の 3 index(`vel_url` / `vel_status` / `vel_referrer_count`)はいずれも単純な単方向 index で、`DESC` は同じ index の逆順スキャンで足りる。 +> **スキーマ**: `viewer_anchor_facts(edge_id PK, source_page_id, dest_page_id, source_url_sort_key, dest_url_sort_key, status, status_sort_key, status_desc_key, count, is_broken, is_external_link)`。`is_external_link` は永続化するが(SQLite の INTEGER 0/1 は実質無コスト)index は張らない——read時にこの列でフィルタするクエリは存在せず、build時の `deriveExternalLinkSummaryRows` の in-memory pass でのみ使われるため。`status_desc_key`(`viewer_pages` と同じ負数キー)が必要な理由: `docs/viewer-sql-query-plan.md` の Stable Ordering 規則は `status desc` でも `source_url_sort_key`/`edge_id` のタイブレークを ASC のまま保つが、row-value keyset タプル比較は列ごとに方向を混在させられないため、主キーを負数化して常に ASC スキャンにする(`sourceUrl`/`destUrl` は主キーとタイブレークが同方向に揃うため、この仕掛けは不要)。 > -> **fast path / legacy の二層構成**: `register-links-route.ts` は `/api/pages` と同じパターンで `isViewerReadModelCurrent` を見て `listViewerExternalLinks`(fast path)と `listExternalLinks`(legacy、無変更のまま残存)を切り替える。`urlPattern`/`status` はどちらの経路でも同じ列に対応するため、`/api/pages` の `hasCSP` 等のような「特定フィルタ指定時は強制 legacy」という除外条件は無い。スキーマ変更を伴うため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 4→5 に bump し、旧バージョンの read model は自動再ビルド対象にした。**検索キーワード**: 「external links」「外部リンク」「COUNT DISTINCT」「viewer_external_links」「GROUP BY 遅い」。 +> **cursor pagination**: `viewer-anchor-facts-cursor/` を `viewer-pages-cursor/` を模した専用モジュールとして新設(既存2つの cursor 実装 `viewer-pages-cursor`/`directory-pages-cursor` のどちらも汎用ジェネリック化されていない慣習に合わせ、共有モジュール化はしない)。`listViewerBrokenLinks` は `listViewerPages` と同じ4系統(初回/forward keyset/backward keyset/offset直読み)を実装するが、`source_url_sort_key`/`dest_url_sort_key`/`status` が既にwriteモデルへの再joinなしで表示可能な値そのものなので、`listViewerPages`のような「id解決→limit後にwideテーブルへjoin」というステップが不要——`viewer_anchor_facts`単体へのSELECTがそのまま最終結果になる。 +> +> **ページネーション契約の変更**: `/api/links?type=broken` は従来 `{items, total}` のみのoffsetベース応答だったが、`#103`の"Do not introduce large OFFSET based pagination for virtualized lists"を満たすため`nextCursor`/`prevCursor`を持つ契約に変更した(`/api/pages`が`listPages`→`listViewerPages`昇格時に行った移行と同型)。legacy (`listLinks`) 経路は `buildLegacyPagesCursors`(offsetを文字列化した疑似cursor)で同じ契約を満たし、フロントの `useLinksInfinite`(`use-links-infinite.ts`)は`nextCursor`のみを見ればどちらの経路でも動作する。**列・ソート・フィルタの見た目のUIは無変更**——`broken-links-view.tsx`はsourceUrl/destUrl/statusの3列のみ表示し、変更したのはhookの内部実装(offset→cursor)のみ。MPAページネーション(`usePagedQuery`経由)はこの変更と無関係(`/api/links`はplainなoffset/limitも引き続き受け付ける)。 +> +> **fast path / legacy の二層構成**: `register-links-route.ts` は `external`/`broken` 両方で `isViewerReadModelCurrent` を見て切り替える。`external`は除外条件なし(`urlPattern`/`status`が両経路の同じ列に対応)。`broken`は `urlPattern`(source/dest 2列に跨るLIKEで単一indexで満たせない)または `includeRedirectSources`(read modelは正規化済みdestinationしか持たない)が指定された場合に強制legacy——`/api/pages`の`urlPattern`/`directory`除外と同じ考え方。スキーマ変更を伴うため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 5→6 に bump。 +> +> **ベンチマーク実測**(`scripts/bench-viewer-anchor-facts.mjs`、synthetic archive、実顧客データ不使用): 5万ページ・40万anchor規模で、read model build time 5.9秒、追加DBサイズ152 MiB(`viewer_anchor_facts`はedge dedup後35万行)。`/api/links?type=broken`は `sourceUrl`/`destUrl`/`status` 昇順・降順の全5パターンで `EXPLAIN QUERY PLAN` が一貫して `SEARCH viewer_anchor_facts USING COVERING INDEX vaf_broken_*` (`TEMP B-TREE`無し)となり、**warm p50 1.2ms、p95 1.2-1.8ms**、cold(初回) 1.2-9.6ms——`docs/viewer-sql-query-plan.md`のtarget(20-80ms)を大幅に上回る。旧 `listLinks` の13-16秒から数千倍の改善。**検索キーワード**: 「broken links」「external links」「COUNT DISTINCT」「viewer_anchor_facts」「viewer_external_links」「GROUP BY 遅い」「issue #114」。 ### @nitpicker/cli diff --git a/CLAUDE.md b/CLAUDE.md index c70a8e90..2ec491b1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -82,7 +82,7 @@ packages/ > **Note (ディレクトリツリー read model、issue #107)**: `viewer_directory_nodes` / `viewer_directory_pages` は `viewer_pages` を返す `sourceRows` を再利用し `buildDirectoryTreeRows` が純粋関数としてメモリ上に構築する。**root_key はホスト単位、ただし internal ページを 1 件も持たないホストは除外**(外部リンク先ドメインの無意味な 1 ページツリーを防ぐ)。**ディレクトリ/ページ境界は末尾スラッシュで判定**(`/blog/2024/post-1` と `/blog/2024/` は同じ `/blog/2024/` ノードに着地)。**`has_children` は `direct_child_dir_count > 0` のみ**(`direct_page_count` を含めると構築ロジック上絶対に `false` にならないため、UI の展開矢印が意味を持つよう子ディレクトリの有無だけを見る)。この機能に legacy フォールバックは存在しないため、3関数(`getDirectoryTree`/`listDirectoryChildren`/`listDirectoryPages`)とも `hasViewerReadModel` ではなく `isViewerReadModelCurrent` を guard に使う。詳細は ARCHITECTURE.md の `@nitpicker/viewer` 節「設計注意(ディレクトリツリー read model...)」を正とする。 -> **Note (外部リンク read model)**: `listExternalLinks`(PR #153)は `anchors` の JOIN + `COALESCE` 計算列での `GROUP BY` + `COUNT(DISTINCT source.id)` をリクエストごとに(`total` 用と data 用で)2 回実行していた。SQLite の `COUNT(DISTINCT ...)` は既存 index を使わず別 b-tree を都度構築する既知のパフォーマンス病理を持つため(実測: 単体 6.4 秒、他の集約と混ぜると 55.2 秒まで悪化する例が SQLite forum に報告されている)、`viewer_pages`/`viewer_directory_nodes` と同じ read model パターンに乗せた。`viewer_external_links`(`dest_page_id` PK / `dest_url` / `status` / `referrer_count`)は `buildViewerReadModel` 内で `computeExternalLinkRows` が `anchors` への専用クエリ(`sourceRows` 再利用不可 — リンク情報は `pages` にはない)で1回だけ集計して構築する。集計ロジック自体(`COALESCE` 解決、referrer 重複排除)は `listExternalLinks` から無変更で移植 — `getPageDetail.inboundLinks`(#71)とのカウント粒度契約を崩さないため。ページネーションは keyset cursor ではなく `paginateQuery`(offset ベース、REST 契約が offset のままなので不要な複雑さを持ち込まない)。`register-links-route.ts` は `/api/pages` と同じ二層構成で `isViewerReadModelCurrent` を見て `listViewerExternalLinks`(fast path)↔ `listExternalLinks`(legacy、無変更で残存)を切り替える。スキーマ変更のため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 4→5 に bump。詳細は ARCHITECTURE.md の `@nitpicker/viewer` 節「設計注意(外部リンク read model)」を正とする。 +> **Note (viewer_anchor_facts read model、issue #114)**: `listExternalLinks`(PR #153)は `anchors` の JOIN + `COALESCE` 計算列での `GROUP BY` + `COUNT(DISTINCT source.id)` を、`listLinks(type:'broken')` はfast pathなしの13-16秒級anchorスキャン+offsetページネーションのまま、それぞれ抱えていた。issue #114 は broken/external 両方を `viewer_anchor_facts` に載せる設計を提示していたが、実装は「read/write/storageのいずれも妥協しない」基準で再検討し、ドキュメント通りのref-table(`url_refs`/`content_items`、issue #139)方式は採用しなかった(#139はまだ未着手で `#103` の実行順序上も `#114` より後)。代わりに `source_url_sort_key`/`dest_url_sort_key` を `viewer_pages.url_sort_key` と同じ発想でインライン複製するのみに絞った。`viewer_anchor_facts`(`edge_id` PK、`(source_page_id, dest_page_id)` ペア単位でdedupし`count`で重複anchorを吸収、`is_broken`/`is_external_link`フラグ、`status_sort_key`/`status_desc_key`)は `compute-anchor-fact-rows.ts` が `anchors` を1回だけスキャンして構築する。`viewer_external_links` はこの1回のスキャン結果から `derive-external-link-summary-rows.ts`(純粋関数、DBアクセス無し)が導出するよう変更——旧 `compute-external-link-rows.ts`(独自の2回目の`anchors`スキャン)は廃止。`listViewerBrokenLinks` は `listViewerPages` と同じ4系統cursorページネーションを実装し、`/api/links?type=broken` の応答契約もoffsetのみから `nextCursor`/`prevCursor` 付きに変更した(`#103`の"large OFFSETを使うな"に対応、フロントのUI見た目は無変更)。`register-links-route.ts` は `external`/`broken` 両方で `isViewerReadModelCurrent` による二層dispatchを持つ(`broken`は`urlPattern`/`includeRedirectSources`指定時に強制legacy)。スキーマ変更のため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 5→6 に bump。5万ページ・40万anchor規模の実測で `viewer_anchor_facts` はwarm p50 1.2ms(旧13-16秒から数千倍改善)。詳細は ARCHITECTURE.md の `@nitpicker/viewer` 節「設計注意(viewer_anchor_facts read model、issue #114)」を正とする。 ## CLI コマンド diff --git a/README.md b/README.md index fd6559e0..db3e9222 100644 --- a/README.md +++ b/README.md @@ -285,7 +285,7 @@ npx @nitpicker/cli viewer-build [--force] - **MPA**: Prev / Next + ページ番号 + ジャンプ入力。現在ページとページサイズはどちらも URL クエリ(`?page=N` / `?pageSize=N`、ともに 1-indexed)に乗るため deep-link / 共有 / ブラウザ戻る/進むが完全に成立する(ページサイズが URL に無いと、`?page=5` を共有しても受け手側のサイズ次第で別の行が見えてしまう)。表示件数は 50 / 100 / 200。フィルタ変更で `?page=` は自動クリア、ページサイズ変更時も `?page=` を 1 に戻す(旧オフセットは新しい窓では意味を持たない)。デフォルト値(page=1, pageSize=100)は URL から省略 - **仮想スクロール**: TanStack Query infinite query + TanStack Virtual。**10 万行規模をクライアント全件ロードせず一定メモリで表示**するため、deep-link は捨てて巨大データの探索性を優先したいときの opt-in -モード本体は localStorage(`nitpicker-pagination-mode`)。ページサイズも localStorage(`nitpicker-page-size`)に保存されるが、これは新規タブ・直 URL 訪問時の hint であり、URL の `?pageSize=` が常に優先される。両モードとも backend は同じ `limit`/`offset` API(無改修)。 +モード本体は localStorage(`nitpicker-pagination-mode`)。ページサイズも localStorage(`nitpicker-page-size`)に保存されるが、これは新規タブ・直 URL 訪問時の hint であり、URL の `?pageSize=` が常に優先される。両モードとも同じ REST エンドポイントを叩くが、継続方法はビュー次第: MPA は常に `?page=`/`?pageSize=` から `limit`/`offset` を組み立てる一方、仮想スクロールは Pages / Broken Links では read model のキーセット `nextCursor` を、それ以外のビューでは `limit`/`offset` を使う。 ### Errors ビュー From e53175c69d6731a5cfa894b6e7d8f26624272d9b Mon Sep 17 00:00:00 2001 From: Yusuke Hirao Date: Fri, 3 Jul 2026 22:57:40 +0900 Subject: [PATCH 2/3] feat(query): add viewer_anchor_facts edge read model for broken links (#114) Add viewer_anchor_facts, an edge-deduped (source_page_id, dest_page_id) table backing a new cursor-paginated listViewerBrokenLinks fast path, and derive viewer_external_links from the same single anchors scan instead of a separate GROUP BY + COUNT(DISTINCT) query. Bumps VIEWER_READ_MODEL_SCHEMA_VERSION to 6. --- .../src/list-viewer-broken-links.spec.ts | 610 ++++++++++++++++++ .../query/src/list-viewer-broken-links.ts | 307 +++++++++ .../query/src/list-viewer-external-links.ts | 2 +- packages/@nitpicker/query/src/query.ts | 1 + packages/@nitpicker/query/src/types.ts | 57 ++ .../build-anchor-facts-filter-key.spec.ts | 23 + .../build-anchor-facts-filter-key.ts | 16 + .../decode-anchor-facts-cursor.spec.ts | 101 +++ .../decode-anchor-facts-cursor.ts | 89 +++ .../encode-anchor-facts-cursor.spec.ts | 19 + .../encode-anchor-facts-cursor.ts | 10 + .../extract-anchor-facts-sort-values.spec.ts | 25 + .../extract-anchor-facts-sort-values.ts | 15 + .../get-anchor-facts-sort-spec.spec.ts | 47 ++ .../get-anchor-facts-sort-spec.ts | 31 + .../src/viewer-anchor-facts-cursor/types.ts | 101 +++ .../build-viewer-read-model.spec.ts | 59 +- .../build-viewer-read-model.ts | 50 +- .../compute-anchor-fact-rows.spec.ts | 524 +++++++++++++++ .../compute-anchor-fact-rows.ts | 70 ++ .../compute-external-link-rows.spec.ts | 321 --------- .../compute-external-link-rows.ts | 54 -- .../create-viewer-read-model-tables.spec.ts | 18 +- .../create-viewer-read-model-tables.ts | 74 ++- .../derive-external-link-summary-rows.spec.ts | 99 +++ .../derive-external-link-summary-rows.ts | 37 ++ .../drop-viewer-read-model-tables.spec.ts | 4 +- .../drop-viewer-read-model-tables.ts | 5 +- .../viewer-read-model/null-status-sentinel.ts | 26 + .../query/src/viewer-read-model/types.ts | 56 +- .../viewer-read-model-schema-version.ts | 2 +- scripts/bench-viewer-anchor-facts.mjs | 310 +++++++++ 32 files changed, 2738 insertions(+), 425 deletions(-) create mode 100644 packages/@nitpicker/query/src/list-viewer-broken-links.spec.ts create mode 100644 packages/@nitpicker/query/src/list-viewer-broken-links.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-anchor-facts-cursor/types.ts create mode 100644 packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.ts delete mode 100644 packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts delete mode 100644 packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts create mode 100644 packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.spec.ts create mode 100644 packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.ts create mode 100644 packages/@nitpicker/query/src/viewer-read-model/null-status-sentinel.ts create mode 100644 scripts/bench-viewer-anchor-facts.mjs diff --git a/packages/@nitpicker/query/src/list-viewer-broken-links.spec.ts b/packages/@nitpicker/query/src/list-viewer-broken-links.spec.ts new file mode 100644 index 00000000..15583c43 --- /dev/null +++ b/packages/@nitpicker/query/src/list-viewer-broken-links.spec.ts @@ -0,0 +1,610 @@ +import path from 'node:path'; + +import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; +import { Archive } from '@nitpicker/crawler'; +import { afterAll, beforeAll, describe, expect, it } from 'vitest'; + +import { listViewerBrokenLinks } from './list-viewer-broken-links.js'; +import { buildViewerReadModel } from './viewer-read-model/build-viewer-read-model.js'; + +const __filename = new URL(import.meta.url).pathname; +const __dirname = path.dirname(__filename); +const workingDir = path.resolve(__dirname, '__test_fixtures_list_viewer_broken_links__'); + +const META = { + lang: null, + title: null, + description: null, + keywords: null, + noindex: false, + nofollow: false, + noarchive: false, + canonical: null, + alternate: null, + 'og:type': null, + 'og:title': null, + 'og:site_name': null, + 'og:description': null, + 'og:url': null, + 'og:image': null, + 'twitter:card': null, +}; + +describe('listViewerBrokenLinks', () => { + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'list-viewer-broken-links-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig({ + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page A' }, + anchorList: [ + { + href: parseUrl('https://example.com/broken-a')!, + isExternal: false, + title: null, + textContent: 'Broken A', + }, + { + href: parseUrl('https://example.com/forbidden')!, + isExternal: false, + title: null, + textContent: 'Forbidden', + }, + { + href: parseUrl('https://example.com/server-error')!, + isExternal: false, + title: null, + textContent: 'Server error', + }, + { + href: parseUrl('https://example.com/never-fetched')!, + isExternal: false, + title: null, + textContent: 'Never fetched', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page B' }, + anchorList: [ + { + href: parseUrl('https://example.com/broken-b')!, + isExternal: false, + title: null, + textContent: 'Broken B', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/broken-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/broken-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/forbidden')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 403, + statusText: 'Forbidden', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/server-error')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 500, + statusText: 'Internal Server Error', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + // No `setPage` call for https://example.com/never-fetched: the anchor + // on page-a above already caused the crawler to insert a discovery + // placeholder row for it (scraped=0, status=NULL) — matching + // list-links.ts's scope note that such rows must never satisfy + // `status = 404`. + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('returns only 404 destinations, excluding 403/5xx/never-fetched', async () => { + const result = await listViewerBrokenLinks(archive); + expect(result.items.map((item) => item.destUrl).toSorted()).toEqual([ + 'https://example.com/broken-a', + 'https://example.com/broken-b', + ]); + expect(result.total).toBe(2); + }); + + it('reports source, dest, and status but always null textContent (not stored in the fast path)', async () => { + const result = await listViewerBrokenLinks(archive, { sortBy: 'destUrl' }); + expect(result.items[0]).toMatchObject({ + sourceUrl: 'https://example.com/page-a', + destUrl: 'https://example.com/broken-a', + status: 404, + isExternal: false, + textContent: null, + }); + }); + + it('filters by status (broken links are always 404, so a non-404 filter matches nothing)', async () => { + const matching = await listViewerBrokenLinks(archive, { status: 404 }); + expect(matching.total).toBe(2); + const nonMatching = await listViewerBrokenLinks(archive, { status: 500 }); + expect(nonMatching.total).toBe(0); + }); + + it('sorts by destUrl ascending', async () => { + const result = await listViewerBrokenLinks(archive, { + sortBy: 'destUrl', + sortOrder: 'asc', + }); + expect(result.items.map((item) => item.destUrl)).toEqual([ + 'https://example.com/broken-a', + 'https://example.com/broken-b', + ]); + }); + + it('status ties (every broken link is 404) still paginate without duplicates or gaps, in both directions', async () => { + // Every row here has the exact same status_sort_key/status_desc_key — + // this is what the source_url_sort_key tie-breaker in the keyset + // tuple exists to disambiguate. + const [pageAsc0, pageAsc1] = await Promise.all([ + listViewerBrokenLinks(archive, { + sortBy: 'status', + sortOrder: 'asc', + limit: 1, + offset: 0, + }), + listViewerBrokenLinks(archive, { + sortBy: 'status', + sortOrder: 'asc', + limit: 1, + offset: 1, + }), + ]); + expect([pageAsc0.items[0]!.destUrl, pageAsc1.items[0]!.destUrl].toSorted()).toEqual([ + 'https://example.com/broken-a', + 'https://example.com/broken-b', + ]); + + const [pageDesc0, pageDesc1] = await Promise.all([ + listViewerBrokenLinks(archive, { + sortBy: 'status', + sortOrder: 'desc', + limit: 1, + offset: 0, + }), + listViewerBrokenLinks(archive, { + sortBy: 'status', + sortOrder: 'desc', + limit: 1, + offset: 1, + }), + ]); + expect([pageDesc0.items[0]!.destUrl, pageDesc1.items[0]!.destUrl].toSorted()).toEqual( + ['https://example.com/broken-a', 'https://example.com/broken-b'], + ); + }); + + it('paginates forward via nextCursor with no duplicates or gaps', async () => { + const page1 = await listViewerBrokenLinks(archive, { sortBy: 'destUrl', limit: 1 }); + expect(page1.items).toHaveLength(1); + expect(page1.nextCursor).not.toBeNull(); + expect(page1.prevCursor).toBeNull(); + + const page2 = await listViewerBrokenLinks(archive, { + sortBy: 'destUrl', + limit: 1, + cursor: page1.nextCursor!, + }); + expect(page2.items).toHaveLength(1); + expect(page2.nextCursor).toBeNull(); + expect(page2.prevCursor).not.toBeNull(); + + expect([...page1.items, ...page2.items].map((item) => item.destUrl)).toEqual([ + 'https://example.com/broken-a', + 'https://example.com/broken-b', + ]); + }); + + it('walks backward from a forward cursor via direction: "prev" and restores the same page', async () => { + const page1 = await listViewerBrokenLinks(archive, { sortBy: 'destUrl', limit: 1 }); + const page2 = await listViewerBrokenLinks(archive, { + sortBy: 'destUrl', + limit: 1, + cursor: page1.nextCursor!, + }); + const back = await listViewerBrokenLinks(archive, { + sortBy: 'destUrl', + limit: 1, + cursor: page2.prevCursor!, + direction: 'prev', + }); + expect(back.items).toEqual(page1.items); + }); + + it('supports a direct offset read for MPA page-number jumps', async () => { + const result = await listViewerBrokenLinks(archive, { + sortBy: 'destUrl', + limit: 1, + offset: 1, + }); + expect(result.items).toHaveLength(1); + expect(result.items[0]!.destUrl).toBe('https://example.com/broken-b'); + }); + + it('throws on a cursor minted under a different sort/filter combination', async () => { + const page1 = await listViewerBrokenLinks(archive, { sortBy: 'destUrl', limit: 1 }); + await expect( + listViewerBrokenLinks(archive, { + sortBy: 'sourceUrl', + limit: 1, + cursor: page1.nextCursor!, + }), + ).rejects.toThrow(/does not match/); + }); +}); + +/** + * Mirrors `list-links.spec.ts`'s redirect-resolution coverage: a broken + * anchor reached both directly and via an internal redirect source must + * collapse into separate edge rows (one per distinct referring page) that + * both report the canonical (post-redirect) destination and status. + */ +describe('listViewerBrokenLinks — redirect resolution', () => { + const redirectWorkingDir = path.resolve( + __dirname, + '__test_fixtures_list_viewer_broken_links_redirect__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + redirectWorkingDir, + 'list-viewer-broken-links-redirect-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(redirectWorkingDir, { recursive: true }); + archive = await Archive.create({ + filePath: archiveFilePath, + cwd: redirectWorkingDir, + }); + await archive.setConfig({ + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/direct')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Direct' }, + anchorList: [ + { + href: parseUrl('https://example.com/canonical-target')!, + isExternal: false, + title: null, + textContent: 'Direct link', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/via-redirect')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Via redirect' }, + anchorList: [ + { + href: parseUrl('https://example.com/old')!, + isExternal: false, + title: null, + textContent: 'Old link', + hash: null, + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/canonical-target')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setRedirect({ + url: parseUrl('https://example.com/old')!, + redirectPaths: ['https://example.com/canonical-target'], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(redirectWorkingDir, { recursive: true, force: true }); + }); + + it('reports the canonical destination for both the direct and redirect-source-routed anchors', async () => { + const result = await listViewerBrokenLinks(archive, { sortBy: 'sourceUrl' }); + expect(result.items).toHaveLength(2); + for (const item of result.items) { + expect(item).toMatchObject({ + destUrl: 'https://example.com/canonical-target', + status: 404, + }); + } + expect(result.items.map((item) => item.sourceUrl).toSorted()).toEqual([ + 'https://example.com/direct', + 'https://example.com/via-redirect', + ]); + }); +}); + +/** + * A broken link and an external link are independent judgments on the same + * `viewer_anchor_facts` row (`is_broken`/`is_external_link` are separate + * flags) — a destination can be both. Isolated into its own archive so it + * doesn't perturb the main describe block's exact item/pagination counts. + */ +describe('listViewerBrokenLinks — a destination that is both broken and external', () => { + const brokenExternalWorkingDir = path.resolve( + __dirname, + '__test_fixtures_list_viewer_broken_links_broken_external__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + brokenExternalWorkingDir, + 'list-viewer-broken-links-broken-external-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(brokenExternalWorkingDir, { recursive: true }); + archive = await Archive.create({ + filePath: archiveFilePath, + cwd: brokenExternalWorkingDir, + }); + await archive.setConfig({ + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page A' }, + anchorList: [ + { + href: parseUrl('https://external.example.com/broken-ext')!, + isExternal: true, + title: null, + textContent: 'Broken external', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://external.example.com/broken-ext')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(brokenExternalWorkingDir, { recursive: true, force: true }); + }); + + it('reports isExternal: true for a broken destination that is also external', async () => { + const result = await listViewerBrokenLinks(archive); + expect(result.items).toEqual([ + expect.objectContaining({ + sourceUrl: 'https://example.com/page-a', + destUrl: 'https://external.example.com/broken-ext', + status: 404, + isExternal: true, + }), + ]); + }); +}); diff --git a/packages/@nitpicker/query/src/list-viewer-broken-links.ts b/packages/@nitpicker/query/src/list-viewer-broken-links.ts new file mode 100644 index 00000000..89d40e6c --- /dev/null +++ b/packages/@nitpicker/query/src/list-viewer-broken-links.ts @@ -0,0 +1,307 @@ +import type { + CursorPaginatedLinkList, + LinkEntry, + ListViewerBrokenLinksOptions, +} from './types.js'; +import type { + AnchorFactsKeysetRow, + AnchorFactsSortSpec, +} from './viewer-anchor-facts-cursor/types.js'; +import type { ArchiveAccessor } from '@nitpicker/crawler'; +import type { Knex } from 'knex'; + +import { buildAnchorFactsFilterKey } from './viewer-anchor-facts-cursor/build-anchor-facts-filter-key.js'; +import { decodeAnchorFactsCursor } from './viewer-anchor-facts-cursor/decode-anchor-facts-cursor.js'; +import { encodeAnchorFactsCursor } from './viewer-anchor-facts-cursor/encode-anchor-facts-cursor.js'; +import { extractAnchorFactsSortValues } from './viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.js'; +import { getAnchorFactsSortSpec } from './viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.js'; +import { VIEWER_READ_MODEL_SCHEMA_VERSION } from './viewer-read-model/viewer-read-model-schema-version.js'; + +/** + * Adds a keyset comparison tuple as a `WHERE` predicate — `(col1, col2, …) + * {>|<} (?, ?, …)` — using SQLite's row-value comparison. Column names come + * from the fixed {@link AnchorFactsSortSpec} column set, never from request + * input, so interpolating them into the SQL text (rather than parameter + * binding, which only covers values) carries no injection risk. Mirrors + * `list-viewer-pages.ts`'s identical helper — not shared as a common module + * since the two existing keyset-cursor implementations in this package have + * never been generalised into one, matching `list-directory-pages.ts`'s + * independent, table-specific cursor scheme. + * @param qb - The query builder to constrain. + * @param columns - The keyset tuple columns, in comparison order. + * @param operator - `'>'` for a forward (ascending-tuple) seek, `'<'` for a + * backward one. + * @param values - The boundary row's tuple values, in `columns` order. + */ +function applyKeysetPredicate( + qb: Knex.QueryBuilder, + columns: readonly string[], + operator: '>' | '<', + values: readonly (string | number)[], +): void { + const columnList = columns.join(', '); + const placeholders = columns.map(() => '?').join(', '); + qb.whereRaw(`(${columnList}) ${operator} (${placeholders})`, [...values]); +} + +/** + * Applies the (currently sole) filter — `status` — on top of the fixed + * `is_broken = 1` predicate every read shares. + * @param qb - The query builder to constrain. + * @param options - The caller's filter options. + */ +function applyBrokenLinksFilters( + qb: Knex.QueryBuilder, + options: ListViewerBrokenLinksOptions, +): void { + qb.where('is_broken', 1); + if (options.status != null) { + qb.where('status', options.status); + } +} + +/** + * Counts the total `is_broken = 1` rows matching the caller's filters. + * @param knex - The archive's Knex instance. + * @param options - The caller's filter options. + * @returns The total matching row count. + */ +async function countAnchorFactsTotal( + knex: Knex, + options: ListViewerBrokenLinksOptions, +): Promise { + const qb = knex('viewer_anchor_facts'); + applyBrokenLinksFilters(qb, options); + const result = await qb.count<{ count: string }[]>({ count: '*' }); + return Number(result[0]?.count ?? 0); +} + +/** + * Runs one `viewer_anchor_facts` read: applies filters, an optional keyset + * predicate, an `ORDER BY` in `orderDirection`, and `limit + 1` rows (the + * `+1` lets the caller detect "is there another row past this page" + * without a second query). Unlike `list-viewer-pages.ts`'s equivalent, no + * id-then-join step follows: `source_url_sort_key`/`dest_url_sort_key`/ + * `status` are already the exact display values, so this window read IS + * the final row set. + * @param knex - The archive's Knex instance. + * @param options - The caller's filter options. + * @param spec - The resolved sort spec (columns to select/order by). + * @param orderDirection - The physical scan direction for this read. + * @param limit - The page size (the read fetches `limit + 1` rows). + * @param keyset - The keyset predicate to apply, or `undefined` for an + * unconstrained (initial / offset) read. + * @param keyset.operator - `'>'` or `'<'`, per {@link applyKeysetPredicate}. + * @param keyset.values - The boundary row's tuple values. + * @param offset - Row offset for a direct `OFFSET` read (page-number jumps). + * Ignored when `keyset` is supplied. + * @returns Up to `limit + 1` rows. + */ +async function readAnchorFactsWindow( + knex: Knex, + options: ListViewerBrokenLinksOptions, + spec: AnchorFactsSortSpec, + orderDirection: 'asc' | 'desc', + limit: number, + keyset: { operator: '>' | '<'; values: readonly (string | number)[] } | undefined, + offset: number, +): Promise< + (AnchorFactsKeysetRow & { + source_url_sort_key: string; + dest_url_sort_key: string; + status: number | null; + is_external_link: number; + })[] +> { + const qb = knex('viewer_anchor_facts'); + applyBrokenLinksFilters(qb, options); + if (keyset) { + applyKeysetPredicate(qb, spec.columns, keyset.operator, keyset.values); + } + const selectColumns = [ + ...new Set([ + 'edge_id', + 'source_url_sort_key', + 'dest_url_sort_key', + 'status', + 'status_sort_key', + 'status_desc_key', + 'is_external_link', + ...spec.columns, + ]), + ]; + let query = qb + .select(selectColumns) + .orderBy(spec.columns.map((column) => ({ column, order: orderDirection }))) + .limit(limit + 1); + if (!keyset && offset > 0) { + query = query.offset(offset); + } + return query; +} + +/** + * Maps one raw window row to the public {@link LinkEntry} shape. + * `textContent` is always `null`: `viewer_anchor_facts` doesn't store per- + * anchor text (broken-links-view.tsx never renders it, and storing it would + * duplicate potentially large strings across every edge row — see + * ARCHITECTURE.md「設計注意(viewer_anchor_facts read model、issue + * #114)」). `isExternal` reflects the edge's `is_external_link` flag — + * broken and external are independent judgments, so a broken link CAN also + * be external. + * @param row - One row from {@link readAnchorFactsWindow}. + * @param row.source_url_sort_key + * @param row.dest_url_sort_key + * @param row.status + * @param row.is_external_link + * @returns The corresponding {@link LinkEntry}. + */ +function toLinkEntry(row: { + source_url_sort_key: string; + dest_url_sort_key: string; + status: number | null; + is_external_link?: number; +}): LinkEntry { + return { + sourceUrl: row.source_url_sort_key, + destUrl: row.dest_url_sort_key, + status: row.status, + isExternal: !!row.is_external_link, + textContent: null, + }; +} + +/** + * Lists broken links from `viewer_anchor_facts` — the read-model-backed, + * cursor-paginated counterpart of `listLinks(accessor, { type: 'broken' })` + * that powers `/api/links?type=broken`'s fast path. + * + * Filter/sort resolution runs entirely against `viewer_anchor_facts`; there + * is no id-then-join step (unlike `listViewerPages`) because + * `source_url_sort_key`/`dest_url_sort_key`/`status` are already the exact + * display values — see that table's `create-viewer-read-model-tables.ts` + * docs for why this doesn't reintroduce the URL-duplication cost issue + * #114 warns about at 13M-edge scale (negligible at this package's actual + * benchmark scale; see ARCHITECTURE.md). + * + * The initial read (no `cursor`), the forward keyset read, the backward + * keyset read, and the direct-`offset` read are four separate code paths — + * no `(:cursor IS NULL OR …)`-style nullable predicate ties them together, + * mirroring `listViewerPages`. + * @param accessor - The archive accessor to query. Callers are responsible + * for confirming the read model is built and current (see + * `isViewerReadModelCurrent`) AND that `urlPattern` is not set (see + * `ListViewerBrokenLinksOptions`'s docs) before calling this. + * @param options - Filter, sort, and pagination options. + * @returns A cursor-paginated list of broken-link entries. + * @throws {Error} If `options.cursor` is malformed, stale, or was minted + * under a different filter/sort combination. + * @example + * // Virtual-scroll continuation — the caller only ever inspects nextCursor: + * const page1 = await listViewerBrokenLinks(accessor, { limit: 100 }); + * const page2 = page1.nextCursor + * ? await listViewerBrokenLinks(accessor, { limit: 100, cursor: page1.nextCursor }) + * : null; + */ +export async function listViewerBrokenLinks( + accessor: ArchiveAccessor, + options: ListViewerBrokenLinksOptions = {}, +): Promise { + const knex = accessor.getKnex(); + const limit = options.limit ?? 100; + const sortBy = options.sortBy ?? 'sourceUrl'; + const sortOrder = options.sortOrder ?? 'asc'; + const spec = getAnchorFactsSortSpec(sortBy, sortOrder); + const filterKey = buildAnchorFactsFilterKey(options); + + const total = await countAnchorFactsTotal(knex, options); + + /** + * Builds the final result from a `limit`-or-fewer window, already in + * final display order. + * @param window - The trimmed row window. + * @param hasMoreAfter - Whether a subsequent page exists. + * @param hasMoreBefore - Whether a preceding page exists. + * @returns The full paginated result. + */ + function buildResult( + window: Awaited>, + hasMoreAfter: boolean, + hasMoreBefore: boolean, + ): CursorPaginatedLinkList { + const items = window.map((row) => toLinkEntry(row)); + const lastRow = window.at(-1); + const firstRow = window[0]; + const nextCursor = + hasMoreAfter && lastRow + ? encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + filterKey, + sortBy, + sortOrder, + values: extractAnchorFactsSortValues(spec, lastRow), + }) + : null; + const prevCursor = + hasMoreBefore && firstRow + ? encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + filterKey, + sortBy, + sortOrder, + values: extractAnchorFactsSortValues(spec, firstRow), + }) + : null; + return { items, total, nextCursor, prevCursor }; + } + + if (options.cursor) { + const decoded = decodeAnchorFactsCursor(options.cursor, { + filterKey, + sortBy, + sortOrder, + columns: spec.columns, + }); + if (options.direction === 'prev') { + const oppositeDirection = spec.scanDirection === 'asc' ? 'desc' : 'asc'; + const fetched = await readAnchorFactsWindow( + knex, + options, + spec, + oppositeDirection, + limit, + { operator: spec.scanDirection === 'asc' ? '<' : '>', values: decoded.values }, + 0, + ); + const hasMoreBefore = fetched.length > limit; + const window = fetched.slice(0, limit).toReversed(); + return buildResult(window, true, hasMoreBefore); + } + const fetched = await readAnchorFactsWindow( + knex, + options, + spec, + spec.scanDirection, + limit, + { operator: spec.scanDirection === 'asc' ? '>' : '<', values: decoded.values }, + 0, + ); + const hasMoreAfter = fetched.length > limit; + const window = fetched.slice(0, limit); + return buildResult(window, hasMoreAfter, true); + } + + const offset = options.offset ?? 0; + const fetched = await readAnchorFactsWindow( + knex, + options, + spec, + spec.scanDirection, + limit, + undefined, + offset, + ); + const hasMoreAfter = fetched.length > limit; + const window = fetched.slice(0, limit); + return buildResult(window, hasMoreAfter, offset > 0); +} diff --git a/packages/@nitpicker/query/src/list-viewer-external-links.ts b/packages/@nitpicker/query/src/list-viewer-external-links.ts index 815d6c18..22980c6b 100644 --- a/packages/@nitpicker/query/src/list-viewer-external-links.ts +++ b/packages/@nitpicker/query/src/list-viewer-external-links.ts @@ -9,7 +9,7 @@ import { paginateQuery } from './paginate-query.js'; * model — the fast-path counterpart of {@link listExternalLinks}, backed by * a table pre-aggregated once at read-model build time instead of a live * `anchors` JOIN + `GROUP BY` per request (see - * ARCHITECTURE.md「設計注意(外部リンク read model)」for why the live + * ARCHITECTURE.md「設計注意(viewer_anchor_facts read model、issue #114)」for why the live * version's `GROUP BY` + `COUNT(DISTINCT ...)` combination is a known * SQLite performance pitfall). * diff --git a/packages/@nitpicker/query/src/query.ts b/packages/@nitpicker/query/src/query.ts index 75cc61d2..59b84fa5 100644 --- a/packages/@nitpicker/query/src/query.ts +++ b/packages/@nitpicker/query/src/query.ts @@ -50,6 +50,7 @@ export { listPagesByJsonLdType } from './list-pages-by-jsonld-type.js'; export { listPagesByTag } from './list-pages-by-tag.js'; export { listResources } from './list-resources.js'; export { listUnusedResources } from './list-unused-resources.js'; +export { listViewerBrokenLinks } from './list-viewer-broken-links.js'; export { listViewerExternalLinks } from './list-viewer-external-links.js'; export { listViewerPages } from './list-viewer-pages.js'; export { prepareUrlSortTempTable } from './url-sort-temp-table.js'; diff --git a/packages/@nitpicker/query/src/types.ts b/packages/@nitpicker/query/src/types.ts index a0089e95..9ed7b38b 100644 --- a/packages/@nitpicker/query/src/types.ts +++ b/packages/@nitpicker/query/src/types.ts @@ -1129,6 +1129,63 @@ export interface LinkAnalysisResult { total: number; } +/** + * Filter/sort/pagination options for {@link listViewerBrokenLinks} — the + * `viewer_anchor_facts` read-model fast path for broken-link listing. + * + * `urlPattern` and `includeRedirectSources` are deliberately absent: + * `urlPattern` matches source OR destination across two columns + * (`ListLinksOptions`'s semantics), which no single index on + * `viewer_anchor_facts` can satisfy, so callers with a `urlPattern` set + * must use `listLinks` instead (see `register-links-route.ts`). + * `includeRedirectSources` has no equivalent here: `viewer_anchor_facts` + * only ever stores the canonical (redirect-resolved) destination. + */ +export interface ListViewerBrokenLinksOptions { + /** Filter by destination HTTP status. Broken links are always `404`, so this is effectively a no-op unless set to a non-`404` value (which then matches nothing). */ + status?: number; + /** Field to sort results by. Defaults to `'sourceUrl'`. */ + sortBy?: 'sourceUrl' | 'destUrl' | 'status'; + /** Sort direction. Defaults to `'asc'`. */ + sortOrder?: SortOrder; + /** Maximum number of results to return. Defaults to 100. */ + limit?: number; + /** + * Opaque keyset cursor from a previous {@link CursorPaginatedLinkList}'s + * `nextCursor`/`prevCursor`. Mutually exclusive with `offset` — when both + * are supplied, `cursor` wins. Omit for the first page. + */ + cursor?: string; + /** + * Direction to walk from `cursor`: `'next'` (forward, default) or + * `'prev'` (backward). Ignored when `cursor` is omitted. + */ + direction?: 'next' | 'prev'; + /** + * Row offset for page-number jumps (MPA pagination). Mutually exclusive + * with `cursor`. + */ + offset?: number; +} + +/** + * Paginated result wrapper for {@link listViewerBrokenLinks} — + * {@link LinkAnalysisResult} plus keyset cursors for virtual-scroll + * continuation. + */ +export interface CursorPaginatedLinkList extends LinkAnalysisResult { + /** + * Opaque cursor to fetch the next page in the current sort order, or + * `null` when this is the last page. + */ + nextCursor: string | null; + /** + * Opaque cursor to fetch the previous page in the current sort order, or + * `null` when this is already the first page. + */ + prevCursor: string | null; +} + /** * Filter/sort/pagination options for {@link listExternalLinks}. * diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.spec.ts new file mode 100644 index 00000000..4675423b --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.spec.ts @@ -0,0 +1,23 @@ +import { describe, expect, it } from 'vitest'; + +import { buildAnchorFactsFilterKey } from './build-anchor-facts-filter-key.js'; + +describe('buildAnchorFactsFilterKey', () => { + it('produces the same key for an empty options object and an explicit status: undefined', () => { + expect(buildAnchorFactsFilterKey({})).toBe( + buildAnchorFactsFilterKey({ status: undefined }), + ); + }); + + it('produces a different key for different status values', () => { + expect(buildAnchorFactsFilterKey({ status: 404 })).not.toBe( + buildAnchorFactsFilterKey({ status: 500 }), + ); + }); + + it('produces a different key when status is set vs unset', () => { + expect(buildAnchorFactsFilterKey({})).not.toBe( + buildAnchorFactsFilterKey({ status: 404 }), + ); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.ts new file mode 100644 index 00000000..50b435dd --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/build-anchor-facts-filter-key.ts @@ -0,0 +1,16 @@ +import type { AnchorFactsCursorFilterKeyInput } from './types.js'; + +/** + * Builds the normalized `filterKey` embedded in a cursor. Two calls with the + * same effective filters (regardless of `undefined` vs omitted key order) + * always produce the same string. + * @param filters - The filter-affecting subset of the caller's options. + * @returns A canonical JSON string uniquely identifying the filter set. + */ +export function buildAnchorFactsFilterKey( + filters: AnchorFactsCursorFilterKeyInput, +): string { + return JSON.stringify({ + status: filters.status ?? null, + }); +} diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.spec.ts new file mode 100644 index 00000000..67ac7cc0 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.spec.ts @@ -0,0 +1,101 @@ +import { describe, expect, it } from 'vitest'; + +import { VIEWER_READ_MODEL_SCHEMA_VERSION } from '../viewer-read-model/viewer-read-model-schema-version.js'; + +import { decodeAnchorFactsCursor } from './decode-anchor-facts-cursor.js'; +import { encodeAnchorFactsCursor } from './encode-anchor-facts-cursor.js'; + +const PAYLOAD_BASE = { + filterKey: '{"status":null}', + sortBy: 'sourceUrl' as const, + sortOrder: 'asc' as const, +}; + +const EXPECTED = { + ...PAYLOAD_BASE, + columns: ['source_url_sort_key', 'edge_id'] as const, +}; + +describe('decodeAnchorFactsCursor', () => { + it('decodes a cursor that matches the expected filter/sort', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + values: ['https://example.com/a', 1], + }); + expect(decodeAnchorFactsCursor(cursor, EXPECTED)).toEqual({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + values: ['https://example.com/a', 1], + }); + }); + + it('throws on an undecodable string', () => { + expect(() => decodeAnchorFactsCursor('%%%not-base64%%%', EXPECTED)).toThrow( + /not decodable/, + ); + }); + + it('throws on a decodable but malformed payload', () => { + const cursor = Buffer.from(JSON.stringify({ foo: 'bar' }), 'utf8').toString( + 'base64url', + ); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/malformed/); + }); + + it('throws on a cursor minted under a stale schema version', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION - 1, + ...PAYLOAD_BASE, + values: ['https://example.com/a', 1], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/[Ss]tale/); + }); + + it('throws on a cursor minted under a different filter', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + filterKey: '{"status":404}', + values: ['https://example.com/a', 1], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/does not match/); + }); + + it('throws on a cursor minted under a different sort', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + sortBy: 'status', + values: [404, 'https://example.com/a', 1], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/does not match/); + }); + + it('throws on a values array whose length does not match the expected column count', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + values: ['https://example.com/a'], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/keyset value count/); + }); + + it('throws on a numeric-column position holding a string value', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + values: ['https://example.com/a', 'not-a-number'], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/must be a number/); + }); + + it('throws on a text-column position holding a numeric value', () => { + const cursor = encodeAnchorFactsCursor({ + v: VIEWER_READ_MODEL_SCHEMA_VERSION, + ...PAYLOAD_BASE, + values: [123, 1], + }); + expect(() => decodeAnchorFactsCursor(cursor, EXPECTED)).toThrow(/must be a string/); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.ts new file mode 100644 index 00000000..9735f995 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/decode-anchor-facts-cursor.ts @@ -0,0 +1,89 @@ +import type { AnchorFactsCursorPayload, AnchorFactsSortColumn } from './types.js'; + +import { VIEWER_READ_MODEL_SCHEMA_VERSION } from '../viewer-read-model/viewer-read-model-schema-version.js'; + +import { isNumericAnchorFactsSortColumn } from './types.js'; + +/** + * The current request's identity to validate a decoded cursor against. + */ +export interface ExpectedAnchorFactsCursor { + /** See `buildAnchorFactsFilterKey`. */ + filterKey: string; + /** The current request's sort field. */ + sortBy: 'sourceUrl' | 'destUrl' | 'status'; + /** The current request's sort direction. */ + sortOrder: 'asc' | 'desc'; + /** + * `getAnchorFactsSortSpec(sortBy, sortOrder).columns` — both the exact + * tuple length `payload.values` must carry, and the per-position type + * (`isNumericAnchorFactsSortColumn`) each value is checked against. + * Without the length check, a `values` array of the wrong length would + * reach the keyset predicate's positional column/value zip and build a + * malformed SQL comparison; without the per-position type check, a + * same-length but wrong-typed `values` array (e.g. a string standing in + * for `edge_id`) would silently seek to the wrong keyset boundary via + * SQLite's type-affinity comparison rules instead of erroring. + */ + columns: readonly AnchorFactsSortColumn[]; +} + +/** + * Decodes and validates an opaque cursor against the caller's current + * filters/sort. Rejects cursors minted under a different schema version or a + * different effective filter/sort combination — replaying a cursor across a + * changed query would silently seek to a nonsensical position. + * @param cursor - The opaque cursor string from the request. + * @param expected - The current request's filter key + sort, to validate against. + * @returns The decoded, validated payload. + * @throws {Error} If the cursor is malformed, stale, or was minted under a + * different filter/sort combination. + */ +export function decodeAnchorFactsCursor( + cursor: string, + expected: ExpectedAnchorFactsCursor, +): AnchorFactsCursorPayload { + let payload: AnchorFactsCursorPayload; + try { + payload = JSON.parse(Buffer.from(cursor, 'base64url').toString('utf8')); + } catch { + throw new Error('Invalid /api/links?type=broken cursor: not decodable'); + } + if ( + typeof payload !== 'object' || + payload === null || + !Array.isArray(payload.values) || + typeof payload.filterKey !== 'string' || + typeof payload.v !== 'number' + ) { + throw new Error('Invalid /api/links?type=broken cursor: malformed payload'); + } + if (payload.v !== VIEWER_READ_MODEL_SCHEMA_VERSION) { + throw new Error( + 'Stale /api/links?type=broken cursor: read-model schema has changed since it was issued', + ); + } + if ( + payload.filterKey !== expected.filterKey || + payload.sortBy !== expected.sortBy || + payload.sortOrder !== expected.sortOrder + ) { + throw new Error( + 'Invalid /api/links?type=broken cursor: does not match the current filter/sort combination', + ); + } + if (payload.values.length !== expected.columns.length) { + throw new Error( + 'Invalid /api/links?type=broken cursor: unexpected keyset value count', + ); + } + for (const [i, column] of expected.columns.entries()) { + const expectedType = isNumericAnchorFactsSortColumn(column) ? 'number' : 'string'; + if (typeof payload.values[i] !== expectedType) { + throw new TypeError( + `Invalid /api/links?type=broken cursor: value at position ${i} must be a ${expectedType}`, + ); + } + } + return payload; +} diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.spec.ts new file mode 100644 index 00000000..786c73e2 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.spec.ts @@ -0,0 +1,19 @@ +import { describe, expect, it } from 'vitest'; + +import { encodeAnchorFactsCursor } from './encode-anchor-facts-cursor.js'; + +describe('encodeAnchorFactsCursor', () => { + it('round-trips through base64url without loss', () => { + const payload = { + v: 6, + filterKey: '{"status":null}', + sortBy: 'sourceUrl' as const, + sortOrder: 'asc' as const, + values: ['https://example.com/a', 1], + }; + const cursor = encodeAnchorFactsCursor(payload); + expect(JSON.parse(Buffer.from(cursor, 'base64url').toString('utf8'))).toEqual( + payload, + ); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.ts new file mode 100644 index 00000000..1c5170d1 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/encode-anchor-facts-cursor.ts @@ -0,0 +1,10 @@ +import type { AnchorFactsCursorPayload } from './types.js'; + +/** + * Encodes a cursor payload as an opaque, URL-safe string. + * @param payload - The cursor payload to encode. + * @returns The base64url-encoded cursor. + */ +export function encodeAnchorFactsCursor(payload: AnchorFactsCursorPayload): string { + return Buffer.from(JSON.stringify(payload), 'utf8').toString('base64url'); +} diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.spec.ts new file mode 100644 index 00000000..dacdfa03 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.spec.ts @@ -0,0 +1,25 @@ +import type { AnchorFactsKeysetRow } from './types.js'; + +import { describe, expect, it } from 'vitest'; + +import { extractAnchorFactsSortValues } from './extract-anchor-facts-sort-values.js'; + +describe('extractAnchorFactsSortValues', () => { + it('extracts values in spec.columns order, ignoring columns not in the spec', () => { + const row: AnchorFactsKeysetRow = { + source_url_sort_key: 'https://example.com/a', + dest_url_sort_key: 'https://example.com/b', + status_sort_key: 404, + status_desc_key: -404, + edge_id: 7, + }; + const values = extractAnchorFactsSortValues( + { + columns: ['status_sort_key', 'source_url_sort_key', 'edge_id'], + scanDirection: 'asc', + }, + row, + ); + expect(values).toEqual([404, 'https://example.com/a', 7]); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.ts new file mode 100644 index 00000000..6df7b435 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/extract-anchor-facts-sort-values.ts @@ -0,0 +1,15 @@ +import type { AnchorFactsKeysetRow, AnchorFactsSortSpec } from './types.js'; + +/** + * Extracts a row's keyset tuple values in `spec.columns` order — the values + * bound into a cursor's comparison tuple. + * @param spec - The sort spec whose columns to read. + * @param row - The source row (must carry every column in `spec.columns`). + * @returns The tuple values, in `spec.columns` order. + */ +export function extractAnchorFactsSortValues( + spec: AnchorFactsSortSpec, + row: AnchorFactsKeysetRow, +): (string | number)[] { + return spec.columns.map((column) => row[column]); +} diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.spec.ts new file mode 100644 index 00000000..e3dfdd30 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.spec.ts @@ -0,0 +1,47 @@ +import { describe, expect, it } from 'vitest'; + +import { getAnchorFactsSortSpec } from './get-anchor-facts-sort-spec.js'; + +describe('getAnchorFactsSortSpec', () => { + it('sorts by sourceUrl ascending using source_url_sort_key/edge_id, scanned ascending', () => { + expect(getAnchorFactsSortSpec('sourceUrl', 'asc')).toEqual({ + columns: ['source_url_sort_key', 'edge_id'], + scanDirection: 'asc', + }); + }); + + it('sorts by sourceUrl descending by flipping the scan direction, no negated key needed', () => { + expect(getAnchorFactsSortSpec('sourceUrl', 'desc')).toEqual({ + columns: ['source_url_sort_key', 'edge_id'], + scanDirection: 'desc', + }); + }); + + it('sorts by destUrl ascending using dest_url_sort_key/edge_id, scanned ascending', () => { + expect(getAnchorFactsSortSpec('destUrl', 'asc')).toEqual({ + columns: ['dest_url_sort_key', 'edge_id'], + scanDirection: 'asc', + }); + }); + + it('sorts by destUrl descending by flipping the scan direction', () => { + expect(getAnchorFactsSortSpec('destUrl', 'desc')).toEqual({ + columns: ['dest_url_sort_key', 'edge_id'], + scanDirection: 'desc', + }); + }); + + it('sorts by status ascending using status_sort_key with a source_url_sort_key tie-breaker, scanned ascending', () => { + expect(getAnchorFactsSortSpec('status', 'asc')).toEqual({ + columns: ['status_sort_key', 'source_url_sort_key', 'edge_id'], + scanDirection: 'asc', + }); + }); + + it('sorts by status descending using the negated status_desc_key, ALWAYS scanned ascending', () => { + expect(getAnchorFactsSortSpec('status', 'desc')).toEqual({ + columns: ['status_desc_key', 'source_url_sort_key', 'edge_id'], + scanDirection: 'asc', + }); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.ts new file mode 100644 index 00000000..7a7f7783 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/get-anchor-facts-sort-spec.ts @@ -0,0 +1,31 @@ +import type { AnchorFactsSortSpec } from './types.js'; + +/** + * Resolves the keyset sort plan for a `sortBy`/`sortOrder` pair. + * @param sortBy - The field to sort by. + * @param sortOrder - The sort direction. + * @returns The resolved {@link AnchorFactsSortSpec}. + */ +export function getAnchorFactsSortSpec( + sortBy: 'sourceUrl' | 'destUrl' | 'status', + sortOrder: 'asc' | 'desc', +): AnchorFactsSortSpec { + switch (sortBy) { + case 'status': { + return { + columns: [ + sortOrder === 'desc' ? 'status_desc_key' : 'status_sort_key', + 'source_url_sort_key', + 'edge_id', + ], + scanDirection: 'asc', + }; + } + case 'destUrl': { + return { columns: ['dest_url_sort_key', 'edge_id'], scanDirection: sortOrder }; + } + default: { + return { columns: ['source_url_sort_key', 'edge_id'], scanDirection: sortOrder }; + } + } +} diff --git a/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/types.ts b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/types.ts new file mode 100644 index 00000000..86a2f4c4 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-anchor-facts-cursor/types.ts @@ -0,0 +1,101 @@ +/** + * The columns (in tuple order) that make up a given sort's keyset — both the + * `ORDER BY` clause and the cursor comparison tuple. Always ends in + * `edge_id`, the stable tie-breaker. + */ +export type AnchorFactsSortColumn = + | 'source_url_sort_key' + | 'dest_url_sort_key' + | 'status_sort_key' + | 'status_desc_key' + | 'edge_id'; + +/** + * Resolved sort plan for one `sortBy`/`sortOrder` pair: which + * `viewer_anchor_facts` columns form the keyset tuple, and which physical + * scan direction (`asc`/`desc`) reads them in display order. + * + * `status` desc uses `status_desc_key` (`= -status_sort_key`) walked + * ascending, so the `source_url_sort_key`/`edge_id` tie-breakers stay + * ascending too — ties always display in source-URL order regardless of the + * primary sort direction, mirroring `viewer_pages`'s identical + * `ViewerPagesSortSpec` rationale (`docs/viewer-sql-query-plan.md`'s "Stable + * Ordering" section). `sourceUrl`/`destUrl` don't need this negation trick + * (text has no numeric negation, and their tie-breaker — `edge_id` alone — + * flips direction together with the primary column, so no per-column + * direction mixing occurs). + */ +export interface AnchorFactsSortSpec { + /** Keyset tuple columns, in comparison/`ORDER BY` order. */ + readonly columns: readonly AnchorFactsSortColumn[]; + /** Physical scan direction that yields display order for `columns`. */ + readonly scanDirection: 'asc' | 'desc'; +} + +/** One `viewer_anchor_facts` row's worth of keyset column values, keyed by column name. */ +export type AnchorFactsKeysetRow = Record & { + edge_id: number; +}; + +/** + * The `viewer_anchor_facts` columns whose keyset value is a SQLite INTEGER + * (bound as a JS `number`) rather than TEXT (`string`). Used by + * `decodeAnchorFactsCursor` to reject a cursor whose `values` array has the + * right length but a value of the wrong type at some position (e.g. a + * string where `edge_id` belongs) — SQLite's type-affinity comparison rules + * would otherwise silently seek to the wrong keyset boundary instead of + * erroring. + */ +const NUMERIC_ANCHOR_FACTS_SORT_COLUMNS: ReadonlySet = new Set([ + 'status_sort_key', + 'status_desc_key', + 'edge_id', +]); + +/** + * Whether `column`'s keyset value is a SQLite INTEGER (`number`) rather + * than TEXT (`string`). + * @param column - The sort-spec column to check. + * @returns `true` for `status_sort_key`/`status_desc_key`/`edge_id`, `false` for the URL sort-key columns. + */ +export function isNumericAnchorFactsSortColumn(column: AnchorFactsSortColumn): boolean { + return NUMERIC_ANCHOR_FACTS_SORT_COLUMNS.has(column); +} + +/** + * The subset of `ListViewerBrokenLinksOptions` that affects which rows + * match — used to build a cursor's `filterKey` so a cursor minted under one + * filter/sort combination can't silently be replayed under another. Unlike + * `viewer_pages`, `is_broken` itself is never variable here (this cursor + * family only ever backs the broken-link listing), and `urlPattern` is + * excluded entirely: it matches source OR destination across two columns + * (`list-links.ts`'s semantics), which no single index here can satisfy, so + * the caller (`register-links-route.ts`) forces the legacy fallback instead + * of ever reaching this cursor machinery with a `urlPattern` set — the same + * precedent `register-pages-route.ts` already established for `/api/pages`. + */ +export interface AnchorFactsCursorFilterKeyInput { + /** See `ListViewerBrokenLinksOptions.status`. */ + status?: number; +} + +/** + * Decoded shape of an opaque `/api/links?type=broken` viewer cursor. + */ +export interface AnchorFactsCursorPayload { + /** + * The read-model schema version the cursor was minted under (see + * `VIEWER_READ_MODEL_SCHEMA_VERSION`). A schema bump changes column + * meanings (or removes them), so a cursor from a stale schema must never + * be replayed. + */ + v: number; + /** See `buildAnchorFactsFilterKey`. */ + filterKey: string; + /** The sort field the cursor was minted under. */ + sortBy: 'sourceUrl' | 'destUrl' | 'status'; + /** The sort direction the cursor was minted under. */ + sortOrder: 'asc' | 'desc'; + /** The boundary row's keyset tuple values, in sort-spec column order. */ + values: (string | number)[]; +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts index 5de18cc9..385247bf 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts @@ -1102,7 +1102,9 @@ describe('buildViewerReadModel', () => { isSkipped: false, }); - // A second, distinct referring page to the same destination. + // A second, distinct referring page to the same destination, plus two + // duplicate anchors to a broken destination — must collapse to one + // viewer_anchor_facts row with count=2, not two rows. await archive.setPage({ url: parseUrl('https://example.com/page-b')!, redirectPaths: [], @@ -1122,6 +1124,18 @@ describe('buildViewerReadModel', () => { title: null, textContent: 'Ad sidebar', }, + { + href: parseUrl('https://example.com/broken')!, + isExternal: false, + title: null, + textContent: 'Broken link 1', + }, + { + href: parseUrl('https://example.com/broken')!, + isExternal: false, + title: null, + textContent: 'Broken link 2', + }, ], imageList: [], isSkipped: false, @@ -1143,6 +1157,22 @@ describe('buildViewerReadModel', () => { imageList: [], isSkipped: false, }); + await archive.setPage({ + url: parseUrl('https://example.com/broken')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); await buildViewerReadModel(archive); }); @@ -1171,5 +1201,32 @@ describe('buildViewerReadModel', () => { expect(rows).toHaveLength(1); expect(rows[0]).toMatchObject({ referrer_count: 2 }); }); + + it('populates viewer_anchor_facts with one row per unique (source,dest) pair, collapsing duplicate anchors via count', async () => { + const rows = await archive + .getKnex()('viewer_anchor_facts') + .where('dest_url_sort_key', 'https://example.com/broken') + .select('*'); + expect(rows).toHaveLength(1); + expect(rows[0]).toMatchObject({ count: 2, is_broken: 1, is_external_link: 0 }); + }); + + it('flags the external-destination edges as is_external_link without indexing them for read (no vaf_external_* index exists)', async () => { + const rows = await archive + .getKnex()('viewer_anchor_facts') + .where('dest_url_sort_key', 'https://ads.example.com') + .select('*'); + expect(rows).toHaveLength(2); + for (const row of rows) { + expect(row).toMatchObject({ is_broken: 0, is_external_link: 1 }); + } + }); + + it('rebuilds viewer_anchor_facts idempotently — a second build leaves the same row count', async () => { + await buildViewerReadModel(archive); + const rows = await archive.getKnex()('viewer_anchor_facts').select('*'); + // 2 edges to ads.example.com (page-a, page-b) + 1 edge to /broken (page-b). + expect(rows).toHaveLength(3); + }); }); }); diff --git a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts index 664ea08d..07963cc7 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts @@ -5,36 +5,17 @@ import { classifyContentType } from '../classify-content-type.js'; import { excludeSkippedPages } from '../exclude-skipped-pages.js'; import { buildDirectoryTreeRows } from './build-directory-tree-rows.js'; -import { computeExternalLinkRows } from './compute-external-link-rows.js'; +import { computeAnchorFactRows } from './compute-anchor-fact-rows.js'; import { computePageFacetBuckets } from './compute-page-facet-buckets.js'; import { createViewerReadModelTables } from './create-viewer-read-model-tables.js'; +import { deriveExternalLinkSummaryRows } from './derive-external-link-summary-rows.js'; import { dropViewerReadModelTables } from './drop-viewer-read-model-tables.js'; +import { NULL_STATUS_SENTINEL } from './null-status-sentinel.js'; import { VIEWER_READ_MODEL_SCHEMA_VERSION } from './viewer-read-model-schema-version.js'; /** Number of rows written per `INSERT` statement while populating `viewer_pages`. */ const INSERT_CHUNK_SIZE = 500; -/** - * Sentinel `status_sort_key` value substituted for `null` status (errored / - * not-yet-classified rows). Chosen smaller than any real HTTP status code - * (100-599) so unknown-status rows keep sorting first in ascending order — - * matching `listPages`'s prior behavior of ordering directly on the nullable - * `status` column, where SQLite treats `NULL` as smaller than any value. - * - * Deliberately distinct from `-1`, which `Database.resetFailedPages` already - * uses as the "hard failure" HTTP status sentinel (see that function's docs) - * — reusing `-1` here would conflate two different populations of rows in - * `status_sort_key` ordering and in any future `status = -1` equality filter. - * - * Keyset cursor comparisons need a NEVER-`null` sort-key column: SQL's - * three-valued logic makes `NULL > x` / `NULL < x` always evaluate to - * `NULL` (never true), which would silently break tuple comparisons like - * `(status_sort_key, url_sort_key, page_id) > (?, ?, ?)` for rows whose - * status is unknown. Substituting a sentinel keeps every row on this column - * strictly orderable. - */ -const NULL_STATUS_SENTINEL = -32_768; - /** * Row shape read from the write-model `pages` table while populating * `viewer_pages`. Column names match `pages` verbatim (see @@ -201,14 +182,16 @@ function toViewerPageInsertRow(row: PagesSourceRow): ViewerPageInsertRow { } /** - * Performs a full rebuild of the viewer read model: drops all 8 tables if + * Performs a full rebuild of the viewer read model: drops all 9 tables if * present, recreates them, populates `viewer_pages` from the current * `pages` write-model table, populates `viewer_directory_nodes`/ * `viewer_directory_pages` from that same page set (see * `buildDirectoryTreeRows` for the tree-building rules), populates - * `viewer_external_links` from a dedicated `anchors` aggregation query (see - * `computeExternalLinkRows` — unlike the directory tree, this cannot reuse - * `sourceRows`, since link data lives on `anchors`, not `pages`), seeds one + * `viewer_anchor_facts` from a single `anchors` aggregation query (see + * `computeAnchorFactRows` — unlike the directory tree, this cannot reuse + * `sourceRows`, since link data lives on `anchors`, not `pages`) and derives + * `viewer_external_links` from those same in-memory rows with no second + * `anchors` scan (see `deriveExternalLinkSummaryRows`), seeds one * smoke-test row into `viewer_query_profiles`, writes the * `viewer_count_buckets` totals row plus one row per distinct Pages-list * facet value (see `computePageFacetBuckets`), and writes the @@ -325,10 +308,19 @@ export async function buildViewerReadModel( // Unlike `viewer_pages`/the directory tree, this needs its own `anchors` // query — `sourceRows` (loaded from `pages` only) has no anchor/link - // data. Runs once, here, instead of on every `/api/links?type=external` - // request — see `computeExternalLinkRows`'s docs for the SQLite + // data. Runs once, here, instead of on every `/api/links?type=broken` + // request — see `computeAnchorFactRows`'s docs for the SQLite // performance rationale. - const externalLinkRows = await computeExternalLinkRows(trx); + const anchorFactRows = await computeAnchorFactRows(trx); + for (let start = 0; start < anchorFactRows.length; start += INSERT_CHUNK_SIZE) { + await trx('viewer_anchor_facts').insert( + anchorFactRows.slice(start, start + INSERT_CHUNK_SIZE), + ); + } + + // Derived in memory from the anchor-fact rows already computed above — + // no second `anchors` scan (see `deriveExternalLinkSummaryRows`'s docs). + const externalLinkRows = deriveExternalLinkSummaryRows(anchorFactRows); for (let start = 0; start < externalLinkRows.length; start += INSERT_CHUNK_SIZE) { await trx('viewer_external_links').insert( externalLinkRows.slice(start, start + INSERT_CHUNK_SIZE), diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.spec.ts new file mode 100644 index 00000000..55117c35 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.spec.ts @@ -0,0 +1,524 @@ +import path from 'node:path'; + +import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; +import { Archive } from '@nitpicker/crawler'; +import { afterAll, beforeAll, describe, expect, it } from 'vitest'; + +import { computeAnchorFactRows } from './compute-anchor-fact-rows.js'; + +const __filename = new URL(import.meta.url).pathname; +const __dirname = path.dirname(__filename); + +const BASE_CONFIG = { + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, +}; + +const META = { + lang: null, + title: null, + description: null, + keywords: null, + noindex: false, + nofollow: false, + noarchive: false, + canonical: null, + alternate: null, + 'og:type': null, + 'og:title': null, + 'og:site_name': null, + 'og:description': null, + 'og:url': null, + 'og:image': null, + 'twitter:card': null, +}; + +describe('computeAnchorFactRows', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_compute_anchor_fact_rows__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'compute-anchor-fact-rows-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + // Page A: two anchors to /broken (same pair, must collapse to one + // row with count=2), one anchor to ads.example.com (external). + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page A' }, + anchorList: [ + { + href: parseUrl('https://example.com/broken')!, + isExternal: false, + title: null, + textContent: 'Broken link 1', + }, + { + href: parseUrl('https://example.com/broken')!, + isExternal: false, + title: null, + textContent: 'Broken link 2', + }, + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad', + }, + ], + imageList: [], + isSkipped: false, + }); + + // Page B: anchor to a 403 destination (must NOT be flagged broken) + // and a 500 destination (must NOT be flagged broken either). + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page B' }, + anchorList: [ + { + href: parseUrl('https://example.com/forbidden')!, + isExternal: false, + title: null, + textContent: 'Forbidden', + }, + { + href: parseUrl('https://example.com/server-error')!, + isExternal: false, + title: null, + textContent: 'Server error', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/broken')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/forbidden')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 403, + statusText: 'Forbidden', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/server-error')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 500, + statusText: 'Internal Server Error', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://ads.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('collapses duplicate anchors between the same (source,dest) pair into one row with count', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const broken = rows.find( + (row) => row.dest_url_sort_key === 'https://example.com/broken', + ); + expect(broken).toMatchObject({ count: 2, is_broken: 1 }); + }); + + it('flags only 404 destinations as broken, not 403 or 5xx', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const forbidden = rows.find( + (row) => row.dest_url_sort_key === 'https://example.com/forbidden', + ); + const serverError = rows.find( + (row) => row.dest_url_sort_key === 'https://example.com/server-error', + ); + expect(forbidden).toMatchObject({ status: 403, is_broken: 0 }); + expect(serverError).toMatchObject({ status: 500, is_broken: 0 }); + }); + + it('flags external destinations via is_external_link, not is_broken', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const ads = rows.find((row) => row.dest_url_sort_key === 'https://ads.example.com'); + expect(ads).toMatchObject({ count: 1, is_broken: 0, is_external_link: 1 }); + }); + + it('substitutes NULL_STATUS_SENTINEL only when status is null, never a real status', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const broken = rows.find( + (row) => row.dest_url_sort_key === 'https://example.com/broken', + )!; + expect(broken.status_sort_key).toBe(404); + }); + + it('sets status_desc_key to the negation of status_sort_key', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const broken = rows.find( + (row) => row.dest_url_sort_key === 'https://example.com/broken', + )!; + expect(broken.status_desc_key).toBe(-404); + }); +}); + +/** + * Mirrors `list-links.spec.ts`'s redirect-resolution describe block: an + * anchor to an internal redirect-source page and an anchor directly to the + * same canonical destination must collapse into a single row (same + * dest_page_id), not two, and the broken/external judgment must use the + * canonical destination, not the literal redirect-source. + */ +describe('computeAnchorFactRows — redirect resolution', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_compute_anchor_fact_rows_redirect__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'compute-anchor-fact-rows-redirect-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + await archive.setPage({ + url: parseUrl('https://example.com/direct')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Direct' }, + anchorList: [ + { + href: parseUrl('https://example.com/canonical-target')!, + isExternal: false, + title: null, + textContent: 'Direct link', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/via-redirect')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Via redirect' }, + anchorList: [ + { + href: parseUrl('https://example.com/old')!, + isExternal: false, + title: null, + textContent: 'Old link', + hash: null, + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/canonical-target')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await archive.setRedirect({ + url: parseUrl('https://example.com/old')!, + redirectPaths: ['https://example.com/canonical-target'], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('collapses a redirect-source anchor and a direct anchor onto the same canonical dest_page_id, judged broken via the canonical status', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const targetRows = rows.filter( + (row) => row.dest_url_sort_key === 'https://example.com/canonical-target', + ); + expect(targetRows).toHaveLength(2); + expect(new Set(targetRows.map((row) => row.dest_page_id)).size).toBe(1); + for (const row of targetRows) { + expect(row).toMatchObject({ status: 404, is_broken: 1, count: 1 }); + } + }); +}); + +/** + * Mirrors the internal-destination redirect-resolution block above, but for + * a canonical destination that is itself external — `is_external_link` must + * come from the canonical page's `isExternal`, not the (always-internal) + * redirect-source page's, and the two anchors (one direct, one via an + * internal redirect-source) must still collapse onto one `dest_page_id`. + */ +describe('computeAnchorFactRows — redirect resolution to an external destination', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_compute_anchor_fact_rows_redirect_external__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'compute-anchor-fact-rows-redirect-external-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + await archive.setPage({ + url: parseUrl('https://example.com/direct-ext')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Direct ext' }, + anchorList: [ + { + href: parseUrl('https://external.example.com/target')!, + isExternal: true, + title: null, + textContent: 'Direct external link', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/via-redirect-ext')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Via redirect ext' }, + anchorList: [ + { + href: parseUrl('https://example.com/old-ext')!, + isExternal: false, + title: null, + textContent: 'Old external link', + hash: null, + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://external.example.com/target')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await archive.setRedirect({ + url: parseUrl('https://example.com/old-ext')!, + redirectPaths: ['https://external.example.com/target'], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('collapses a redirect-source anchor and a direct anchor onto the same external canonical dest_page_id, flagged external via the canonical page', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeAnchorFactRows(trx)); + const targetRows = rows.filter( + (row) => row.dest_url_sort_key === 'https://external.example.com/target', + ); + expect(targetRows).toHaveLength(2); + expect(new Set(targetRows.map((row) => row.dest_page_id)).size).toBe(1); + for (const row of targetRows) { + expect(row).toMatchObject({ + status: 200, + is_broken: 0, + is_external_link: 1, + count: 1, + }); + } + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.ts new file mode 100644 index 00000000..dba94774 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/compute-anchor-fact-rows.ts @@ -0,0 +1,70 @@ +import type { AnchorFactInsertRow } from './types.js'; +import type { Knex } from 'knex'; + +import { NULL_STATUS_SENTINEL } from './null-status-sentinel.js'; + +/** + * Computes one row per unique `(source_page_id, dest_page_id)` pair for + * bulk insert into `viewer_anchor_facts`, for `viewer_external_links` to + * derive its summary from afterward (`deriveExternalLinkSummaryRows`) — + * this is the only `anchors` scan the read-model build performs for either + * table. + * + * Redirect resolution (`COALESCE(canonical.*, dest.*)`) and the broken-link + * definition (`status = 404` strictly — see `list-links.ts`'s scope note: + * 403/5xx/unknown never count as broken) are lifted verbatim from + * `list-links.ts`/`list-external-links.ts`'s live queries. Duplicate + * anchors between the same pair (e.g. a nav link repeated in header and + * footer) collapse into one row via `count` — see + * ARCHITECTURE.md「設計注意(viewer_anchor_facts read model、issue + * #114)」for why this is a genuine read/write/storage improvement, not + * just a shortcut. + * @param trx - An open Knex transaction (a plain `Knex` instance also + * works, e.g. in tests). + * @returns One row per unique `(source_page_id, dest_page_id)` pair. + */ +export async function computeAnchorFactRows(trx: Knex): Promise { + const destIdExpression = 'COALESCE("canonical"."id", "dest"."id")'; + const destUrlExpression = 'COALESCE("canonical"."url", "dest"."url")'; + const statusExpression = 'COALESCE("canonical"."status", "dest"."status")'; + const isExternalExpression = 'COALESCE("canonical"."isExternal", "dest"."isExternal")'; + + const rows: { + sourcePageId: number; + destPageId: number; + sourceUrl: string; + destUrl: string; + status: number | null; + isExternal: 0 | 1; + count: number; + }[] = await trx('anchors') + .join('pages as source', 'anchors.pageId', '=', 'source.id') + .join('pages as dest', 'anchors.hrefId', '=', 'dest.id') + .leftJoin('pages as canonical', 'dest.redirectDestId', '=', 'canonical.id') + .groupBy('source.id', trx.raw(destIdExpression)) + .select( + 'source.id as sourcePageId', + trx.raw(`${destIdExpression} as "destPageId"`), + 'source.url as sourceUrl', + trx.raw(`${destUrlExpression} as "destUrl"`), + trx.raw(`${statusExpression} as "status"`), + trx.raw(`${isExternalExpression} as "isExternal"`), + trx.raw('count(*) as "count"'), + ); + + return rows.map((row) => { + const statusSortKey = row.status ?? NULL_STATUS_SENTINEL; + return { + source_page_id: row.sourcePageId, + dest_page_id: row.destPageId, + source_url_sort_key: row.sourceUrl, + dest_url_sort_key: row.destUrl, + status: row.status, + status_sort_key: statusSortKey, + status_desc_key: -statusSortKey, + count: Number(row.count), + is_broken: row.status === 404 ? 1 : 0, + is_external_link: row.isExternal ? 1 : 0, + }; + }); +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts deleted file mode 100644 index bde2b0d6..00000000 --- a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts +++ /dev/null @@ -1,321 +0,0 @@ -import path from 'node:path'; - -import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; -import { Archive } from '@nitpicker/crawler'; -import { afterAll, beforeAll, describe, expect, it } from 'vitest'; - -import { computeExternalLinkRows } from './compute-external-link-rows.js'; - -const __filename = new URL(import.meta.url).pathname; -const __dirname = path.dirname(__filename); - -const BASE_CONFIG = { - baseUrl: 'https://example.com', - name: 'test', - version: '0.10.0', - recursive: true, - interval: 0, - image: true, - fetchExternal: false, - parallels: 1, - roots: ['https://example.com'], - excludes: [], - excludeKeywords: [], - excludeUrls: [], - maxExcludedDepth: 0, - retry: 3, - fromList: false, - disableQueries: false, - userAgent: 'test', - ignoreRobots: false, -}; - -const META = { - lang: null, - title: null, - description: null, - keywords: null, - noindex: false, - nofollow: false, - noarchive: false, - canonical: null, - alternate: null, - 'og:type': null, - 'og:title': null, - 'og:site_name': null, - 'og:description': null, - 'og:url': null, - 'og:image': null, - 'twitter:card': null, -}; - -describe('computeExternalLinkRows', () => { - const workingDir = path.resolve( - __dirname, - '__test_fixtures_compute_external_link_rows__', - ); - let archive: InstanceType; - const archiveFilePath = path.resolve( - workingDir, - 'compute-external-link-rows-test.nitpicker', - ); - - beforeAll(async () => { - const { mkdirSync } = await import('node:fs'); - mkdirSync(workingDir, { recursive: true }); - archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); - await archive.setConfig(BASE_CONFIG); - - // Page A: two anchors to ads.example.com (same page, must count as one - // referrer, not two), plus one to tracking. - await archive.setPage({ - url: parseUrl('https://example.com/page-a')!, - redirectPaths: [], - isExternal: false, - isTarget: true, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: { ...META, title: 'Page A' }, - anchorList: [ - { - href: parseUrl('https://ads.example.com/')!, - isExternal: true, - title: null, - textContent: 'Ad banner', - }, - { - href: parseUrl('https://ads.example.com/')!, - isExternal: true, - title: null, - textContent: 'Ad footer', - }, - { - href: parseUrl('https://tracking.example.com/')!, - isExternal: true, - title: null, - textContent: 'Tracking', - }, - ], - imageList: [], - isSkipped: false, - }); - - // Page B: a second, distinct referrer to ads.example.com. - await archive.setPage({ - url: parseUrl('https://example.com/page-b')!, - redirectPaths: [], - isExternal: false, - isTarget: true, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: { ...META, title: 'Page B' }, - anchorList: [ - { - href: parseUrl('https://ads.example.com/')!, - isExternal: true, - title: null, - textContent: 'Ad sidebar', - }, - ], - imageList: [], - isSkipped: false, - }); - - await archive.setPage({ - url: parseUrl('https://ads.example.com/')!, - redirectPaths: [], - isExternal: true, - isTarget: false, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: META, - anchorList: [], - imageList: [], - isSkipped: false, - }); - await archive.setPage({ - url: parseUrl('https://tracking.example.com/')!, - redirectPaths: [], - isExternal: true, - isTarget: false, - status: 404, - statusText: 'Not Found', - contentType: 'text/html', - contentLength: 0, - responseHeaders: {}, - html: '', - meta: META, - anchorList: [], - imageList: [], - isSkipped: false, - }); - }); - - afterAll(async () => { - if (archive) { - await archive.releaseHandle(); - } - const { rmSync } = await import('node:fs'); - rmSync(workingDir, { recursive: true, force: true }); - }); - - it('groups anchors by canonical destination, one row per unique destination', async () => { - const knex = archive.getKnex(); - const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); - expect(rows).toHaveLength(2); - }); - - it('counts referrers by distinct page id, not anchor count', async () => { - // Page A has two tags to ads.example.com; combined with page B - // that's 2 distinct referring pages, not 3 anchors. - const knex = archive.getKnex(); - const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); - const ads = rows.find((row) => row.dest_url === 'https://ads.example.com'); - expect(ads).toMatchObject({ status: 200, referrer_count: 2 }); - }); - - it('carries the canonical destination status through', async () => { - const knex = archive.getKnex(); - const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); - const tracking = rows.find((row) => row.dest_url === 'https://tracking.example.com'); - expect(tracking).toMatchObject({ status: 404, referrer_count: 1 }); - }); -}); - -/** - * Mirrors `list-external-links.spec.ts`'s redirect-resolution describe - * block: an anchor to an internal redirect-source page and an anchor - * directly to the same external canonical destination must collapse into a - * single `viewer_external_links` row, not two. - */ -describe('computeExternalLinkRows — redirect resolution', () => { - const workingDir = path.resolve( - __dirname, - '__test_fixtures_compute_external_link_rows_redirect__', - ); - let archive: InstanceType; - const archiveFilePath = path.resolve( - workingDir, - 'compute-external-link-rows-redirect-test.nitpicker', - ); - - beforeAll(async () => { - const { mkdirSync } = await import('node:fs'); - mkdirSync(workingDir, { recursive: true }); - archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); - await archive.setConfig(BASE_CONFIG); - - await archive.setPage({ - url: parseUrl('https://example.com/direct')!, - redirectPaths: [], - isExternal: false, - isTarget: true, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: { ...META, title: 'Direct' }, - anchorList: [ - { - href: parseUrl('https://redirect-target.example.com/')!, - isExternal: true, - title: null, - textContent: 'Direct link', - }, - ], - imageList: [], - isSkipped: false, - }); - - await archive.setPage({ - url: parseUrl('https://example.com/via-redirect')!, - redirectPaths: [], - isExternal: false, - isTarget: true, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: { ...META, title: 'Via redirect' }, - anchorList: [ - { - href: parseUrl('https://example.com/old')!, - isExternal: false, - title: null, - textContent: 'Old link', - hash: null, - }, - ], - imageList: [], - isSkipped: false, - }); - - await archive.setPage({ - url: parseUrl('https://redirect-target.example.com/')!, - redirectPaths: [], - isExternal: true, - isTarget: false, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: META, - anchorList: [], - imageList: [], - isSkipped: false, - }); - - await archive.setRedirect({ - url: parseUrl('https://example.com/old')!, - redirectPaths: ['https://redirect-target.example.com/'], - isExternal: false, - isTarget: true, - status: 200, - statusText: 'OK', - contentType: 'text/html', - contentLength: 100, - responseHeaders: {}, - html: '', - meta: META, - anchorList: [], - imageList: [], - isSkipped: false, - }); - }); - - afterAll(async () => { - if (archive) { - await archive.releaseHandle(); - } - const { rmSync } = await import('node:fs'); - rmSync(workingDir, { recursive: true, force: true }); - }); - - it('collapses a redirect-source anchor and a direct anchor onto the same canonical destination row', async () => { - const knex = archive.getKnex(); - const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); - expect(rows).toHaveLength(1); - expect(rows[0]).toMatchObject({ - dest_url: 'https://redirect-target.example.com', - referrer_count: 2, - }); - }); -}); diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts deleted file mode 100644 index a09dc2af..00000000 --- a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts +++ /dev/null @@ -1,54 +0,0 @@ -import type { ExternalLinkInsertRow } from './types.js'; -import type { Knex } from 'knex'; - -/** - * Computes every unique external destination reached from the site, for - * bulk insert into `viewer_external_links`. - * - * The aggregation itself (`COALESCE(canonical.*, dest.*)` redirect - * resolution, `GROUP BY` on the canonical destination id, `COUNT(DISTINCT - * source.id)` for the referrer count) is lifted verbatim from - * `list-external-links.ts`'s live query — see that file's docs for why the - * counting grain must stay in lockstep with `getPageDetail.inboundLinks` - * (#71). The only difference here is that this runs once, at read-model - * build time, against the full `anchors` table with no `LIMIT`/`OFFSET` — - * see ARCHITECTURE.md「設計注意(外部リンク read model)」for why running - * this JOIN + `GROUP BY` + `COUNT(DISTINCT ...)` combination on every - * `/api/links?type=external` request is a known SQLite performance - * pitfall, and why materialising it once avoids it. - * @param trx - An open Knex transaction (a plain `Knex` instance also - * works, e.g. in tests). - * @returns One row per unique canonical external destination. - */ -export async function computeExternalLinkRows( - trx: Knex, -): Promise { - const destIdExpression = 'COALESCE("canonical"."id", "dest"."id")'; - const destUrlExpression = 'COALESCE("canonical"."url", "dest"."url")'; - const statusExpression = 'COALESCE("canonical"."status", "dest"."status")'; - - const rows: { - destPageId: number; - destUrl: string; - status: number | null; - referrerCount: number; - }[] = await trx('anchors') - .join('pages as source', 'anchors.pageId', '=', 'source.id') - .join('pages as dest', 'anchors.hrefId', '=', 'dest.id') - .leftJoin('pages as canonical', 'dest.redirectDestId', '=', 'canonical.id') - .whereRaw(`COALESCE("canonical"."isExternal", "dest"."isExternal") = 1`) - .groupBy(trx.raw(destIdExpression)) - .select( - trx.raw(`${destIdExpression} as "destPageId"`), - trx.raw(`${destUrlExpression} as "destUrl"`), - trx.raw(`${statusExpression} as "status"`), - trx.raw('count(distinct "source"."id") as "referrerCount"'), - ); - - return rows.map((row) => ({ - dest_page_id: row.destPageId, - dest_url: row.destUrl, - status: row.status, - referrer_count: Number(row.referrerCount), - })); -} diff --git a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts index 570ef951..dc9fdff6 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts @@ -30,7 +30,7 @@ describe('createViewerReadModelTables', () => { rmSync(workingDir, { recursive: true, force: true }); }); - it('creates all 8 tables and the named viewer_pages indexes', async () => { + it('creates all 9 tables and the named viewer_pages indexes', async () => { const knex = archive.getKnex(); await knex.transaction((trx) => createViewerReadModelTables(trx)); @@ -43,6 +43,7 @@ describe('createViewerReadModelTables', () => { 'viewer_directory_nodes', 'viewer_directory_pages', 'viewer_external_links', + 'viewer_anchor_facts', ]) { expect(await knex.schema.hasTable(table)).toBe(true); } @@ -71,6 +72,21 @@ describe('createViewerReadModelTables', () => { for (const indexName of ['vel_url', 'vel_status', 'vel_referrer_count']) { expect(externalLinkIndexNames.has(indexName)).toBe(true); } + + const anchorFactIndexRows: Array<{ name: string }> = await knex('sqlite_master') + .where({ type: 'index', tbl_name: 'viewer_anchor_facts' }) + .select('name'); + const anchorFactIndexNames = new Set(anchorFactIndexRows.map((r) => r.name)); + for (const indexName of [ + 'vaf_broken_source', + 'vaf_broken_dest', + 'vaf_broken_status', + 'vaf_broken_status_desc', + 'vaf_source', + 'vaf_dest', + ]) { + expect(anchorFactIndexNames.has(indexName)).toBe(true); + } }); it('viewer_query_profiles enforces a composite (scope, profile_key) key, not a single-column rowid', async () => { diff --git a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts index ecfed6de..0644ae20 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts @@ -1,14 +1,14 @@ import type { Knex } from 'knex'; /** - * Creates all 8 viewer-read-model tables (and `viewer_pages`'s named + * Creates all 9 viewer-read-model tables (and `viewer_pages`'s named * indexes) against the given connection. Assumes none of the tables * currently exist — callers (`buildViewerReadModel`) are responsible for * dropping any prior version first, inside the same transaction, so this * function is not itself idempotent. * * Every statement runs via `raw()` rather than knex's chainable schema - * builder: 5 of the 8 tables need `WITHOUT ROWID` / a composite primary key + * builder: 5 of the 9 tables need `WITHOUT ROWID` / a composite primary key * / a `CHECK` constraint / a table-level `UNIQUE` constraint, none of which * the chainable builder can express (the same reason `page_html_blobs` / * `page_html_ref` drop to `raw()` in `@nitpicker/crawler`'s @@ -159,14 +159,16 @@ export async function createViewerReadModelTables(trx: Knex): Promise { ); // Pre-aggregated, deduplicated-by-canonical-destination external link - // list — see `computeExternalLinkRows`'s docs for why this needs its own - // `anchors` query rather than reusing `viewer_pages`'s `sourceRows` (the - // aggregation joins `anchors` at build time instead of on every read, - // see ARCHITECTURE.md「設計注意(外部リンク read model)」for the - // SQLite COUNT(DISTINCT)/GROUP BY performance rationale). No + // summary — derived in memory from `viewer_anchor_facts` rows (see + // `deriveExternalLinkSummaryRows`'s docs) rather than its own `anchors` + // scan, so building this table costs no extra JOIN over the one + // `computeAnchorFactRows` already does. See ARCHITECTURE.md「設計注意 + // (viewer_anchor_facts read model、issue #114)」for the SQLite + // COUNT(DISTINCT)/GROUP BY performance rationale this sidesteps. No // `_desc_key` columns like `viewer_pages` needs: pagination here is - // plain offset-based (via `paginateQuery`), not keyset-cursor, so a - // single ascending index scanned backward is enough for DESC. + // plain offset-based (via + // `paginateQuery`), not keyset-cursor, so a single ascending index + // scanned backward is enough for DESC. await trx.raw(` CREATE TABLE viewer_external_links ( dest_page_id integer primary key, @@ -182,4 +184,58 @@ export async function createViewerReadModelTables(trx: Knex): Promise { await trx.raw( 'CREATE INDEX vel_referrer_count ON viewer_external_links(referrer_count, dest_url, dest_page_id)', ); + + // Edge-level (one row per unique (source_page_id, dest_page_id) pair, + // with `count` absorbing duplicate anchor observations between the same + // pair) fact table backing broken-link listing. Deliberately has no + // `url_refs`/`content_items` ref-table indirection (issue #139 — not + // landed, and #103's own execution order places it after this table): + // `source_url_sort_key`/`dest_url_sort_key` are inline text, copied at + // build time exactly like `viewer_pages.url_sort_key`, so indexed + // `ORDER BY` works without a pre-join. Full URL text for the OTHER + // (non-sort-key) display columns is resolved by joining back to `pages` + // only after the id set is limit-bounded (same limit-before-join + // pattern as `joinViewerPageIdsToListItems`). `is_external_link` is + // stored (SQLite INTEGER 0/1 costs ~0 bytes) but intentionally has no + // index: nothing reads this table filtered by it — it exists only for + // `deriveExternalLinkSummaryRows`'s in-memory pass over the full row + // set at build time. `status_desc_key` mirrors `viewer_pages`'s same + // column for the same reason: `docs/viewer-sql-query-plan.md`'s Stable + // Ordering rule keeps the `source_url_sort_key`/`edge_id` tie-breakers + // ascending even when the primary sort is `status desc` — a row-value + // keyset tuple comparison can't mix per-column directions, so the + // primary column is negated and walked ascending instead. See + // ARCHITECTURE.md「設計注意(viewer_anchor_facts read model、issue + // #114)」for the full read/write/storage rationale. + await trx.raw(` + CREATE TABLE viewer_anchor_facts ( + edge_id integer primary key, + source_page_id integer not null, + dest_page_id integer not null, + source_url_sort_key text not null, + dest_url_sort_key text not null, + status integer, + status_sort_key integer not null, + status_desc_key integer not null, + count integer not null, + is_broken integer not null, + is_external_link integer not null + ) + `); + await trx.raw( + 'CREATE INDEX vaf_broken_source ON viewer_anchor_facts(is_broken, source_url_sort_key, edge_id)', + ); + await trx.raw( + 'CREATE INDEX vaf_broken_dest ON viewer_anchor_facts(is_broken, dest_url_sort_key, edge_id)', + ); + await trx.raw( + 'CREATE INDEX vaf_broken_status ON viewer_anchor_facts(is_broken, status_sort_key, source_url_sort_key, edge_id)', + ); + await trx.raw( + 'CREATE INDEX vaf_broken_status_desc ON viewer_anchor_facts(is_broken, status_desc_key, source_url_sort_key, edge_id)', + ); + await trx.raw( + 'CREATE INDEX vaf_source ON viewer_anchor_facts(source_page_id, edge_id)', + ); + await trx.raw('CREATE INDEX vaf_dest ON viewer_anchor_facts(dest_page_id, edge_id)'); } diff --git a/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.spec.ts new file mode 100644 index 00000000..93c720a8 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.spec.ts @@ -0,0 +1,99 @@ +import type { AnchorFactInsertRow } from './types.js'; + +import { describe, expect, it } from 'vitest'; + +import { deriveExternalLinkSummaryRows } from './derive-external-link-summary-rows.js'; + +/** + * Builds a minimal {@link AnchorFactInsertRow} with sensible defaults, + * overridable per test. + * @param overrides - Fields to override. + * @returns The constructed row. + */ +function makeFact(overrides: Partial): AnchorFactInsertRow { + return { + source_page_id: 1, + dest_page_id: 100, + source_url_sort_key: 'https://example.com/page', + dest_url_sort_key: 'https://ads.example.com', + status: 200, + status_sort_key: 200, + status_desc_key: -200, + count: 1, + is_broken: 0, + is_external_link: 1, + ...overrides, + }; +} + +describe('deriveExternalLinkSummaryRows', () => { + it('returns an empty array when there are no external-link facts', () => { + const facts = [makeFact({ is_external_link: 0 })]; + expect(deriveExternalLinkSummaryRows(facts)).toEqual([]); + }); + + it('excludes broken (non-external) facts from the summary', () => { + const facts = [ + makeFact({ source_page_id: 1, is_external_link: 0, is_broken: 1 }), + makeFact({ source_page_id: 2, is_external_link: 1 }), + ]; + expect(deriveExternalLinkSummaryRows(facts)).toEqual([ + { + dest_page_id: 100, + dest_url: 'https://ads.example.com', + status: 200, + referrer_count: 1, + }, + ]); + }); + + it('counts referrer_count as the number of distinct-source edge rows sharing a destination', () => { + const facts = [ + makeFact({ source_page_id: 1 }), + makeFact({ source_page_id: 2 }), + makeFact({ source_page_id: 3 }), + ]; + const [summary] = deriveExternalLinkSummaryRows(facts); + expect(summary).toMatchObject({ dest_page_id: 100, referrer_count: 3 }); + }); + + it('does not inflate referrer_count using the edge-level count column (duplicate anchors already collapsed upstream)', () => { + const facts = [makeFact({ source_page_id: 1, count: 5 })]; + const [summary] = deriveExternalLinkSummaryRows(facts); + expect(summary).toMatchObject({ referrer_count: 1 }); + }); + + it('produces one summary row per unique dest_page_id', () => { + const facts = [ + makeFact({ + source_page_id: 1, + dest_page_id: 100, + dest_url_sort_key: 'https://ads.example.com', + }), + makeFact({ + source_page_id: 1, + dest_page_id: 200, + dest_url_sort_key: 'https://tracking.example.com', + status: 404, + }), + ]; + const summaries = deriveExternalLinkSummaryRows(facts); + expect(summaries).toHaveLength(2); + expect(summaries).toEqual( + expect.arrayContaining([ + { + dest_page_id: 100, + dest_url: 'https://ads.example.com', + status: 200, + referrer_count: 1, + }, + { + dest_page_id: 200, + dest_url: 'https://tracking.example.com', + status: 404, + referrer_count: 1, + }, + ]), + ); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.ts b/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.ts new file mode 100644 index 00000000..a4af50e0 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/derive-external-link-summary-rows.ts @@ -0,0 +1,37 @@ +import type { AnchorFactInsertRow, ExternalLinkInsertRow } from './types.js'; + +/** + * Derives `viewer_external_links` rows from already-computed + * {@link AnchorFactInsertRow} rows — no `anchors` scan of its own. + * + * `AnchorFactInsertRow` is already deduplicated one row per unique + * `(source_page_id, dest_page_id)` pair, so the number of `is_external_link` + * rows sharing a `dest_page_id` IS the distinct-referrer count — equivalent + * to `COUNT(DISTINCT source.id)` over the raw `anchors` table, but computed + * by counting already-grouped rows instead of a second aggregation pass. + * @param anchorFacts - The full `viewer_anchor_facts` row set for this + * build (as computed by `computeAnchorFactRows`). + * @returns One row per unique external destination. + */ +export function deriveExternalLinkSummaryRows( + anchorFacts: readonly AnchorFactInsertRow[], +): ExternalLinkInsertRow[] { + const summaries = new Map(); + for (const fact of anchorFacts) { + if (!fact.is_external_link) { + continue; + } + const existing = summaries.get(fact.dest_page_id); + if (existing) { + existing.referrer_count += 1; + } else { + summaries.set(fact.dest_page_id, { + dest_page_id: fact.dest_page_id, + dest_url: fact.dest_url_sort_key, + status: fact.status, + referrer_count: 1, + }); + } + } + return [...summaries.values()]; +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts index f873e94e..75fa4b37 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts @@ -37,7 +37,7 @@ describe('dropViewerReadModelTables', () => { ).resolves.toBeUndefined(); }); - it('drops all 8 tables after they were created', async () => { + it('drops all 9 tables after they were created', async () => { const knex = archive.getKnex(); await knex.transaction((trx) => createViewerReadModelTables(trx)); for (const table of [ @@ -49,6 +49,7 @@ describe('dropViewerReadModelTables', () => { 'viewer_directory_nodes', 'viewer_directory_pages', 'viewer_external_links', + 'viewer_anchor_facts', ]) { expect(await knex.schema.hasTable(table)).toBe(true); } @@ -63,6 +64,7 @@ describe('dropViewerReadModelTables', () => { 'viewer_directory_nodes', 'viewer_directory_pages', 'viewer_external_links', + 'viewer_anchor_facts', ]) { expect(await knex.schema.hasTable(table)).toBe(false); } diff --git a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts index 5f665079..60666545 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts @@ -1,16 +1,17 @@ import type { Knex } from 'knex'; /** - * Drops all 8 viewer-read-model tables if present, against the given + * Drops all 9 viewer-read-model tables if present, against the given * connection. Shared by `buildViewerReadModel` (which drops before * recreating, inside its own rebuild transaction) and - * `dropViewerReadModel` (which drops with no recreate), so the 8-table + * `dropViewerReadModel` (which drops with no recreate), so the 9-table * list only needs to be kept in sync with `createViewerReadModelTables` * in one place. * @param trx - An open Knex transaction (a plain `Knex` instance also * works, e.g. in tests). */ export async function dropViewerReadModelTables(trx: Knex): Promise { + await trx.schema.dropTableIfExists('viewer_anchor_facts'); await trx.schema.dropTableIfExists('viewer_external_links'); await trx.schema.dropTableIfExists('viewer_directory_pages'); await trx.schema.dropTableIfExists('viewer_directory_nodes'); diff --git a/packages/@nitpicker/query/src/viewer-read-model/null-status-sentinel.ts b/packages/@nitpicker/query/src/viewer-read-model/null-status-sentinel.ts new file mode 100644 index 00000000..32bb96c3 --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/null-status-sentinel.ts @@ -0,0 +1,26 @@ +/** + * Sentinel `status_sort_key` value substituted for `null` status (errored / + * not-yet-classified rows, or destinations never fetched). Chosen smaller + * than any real HTTP status code (100-599) so unknown-status rows keep + * sorting first in ascending order — matching the legacy write-model + * queries' prior behavior of ordering directly on the nullable `status` + * column, where SQLite treats `NULL` as smaller than any value. + * + * Deliberately distinct from `-1`, which `Database.resetFailedPages` already + * uses as the "hard failure" HTTP status sentinel (see that function's docs) + * — reusing `-1` here would conflate two different populations of rows in + * `status_sort_key` ordering and in any future `status = -1` equality filter. + * + * Keyset cursor comparisons need a NEVER-`null` sort-key column: SQL's + * three-valued logic makes `NULL > x` / `NULL < x` always evaluate to + * `NULL` (never true), which would silently break tuple comparisons like + * `(status_sort_key, url_sort_key, page_id) > (?, ?, ?)` for rows whose + * status is unknown. Substituting a sentinel keeps every row on this column + * strictly orderable. + * + * Shared by `viewer_pages` (`build-viewer-read-model.ts`) and + * `viewer_anchor_facts` (`compute-anchor-fact-rows.ts`) so the same + * status-ordering convention holds across both keyset-paginated read + * models. + */ +export const NULL_STATUS_SENTINEL = -32_768; diff --git a/packages/@nitpicker/query/src/viewer-read-model/types.ts b/packages/@nitpicker/query/src/viewer-read-model/types.ts index 954c3b7d..4aef84e3 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/types.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/types.ts @@ -161,7 +161,8 @@ export interface DirectoryTreeBuildResult { /** * One row to insert into `viewer_external_links`, one per unique canonical * (redirect-resolved) external destination. Produced by - * `computeExternalLinkRows`. + * `deriveExternalLinkSummaryRows` from the already-computed + * {@link AnchorFactInsertRow} set — no separate `anchors` scan. */ export interface ExternalLinkInsertRow { /** `COALESCE(canonical.id, dest.id)` — the canonical destination's `pages.id`. */ @@ -171,10 +172,55 @@ export interface ExternalLinkInsertRow { /** `COALESCE(canonical.status, dest.status)` — the canonical destination's HTTP status, or `null` if unknown. */ status: number | null; /** - * `COUNT(DISTINCT source.id)` — the number of distinct internal pages - * linking to this destination. Must stay in the same counting grain as - * `getPageDetail.inboundLinks` (see that function's docs, #71) — - * multiple anchors from the same page count once. + * The number of distinct internal pages linking to this destination — + * the count of {@link AnchorFactInsertRow} rows sharing this + * `dest_page_id`, since those rows are already deduplicated one-per- + * `(source_page_id, dest_page_id)` pair. Must stay in the same counting + * grain as `getPageDetail.inboundLinks` (see that function's docs, #71) + * — multiple anchors from the same page count once. */ referrer_count: number; } + +/** + * One row to insert into `viewer_anchor_facts`, one per unique + * `(source_page_id, dest_page_id)` pair — duplicate anchor observations + * between the same pair collapse into a single row via `count`. Produced by + * `computeAnchorFactRows`. + */ +export interface AnchorFactInsertRow { + /** `anchors.pageId` — the referring page's `pages.id`. */ + source_page_id: number; + /** `COALESCE(canonical.id, dest.id)` — the canonical destination's `pages.id`. */ + dest_page_id: number; + /** + * The referring page's URL, verbatim — copied at build time so indexed + * `ORDER BY`/keyset comparisons don't need a pre-join, the same + * rationale as `viewer_pages.url_sort_key`. + */ + source_url_sort_key: string; + /** `COALESCE(canonical.url, dest.url)`, verbatim — same rationale as {@link source_url_sort_key}. */ + dest_url_sort_key: string; + /** `COALESCE(canonical.status, dest.status)` — the canonical destination's HTTP status, or `null` if unknown. */ + status: number | null; + /** `status`, or `NULL_STATUS_SENTINEL` when `status` is `null` — see that constant's docs. */ + status_sort_key: number; + /** + * The negation of {@link status_sort_key} — walking this column + * ascending yields `status desc` display order while keeping the + * `source_url_sort_key`/`edge_id` tie-breakers ascending too, the same + * `viewer_pages.status_desc_key` rationale (a row-value keyset tuple + * comparison can't mix per-column directions). + */ + status_desc_key: number; + /** Number of raw anchor observations collapsed into this `(source_page_id, dest_page_id)` row. */ + count: number; + /** `1` iff the canonical destination's status is `404` (see `list-links.ts`'s broken-link scope note — 403/5xx/unknown never count). */ + is_broken: number; + /** + * `1` iff the canonical destination is external. Not indexed — consumed + * only by `deriveExternalLinkSummaryRows`'s in-memory pass at build + * time, never by an indexed read query. + */ + is_external_link: number; +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts b/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts index 978f7a00..19fee1bc 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts @@ -7,4 +7,4 @@ * `viewer_read_model_meta.schema_version` to decide whether a rebuild is * needed. */ -export const VIEWER_READ_MODEL_SCHEMA_VERSION = 5; +export const VIEWER_READ_MODEL_SCHEMA_VERSION = 6; diff --git a/scripts/bench-viewer-anchor-facts.mjs b/scripts/bench-viewer-anchor-facts.mjs new file mode 100644 index 00000000..1c98c8a6 --- /dev/null +++ b/scripts/bench-viewer-anchor-facts.mjs @@ -0,0 +1,310 @@ +#!/usr/bin/env node +/** + * Benchmarks `/api/links?type=broken`'s `viewer_anchor_facts` read-model + * fast path (issue #114) on a synthetic archive with hundreds of thousands + * of anchor records — no real customer archive is ever read or referenced. + * + * Records, per `docs/viewer-implementation-plan.md`'s Benchmark Contract: + * + * - page/anchor row counts, read-model build time, added DB size + * - `/api/links?type=broken` cold (first request after the just-built + * DB) and warm p50/p95 timing, per sort combination + * - `EXPLAIN QUERY PLAN` for each combination's read query + * + * "Cold"/"warm" follow the same convention as + * `bench-viewer-pages-read-model.mjs` and CLAUDE.md's `getSummary` cache + * note. + * + * USAGE + * ----- + * + * yarn build && node scripts/bench-viewer-anchor-facts.mjs + * + * Sizes (page counts) default to {50,000}; override via `BENCH_SIZES=…` + * (comma separated). Each page gets a fixed anchor fan-out, so the anchor + * (and viewer_anchor_facts) row count is roughly 8x the page count. Always + * disk-backed (never `:memory:`) — the whole point is measuring realistic + * cold-cache I/O, which an in-memory DB can't produce. + */ + +/* eslint-disable no-console, import-x/no-extraneous-dependencies */ + +import { mkdirSync, rmSync, statSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import path from 'node:path'; +import process from 'node:process'; + +import knex from 'knex'; + +import { initSchema } from '../packages/@nitpicker/crawler/lib/archive/init-schema.js'; +import { LibsqlDialect } from '../packages/@nitpicker/crawler/lib/archive/libsql-dialect.js'; +import { listViewerBrokenLinks } from '../packages/@nitpicker/query/lib/list-viewer-broken-links.js'; +import { buildViewerReadModel } from '../packages/@nitpicker/query/lib/viewer-read-model/build-viewer-read-model.js'; +import { createApp } from '../packages/@nitpicker/viewer/lib/create-app.js'; + +const SIZES = process.env.BENCH_SIZES + ? process.env.BENCH_SIZES.split(',').map((s) => Number(s.trim())) + : [50_000]; + +/** Anchors created per page — tunes the anchor:page row-count ratio. */ +const ANCHOR_FANOUT = 8; + +/** Repeated warm requests per matrix entry, for p50/p95. */ +const WARM_ITERATIONS = 30; + +/** + * Sort combinations benchmarked per `broken-links-view.tsx`'s exposed sort + * controls (`sourceUrl`/`destUrl`/`status`, both directions). + */ +const MATRIX = [ + { label: 'default (sourceUrl asc)', query: 'type=broken&limit=100' }, + { + label: 'sourceUrl desc', + query: 'type=broken&limit=100&sortBy=sourceUrl&sortOrder=desc', + }, + { label: 'destUrl asc', query: 'type=broken&limit=100&sortBy=destUrl&sortOrder=asc' }, + { label: 'status asc', query: 'type=broken&limit=100&sortBy=status&sortOrder=asc' }, + { label: 'status desc', query: 'type=broken&limit=100&sortBy=status&sortOrder=desc' }, +]; + +/** + * Materialises a disk-backed synthetic archive DB with `n` `pages` rows and + * `n * ANCHOR_FANOUT` `anchors` rows. Status mix (200/301/404/500/null) + * matches `bench-viewer-pages-read-model.mjs`'s real-world skew. Anchor + * targets are deterministic offsets from the source page index, including + * one guaranteed duplicate target per page (exercises `count` dedup) and a + * regular hit rate on 404 destinations (exercises `is_broken`). + * @param {number} n - The number of page rows to insert. + * @returns {Promise<{db: import('knex').Knex, dbFilePath: string, cleanupDir: string, anchorRowCount: number}>} + * The seeded Knex instance, its backing file/dir (for size + cleanup), + * and the total anchor row count inserted. + */ +async function makeDb(n) { + const cleanupDir = path.join( + tmpdir(), + `nitpicker-bench-viewer-anchor-facts-${n}-${process.pid}`, + ); + rmSync(cleanupDir, { recursive: true, force: true }); + mkdirSync(cleanupDir, { recursive: true }); + const dbFilePath = path.join(cleanupDir, 'db.sqlite'); + + const db = knex({ + client: LibsqlDialect, + connection: { filename: dbFilePath }, + useNullAsDefault: true, + }); + await initSchema(db); + + const STATUSES = [200, 200, 200, 200, 301, 404, 500, null]; + const CHUNK = 100; + + const pageRows = []; + for (let i = 0; i < n; i++) { + const padded = String(i).padStart(8, '0'); + pageRows.push({ + url: `https://example.com/page-${padded}`, + scraped: 1, + isTarget: 1, + isExternal: 0, + isSkipped: 0, + redirectDestId: null, + status: STATUSES[i % STATUSES.length], + statusText: 'OK', + contentType: 'text/html', + contentLength: 1000, + title: `Page ${padded}`, + source: 'crawled', + tag_count: 0, + jsonld_count: 0, + }); + if (pageRows.length >= CHUNK) { + await db('pages').insert(pageRows); + pageRows.length = 0; + } + } + if (pageRows.length > 0) { + await db('pages').insert(pageRows); + } + + const idRows = await db('pages').select('id').orderBy('id'); + const idByIndex = idRows.map((row) => row.id); + + let anchorRowCount = 0; + const anchorRows = []; + for (let i = 0; i < n; i++) { + const sourceId = idByIndex[i]; + for (let k = 0; k < ANCHOR_FANOUT; k++) { + // A fixed prime-step walk spreads targets across the whole page + // set deterministically; k === ANCHOR_FANOUT - 1 repeats the k=0 + // target on purpose, so every page has at least one duplicate + // (source,dest) pair collapsing into a viewer_anchor_facts row + // with count=2. + const step = k === ANCHOR_FANOUT - 1 ? 0 : k; + const targetIndex = (i + 1 + step * 97) % n; + anchorRows.push({ pageId: sourceId, hrefId: idByIndex[targetIndex] }); + anchorRowCount++; + } + if (anchorRows.length >= CHUNK) { + await db('anchors').insert(anchorRows); + anchorRows.length = 0; + } + } + if (anchorRows.length > 0) { + await db('anchors').insert(anchorRows); + } + + return { db, dbFilePath, cleanupDir, anchorRowCount }; +} + +/** + * Builds the viewer read model against the seeded DB, timing the build and + * measuring the DB file's size delta. + * @param {import('knex').Knex} db - The seeded Knex instance. + * @param {string} dbFilePath - The DB's backing file path (for `statSync`). + * @returns {Promise<{buildMs: number, sizeBeforeBytes: number, sizeAfterBytes: number, anchorFactRowCount: number}>} + * Build timing and size metrics. + */ +async function buildReadModel(db, dbFilePath) { + const sizeBeforeBytes = statSync(dbFilePath).size; + const accessorStub = { readOnly: false, getKnex: () => db }; + const start = process.hrtime.bigint(); + await buildViewerReadModel(accessorStub); + const buildMs = Number(process.hrtime.bigint() - start) / 1e6; + const sizeAfterBytes = statSync(dbFilePath).size; + + const anchorFactRowCount = await db('viewer_anchor_facts').count('* as count'); + + return { + buildMs, + sizeBeforeBytes, + sizeAfterBytes, + anchorFactRowCount: Number(anchorFactRowCount[0]?.count ?? 0), + }; +} + +/** + * Runs `EXPLAIN QUERY PLAN` for one matrix entry's window query, built via + * `db.raw` against the same `is_broken = 1` + `ORDER BY` shape + * `list-viewer-broken-links.ts`'s `readAnchorFactsWindow` issues. + * @param {import('knex').Knex} db - The Knex instance. + * @param {string} orderByColumns - The `ORDER BY` column list (no `is_broken` — that's a fixed `WHERE`). + * @returns {Promise} One `|`-joined line of `EXPLAIN QUERY PLAN` detail rows. + */ +async function explainMatrixEntry(db, orderByColumns) { + const sql = `SELECT edge_id FROM viewer_anchor_facts WHERE is_broken = 1 ORDER BY ${orderByColumns} LIMIT 101`; + const plan = await db.raw(`EXPLAIN QUERY PLAN ${sql}`); + return plan.map((row) => row.detail).join(' | '); +} + +/** + * Times `iterations` sequential HTTP round-trips through the real Hono app + * for one query string, returning p50/p95 in milliseconds. + * @param {import('hono').Hono} app - The app under test. + * @param {string} query - The `/api/links` query string (no leading `?`). + * @param {number} iterations - Number of warm requests to time. + * @returns {Promise<{p50: number, p95: number}>} Warm latency percentiles. + */ +async function timeWarmRequests(app, query, iterations) { + const timings = []; + for (let i = 0; i < iterations; i++) { + const start = process.hrtime.bigint(); + const res = await app.request(`/api/links?${query}`); + await res.text(); + timings.push(Number(process.hrtime.bigint() - start) / 1e6); + } + timings.sort((a, b) => a - b); + const p50 = timings[Math.floor(timings.length * 0.5)]; + const p95 = timings[Math.floor(timings.length * 0.95)]; + return { p50, p95 }; +} + +const EXPLAIN_ORDER_BY = { + 'default (sourceUrl asc)': 'source_url_sort_key, edge_id', + 'sourceUrl desc': 'source_url_sort_key DESC, edge_id DESC', + 'destUrl asc': 'dest_url_sort_key, edge_id', + 'status asc': 'status_sort_key, source_url_sort_key, edge_id', + 'status desc': 'status_desc_key, source_url_sort_key, edge_id', +}; + +/** + * Runs the full matrix (EXPLAIN + cold/warm HTTP timing) against one + * already-built read model, printing a results table and a copy-pasteable + * Markdown summary block. + * @param {import('knex').Knex} db - The Knex instance with a built read model. + * @param {number} n - The page-row count this DB was seeded with (for the report header). + */ +async function runMatrix(db, n) { + const accessorStub = { getKnex: () => db }; + const app = createApp({ + context: { archiveId: 'bench', manager: { get: () => accessorStub } }, + publicDir: '/tmp/no-such-dir-bench', + }); + + const results = []; + for (const entry of MATRIX) { + const explain = await explainMatrixEntry(db, EXPLAIN_ORDER_BY[entry.label]); + const coldStart = process.hrtime.bigint(); + const coldRes = await app.request(`/api/links?${entry.query}`); + await coldRes.text(); + const coldMs = Number(process.hrtime.bigint() - coldStart) / 1e6; + const { p50, p95 } = await timeWarmRequests(app, entry.query, WARM_ITERATIONS); + results.push({ ...entry, coldMs, p50, p95, explain }); + } + + console.log('\n sort cold p50 p95'); + for (const r of results) { + console.log( + ` ${r.label.padEnd(35)} ${`${r.coldMs.toFixed(1)}ms`.padStart(8)} ${`${r.p50.toFixed(1)}ms`.padStart(8)} ${`${r.p95.toFixed(1)}ms`.padStart(8)}`, + ); + console.log(` EXPLAIN: ${r.explain}`); + } + + console.log( + '\n### Markdown summary (paste into PR/ARCHITECTURE.md, no archive-identifying details)\n', + ); + console.log( + `\`${n.toLocaleString()} synthetic pages\` — /api/links?type=broken viewer_anchor_facts fast path:\n`, + ); + console.log('| sort | cold | warm p50 | warm p95 | EXPLAIN QUERY PLAN |'); + console.log('| --- | --- | --- | --- | --- |'); + for (const r of results) { + console.log( + `| ${r.label} | ${r.coldMs.toFixed(1)}ms | ${r.p50.toFixed(1)}ms | ${r.p95.toFixed(1)}ms | ${r.explain} |`, + ); + } + + // listViewerBrokenLinks function-level sanity check — confirms the HTTP + // numbers above aren't dominated by Hono/JSON overhead alone. + const directStart = process.hrtime.bigint(); + await listViewerBrokenLinks(accessorStub, { limit: 100 }); + const directMs = Number(process.hrtime.bigint() - directStart) / 1e6; + console.log( + `\nDirect \`listViewerBrokenLinks\` call (no HTTP layer), default sort: ${directMs.toFixed(1)}ms`, + ); +} + +for (const n of SIZES) { + console.log( + `\n══════════ ${n.toLocaleString()} pages (~${(n * ANCHOR_FANOUT).toLocaleString()} anchors) ══════════`, + ); + const { db, dbFilePath, cleanupDir, anchorRowCount } = await makeDb(n); + try { + const seedSizeBytes = statSync(dbFilePath).size; + console.log(` seeded DB size: ${(seedSizeBytes / 1024 / 1024).toFixed(1)} MiB`); + console.log(` anchors inserted: ${anchorRowCount.toLocaleString()}`); + + const { buildMs, sizeBeforeBytes, sizeAfterBytes, anchorFactRowCount } = + await buildReadModel(db, dbFilePath); + const addedBytes = sizeAfterBytes - sizeBeforeBytes; + console.log(` read-model build time: ${buildMs.toFixed(0)}ms`); + console.log( + ` read-model added DB size: ${(addedBytes / 1024 / 1024).toFixed(1)} MiB (viewer_anchor_facts rows after edge dedup: ${anchorFactRowCount.toLocaleString()})`, + ); + + await runMatrix(db, n); + } finally { + await db.destroy(); + rmSync(cleanupDir, { recursive: true, force: true }); + } +} +console.log('\nDone.'); From d619c4a04c9968906002d7e76b0e6a686213fda1 Mon Sep 17 00:00:00 2001 From: Yusuke Hirao Date: Fri, 3 Jul 2026 22:58:37 +0900 Subject: [PATCH 3/3] feat(viewer): route broken links through the viewer_anchor_facts fast path /api/links?type=broken now dispatches to the cursor-paginated listViewerBrokenLinks when the read model is current, falling back to the legacy offset-based listLinks otherwise. useLinksInfinite switches its virtual-scroll pageParam from offset to the server-issued nextCursor to match. --- .../src/routes/register-links-route.spec.ts | 247 ++++++++++++++++++ .../viewer/src/routes/register-links-route.ts | 111 ++++++-- .../viewer/web/api/use-links-infinite.ts | 29 +- 3 files changed, 345 insertions(+), 42 deletions(-) diff --git a/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts b/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts index e9d55294..ae351bea 100644 --- a/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts +++ b/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts @@ -120,6 +120,12 @@ async function buildFixture(workingDir: string, withReadModel: boolean) { title: null, textContent: 'Ad sidebar', }, + { + href: parseUrl('https://example.com/broken')!, + isExternal: false, + title: null, + textContent: 'Broken link', + }, ], imageList: [], isSkipped: false, @@ -140,6 +146,22 @@ async function buildFixture(workingDir: string, withReadModel: boolean) { imageList: [], isSkipped: false, }); + await archive.setPage({ + url: parseUrl('https://example.com/broken')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); if (withReadModel) { await buildViewerReadModel(archive); @@ -221,3 +243,228 @@ describe('registerLinksRoute — /api/links?type=external (integration)', () => }); }); }); + +describe('registerLinksRoute — /api/links?type=broken (integration)', () => { + describe('fast path (viewer_anchor_facts read model built)', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_register_links_route_broken_fast__', + ); + let fixture: Awaited>; + + beforeAll(async () => { + fixture = await buildFixture(workingDir, true); + }); + + afterAll(async () => { + await fixture.manager.closeAll(); + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('returns the broken-link shape with a nextCursor contract', async () => { + const res = await fixture.app.request('/api/links?type=broken'); + const body = (await res.json()) as { + items: { sourceUrl: string; destUrl: string; status: number | null }[]; + total: number; + nextCursor: string | null; + prevCursor: string | null; + }; + expect(body.total).toBe(1); + expect(body.items).toEqual([ + { + sourceUrl: 'https://example.com/page-b', + destUrl: 'https://example.com/broken', + status: 404, + isExternal: false, + textContent: null, + }, + ]); + expect(body.nextCursor).toBeNull(); + expect(body.prevCursor).toBeNull(); + }); + + it('forces the legacy fallback when urlPattern is set, since no single index covers source-OR-dest matching', async () => { + const res = await fixture.app.request( + `/api/links?type=broken&urlPattern=${encodeURIComponent('%page-b%')}`, + ); + const body = (await res.json()) as { + items: { sourceUrl: string }[]; + total: number; + }; + expect(body.total).toBe(1); + expect(body.items[0]!.sourceUrl).toBe('https://example.com/page-b'); + }); + }); + + describe('fast path — sortBy outside the read model’s narrower union', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_register_links_route_broken_unsupported_sort__', + ); + let fixture: Awaited>; + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + const archive = await Archive.create({ + filePath: path.resolve(workingDir, 'fixture.nitpicker'), + cwd: workingDir, + }); + await archive.setConfig(BASE_CONFIG); + + // `s1`'s broken destination is external, `s2`'s is internal. + // Sorting by `sourceUrl` (the fast path's silent fallback if the + // unsupported-sort guard were missing) would place `s1` before + // `s2` (alphabetical). Sorting by `isExternal` ascending (only + // `listLinks`, the legacy path, supports this) places the + // internal destination (`s2`) first instead — a result only + // reachable by actually forcing the legacy fallback. + await archive.setPage({ + url: parseUrl('https://example.com/s1')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://ext.example.com/e1')!, + isExternal: true, + title: null, + textContent: 'External broken', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/s2')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://example.com/i1')!, + isExternal: false, + title: null, + textContent: 'Internal broken', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://ext.example.com/e1')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/i1')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + + const manager = new ArchiveManager(); + const { archiveId, mode } = await manager.open(archive.tmpDir); + const app = createApp({ + context: { + manager, + archiveId, + filePath: archive.tmpDir, + mode, + crawlerLockHolder: null, + }, + publicDir: '/tmp/no-such-dir-register-links-route-spec', + }); + fixture = { app, archive, manager }; + }); + + afterAll(async () => { + await fixture.manager.closeAll(); + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('forces the legacy fallback for sortBy=isExternal, which viewer_anchor_facts has no index for', async () => { + const res = await fixture.app.request( + '/api/links?type=broken&sortBy=isExternal&sortOrder=asc', + ); + const body = (await res.json()) as { items: { sourceUrl: string }[] }; + expect(body.items.map((item) => item.sourceUrl)).toEqual([ + 'https://example.com/s2', + 'https://example.com/s1', + ]); + }); + }); + + describe('legacy fallback path (no read model built)', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_register_links_route_broken_legacy__', + ); + let fixture: Awaited>; + + beforeAll(async () => { + fixture = await buildFixture(workingDir, false); + }); + + afterAll(async () => { + await fixture.manager.closeAll(); + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('returns the same broken-link shape via the legacy live query, with an offset-string nextCursor', async () => { + const res = await fixture.app.request('/api/links?type=broken'); + const body = (await res.json()) as { + items: { sourceUrl: string; destUrl: string; status: number | null }[]; + total: number; + nextCursor: string | null; + }; + expect(body.total).toBe(1); + expect(body.items[0]).toMatchObject({ + sourceUrl: 'https://example.com/page-b', + destUrl: 'https://example.com/broken', + status: 404, + }); + expect(body.nextCursor).toBeNull(); + }); + }); +}); diff --git a/packages/@nitpicker/viewer/src/routes/register-links-route.ts b/packages/@nitpicker/viewer/src/routes/register-links-route.ts index e06d3c19..84824344 100644 --- a/packages/@nitpicker/viewer/src/routes/register-links-route.ts +++ b/packages/@nitpicker/viewer/src/routes/register-links-route.ts @@ -5,14 +5,34 @@ import { isViewerReadModelCurrent, listExternalLinks, listLinks, + listViewerBrokenLinks, listViewerExternalLinks, } from '@nitpicker/query'; +import { buildLegacyPagesCursors } from '../query-params/build-legacy-pages-cursors.js'; +import { parseLegacyPagesCursor } from '../query-params/parse-legacy-pages-cursor.js'; import { toNumber } from '../query-params/to-number.js'; /** Valid `type` values for the links route. */ const VALID_LINK_TYPES = ['broken', 'external'] as const; +/** Default page size, matching `listLinks`/`listViewerBrokenLinks`'s own default. */ +const DEFAULT_LIMIT = 100; + +/** + * `sortBy` values `listViewerBrokenLinks` supports — a strict subset of + * `listLinks`'s 5 (`sourceUrl`/`destUrl`/`status`/`isExternal`/ + * `textContent`), since `viewer_anchor_facts` has no index on + * `is_external_link` and stores no anchor text at all (see + * `list-viewer-broken-links.ts`'s docs). A request for `isExternal`/ + * `textContent` must force the legacy fallback rather than silently + * falling through `getAnchorFactsSortSpec`'s `sourceUrl` default — a + * bookmarked/shared `?sortBy=isExternal` URL must sort the same way + * whether or not the read model happens to be current, not silently + * change order depending on internal cache state. + */ +const BROKEN_LINKS_FAST_PATH_SORT_KEYS = new Set(['sourceUrl', 'destUrl', 'status']); + /** * Registers `GET /api/links?type=broken|external` — link analysis. * @@ -27,22 +47,38 @@ const VALID_LINK_TYPES = ['broken', 'external'] as const; * `sourceUrl`/`isExternal`/`textContent` sort keys, an added * `referrerCount` sort key). * - * `external` dispatches to one of two backends per request, the same - * two-layer pattern `register-pages-route.ts` uses for `/api/pages`: + * Both `external` and `broken` dispatch to one of two backends per request, + * the same two-layer pattern `register-pages-route.ts` uses for + * `/api/pages`: * - * - `listViewerExternalLinks` (the `viewer_external_links` read-model fast - * path) when the read model is built and current. Unlike `/api/pages`, - * there is no filter that forces a legacy fallback: `urlPattern`/`status` - * both map directly onto `viewer_external_links` columns. - * - `listExternalLinks` (the legacy live `anchors` JOIN + `GROUP BY` query) - * otherwise — covers archives predating the read model. Both share the - * same options/response shape, so callers see no difference beyond speed. + * - `external`: `listViewerExternalLinks` (the `viewer_external_links` + * read-model fast path) when the read model is current — no filter forces + * a legacy fallback here, since `urlPattern`/`status` both map directly + * onto `viewer_external_links` columns. Otherwise `listExternalLinks` + * (the legacy live `anchors` JOIN + `GROUP BY` query). + * - `broken`: `listViewerBrokenLinks` (the `viewer_anchor_facts` read-model + * fast path, cursor-paginated) when the read model is current AND none of + * `urlPattern`, `includeRedirectSources`, or an unsupported `sortBy` + * (`isExternal`/`textContent` — see `BROKEN_LINKS_FAST_PATH_SORT_KEYS`) is + * set — `urlPattern` matches source OR destination across two columns, + * which no single index can satisfy; `includeRedirectSources` has no + * read-model equivalent (`viewer_anchor_facts` only ever stores the + * canonical destination); and the fast path's narrower `sortBy` union + * means an unsupported value must force the legacy fallback rather than + * silently resolving to a different sort. Otherwise `listLinks` (legacy, + * anchor-scan-bound, offset-based). The + * legacy path's `cursor` is a plain decimal offset string (see + * `buildLegacyPagesCursors`), not the fast path's opaque keyset token, but + * exposes the same `nextCursor`-only contract so `useLinksInfinite`'s + * virtual scroll keeps paginating past the first page regardless of which + * backend served it. * @param app - The Hono application. * @param context - The opened archive context. */ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { app.get('/api/links', async (c) => { - const type = c.req.query('type'); + const q = c.req.query(); + const type = q.type; if (!type || !(VALID_LINK_TYPES as readonly string[]).includes(type)) { return c.json( { @@ -52,11 +88,11 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { ); } const accessor = context.manager.get(context.archiveId); - const limit = toNumber(c.req.query('limit')); - const offset = toNumber(c.req.query('offset')); - const urlPattern = c.req.query('urlPattern'); - const status = toNumber(c.req.query('status')); - const sortOrder = c.req.query('sortOrder') as 'asc' | 'desc' | undefined; + const limit = toNumber(q.limit); + const offset = toNumber(q.offset); + const urlPattern = q.urlPattern; + const status = toNumber(q.status); + const sortOrder = q.sortOrder as 'asc' | 'desc' | undefined; if (type === 'external') { const params = { @@ -64,11 +100,7 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { offset, urlPattern, status, - sortBy: c.req.query('sortBy') as - | 'destUrl' - | 'status' - | 'referrerCount' - | undefined, + sortBy: q.sortBy as 'destUrl' | 'status' | 'referrerCount' | undefined, sortOrder, }; const result = (await isViewerReadModelCurrent(accessor)) @@ -77,15 +109,36 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { return c.json(result); } - const includeRedirectSources = c.req.query('includeRedirectSources') === 'true'; - const result = await listLinks(accessor, { + const includeRedirectSources = q.includeRedirectSources === 'true'; + const usesUnsupportedSort = Boolean( + q.sortBy && !BROKEN_LINKS_FAST_PATH_SORT_KEYS.has(q.sortBy), + ); + const usesWideTableOnlyFilter = Boolean( + urlPattern || includeRedirectSources || usesUnsupportedSort, + ); + if (!usesWideTableOnlyFilter && (await isViewerReadModelCurrent(accessor))) { + const result = await listViewerBrokenLinks(accessor, { + limit, + offset, + status, + sortBy: q.sortBy as 'sourceUrl' | 'destUrl' | 'status' | undefined, + sortOrder, + cursor: q.cursor || undefined, + direction: q.direction === 'prev' ? 'prev' : undefined, + }); + return c.json(result); + } + + const legacyLimit = limit ?? DEFAULT_LIMIT; + const legacyOffset = parseLegacyPagesCursor(q.cursor, offset ?? 0); + const legacyResult = await listLinks(accessor, { type: 'broken', - limit, - offset, + limit: legacyLimit, + offset: legacyOffset, includeRedirectSources, urlPattern, status, - sortBy: c.req.query('sortBy') as + sortBy: q.sortBy as | 'sourceUrl' | 'destUrl' | 'status' @@ -94,6 +147,12 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { | undefined, sortOrder, }); - return c.json(result); + const { nextCursor, prevCursor } = buildLegacyPagesCursors({ + offset: legacyOffset, + itemCount: legacyResult.items.length, + total: legacyResult.total, + limit: legacyLimit, + }); + return c.json({ ...legacyResult, nextCursor, prevCursor }); }); } diff --git a/packages/@nitpicker/viewer/web/api/use-links-infinite.ts b/packages/@nitpicker/viewer/web/api/use-links-infinite.ts index 117ef4ff..1d55163d 100644 --- a/packages/@nitpicker/viewer/web/api/use-links-infinite.ts +++ b/packages/@nitpicker/viewer/web/api/use-links-infinite.ts @@ -1,10 +1,9 @@ import type { InfiniteQueryOptions } from './infinite-query-options.js'; -import type { LinkEntry } from '@nitpicker/query'; +import type { CursorPaginatedLinkList, LinkEntry } from '@nitpicker/query'; import { useInfiniteQuery } from '@tanstack/react-query'; import { apiGet } from './api-client.js'; -import { getNextOffset } from './get-next-offset.js'; import { PAGE_SIZE } from './page-size.js'; /** @@ -32,16 +31,15 @@ export interface LinksFilter { sortOrder?: string; } -/** Paginated link analysis response shape. */ -interface LinksPage { - /** Rows for this page. */ - items: LinkRow[]; - /** Total matching rows. */ - total: number; -} - /** - * Infinite-scrolling broken-link analysis. + * Infinite-scrolling broken-link analysis. Fetches `PAGE_SIZE` rows per + * request and advances via the server-issued `nextCursor` (keyset + * pagination) rather than a growing `offset` — the same contract + * `usePagesInfinite` uses for `/api/pages`. `/api/links?type=broken` serves + * this from the `viewer_anchor_facts` read model when available, falling + * back to the legacy anchor-scan path (whose `nextCursor` is a plain + * offset-as-string, per `buildLegacyPagesCursors`) otherwise — this hook + * never needs to know which backend served a given page. * @param type - The link analysis type. * @param filter * @param options - Optional flags (`enabled`). @@ -54,16 +52,15 @@ export function useLinksInfinite( ) { return useInfiniteQuery({ queryKey: ['links', type, filter], - initialPageParam: 0, + initialPageParam: null as string | null, queryFn: ({ pageParam }) => - apiGet('/api/links', { + apiGet('/api/links', { type, ...filter, limit: PAGE_SIZE, - offset: pageParam, + cursor: pageParam ?? undefined, }), - getNextPageParam: (lastPage, _allPages, lastPageParam) => - getNextOffset(lastPage, lastPageParam), + getNextPageParam: (lastPage) => lastPage.nextCursor ?? undefined, enabled: options?.enabled ?? true, }); }