diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 063da47..12edd19 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -179,7 +179,8 @@ crawler/src/ - **`getPageDetail`**: 単一ページの詳細情報(メタデータ、アウトバウンド/インバウンドリンク、リダイレクト元) - **`getPageHtml`**: HTML スナップショット取得(truncation サポート) - **`listLinks`**: リンク分析(`type: 'broken' | 'external'`、anchor 単位 = 1 行 1 `` タグ、重複排除なし)。dest は `pages.redirectDestId` 経由で canonical destination まで解決した上で broken/external 判定(`includeRedirectSources: true` で解決を無効化し literal を見る)。関数自体は変更していないため CLI/MCP は従来通り `type: 'external'` で anchor 単位の生データを取得できるが、**viewer の `/api/links?type=external` だけは `listExternalLinks` に切り替え済み**(後述) — 「外部リンク」ビューは宛先ごとに集約した一覧を必要とするため -- **`listExternalLinks`**: viewer の「外部リンク」ビュー専用。外部リンク先を canonical destination(`listLinks` と同じ `COALESCE(canonical.*, dest.*)` 解決パターン)ごとに `GROUP BY` で重複排除し、`referrerCount`(`COUNT(DISTINCT source.id)` — 同一ページからの複数アンカーは 1 件として数える)を付与した一覧。ページネーションの `total` は distinct 宛先数(anchor 数ではない)を GROUP BY サブクエリでラップして算出 — `paginateQuery` ヘルパーは素朴な `count(idColumn)` のため GROUP BY 済みクエリと非互換で使えない。宛先の詳細(参照元ページ一覧)は新規ビューを作らず既存の `getPageDetail`(`isExternal`/`scraped` 制約なし)の `inboundLinks` をそのまま再利用する +- **`listExternalLinks`**: viewer の「外部リンク」ビュー用の legacy 経路(read model が無い/古いアーカイブのフォールバック)。外部リンク先を canonical destination(`listLinks` と同じ `COALESCE(canonical.*, dest.*)` 解決パターン)ごとに `GROUP BY` で重複排除し、`referrerCount`(`COUNT(DISTINCT source.id)` — 同一ページからの複数アンカーは 1 件として数える)を付与した一覧。ページネーションの `total` は distinct 宛先数(anchor 数ではない)を GROUP BY サブクエリでラップして算出 — `paginateQuery` ヘルパーは素朴な `count(idColumn)` のため GROUP BY 済みクエリと非互換で使えない。宛先の詳細(参照元ページ一覧)は新規ビューを作らず既存の `getPageDetail`(`isExternal`/`scraped` 制約なし)の `inboundLinks` をそのまま再利用する。**`viewer_external_links` read model が current な場合は `listViewerExternalLinks` に切り替わる**(後述の「設計注意(外部リンク read model)」参照)— この関数自体はそのフォールバックとして無変更のまま残る +- **`listViewerExternalLinks`**: `viewer_external_links` read model 専用の fast path。`listExternalLinks` と同じオプション/レスポンス形だが、集計(JOIN + GROUP BY + COUNT DISTINCT)は read model ビルド時に1回だけ実行済みなので、実行時は単純な indexed SELECT + `paginateQuery`(GROUP BY 不要になったため素朴な helper がそのまま使える) - **`listIsolatedPages`** / **`listIsolatedClusters`** / **`getIsolatedCluster`**: inventory subgraph の **完全孤立** (singleton) / **孤立集合** (connected component, size ≥ 2)。crawled-wins downgrade の不変量により crawled 行は定義上 isolated 判定から除外される。cluster の edge は redirect 解決済み anchor を無向で見た weakly connected component(共通ヘルパー `compute-isolated-clusters.ts` が `resolve-redirect-chain` + union-find で計算) - **`listResources`**: サブリソース一覧(CSS, JS, 画像、フォント) - **`listImages`**: 画像一覧(alt 欠損、寸法欠損、オーバーサイズ検出) @@ -254,7 +255,7 @@ nitpicker viewer → SIGINT/SIGTERM: manager.closeAll() → server.close() → resolve(CLI が exit) ``` -**REST API(アーカイブは起動時固定なので archiveId 不要):** `GET /api/summary`, `/api/pages`(`hasCSP`/`hasXFrameOptions`/`hasXContentTypeOptions`/`hasHSTS` の 4 列を含む。旧 `/api/headers`・「Headers」ビューは「ページ」ビューへ統合済み、CLI/MCP 向けの `checkHeaders` 自体は残存), `/api/pages/detail?url=`, `/api/pages/html?url=`, `/api/links?type=`(`broken` は `listLinks` 経由で anchor 単位のまま、canonical destination が HTTP 404 のみ。403/5xx/未取得(NULL) は broken 扱いしない。`external` は `listExternalLinks` 経由で canonical destination ごとに重複排除され `referrerCount` を返す — 宛先の参照元一覧は新規エンドポイントを作らず既存の `/api/pages/detail` の inboundLinks を再利用する), `/api/resources`, `/api/resources/referrers?resourceUrl=`, `/api/images`, `/api/violations`, `/api/duplicates`, `/api/mismatches`, `/api/graph`(内部ページのリンクグラフ、`getLinkGraph`), `/api/directory-tree`(全 root の初期 3 depth ツリー、`getDirectoryTree`), `/api/directory-tree/children?nodeId=`(1 ノード直下の子ディレクトリ、`listDirectoryChildren`), `/api/directory-tree/pages?nodeId=&cursor=&limit=`(1 ディレクトリ直下ページの cursor 一覧、`listDirectoryPages`), `/api/info`(開いているアーカイブの絶対パス、フッター表示用)。クエリパラメータ → query options 変換は `query-params/to-number.ts` / `to-boolean.ts`、エラーは `sanitize-error-message.ts` で絶対パスを伏せて JSON 返却(mcp-server と同方針)。旧 `/api/page-links`(`listPageLinks`)は「ページリンク」ビューの廃止に伴い削除 — per-page の status/referrers/redirect-from は Page Detail ビュー(`/api/pages/detail`)の inbound/outbound/redirectFrom で個別ページ単位に確認する。`getPageDetail` は `isSkipped`/`skipReason`(robots.txt / `excludeUrls` による除外理由)も返すようになり、URL 既知の場合は除外理由を引き続き確認できる。**受容したギャップ**: `listPages` / `listPagesByTag` / `listPagesByJsonLdType` はすべて `scraped = 1` 前提のため、「除外されて一度も取得されていない URL 一覧」を一括列挙する手段は無くなった(旧 `listPageLinks` だけが `scraped` 制約なしだった)。URL が分かっていれば `getPageDetail` で確認できるが、一括把握が必要な場合は `nitpicker query error-kinds` や archive の `pages` テーブルを直接クエリすること。 +**REST API(アーカイブは起動時固定なので archiveId 不要):** `GET /api/summary`, `/api/pages`(`hasCSP`/`hasXFrameOptions`/`hasXContentTypeOptions`/`hasHSTS` の 4 列を含む。旧 `/api/headers`・「Headers」ビューは「ページ」ビューへ統合済み、CLI/MCP 向けの `checkHeaders` 自体は残存), `/api/pages/detail?url=`, `/api/pages/html?url=`, `/api/links?type=`(`broken` は `listLinks` 経由で anchor 単位のまま、canonical destination が HTTP 404 のみ。403/5xx/未取得(NULL) は broken 扱いしない。`external` は canonical destination ごとに重複排除され `referrerCount` を返す — read model が current なら `listViewerExternalLinks`、そうでなければ `listExternalLinks` にフォールバック(`/api/pages` と同じ二層構成)。宛先の参照元一覧は新規エンドポイントを作らず既存の `/api/pages/detail` の inboundLinks を再利用する), `/api/resources`, `/api/resources/referrers?resourceUrl=`, `/api/images`, `/api/violations`, `/api/duplicates`, `/api/mismatches`, `/api/graph`(内部ページのリンクグラフ、`getLinkGraph`), `/api/directory-tree`(全 root の初期 3 depth ツリー、`getDirectoryTree`), `/api/directory-tree/children?nodeId=`(1 ノード直下の子ディレクトリ、`listDirectoryChildren`), `/api/directory-tree/pages?nodeId=&cursor=&limit=`(1 ディレクトリ直下ページの cursor 一覧、`listDirectoryPages`), `/api/info`(開いているアーカイブの絶対パス、フッター表示用)。クエリパラメータ → query options 変換は `query-params/to-number.ts` / `to-boolean.ts`、エラーは `sanitize-error-message.ts` で絶対パスを伏せて JSON 返却(mcp-server と同方針)。旧 `/api/page-links`(`listPageLinks`)は「ページリンク」ビューの廃止に伴い削除 — per-page の status/referrers/redirect-from は Page Detail ビュー(`/api/pages/detail`)の inbound/outbound/redirectFrom で個別ページ単位に確認する。`getPageDetail` は `isSkipped`/`skipReason`(robots.txt / `excludeUrls` による除外理由)も返すようになり、URL 既知の場合は除外理由を引き続き確認できる。**受容したギャップ**: `listPages` / `listPagesByTag` / `listPagesByJsonLdType` はすべて `scraped = 1` 前提のため、「除外されて一度も取得されていない URL 一覧」を一括列挙する手段は無くなった(旧 `listPageLinks` だけが `scraped` 制約なしだった)。URL が分かっていれば `getPageDetail` で確認できるが、一括把握が必要な場合は `nitpicker query error-kinds` や archive の `pages` テーブルを直接クエリすること。 **バイナリ:** なし(CLI の `viewer` サブコマンド経由で起動) @@ -340,6 +341,14 @@ nitpicker viewer > > **`getDirectoryTree` の ORDER BY は `path_sort_key` 単独、`root_key` を含めない**: 全 root を 1 クエリで返す設計上、`root_key` の等価フィルタが存在しないため、`vdn_root_depth_path (root_key, depth, path_sort_key, node_id)` のような `root_key` 先頭 index は `depth <= 3` という range 条件との組み合わせで一切活用できず、`EXPLAIN QUERY PLAN` で実測すると `USE TEMP B-TREE FOR LAST TERM OF ORDER BY` が付く(PR #96 の `idx_pages_listfilter` column 順ミスと同型の教訓)。`path_sort_key` を先頭に置いた `vdn_path_depth (path_sort_key, depth, node_id)` に張り替え、`ORDER BY path_sort_key` のみに変更することで `SCAN ... USING INDEX vdn_path_depth`(sort 無し、`depth` は残差フィルタ)に収まることを確認済み。root_key を ORDER BY から外しても、grouping は JS 側で `Map` に振り分けるだけなので各 root 内の相対順序(`path_sort_key` 昇順)は保たれる。**検索キーワード**: 「directory-tree」「ディレクトリツリー」「has_children」「vdn_path_depth」「USE TEMP B-TREE」。 +> **設計注意(外部リンク read model):** `listExternalLinks`(PR #153)は `anchors JOIN pages(source) JOIN pages(dest) LEFT JOIN pages(canonical)` を `COALESCE` 計算列で `GROUP BY` し `COUNT(DISTINCT source.id)` を求める形で、リクエストごとにこの JOIN+集計を(`total` 用サブクエリと data 用の)2 回実行していた。SQLite は `COUNT(DISTINCT ...)` で既存 index を使わず別の b-tree を都度構築することが知られており(SQLite forum 実測: `count(distinct id)` 単体 6.4 秒、他の集約と同一クエリに混ぜると 55.2 秒まで悪化する例が報告されている)、`GROUP BY` も式インデックス(`CREATE INDEX` の式と `WHERE`/`GROUP BY` の式が構文的に完全一致しないと使われない)では確実に解決できない。回避策として同フォーラムが推奨するのは集計をあらかじめ一時テーブルに書き出す方式で、これは本リポジトリの `viewer_pages`/`viewer_directory_nodes`(issue #106〜#112)と同じ「read model を作って計測してから最適化する」方針そのものである。 +> +> `viewer_external_links`(`dest_page_id` PK / `dest_url` / `status` / `referrer_count`)は `buildViewerReadModel` の同じトランザクション内で `computeExternalLinkRows`(`viewer-read-model/compute-external-link-rows.ts`)が構築する。集計ロジック(`COALESCE` 解決・`COUNT(DISTINCT source.id)`)は `listExternalLinks` から一切変更せずそのまま移植 — `referrerCount` は `getPageDetail.inboundLinks`(#71)と同じ数え方(重複アンカーは 1 referrer)を保つ契約があるため。`viewer_pages`/directory tree と違い、`sourceRows`(`pages` のみ)を再利用できず `anchors` への専用クエリが必要(リンク情報は `anchors` にしかない)。 +> +> **keyset cursor ではなく `paginateQuery`(offset ベース)を使う**: `viewer_pages` が `status_sort_key`/`status_desc_key`/`NULL_STATUS_SENTINEL` という仕掛けを持つのは keyset cursor 特有の要件(SQL の 3 値論理で `NULL` 比較が壊れる、`DESC` を常に `ASC` 方向スキャンにする必要がある)で、`/api/links?type=external` の REST 契約はそもそも offset ベースのまま変更していないため、この複雑さは不要。`viewer_external_links` の 3 index(`vel_url` / `vel_status` / `vel_referrer_count`)はいずれも単純な単方向 index で、`DESC` は同じ index の逆順スキャンで足りる。 +> +> **fast path / legacy の二層構成**: `register-links-route.ts` は `/api/pages` と同じパターンで `isViewerReadModelCurrent` を見て `listViewerExternalLinks`(fast path)と `listExternalLinks`(legacy、無変更のまま残存)を切り替える。`urlPattern`/`status` はどちらの経路でも同じ列に対応するため、`/api/pages` の `hasCSP` 等のような「特定フィルタ指定時は強制 legacy」という除外条件は無い。スキーマ変更を伴うため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 4→5 に bump し、旧バージョンの read model は自動再ビルド対象にした。**検索キーワード**: 「external links」「外部リンク」「COUNT DISTINCT」「viewer_external_links」「GROUP BY 遅い」。 + ### @nitpicker/cli `@d-zero/roar` ベースの統合 CLI。7つのサブコマンドを提供。全 analyze プラグインを `dependencies` に含んでおり、`npx` 実行時に `@nitpicker/core` の動的 `import()` がプラグインモジュールを解決できるようにしている。 diff --git a/CLAUDE.md b/CLAUDE.md index 5bc8dd4..c70a8e9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -82,6 +82,8 @@ packages/ > **Note (ディレクトリツリー read model、issue #107)**: `viewer_directory_nodes` / `viewer_directory_pages` は `viewer_pages` を返す `sourceRows` を再利用し `buildDirectoryTreeRows` が純粋関数としてメモリ上に構築する。**root_key はホスト単位、ただし internal ページを 1 件も持たないホストは除外**(外部リンク先ドメインの無意味な 1 ページツリーを防ぐ)。**ディレクトリ/ページ境界は末尾スラッシュで判定**(`/blog/2024/post-1` と `/blog/2024/` は同じ `/blog/2024/` ノードに着地)。**`has_children` は `direct_child_dir_count > 0` のみ**(`direct_page_count` を含めると構築ロジック上絶対に `false` にならないため、UI の展開矢印が意味を持つよう子ディレクトリの有無だけを見る)。この機能に legacy フォールバックは存在しないため、3関数(`getDirectoryTree`/`listDirectoryChildren`/`listDirectoryPages`)とも `hasViewerReadModel` ではなく `isViewerReadModelCurrent` を guard に使う。詳細は ARCHITECTURE.md の `@nitpicker/viewer` 節「設計注意(ディレクトリツリー read model...)」を正とする。 +> **Note (外部リンク read model)**: `listExternalLinks`(PR #153)は `anchors` の JOIN + `COALESCE` 計算列での `GROUP BY` + `COUNT(DISTINCT source.id)` をリクエストごとに(`total` 用と data 用で)2 回実行していた。SQLite の `COUNT(DISTINCT ...)` は既存 index を使わず別 b-tree を都度構築する既知のパフォーマンス病理を持つため(実測: 単体 6.4 秒、他の集約と混ぜると 55.2 秒まで悪化する例が SQLite forum に報告されている)、`viewer_pages`/`viewer_directory_nodes` と同じ read model パターンに乗せた。`viewer_external_links`(`dest_page_id` PK / `dest_url` / `status` / `referrer_count`)は `buildViewerReadModel` 内で `computeExternalLinkRows` が `anchors` への専用クエリ(`sourceRows` 再利用不可 — リンク情報は `pages` にはない)で1回だけ集計して構築する。集計ロジック自体(`COALESCE` 解決、referrer 重複排除)は `listExternalLinks` から無変更で移植 — `getPageDetail.inboundLinks`(#71)とのカウント粒度契約を崩さないため。ページネーションは keyset cursor ではなく `paginateQuery`(offset ベース、REST 契約が offset のままなので不要な複雑さを持ち込まない)。`register-links-route.ts` は `/api/pages` と同じ二層構成で `isViewerReadModelCurrent` を見て `listViewerExternalLinks`(fast path)↔ `listExternalLinks`(legacy、無変更で残存)を切り替える。スキーマ変更のため `VIEWER_READ_MODEL_SCHEMA_VERSION` を 4→5 に bump。詳細は ARCHITECTURE.md の `@nitpicker/viewer` 節「設計注意(外部リンク read model)」を正とする。 + ## CLI コマンド ```sh diff --git a/packages/@nitpicker/query/src/list-viewer-external-links.spec.ts b/packages/@nitpicker/query/src/list-viewer-external-links.spec.ts new file mode 100644 index 0000000..ea0839e --- /dev/null +++ b/packages/@nitpicker/query/src/list-viewer-external-links.spec.ts @@ -0,0 +1,481 @@ +import path from 'node:path'; + +import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; +import { Archive } from '@nitpicker/crawler'; +import { afterAll, beforeAll, describe, expect, it } from 'vitest'; + +import { listViewerExternalLinks } from './list-viewer-external-links.js'; +import { buildViewerReadModel } from './viewer-read-model/build-viewer-read-model.js'; + +const __filename = new URL(import.meta.url).pathname; +const __dirname = path.dirname(__filename); +const workingDir = path.resolve( + __dirname, + '__test_fixtures_list_viewer_external_links__', +); + +const META = { + lang: null, + title: null, + description: null, + keywords: null, + noindex: false, + nofollow: false, + noarchive: false, + canonical: null, + alternate: null, + 'og:type': null, + 'og:title': null, + 'og:site_name': null, + 'og:description': null, + 'og:url': null, + 'og:image': null, + 'twitter:card': null, +}; + +/** + * Mirrors `list-external-links.spec.ts`'s fixture and test cases, but + * against `listViewerExternalLinks` (the `viewer_external_links` read-model + * fast path) after `buildViewerReadModel` has populated the table — pinning + * that both backends agree on filter/sort/pagination/tie-break behavior. + */ +describe('listViewerExternalLinks', () => { + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'list-viewer-external-links-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig({ + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, + }); + + // Page A: two anchors to ads.example.com (same page, must count as one + // referrer, not two), plus one to tracking, one to solo. + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page A' }, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad banner', + }, + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad footer', + }, + { + href: parseUrl('https://tracking.example.com/')!, + isExternal: true, + title: null, + textContent: 'Tracking', + }, + { + href: parseUrl('https://solo.example.com/')!, + isExternal: true, + title: null, + textContent: 'Solo', + }, + ], + imageList: [], + isSkipped: false, + }); + + // Page B: anchors to ads.example.com (2nd distinct referrer) and tracking. + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page B' }, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad sidebar', + }, + { + href: parseUrl('https://tracking.example.com/')!, + isExternal: true, + title: null, + textContent: 'Tracking again', + }, + ], + imageList: [], + isSkipped: false, + }); + + // Page C: anchor to ads.example.com (3rd distinct referrer). + await archive.setPage({ + url: parseUrl('https://example.com/page-c')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page C' }, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad again', + }, + ], + imageList: [], + isSkipped: false, + }); + + // External destination rows. ads/solo resolve 200; tracking resolves 404. + await archive.setPage({ + url: parseUrl('https://ads.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://tracking.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://solo.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('宛先を1つの行に集約し、参照元ページ数を返す', async () => { + const result = await listViewerExternalLinks(archive); + const ads = result.items.find((item) => item.destUrl === 'https://ads.example.com'); + expect(ads).toBeDefined(); + expect(ads!.referrerCount).toBe(3); + }); + + it('総件数はアンカー数ではなく宛先の異なり数になる', async () => { + const result = await listViewerExternalLinks(archive); + expect(result.total).toBe(3); + expect(result.items).toHaveLength(3); + }); + + it('status で宛先をフィルタする', async () => { + const result = await listViewerExternalLinks(archive, { status: 404 }); + expect(result.items).toHaveLength(1); + expect(result.items[0]).toMatchObject({ + destUrl: 'https://tracking.example.com', + status: 404, + referrerCount: 2, + }); + }); + + it('urlPattern は宛先URLのみを対象にする(リンク元URLにはマッチしない)', async () => { + const matching = await listViewerExternalLinks(archive, { urlPattern: '%ads%' }); + expect(matching.items).toHaveLength(1); + expect(matching.items[0]!.destUrl).toBe('https://ads.example.com'); + + const sourceOnly = await listViewerExternalLinks(archive, { urlPattern: '%page-a%' }); + expect(sourceOnly.items).toHaveLength(0); + }); + + it('referrerCount の降順でソートできる', async () => { + const result = await listViewerExternalLinks(archive, { + sortBy: 'referrerCount', + sortOrder: 'desc', + }); + expect(result.items.map((item) => item.destUrl)).toEqual([ + 'https://ads.example.com', + 'https://tracking.example.com', + 'https://solo.example.com', + ]); + }); + + it('ページネーションが機能する', async () => { + const result = await listViewerExternalLinks(archive, { + sortBy: 'referrerCount', + sortOrder: 'desc', + limit: 1, + offset: 1, + }); + expect(result.items).toHaveLength(1); + expect(result.items[0]!.destUrl).toBe('https://tracking.example.com'); + }); + + it('status でタイが発生してもページネーションで宛先が重複・欠落しない', async () => { + // ads and solo both resolve to status 200 (a tie). Paginating one row at + // a time must still cover every distinct destination exactly once — + // this is what the dest_page_id tiebreaker in the ORDER BY clause + // guarantees. + const pages = await Promise.all([ + listViewerExternalLinks(archive, { + sortBy: 'status', + sortOrder: 'asc', + limit: 1, + offset: 0, + }), + listViewerExternalLinks(archive, { + sortBy: 'status', + sortOrder: 'asc', + limit: 1, + offset: 1, + }), + listViewerExternalLinks(archive, { + sortBy: 'status', + sortOrder: 'asc', + limit: 1, + offset: 2, + }), + ]); + const seen = pages.map((page) => page.items[0]!.destUrl).toSorted(); + expect(seen).toEqual([ + 'https://ads.example.com', + 'https://solo.example.com', + 'https://tracking.example.com', + ]); + }); + + it('未知の sortBy 値では destUrl ソートにフォールバックする(例外を投げない)', async () => { + const result = await listViewerExternalLinks(archive, { + sortBy: 'sourceUrl' as unknown as 'destUrl', + }); + expect(result.items.map((item) => item.destUrl)).toEqual([ + 'https://ads.example.com', + 'https://solo.example.com', + 'https://tracking.example.com', + ]); + }); +}); + +/** + * Mirrors `list-external-links.spec.ts`'s redirect-resolution describe + * block for the fast path. + */ +describe('listViewerExternalLinks — redirect resolution', () => { + let archive: InstanceType; + const redirectWorkingDir = path.resolve( + __dirname, + '__test_fixtures_list_viewer_external_links_redirect__', + ); + const redirectArchiveFilePath = path.resolve( + redirectWorkingDir, + 'list-viewer-external-links-redirect-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(redirectWorkingDir, { recursive: true }); + archive = await Archive.create({ + filePath: redirectArchiveFilePath, + cwd: redirectWorkingDir, + }); + await archive.setConfig({ + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/direct')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Direct' }, + anchorList: [ + { + href: parseUrl('https://redirect-target.example.com/')!, + isExternal: true, + title: null, + textContent: 'Direct link', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/via-redirect')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Via redirect' }, + anchorList: [ + { + href: parseUrl('https://example.com/old')!, + isExternal: false, + title: null, + textContent: 'Old link', + hash: null, + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://redirect-target.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await archive.setRedirect({ + url: parseUrl('https://example.com/old')!, + redirectPaths: ['https://redirect-target.example.com/'], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(redirectWorkingDir, { recursive: true, force: true }); + }); + + it('リダイレクト元経由と直接リンクが同じ正規宛先に集約される', async () => { + const result = await listViewerExternalLinks(archive); + expect(result.items).toHaveLength(1); + expect(result.items[0]).toMatchObject({ + destUrl: 'https://redirect-target.example.com', + referrerCount: 2, + }); + }); +}); diff --git a/packages/@nitpicker/query/src/list-viewer-external-links.ts b/packages/@nitpicker/query/src/list-viewer-external-links.ts new file mode 100644 index 0000000..815d6c1 --- /dev/null +++ b/packages/@nitpicker/query/src/list-viewer-external-links.ts @@ -0,0 +1,83 @@ +import type { ListExternalLinksOptions, PaginatedExternalLinkList } from './types.js'; +import type { ArchiveAccessor } from '@nitpicker/crawler'; + +import { applyListOrder } from './apply-list-order.js'; +import { paginateQuery } from './paginate-query.js'; + +/** + * Lists unique external destinations from the `viewer_external_links` read + * model — the fast-path counterpart of {@link listExternalLinks}, backed by + * a table pre-aggregated once at read-model build time instead of a live + * `anchors` JOIN + `GROUP BY` per request (see + * ARCHITECTURE.md「設計注意(外部リンク read model)」for why the live + * version's `GROUP BY` + `COUNT(DISTINCT ...)` combination is a known + * SQLite performance pitfall). + * + * Same options/response shape as {@link listExternalLinks} — callers switch + * between the two purely based on whether the read model is current (see + * `register-links-route.ts`), with no visible contract difference. One + * accepted difference: `destUrl` sorts by plain `BINARY` collation here + * (matching `viewer_pages.url_sort_key`'s precedent), not the natural/ + * numeric-aware sort {@link listExternalLinks} uses via + * `ensureUrlSortTempTable` — the same fast-path/legacy sort divergence + * already accepted for `/api/pages`. + * @param accessor - The archive accessor to query. Callers are responsible + * for confirming the read model is built and current (see + * `isViewerReadModelCurrent`) before calling this — it assumes + * `viewer_external_links` exists and trusts its content. + * @param options - Filter, sort, and pagination options. + * @returns A paginated list of unique external destinations. + * @example + * if (await isViewerReadModelCurrent(accessor)) { + * const page = await listViewerExternalLinks(accessor, { limit: 100 }); + * } + */ +export async function listViewerExternalLinks( + accessor: ArchiveAccessor, + options: ListExternalLinksOptions = {}, +): Promise { + const knex = accessor.getKnex(); + const limit = options.limit ?? 100; + const offset = options.offset ?? 0; + const sortOrder = options.sortOrder ?? 'asc'; + + const baseQuery = knex('viewer_external_links'); + if (options.urlPattern) { + baseQuery.where('dest_url', 'like', options.urlPattern); + } + if (options.status != null) { + baseQuery.where('status', options.status); + } + + const sortColumns: Record<'destUrl' | 'status' | 'referrerCount', { column: string }> = + { + destUrl: { column: '"viewer_external_links"."dest_url"' }, + status: { column: '"viewer_external_links"."status"' }, + referrerCount: { column: '"viewer_external_links"."referrer_count"' }, + }; + const sortBy = + options.sortBy && options.sortBy in sortColumns ? options.sortBy : 'destUrl'; + + return paginateQuery({ + baseQuery, + countColumn: 'dest_page_id', + applySelect: (q) => + applyListOrder( + q.select('dest_url as destUrl', 'status', 'referrer_count as referrerCount'), + knex, + sortBy, + sortOrder, + sortColumns, + // Tiebreaker: mirrors `listExternalLinks`'s `ORDER BY destId asc` — + // without it, ties on `status`/`referrerCount` make offset + // pagination duplicate or skip destinations across pages. + ).orderBy('dest_page_id', 'asc'), + limit, + offset, + mapRow: (row: { destUrl: string; status: number | null; referrerCount: number }) => ({ + destUrl: row.destUrl, + status: row.status, + referrerCount: Number(row.referrerCount), + }), + }); +} diff --git a/packages/@nitpicker/query/src/query.ts b/packages/@nitpicker/query/src/query.ts index a8e95df..75cc61d 100644 --- a/packages/@nitpicker/query/src/query.ts +++ b/packages/@nitpicker/query/src/query.ts @@ -50,6 +50,7 @@ export { listPagesByJsonLdType } from './list-pages-by-jsonld-type.js'; export { listPagesByTag } from './list-pages-by-tag.js'; export { listResources } from './list-resources.js'; export { listUnusedResources } from './list-unused-resources.js'; +export { listViewerExternalLinks } from './list-viewer-external-links.js'; export { listViewerPages } from './list-viewer-pages.js'; export { prepareUrlSortTempTable } from './url-sort-temp-table.js'; export * from './types.js'; diff --git a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts index a6d63c9..5de18cc 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.spec.ts @@ -1055,4 +1055,121 @@ describe('buildViewerReadModel', () => { ).toEqual([]); }); }); + + describe('external links population', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_build_read_model_external_links__', + ); + const archiveFilePath = path.resolve(workingDir, 'external-links-test.nitpicker'); + let archive: InstanceType; + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + // Two anchors on the same page to the same destination — must count + // as one referrer in viewer_external_links, not two. + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad banner', + }, + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad footer', + }, + ], + imageList: [], + isSkipped: false, + }); + + // A second, distinct referring page to the same destination. + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad sidebar', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://ads.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await buildViewerReadModel(archive); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('populates viewer_external_links with one row per unique canonical destination', async () => { + const rows = await archive.getKnex()('viewer_external_links').select('*'); + expect(rows).toHaveLength(1); + expect(rows[0]).toMatchObject({ + dest_url: 'https://ads.example.com', + status: 200, + referrer_count: 2, + }); + }); + + it('rebuilds viewer_external_links idempotently — a second build leaves the same row count', async () => { + await buildViewerReadModel(archive); + const rows = await archive.getKnex()('viewer_external_links').select('*'); + expect(rows).toHaveLength(1); + expect(rows[0]).toMatchObject({ referrer_count: 2 }); + }); + }); }); diff --git a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts index 0bf5b4a..664ea08 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/build-viewer-read-model.ts @@ -5,6 +5,7 @@ import { classifyContentType } from '../classify-content-type.js'; import { excludeSkippedPages } from '../exclude-skipped-pages.js'; import { buildDirectoryTreeRows } from './build-directory-tree-rows.js'; +import { computeExternalLinkRows } from './compute-external-link-rows.js'; import { computePageFacetBuckets } from './compute-page-facet-buckets.js'; import { createViewerReadModelTables } from './create-viewer-read-model-tables.js'; import { dropViewerReadModelTables } from './drop-viewer-read-model-tables.js'; @@ -200,11 +201,14 @@ function toViewerPageInsertRow(row: PagesSourceRow): ViewerPageInsertRow { } /** - * Performs a full rebuild of the viewer read model: drops all 7 tables if + * Performs a full rebuild of the viewer read model: drops all 8 tables if * present, recreates them, populates `viewer_pages` from the current * `pages` write-model table, populates `viewer_directory_nodes`/ * `viewer_directory_pages` from that same page set (see - * `buildDirectoryTreeRows` for the tree-building rules), seeds one + * `buildDirectoryTreeRows` for the tree-building rules), populates + * `viewer_external_links` from a dedicated `anchors` aggregation query (see + * `computeExternalLinkRows` — unlike the directory tree, this cannot reuse + * `sourceRows`, since link data lives on `anchors`, not `pages`), seeds one * smoke-test row into `viewer_query_profiles`, writes the * `viewer_count_buckets` totals row plus one row per distinct Pages-list * facet value (see `computePageFacetBuckets`), and writes the @@ -319,6 +323,18 @@ export async function buildViewerReadModel( ); } + // Unlike `viewer_pages`/the directory tree, this needs its own `anchors` + // query — `sourceRows` (loaded from `pages` only) has no anchor/link + // data. Runs once, here, instead of on every `/api/links?type=external` + // request — see `computeExternalLinkRows`'s docs for the SQLite + // performance rationale. + const externalLinkRows = await computeExternalLinkRows(trx); + for (let start = 0; start < externalLinkRows.length; start += INSERT_CHUNK_SIZE) { + await trx('viewer_external_links').insert( + externalLinkRows.slice(start, start + INSERT_CHUNK_SIZE), + ); + } + const total = insertRows.length; await trx('viewer_query_profiles').insert({ scope: 'pages', diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts new file mode 100644 index 0000000..bde2b0d --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.spec.ts @@ -0,0 +1,321 @@ +import path from 'node:path'; + +import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; +import { Archive } from '@nitpicker/crawler'; +import { afterAll, beforeAll, describe, expect, it } from 'vitest'; + +import { computeExternalLinkRows } from './compute-external-link-rows.js'; + +const __filename = new URL(import.meta.url).pathname; +const __dirname = path.dirname(__filename); + +const BASE_CONFIG = { + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, +}; + +const META = { + lang: null, + title: null, + description: null, + keywords: null, + noindex: false, + nofollow: false, + noarchive: false, + canonical: null, + alternate: null, + 'og:type': null, + 'og:title': null, + 'og:site_name': null, + 'og:description': null, + 'og:url': null, + 'og:image': null, + 'twitter:card': null, +}; + +describe('computeExternalLinkRows', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_compute_external_link_rows__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'compute-external-link-rows-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + // Page A: two anchors to ads.example.com (same page, must count as one + // referrer, not two), plus one to tracking. + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page A' }, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad banner', + }, + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad footer', + }, + { + href: parseUrl('https://tracking.example.com/')!, + isExternal: true, + title: null, + textContent: 'Tracking', + }, + ], + imageList: [], + isSkipped: false, + }); + + // Page B: a second, distinct referrer to ads.example.com. + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Page B' }, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad sidebar', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://ads.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://tracking.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 404, + statusText: 'Not Found', + contentType: 'text/html', + contentLength: 0, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('groups anchors by canonical destination, one row per unique destination', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); + expect(rows).toHaveLength(2); + }); + + it('counts referrers by distinct page id, not anchor count', async () => { + // Page A has two tags to ads.example.com; combined with page B + // that's 2 distinct referring pages, not 3 anchors. + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); + const ads = rows.find((row) => row.dest_url === 'https://ads.example.com'); + expect(ads).toMatchObject({ status: 200, referrer_count: 2 }); + }); + + it('carries the canonical destination status through', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); + const tracking = rows.find((row) => row.dest_url === 'https://tracking.example.com'); + expect(tracking).toMatchObject({ status: 404, referrer_count: 1 }); + }); +}); + +/** + * Mirrors `list-external-links.spec.ts`'s redirect-resolution describe + * block: an anchor to an internal redirect-source page and an anchor + * directly to the same external canonical destination must collapse into a + * single `viewer_external_links` row, not two. + */ +describe('computeExternalLinkRows — redirect resolution', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_compute_external_link_rows_redirect__', + ); + let archive: InstanceType; + const archiveFilePath = path.resolve( + workingDir, + 'compute-external-link-rows-redirect-test.nitpicker', + ); + + beforeAll(async () => { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + archive = await Archive.create({ filePath: archiveFilePath, cwd: workingDir }); + await archive.setConfig(BASE_CONFIG); + + await archive.setPage({ + url: parseUrl('https://example.com/direct')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Direct' }, + anchorList: [ + { + href: parseUrl('https://redirect-target.example.com/')!, + isExternal: true, + title: null, + textContent: 'Direct link', + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://example.com/via-redirect')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: { ...META, title: 'Via redirect' }, + anchorList: [ + { + href: parseUrl('https://example.com/old')!, + isExternal: false, + title: null, + textContent: 'Old link', + hash: null, + }, + ], + imageList: [], + isSkipped: false, + }); + + await archive.setPage({ + url: parseUrl('https://redirect-target.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + await archive.setRedirect({ + url: parseUrl('https://example.com/old')!, + redirectPaths: ['https://redirect-target.example.com/'], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + }); + + afterAll(async () => { + if (archive) { + await archive.releaseHandle(); + } + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('collapses a redirect-source anchor and a direct anchor onto the same canonical destination row', async () => { + const knex = archive.getKnex(); + const rows = await knex.transaction((trx) => computeExternalLinkRows(trx)); + expect(rows).toHaveLength(1); + expect(rows[0]).toMatchObject({ + dest_url: 'https://redirect-target.example.com', + referrer_count: 2, + }); + }); +}); diff --git a/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts new file mode 100644 index 0000000..a09dc2a --- /dev/null +++ b/packages/@nitpicker/query/src/viewer-read-model/compute-external-link-rows.ts @@ -0,0 +1,54 @@ +import type { ExternalLinkInsertRow } from './types.js'; +import type { Knex } from 'knex'; + +/** + * Computes every unique external destination reached from the site, for + * bulk insert into `viewer_external_links`. + * + * The aggregation itself (`COALESCE(canonical.*, dest.*)` redirect + * resolution, `GROUP BY` on the canonical destination id, `COUNT(DISTINCT + * source.id)` for the referrer count) is lifted verbatim from + * `list-external-links.ts`'s live query — see that file's docs for why the + * counting grain must stay in lockstep with `getPageDetail.inboundLinks` + * (#71). The only difference here is that this runs once, at read-model + * build time, against the full `anchors` table with no `LIMIT`/`OFFSET` — + * see ARCHITECTURE.md「設計注意(外部リンク read model)」for why running + * this JOIN + `GROUP BY` + `COUNT(DISTINCT ...)` combination on every + * `/api/links?type=external` request is a known SQLite performance + * pitfall, and why materialising it once avoids it. + * @param trx - An open Knex transaction (a plain `Knex` instance also + * works, e.g. in tests). + * @returns One row per unique canonical external destination. + */ +export async function computeExternalLinkRows( + trx: Knex, +): Promise { + const destIdExpression = 'COALESCE("canonical"."id", "dest"."id")'; + const destUrlExpression = 'COALESCE("canonical"."url", "dest"."url")'; + const statusExpression = 'COALESCE("canonical"."status", "dest"."status")'; + + const rows: { + destPageId: number; + destUrl: string; + status: number | null; + referrerCount: number; + }[] = await trx('anchors') + .join('pages as source', 'anchors.pageId', '=', 'source.id') + .join('pages as dest', 'anchors.hrefId', '=', 'dest.id') + .leftJoin('pages as canonical', 'dest.redirectDestId', '=', 'canonical.id') + .whereRaw(`COALESCE("canonical"."isExternal", "dest"."isExternal") = 1`) + .groupBy(trx.raw(destIdExpression)) + .select( + trx.raw(`${destIdExpression} as "destPageId"`), + trx.raw(`${destUrlExpression} as "destUrl"`), + trx.raw(`${statusExpression} as "status"`), + trx.raw('count(distinct "source"."id") as "referrerCount"'), + ); + + return rows.map((row) => ({ + dest_page_id: row.destPageId, + dest_url: row.destUrl, + status: row.status, + referrer_count: Number(row.referrerCount), + })); +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts index f34505c..570ef95 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.spec.ts @@ -30,7 +30,7 @@ describe('createViewerReadModelTables', () => { rmSync(workingDir, { recursive: true, force: true }); }); - it('creates all 7 tables and the named viewer_pages indexes', async () => { + it('creates all 8 tables and the named viewer_pages indexes', async () => { const knex = archive.getKnex(); await knex.transaction((trx) => createViewerReadModelTables(trx)); @@ -42,6 +42,7 @@ describe('createViewerReadModelTables', () => { 'viewer_page_anchors', 'viewer_directory_nodes', 'viewer_directory_pages', + 'viewer_external_links', ]) { expect(await knex.schema.hasTable(table)).toBe(true); } @@ -62,6 +63,14 @@ describe('createViewerReadModelTables', () => { ]) { expect(indexNames.has(indexName)).toBe(true); } + + const externalLinkIndexRows: Array<{ name: string }> = await knex('sqlite_master') + .where({ type: 'index', tbl_name: 'viewer_external_links' }) + .select('name'); + const externalLinkIndexNames = new Set(externalLinkIndexRows.map((r) => r.name)); + for (const indexName of ['vel_url', 'vel_status', 'vel_referrer_count']) { + expect(externalLinkIndexNames.has(indexName)).toBe(true); + } }); it('viewer_query_profiles enforces a composite (scope, profile_key) key, not a single-column rowid', async () => { @@ -122,4 +131,22 @@ describe('createViewerReadModelTables', () => { .orderBy('page_id'); expect(rows.map((r) => r.page_id)).toEqual([1, 2]); }); + + it('viewer_external_links rejects a duplicate dest_page_id', async () => { + const knex = archive.getKnex(); + await knex('viewer_external_links').insert({ + dest_page_id: 1, + dest_url: 'https://ads.example.com/', + status: 200, + referrer_count: 1, + }); + await expect( + knex('viewer_external_links').insert({ + dest_page_id: 1, + dest_url: 'https://ads.example.com/duplicate', + status: 200, + referrer_count: 2, + }), + ).rejects.toThrow(); + }); }); diff --git a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts index 872e805..ecfed6d 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/create-viewer-read-model-tables.ts @@ -1,14 +1,14 @@ import type { Knex } from 'knex'; /** - * Creates all 7 viewer-read-model tables (and `viewer_pages`'s named + * Creates all 8 viewer-read-model tables (and `viewer_pages`'s named * indexes) against the given connection. Assumes none of the tables * currently exist — callers (`buildViewerReadModel`) are responsible for * dropping any prior version first, inside the same transaction, so this * function is not itself idempotent. * * Every statement runs via `raw()` rather than knex's chainable schema - * builder: 5 of the 7 tables need `WITHOUT ROWID` / a composite primary key + * builder: 5 of the 8 tables need `WITHOUT ROWID` / a composite primary key * / a `CHECK` constraint / a table-level `UNIQUE` constraint, none of which * the chainable builder can express (the same reason `page_html_blobs` / * `page_html_ref` drop to `raw()` in `@nitpicker/crawler`'s @@ -157,4 +157,29 @@ export async function createViewerReadModelTables(trx: Knex): Promise { await trx.raw( 'CREATE INDEX vdp_node_url ON viewer_directory_pages(node_id, page_url_sort_key, page_id)', ); + + // Pre-aggregated, deduplicated-by-canonical-destination external link + // list — see `computeExternalLinkRows`'s docs for why this needs its own + // `anchors` query rather than reusing `viewer_pages`'s `sourceRows` (the + // aggregation joins `anchors` at build time instead of on every read, + // see ARCHITECTURE.md「設計注意(外部リンク read model)」for the + // SQLite COUNT(DISTINCT)/GROUP BY performance rationale). No + // `_desc_key` columns like `viewer_pages` needs: pagination here is + // plain offset-based (via `paginateQuery`), not keyset-cursor, so a + // single ascending index scanned backward is enough for DESC. + await trx.raw(` + CREATE TABLE viewer_external_links ( + dest_page_id integer primary key, + dest_url text not null, + status integer, + referrer_count integer not null + ) + `); + await trx.raw('CREATE INDEX vel_url ON viewer_external_links(dest_url, dest_page_id)'); + await trx.raw( + 'CREATE INDEX vel_status ON viewer_external_links(status, dest_url, dest_page_id)', + ); + await trx.raw( + 'CREATE INDEX vel_referrer_count ON viewer_external_links(referrer_count, dest_url, dest_page_id)', + ); } diff --git a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts index 1ebffad..f873e94 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.spec.ts @@ -37,7 +37,7 @@ describe('dropViewerReadModelTables', () => { ).resolves.toBeUndefined(); }); - it('drops all 7 tables after they were created', async () => { + it('drops all 8 tables after they were created', async () => { const knex = archive.getKnex(); await knex.transaction((trx) => createViewerReadModelTables(trx)); for (const table of [ @@ -48,6 +48,7 @@ describe('dropViewerReadModelTables', () => { 'viewer_page_anchors', 'viewer_directory_nodes', 'viewer_directory_pages', + 'viewer_external_links', ]) { expect(await knex.schema.hasTable(table)).toBe(true); } @@ -61,6 +62,7 @@ describe('dropViewerReadModelTables', () => { 'viewer_page_anchors', 'viewer_directory_nodes', 'viewer_directory_pages', + 'viewer_external_links', ]) { expect(await knex.schema.hasTable(table)).toBe(false); } diff --git a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts index e53d560..5f66507 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/drop-viewer-read-model-tables.ts @@ -1,16 +1,17 @@ import type { Knex } from 'knex'; /** - * Drops all 7 viewer-read-model tables if present, against the given + * Drops all 8 viewer-read-model tables if present, against the given * connection. Shared by `buildViewerReadModel` (which drops before * recreating, inside its own rebuild transaction) and - * `dropViewerReadModel` (which drops with no recreate), so the 7-table + * `dropViewerReadModel` (which drops with no recreate), so the 8-table * list only needs to be kept in sync with `createViewerReadModelTables` * in one place. * @param trx - An open Knex transaction (a plain `Knex` instance also * works, e.g. in tests). */ export async function dropViewerReadModelTables(trx: Knex): Promise { + await trx.schema.dropTableIfExists('viewer_external_links'); await trx.schema.dropTableIfExists('viewer_directory_pages'); await trx.schema.dropTableIfExists('viewer_directory_nodes'); await trx.schema.dropTableIfExists('viewer_page_anchors'); diff --git a/packages/@nitpicker/query/src/viewer-read-model/types.ts b/packages/@nitpicker/query/src/viewer-read-model/types.ts index 2ded669..954c3b7 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/types.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/types.ts @@ -157,3 +157,24 @@ export interface DirectoryTreeBuildResult { /** Every direct page-to-node membership row. */ pages: DirectoryPageInsertRow[]; } + +/** + * One row to insert into `viewer_external_links`, one per unique canonical + * (redirect-resolved) external destination. Produced by + * `computeExternalLinkRows`. + */ +export interface ExternalLinkInsertRow { + /** `COALESCE(canonical.id, dest.id)` — the canonical destination's `pages.id`. */ + dest_page_id: number; + /** `COALESCE(canonical.url, dest.url)` — the canonical destination URL, verbatim. */ + dest_url: string; + /** `COALESCE(canonical.status, dest.status)` — the canonical destination's HTTP status, or `null` if unknown. */ + status: number | null; + /** + * `COUNT(DISTINCT source.id)` — the number of distinct internal pages + * linking to this destination. Must stay in the same counting grain as + * `getPageDetail.inboundLinks` (see that function's docs, #71) — + * multiple anchors from the same page count once. + */ + referrer_count: number; +} diff --git a/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts b/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts index 988f7b1..978f7a0 100644 --- a/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts +++ b/packages/@nitpicker/query/src/viewer-read-model/viewer-read-model-schema-version.ts @@ -7,4 +7,4 @@ * `viewer_read_model_meta.schema_version` to decide whether a rebuild is * needed. */ -export const VIEWER_READ_MODEL_SCHEMA_VERSION = 4; +export const VIEWER_READ_MODEL_SCHEMA_VERSION = 5; diff --git a/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts b/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts new file mode 100644 index 0000000..e9d5529 --- /dev/null +++ b/packages/@nitpicker/viewer/src/routes/register-links-route.spec.ts @@ -0,0 +1,223 @@ +import path from 'node:path'; + +import { tryParseUrl as parseUrl } from '@d-zero/shared/parse-url'; +import { Archive } from '@nitpicker/crawler'; +import { ArchiveManager, buildViewerReadModel } from '@nitpicker/query'; +import { afterAll, beforeAll, describe, expect, it } from 'vitest'; + +import { createApp } from '../create-app.js'; + +const __filename = new URL(import.meta.url).pathname; +const __dirname = path.dirname(__filename); + +const BASE_CONFIG = { + baseUrl: 'https://example.com', + name: 'test', + version: '0.10.0', + recursive: true, + interval: 0, + image: true, + fetchExternal: false, + parallels: 1, + roots: ['https://example.com'], + excludes: [], + excludeKeywords: [], + excludeUrls: [], + maxExcludedDepth: 0, + retry: 3, + fromList: false, + disableQueries: false, + userAgent: 'test', + ignoreRobots: false, +}; + +const META = { + lang: null, + title: null, + description: null, + keywords: null, + noindex: false, + nofollow: false, + noarchive: false, + canonical: null, + alternate: null, + 'og:type': null, + 'og:title': null, + 'og:site_name': null, + 'og:description': null, + 'og:url': null, + 'og:image': null, + 'twitter:card': null, +}; + +/** + * Builds a fixture archive with 2 internal pages linking to the same + * external destination (one page with 2 anchors, one with 1 — referrer + * count must land on 2, not 3) and returns an in-process Hono app wired to + * it via the same read-only-open path the real viewer uses, mirroring + * `register-pages-route.spec.ts`'s `buildFixture` helper. + * @param workingDir - Unique scratch directory for this fixture. + * @param withReadModel - Whether to build the `viewer_external_links` read + * model before opening read-only (exercises the fast path) or leave it + * unbuilt (exercises the legacy fallback path). + * @returns The app, archive, and manager — callers must close both in + * `afterAll`. + */ +async function buildFixture(workingDir: string, withReadModel: boolean) { + const { mkdirSync } = await import('node:fs'); + mkdirSync(workingDir, { recursive: true }); + const archive = await Archive.create({ + filePath: path.resolve(workingDir, 'fixture.nitpicker'), + cwd: workingDir, + }); + await archive.setConfig(BASE_CONFIG); + + await archive.setPage({ + url: parseUrl('https://example.com/page-a')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad banner', + }, + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad footer', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://example.com/page-b')!, + redirectPaths: [], + isExternal: false, + isTarget: true, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [ + { + href: parseUrl('https://ads.example.com/')!, + isExternal: true, + title: null, + textContent: 'Ad sidebar', + }, + ], + imageList: [], + isSkipped: false, + }); + await archive.setPage({ + url: parseUrl('https://ads.example.com/')!, + redirectPaths: [], + isExternal: true, + isTarget: false, + status: 200, + statusText: 'OK', + contentType: 'text/html', + contentLength: 100, + responseHeaders: {}, + html: '', + meta: META, + anchorList: [], + imageList: [], + isSkipped: false, + }); + + if (withReadModel) { + await buildViewerReadModel(archive); + } + + const manager = new ArchiveManager(); + const { archiveId, mode } = await manager.open(archive.tmpDir); + const app = createApp({ + context: { + manager, + archiveId, + filePath: archive.tmpDir, + mode, + crawlerLockHolder: null, + }, + publicDir: '/tmp/no-such-dir-register-links-route-spec', + }); + return { app, archive, manager }; +} + +describe('registerLinksRoute — /api/links?type=external (integration)', () => { + describe('fast path (viewer_external_links read model built)', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_register_links_route_fast__', + ); + let fixture: Awaited>; + + beforeAll(async () => { + fixture = await buildFixture(workingDir, true); + }); + + afterAll(async () => { + await fixture.manager.closeAll(); + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('returns the destination-deduped shape with the correct referrer count', async () => { + const res = await fixture.app.request('/api/links?type=external'); + const body = (await res.json()) as { + items: { destUrl: string; status: number | null; referrerCount: number }[]; + total: number; + }; + expect(body.total).toBe(1); + expect(body.items).toEqual([ + { destUrl: 'https://ads.example.com', status: 200, referrerCount: 2 }, + ]); + }); + }); + + describe('legacy fallback path (no read model built)', () => { + const workingDir = path.resolve( + __dirname, + '__test_fixtures_register_links_route_legacy__', + ); + let fixture: Awaited>; + + beforeAll(async () => { + fixture = await buildFixture(workingDir, false); + }); + + afterAll(async () => { + await fixture.manager.closeAll(); + const { rmSync } = await import('node:fs'); + rmSync(workingDir, { recursive: true, force: true }); + }); + + it('returns the same destination-deduped shape via the legacy live query', async () => { + const res = await fixture.app.request('/api/links?type=external'); + const body = (await res.json()) as { + items: { destUrl: string; status: number | null; referrerCount: number }[]; + total: number; + }; + expect(body.total).toBe(1); + expect(body.items).toEqual([ + { destUrl: 'https://ads.example.com', status: 200, referrerCount: 2 }, + ]); + }); + }); +}); diff --git a/packages/@nitpicker/viewer/src/routes/register-links-route.ts b/packages/@nitpicker/viewer/src/routes/register-links-route.ts index 5b24d98..e06d3c1 100644 --- a/packages/@nitpicker/viewer/src/routes/register-links-route.ts +++ b/packages/@nitpicker/viewer/src/routes/register-links-route.ts @@ -1,7 +1,12 @@ import type { ArchiveContext } from '../types.js'; import type { Hono } from 'hono'; -import { listExternalLinks, listLinks } from '@nitpicker/query'; +import { + isViewerReadModelCurrent, + listExternalLinks, + listLinks, + listViewerExternalLinks, +} from '@nitpicker/query'; import { toNumber } from '../query-params/to-number.js'; @@ -16,11 +21,22 @@ const VALID_LINK_TYPES = ['broken', 'external'] as const; * `/api/isolated-clusters`. `broken` stays anchor-level (one row per `` * tag, resolved through `pages.redirectDestId` to the canonical final * destination unless `includeRedirectSources=true`) via `listLinks`. - * `external` is deduplicated by canonical destination via - * `listExternalLinks` — one row per unique destination with a - * `referrerCount` — so its response shape and query params differ (no - * `includeRedirectSources`, no `sourceUrl`/`isExternal`/`textContent` - * sort keys, an added `referrerCount` sort key). + * `external` is deduplicated by canonical destination — one row per unique + * destination with a `referrerCount` — so its response shape and query + * params differ (no `includeRedirectSources`, no + * `sourceUrl`/`isExternal`/`textContent` sort keys, an added + * `referrerCount` sort key). + * + * `external` dispatches to one of two backends per request, the same + * two-layer pattern `register-pages-route.ts` uses for `/api/pages`: + * + * - `listViewerExternalLinks` (the `viewer_external_links` read-model fast + * path) when the read model is built and current. Unlike `/api/pages`, + * there is no filter that forces a legacy fallback: `urlPattern`/`status` + * both map directly onto `viewer_external_links` columns. + * - `listExternalLinks` (the legacy live `anchors` JOIN + `GROUP BY` query) + * otherwise — covers archives predating the read model. Both share the + * same options/response shape, so callers see no difference beyond speed. * @param app - The Hono application. * @param context - The opened archive context. */ @@ -43,7 +59,7 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { const sortOrder = c.req.query('sortOrder') as 'asc' | 'desc' | undefined; if (type === 'external') { - const result = await listExternalLinks(accessor, { + const params = { limit, offset, urlPattern, @@ -54,7 +70,10 @@ export function registerLinksRoute(app: Hono, context: ArchiveContext): void { | 'referrerCount' | undefined, sortOrder, - }); + }; + const result = (await isViewerReadModelCurrent(accessor)) + ? await listViewerExternalLinks(accessor, params) + : await listExternalLinks(accessor, params); return c.json(result); }