From ddf869d2c8f3020db0366d2f90b1c57800bf2378 Mon Sep 17 00:00:00 2001 From: myoshizumi Date: Wed, 12 Nov 2025 22:37:56 +0900 Subject: [PATCH] SQL: Basic Select 1050. Actors and Directors Who Cooperated At Least Three Times --- ...ooperated_At_Least_Three_Times_mysql.ipynb | 222 ++++++++++++++ ...operated_At_Least_Three_Times_pandas.ipynb | 228 +++++++++++++++ ...perated_At_Least_Three_Times_posgres.ipynb | 275 ++++++++++++++++++ 3 files changed, 725 insertions(+) create mode 100644 SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_mysql.ipynb create mode 100644 SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_pandas.ipynb create mode 100644 SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_posgres.ipynb diff --git a/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_mysql.ipynb b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_mysql.ipynb new file mode 100644 index 00000000..d9918f99 --- /dev/null +++ b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_mysql.ipynb @@ -0,0 +1,222 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "015e0603", + "metadata": {}, + "source": [ + "# MySQL 8.0.40\n", + "\n", + "## 0) 前提\n", + "\n", + "* エンジン: **MySQL 8**\n", + "* 並び順: 任意(`ORDER BY` なし)\n", + "* `NOT IN` は未使用\n", + "* 判定は **ID 基準**、表示は仕様どおり `actor_id, director_id`\n", + "\n", + "## 1) 問題\n", + "\n", + "* `ある俳優 (actor_id) と監督 (director_id) の組み合わせで、協働回数が3回以上のペアを求める。`\n", + "* 入力テーブル例: `ActorDirector(actor_id INT, director_id INT, timestamp INT PRIMARY KEY)`\n", + "* 出力仕様: `actor_id, director_id`(重複なし・順不同)\n", + "\n", + "---\n", + "\n", + "## 2) 最適解(単一クエリ)\n", + "\n", + "> ウィンドウ集計でペアごとの協働回数を数え、3回以上のみを射影。重複除去は `DISTINCT`。\n", + "\n", + "```sql\n", + "WITH win AS (\n", + " SELECT\n", + " actor_id,\n", + " director_id,\n", + " COUNT(*) OVER (PARTITION BY actor_id, director_id) AS coop_cnt\n", + " FROM ActorDirector\n", + ")\n", + "SELECT DISTINCT\n", + " actor_id,\n", + " director_id\n", + "FROM win\n", + "WHERE coop_cnt >= 3;\n", + "\n", + "Runtime 349 ms\n", + "Beats 56.27%\n", + "\n", + "```\n", + "\n", + "* ポイント: `COUNT(*) OVER (PARTITION BY actor_id, director_id)` でペアごとの総件数を1パスで算出。\n", + "* 出力は仕様列のみ、順序指定なし。\n", + "\n", + "---\n", + "\n", + "## 3) 代替解\n", + "\n", + "> 集約で十分に速いケース。実務ではこちらが最小コストになりやすい。\n", + "\n", + "```sql\n", + "SELECT\n", + " actor_id,\n", + " director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "\n", + "Runtime 348 ms\n", + "Beats 57.64%\n", + "\n", + "```\n", + "\n", + "* 追加の手段(参考・必要時のみ): 事前にユニークな重複を除く必要があれば `SELECT DISTINCT actor_id, director_id, timestamp ...` の下位派生を作り `GROUP BY`。\n", + "\n", + "---\n", + "\n", + "## 4) 要点解説\n", + "\n", + "* **方針**: ペア単位で出現回数をカウント → しきい値(3回)でフィルタ → 必要列のみ投影。\n", + "* **NULL/重複**: 入力列はINTでNULL前提なし、`timestamp` は主キーで重複なし。\n", + "* **インデックス**: `PRIMARY KEY(timestamp)` だけだとペア集計で全表走査になりがち。\n", + "\n", + " * 可能なら **複合インデックス `(actor_id, director_id)`** を作成すると `GROUP BY` / `PARTITION BY` の集約が効率化。\n", + "\n", + "---\n", + "\n", + "## 5) 計算量(概算)\n", + "\n", + "* ウィンドウ解: 各パーティション集計で **O(N)**〜**O(N log N)**(実装・メモリアルゴ次第)。\n", + "* `GROUP BY` 解: ソート/ハッシュ集約で **O(N log N)** 近辺(適切なインデックスで実効はほぼ **O(N)**)。\n", + "\n", + "---\n", + "\n", + "## 6) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[入力 テーブル]\n", + " B[ペア単位の回数を算出]\n", + " C[回数が3以上を抽出]\n", + " D[出力 俳優ID 監督ID]\n", + " A --> B\n", + " B --> C\n", + " C --> D\n", + "```\n", + "\n", + "---\n", + "\n", + "### 補足(実運用メモ)\n", + "\n", + "* データ量が多い場合は `GROUP BY` 案+ `(actor_id, director_id)` の複合インデックスが最も簡潔で速い構成になりやすいです。\n", + "* 結果順は任意のため、**`ORDER BY` を付けない**ことで不要なソートを省きます。\n", + "\n", + "結論:**クエリ自体は最短経路**です。差が出るのは **実装より実行計画(インデックス・統計・メモリ)**。以下を順に打つと体感で大きく縮みます。\n", + "\n", + "---\n", + "\n", + "## 即効性のある改善(順に適用)\n", + "\n", + "1. **複合インデックスを追加(最重要)**\n", + " `GROUP BY actor_id, director_id` をインデックス順でなぞらせ、**ソート/テンポラリ回避**を狙います。\n", + "\n", + "```sql\n", + "CREATE INDEX idx_actor_director ON ActorDirector (actor_id, director_id);\n", + "```\n", + "\n", + "* これで `GROUP BY` 案はほぼ **インデックス順走査→集約** に変わります。\n", + "* 結果列がキーだけなので、**セカンダリインデックスのみ**で完結 (InnoDB はセカンダリ葉にPK含むが今回未参照)。\n", + "\n", + "2. **ウィンドウ版は封印、`GROUP BY` を採用**\n", + " ウィンドウ版は `COUNT OVER` → `DISTINCT` で無駄に行を増やします。最速はこれ:\n", + "\n", + "```sql\n", + "SELECT actor_id, director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "```\n", + "\n", + "> インデックスが効けば **ファイルソート/テンポラリなし** で流せます。\n", + "\n", + "3. **統計の鮮度を上げる**\n", + "\n", + "```sql\n", + "ANALYZE TABLE ActorDirector;\n", + "```\n", + "\n", + "* 古い統計だとインデックスを握ってくれないことがあります。\n", + "\n", + "4. **ハッシュ集約の挙動を確認(MySQL 8)**\n", + " 場合によってはハッシュ集約が一時領域を使い遅くなることがあります。悪化時のみヒントで切替。\n", + "\n", + "```sql\n", + "-- 悪い計画が出たときだけ\n", + "SELECT /*+ NO_HASH_AGGREGATION() */\n", + " actor_id, director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "```\n", + "\n", + "---\n", + "\n", + "## 追加の選択肢(データ量・更新頻度しだい)\n", + "\n", + "* **集計サマリ表**(マテビュー代替)\n", + "\n", + " ```sql\n", + " -- 初期ロード\n", + " CREATE TABLE CoopSummary AS\n", + " SELECT actor_id, director_id, COUNT(*) AS coop_cnt\n", + " FROM ActorDirector\n", + " GROUP BY actor_id, director_id;\n", + "\n", + " CREATE UNIQUE INDEX ux_coop ON CoopSummary(actor_id, director_id);\n", + " ```\n", + "\n", + " * 以降はバッチで増分反映(新規 `ActorDirector` のみ集計→`INSERT ... ON DUPLICATE KEY UPDATE`)。\n", + " * 本問は「閾値3以上の存在判定」なので `WHERE coop_cnt >= 3` で**即時応答**。\n", + "\n", + "---\n", + "\n", + "## 期待できる効果(目安)\n", + "\n", + "* 複合インデックス導入だけで、**中〜大規模**でも 2〜10倍短縮が珍しくありません。\n", + "* あなたの計測(~350ms)規模なら、**2桁〜100ms台**まで落ちる可能性が高いです(I/O・メモリ次第)。\n", + "\n", + "---\n", + "\n", + "## チェックリスト(必ず EXPLAIN)\n", + "\n", + "```sql\n", + "EXPLAIN\n", + "SELECT actor_id, director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "```\n", + "\n", + "* `key = idx_actor_director` が選ばれているか\n", + "* `Extra` に **Using index**、**Using temporary/Using filesort が消えているか**\n", + "* 行数見積り(rows)が現実的か(統計が効いているか)\n", + "\n", + "---\n", + "\n", + "## まとめ(処方箋)\n", + "\n", + "* **ベースクエリ**:`GROUP BY ... HAVING COUNT(*) >= 3`\n", + "* **必須インデックス**:`(actor_id, director_id)`\n", + "* **統計更新**:`ANALYZE TABLE`\n", + "* (必要時のみ)**NO_HASH_AGGREGATION ヒント**\n", + "* 本番ワークロードでまだ重いなら、**集計サマリ表**で根本対策\n", + "\n", + "この順で手当すれば、数字はまだ縮みます。ボトルネックはクエリじゃなく**計画**—です。\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_pandas.ipynb b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_pandas.ipynb new file mode 100644 index 00000000..7afa5331 --- /dev/null +++ b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_pandas.ipynb @@ -0,0 +1,228 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e3eef12d", + "metadata": {}, + "source": [ + "# Pandas 2.2.2用\n", + "\n", + "## 0) 前提\n", + "\n", + "* 環境: **Python 3.10.15 / pandas 2.2.2**\n", + "* **指定シグネチャ厳守**(関数名・引数名・返却列・順序)\n", + "* I/O 禁止、不要な `print` や `sort_values` 禁止\n", + "\n", + "## 1) 問題\n", + "\n", + "* `同じ (actor_id, director_id) の組が 3 回以上 協働しているペアを抽出せよ。`\n", + "* 入力 DF: `ActorDirector(actor_id: int, director_id: int, timestamp: int)`(行=協働の発生、`timestamp` は一意)\n", + "* 出力: `actor_id, director_id`(重複なし・順不同)\n", + "\n", + "## 2) 実装(指定シグネチャ厳守)\n", + "\n", + "> 原則は **列最小化 → グループ処理(集計) → 条件抽出 → 最終投影**。最小メモリで確定できます。\n", + "\n", + "```python\n", + "import pandas as pd\n", + "\n", + "def find_cooperative_pairs(actor_director: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Returns:\n", + " pd.DataFrame: 列名と順序は ['actor_id', 'director_id']\n", + " \"\"\"\n", + " # 1) 列最小化(必要列のみ)\n", + " df = actor_director[['actor_id', 'director_id']]\n", + "\n", + " # 2) ペア単位で件数を集計\n", + " cnt = (\n", + " df.groupby(['actor_id', 'director_id'], as_index=False)\n", + " .size() # → columns: ['actor_id','director_id','size']\n", + " )\n", + "\n", + " # 3) しきい値(3回以上)で抽出\n", + " kept = cnt.loc[cnt['size'] >= 3, ['actor_id', 'director_id']]\n", + "\n", + " # 4) 仕様列のみ返却(順不同・重複なし)\n", + " return kept\n", + "\n", + "Analyze Complexity\n", + "Runtime 282 ms\n", + "Beats 63.13%\n", + "Memory 67.90 MB\n", + "Beats 28.71%\n", + "\n", + "```\n", + "\n", + "### 代替(`transform` を使ったセミジョイン風:等価・やや重め)\n", + "\n", + "```python\n", + "def find_cooperative_pairs_alt(actor_director: pd.DataFrame) -> pd.DataFrame:\n", + " df = actor_director[['actor_id', 'director_id']]\n", + " coop_cnt = df.groupby(['actor_id', 'director_id'])['actor_id'].transform('size')\n", + " kept_pairs = df.loc[coop_cnt >= 3, ['actor_id', 'director_id']].drop_duplicates()\n", + " return kept_pairs\n", + "\n", + "Analyze Complexity\n", + "Runtime 305 ms\n", + "Beats 27.71%\n", + "Memory 67.52 MB\n", + "Beats 79.84%\n", + "\n", + "```\n", + "\n", + "## 3) アルゴリズム説明\n", + "\n", + "* 使用 API:\n", + "\n", + " * `DataFrame.groupby(['actor_id','director_id']).size()`:ペアごとの出現回数を計算(ウィンドウ不要)\n", + " * `loc[...]`:閾値フィルタ\n", + " * `drop_duplicates()`:代替案での最終ユニーク化(標準案は不要)\n", + "* **NULL / 重複 / 型**:\n", + "\n", + " * 入力は整数想定で NULL なし(問題仕様)。NULL があり得る場合は前処理で `dropna(subset=[...])` を検討。\n", + " * `timestamp` は主キーだが本問では未使用(ID 基準の協働回数のみ必要)。\n", + " * 返却は **重複なし**・列順固定 `['actor_id','director_id']`。\n", + "\n", + "## 4) 計算量(概算)\n", + "\n", + "* `groupby.size()`:**O(N)** ~ **O(N log n_g)**(実装依存、平均は線形近似)\n", + "* メモリ:ペア数に比例(`O(#unique(actor_id, director_id))`)\n", + "\n", + "## 5) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[入力 データフレーム]\n", + " B[列最小化 actor_id director_id]\n", + " C[グループ処理 size で回数集計]\n", + " D[条件抽出 回数が3以上]\n", + " E[出力 actor_id director_id]\n", + " A --> B\n", + " B --> C\n", + " C --> D\n", + " D --> E\n", + "```\n", + "\n", + "**補足(パフォーマンスTips)**\n", + "データが極端に大きい場合は、あらかじめ `actor_id, director_id` の順でソートされた入力(もしくは同キーでのパーティション投入)だと `groupby` のチャンク処理効率が上がることがあります(本回答は `sort_values` 禁止のため未使用)。また、複数問い合わせが繰り返されるワークロードでは、ペア→回数のサマリ DF をキャッシュする戦略が有効です。\n", + "\n", + "結論から:**`value_counts` 版**か**NumPy 版**が速く・メモリ効率も良いです。どちらも列最小化→集約→閾値抽出の原則は同じ。\n", + "\n", + "---\n", + "\n", + "## 改善案A(最小変更・Pandas最速パス)\n", + "\n", + "`groupby.size()`より内部パスが軽くなることが多いです(2.2系最適化の恩恵)。\n", + "\n", + "```python\n", + "import pandas as pd\n", + "\n", + "def find_cooperative_pairs(actor_director: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Returns:\n", + " pd.DataFrame: ['actor_id', 'director_id']\n", + " \"\"\"\n", + " df = actor_director[['actor_id', 'director_id']]\n", + "\n", + " kept = (\n", + " df.value_counts(['actor_id', 'director_id']) # Series: MultiIndex -> count\n", + " .loc[lambda s: s >= 3] # しきい値\n", + " .reset_index()[['actor_id', 'director_id']] # 仕様列のみ\n", + " )\n", + " return kept\n", + "\n", + "Analyze Complexity\n", + "Runtime 292 ms\n", + "Beats 45.42%\n", + "Memory 67.55 MB\n", + "Beats 79.84%\n", + "\n", + "```\n", + "\n", + "**狙い**\n", + "\n", + "* `value_counts` は Cython 実装でハッシュ集約が速いケースが多い\n", + "* 返すのがキーだけなので `reset_index()` → 列抜きで終了(`sort=False`は指定不要:順不同要件)\n", + "\n", + "---\n", + "\n", + "## 改善案B(NumPy直叩き・最大スループット)\n", + "\n", + "巨大データでさらに押し込みたいとき。`np.unique(axis=0)` は**整列副作用**がありますが、本件は「順不同」なので問題なし。\n", + "\n", + "```python\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "def find_cooperative_pairs_numpy(actor_director: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Returns:\n", + " pd.DataFrame: ['actor_id', 'director_id']\n", + " \"\"\"\n", + " a = actor_director[['actor_id', 'director_id']].to_numpy(copy=False)\n", + " # uniques は辞書順にソートされる(順不同要件のためOK)\n", + " uniques, counts = np.unique(a, axis=0, return_counts=True)\n", + " res = uniques[counts >= 3]\n", + " return pd.DataFrame(res, columns=['actor_id', 'director_id'])\n", + "\n", + "Analyze Complexity\n", + "Runtime 286 ms\n", + "Beats 55.73%\n", + "Memory 66.76 MB\n", + "Beats 99.65%\n", + "\n", + "```\n", + "\n", + "**狙い**\n", + "\n", + "* 追加オブジェクトを最小化(`copy=False`)\n", + "* `groupby`/`value_counts`よりも高速・低メモリになることがある(特に高重複×大件数)\n", + "\n", + "---\n", + "\n", + "## 微調整オプション(必要に応じて)\n", + "\n", + "* **dtypesの圧縮**:`int64` → `int32` でメモリを削減(IDが32bit範囲なら)\n", + "\n", + " ```python\n", + " actor_director = actor_director.astype({'actor_id':'int32','director_id':'int32'}, copy=False)\n", + " ```\n", + "* **カテゴリ化(高重複時)**:重複が極端に多ければ `category` 化 → `groupby.size()` が効くケースあり(ただしキャストコストがあるので要計測)\n", + "\n", + " ```python\n", + " ad = actor_director.assign(\n", + " actor_id=actor_director['actor_id'].astype('category'),\n", + " director_id=actor_director['director_id'].astype('category')\n", + " )\n", + " # 以降は groupby.size() or value_counts で同様に\n", + " ```\n", + "\n", + "---\n", + "\n", + "## なぜ速いのか(短評)\n", + "\n", + "* `value_counts`:キー集合のハッシュ化→集計→Series出力の最短レーン。`groupby.size()`よりオーバーヘッドが少ない場面が多い。\n", + "* NumPy:Pandasのインデクシング・アロケーションを回避し、配列ベースで一撃。`axis=0` の整列は発生するが、本問の要件と矛盾しない。\n", + "\n", + "---\n", + "\n", + "## まとめ(推奨順)\n", + "\n", + "1. **まずは `value_counts` 案**に差し替え(低リスク・高リターン)。\n", + "2. さらに攻めたい or 超大規模なら **NumPy 案**。\n", + "3. 併せて **`int32` 圧縮**でメモリを抑制。\n", + "\n", + "この順で置き換えると、提示の **282ms → 200ms前後(環境次第)**まで狙える余地があります。\n", + "\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_posgres.ipynb b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_posgres.ipynb new file mode 100644 index 00000000..8abd21d4 --- /dev/null +++ b/SQL/Leetcode/Basic select/1050. Actors and Directors Who Cooperated At Least Three Times/gpt5 thinking customized/Actors_and_Directors_Who_Cooperated_At_Least_Three_Times_posgres.ipynb @@ -0,0 +1,275 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4822e96e", + "metadata": {}, + "source": [ + "# PostgreSQL 16.6+\n", + "\n", + "## 0) 前提\n", + "\n", + "* エンジン: **PostgreSQL 16.6+**\n", + "* 並び順: 任意(`ORDER BY` なし)\n", + "* `NOT IN` 回避(本問では不使用)\n", + "* 判定は **ID 基準**、表示は仕様どおり `actor_id, director_id`\n", + "\n", + "## 1) 問題\n", + "\n", + "* `同一の (actor_id, director_id) が3回以上出現する協働ペアを抽出せよ。`\n", + "* 入力: `ActorDirector(actor_id int, director_id int, \"timestamp\" int PRIMARY KEY)`\n", + "* 出力: `actor_id, director_id`(重複なし・順不同)\n", + "\n", + "---\n", + "\n", + "## 2) 最適解(単一クエリ)\n", + "\n", + "> PostgreSQL でも素直に **ウィンドウ + 外側で重複排除** で書けます(要件充足)。ただし実務では後述の **`GROUP BY` 案が最速**になりがち。\n", + "\n", + "```sql\n", + "WITH pre AS (\n", + " SELECT actor_id, director_id\n", + " FROM ActorDirector\n", + "),\n", + "win AS (\n", + " SELECT\n", + " actor_id,\n", + " director_id,\n", + " COUNT(*) OVER (PARTITION BY actor_id, director_id) AS coop_cnt\n", + " FROM pre\n", + ")\n", + "SELECT DISTINCT\n", + " actor_id,\n", + " director_id\n", + "FROM win\n", + "WHERE coop_cnt >= 3;\n", + "\n", + "Runtime\n", + "314\n", + "ms\n", + "Beats\n", + "44.19%\n", + "\n", + "```\n", + "\n", + "### 代替(推奨:集約一発)\n", + "\n", + "> PostgreSQL のプランナは集約に強いので、こちらが概ね速いです。\n", + "\n", + "```sql\n", + "SELECT\n", + " actor_id,\n", + " director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "\n", + "Runtime\n", + "295\n", + "ms\n", + "Beats\n", + "69.28%\n", + "\n", + "```\n", + "\n", + "> 結果は同じ。**読みやすく、余計な行膨張もない**ため、まずはこれで。\n", + "\n", + "---\n", + "\n", + "## 3) 要点解説\n", + "\n", + "* **最小十分条件**は「ペア単位の件数 ≥ 3」だけ。\n", + " ウィンドウを使う場合も `COUNT(*) OVER (PARTITION BY ...)` で総回数を付与してから閾値抽出。\n", + "* **集合演算としては `GROUP BY ... HAVING COUNT(*) >= 3` が最短ルート**。\n", + " ウィンドウ版は内側で行を複製するため I/O が増えやすい。\n", + "* **列名** `timestamp` は予約語ではないものの関数と紛らうので、DDL 上はダブルクォートが安全(上の例は既存前提)。\n", + "\n", + "---\n", + "\n", + "## 4) 計算量(概算)\n", + "\n", + "* ウィンドウ: **O(Σ n_g log n_g)**(パーティション内の並べ替え/バッファリング次第)\n", + "* 集約(推奨クエリ): **O(N)**〜**O(N log N)**(HashAggregate or GroupAggregate)\n", + "\n", + "---\n", + "\n", + "## 5) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[入力 テーブル]\n", + " B[ペアごとに件数を算出]\n", + " C[しきい値 3 以上を抽出]\n", + " D[出力 actor_id director_id]\n", + " A --> B\n", + " B --> C\n", + " C --> D\n", + "```\n", + "\n", + "---\n", + "\n", + "### 実運用メモ(速度チューニング)\n", + "\n", + "* **インデックス**\n", + " 集約を「並び順で一発巻き上げ」できる余地を作ります。\n", + "\n", + " ```sql\n", + " CREATE INDEX ON ActorDirector (actor_id, director_id);\n", + " ```\n", + "\n", + " * これで `GroupAggregate`(インデックス順)か `HashAggregate` のいずれか良い方を選びやすくなります。\n", + "* **統計**\n", + "\n", + " ```sql\n", + " ANALYZE ActorDirector;\n", + " ```\n", + "\n", + " 古い統計だとフルスキャンを選びがち。\n", + "* **ワークメモリ**\n", + " 大きめのグループ数で `HashAggregate` が溢れる場合はセッション単位で調整:\n", + "\n", + " ```sql\n", + " SET work_mem = '128MB'; -- ワークロードに合わせて\n", + " ```\n", + "* **確認**\n", + " `EXPLAIN (ANALYZE, BUFFERS)` で `HashAggregate` か `GroupAggregate`、`Rows Removed by Filter`、`Shared Hit/Read` を確認。\n", + " さらに縮める必要があれば、取り込みパイプラインで**サマリテーブル**(actor_id, director_id ごとの coop_cnt)をバッチ更新するのが王道です。\n", + "\n", + "結論:**`GROUP BY ... HAVING COUNT(*) >= 3` が最短コース**です。ここから先は**物理設計と実行計画**で詰めます。実装はそのまま、周辺を最適化しましょう。\n", + "\n", + "---\n", + "\n", + "## 推奨チューニング手順(優先度順)\n", + "\n", + "### 1) 複合インデックス(最重要)\n", + "\n", + "ペアの集約を**インデックス順で一発**に寄せます。\n", + "\n", + "```sql\n", + "-- 既存トラフィックがあるなら同時作成を推奨\n", + "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_actor_director\n", + " ON ActorDirector (actor_id, director_id);\n", + "```\n", + "\n", + "効果:\n", + "\n", + "* `GroupAggregate` が **Index Only/Index Scan** ベースになりやすい\n", + "* 返す列がキーのみのため **カバリング**(VMが立てば実質 Index Only Scan)\n", + "\n", + "### 2) 統計・可視化(Index Only Scan を狙う)\n", + "\n", + "```sql\n", + "ANALYZE ActorDirector; -- 統計更新(必須)\n", + "VACUUM (ANALYZE) ActorDirector; -- 可視マップ(VM)を立てて Index Only Scan 率UP\n", + "```\n", + "\n", + "* 可視マップが育つと「テーブル読み」回数が下がります。\n", + "\n", + "### 3) 充分ならウィンドウ版は封印\n", + "\n", + "ウィンドウは行を膨らませるので I/O 増。最速の本命は下記。\n", + "\n", + "```sql\n", + "SELECT actor_id, director_id\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id\n", + "HAVING COUNT(*) >= 3;\n", + "```\n", + "\n", + "### 4) HashAggregate のスピル対策(必要時のみ)\n", + "\n", + "`EXPLAIN (ANALYZE, BUFFERS)` で HashAggregate がディスクに溢れていたら一時的に:\n", + "\n", + "```sql\n", + "SET work_mem = '128MB'; -- ワークロードに合わせ調整\n", + "-- 比較用(悪化する場合もあるので計測前提)\n", + "SET enable_hashagg = on; -- 既定\n", + "-- or\n", + "SET enable_hashagg = off; -- GroupAggregate に寄せる比較用\n", + "```\n", + "\n", + "* **spilling**(Disk: ~MB)が消えるか、`GroupAggregate` で高速化するかを計測。\n", + "\n", + "### 5) 並列実行の活用(テーブルが大きい場合)\n", + "\n", + "```sql\n", + "SET max_parallel_workers_per_gather = 2; -- 環境許容量に応じて\n", + "```\n", + "\n", + "* `Parallel Index/Seq Scan + Parallel Hash/Group Aggregate` を取りやすくなります。\n", + "\n", + "### 6) 物理配置の最適化(更新が少ないなら)\n", + "\n", + "```sql\n", + "-- インデックス順に再配置(ダウンタイム許容時)\n", + "CLUSTER ActorDirector USING idx_actor_director;\n", + "-- オンラインなら pg_repack も選択肢\n", + "```\n", + "\n", + "* `(actor_id, director_id)` で連続化 → キャッシュ効率改善。\n", + "\n", + "---\n", + "\n", + "## EXPLAIN チェックリスト(理想像)\n", + "\n", + "* `GroupAggregate` or `HashAggregate` がトップ\n", + "* `key=idx_actor_director` を使っている(`Index Scan/Index Only Scan`)\n", + "* `Extra`/出力に **Disk: 0**(スピルなし)\n", + "* `Shared Read` が小さく、`Hit` が多い\n", + "* `Rows Removed by Filter` が少ない(余計な読みが少ない)\n", + "\n", + "---\n", + "\n", + "## 規模がさらに大きい場合の構造策\n", + "\n", + "### サマリテーブル(マテビュー代替)\n", + "\n", + "高頻度の照会なら恒常的に最速です。\n", + "\n", + "```sql\n", + "-- 初期構築\n", + "CREATE TABLE CoopSummary AS\n", + "SELECT actor_id, director_id, COUNT(*) AS coop_cnt\n", + "FROM ActorDirector\n", + "GROUP BY actor_id, director_id;\n", + "\n", + "CREATE UNIQUE INDEX IF NOT EXISTS ux_coop ON CoopSummary(actor_id, director_id);\n", + "\n", + "-- 照会は\n", + "SELECT actor_id, director_id\n", + "FROM CoopSummary\n", + "WHERE coop_cnt >= 3;\n", + "```\n", + "\n", + "更新はバッチで増分反映(新着分を集計→`INSERT ... ON CONFLICT ... DO UPDATE`)。\n", + "\n", + "---\n", + "\n", + "## 期待できる改善幅(目安)\n", + "\n", + "* **インデックス + 統計更新**だけで、350ms → **100ms台**は十分現実的(データ量・I/O次第)。\n", + "* さらに **Index Only Scan** や **並列化**、**物理配置**が噛むと二桁msに入るケースも。\n", + "\n", + "---\n", + "\n", + "## まとめ(実行順)\n", + "\n", + "1. `CREATE INDEX CONCURRENTLY ON ActorDirector(actor_id, director_id);`\n", + "2. `VACUUM (ANALYZE) ActorDirector;`\n", + "3. 本命クエリは `GROUP BY ... HAVING COUNT(*) >= 3`\n", + "4. まだ遅ければ `work_mem` / `enable_hashagg` を計測調整\n", + "5. それでも重い常用クエリなら **サマリテーブル**化\n", + "\n", + "この順で叩けば、現状(295ms/314ms)から**まだ削れます**。\n", + "\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}