Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "015e0603",
"metadata": {},
"source": [
"# MySQL 8.0.40\n",
"\n",
"## 0) 前提\n",
"\n",
"* エンジン: **MySQL 8**\n",
"* 並び順: 任意(`ORDER BY` なし)\n",
"* `NOT IN` は未使用\n",
"* 判定は **ID 基準**、表示は仕様どおり `actor_id, director_id`\n",
"\n",
"## 1) 問題\n",
"\n",
"* `ある俳優 (actor_id) と監督 (director_id) の組み合わせで、協働回数が3回以上のペアを求める。`\n",
"* 入力テーブル例: `ActorDirector(actor_id INT, director_id INT, timestamp INT PRIMARY KEY)`\n",
"* 出力仕様: `actor_id, director_id`(重複なし・順不同)\n",
"\n",
"---\n",
"\n",
"## 2) 最適解(単一クエリ)\n",
"\n",
"> ウィンドウ集計でペアごとの協働回数を数え、3回以上のみを射影。重複除去は `DISTINCT`。\n",
"\n",
"```sql\n",
"WITH win AS (\n",
" SELECT\n",
" actor_id,\n",
" director_id,\n",
" COUNT(*) OVER (PARTITION BY actor_id, director_id) AS coop_cnt\n",
" FROM ActorDirector\n",
")\n",
"SELECT DISTINCT\n",
" actor_id,\n",
" director_id\n",
"FROM win\n",
"WHERE coop_cnt >= 3;\n",
"\n",
"Runtime 349 ms\n",
"Beats 56.27%\n",
"\n",
"```\n",
"\n",
"* ポイント: `COUNT(*) OVER (PARTITION BY actor_id, director_id)` でペアごとの総件数を1パスで算出。\n",
"* 出力は仕様列のみ、順序指定なし。\n",
"\n",
"---\n",
"\n",
"## 3) 代替解\n",
"\n",
"> 集約で十分に速いケース。実務ではこちらが最小コストになりやすい。\n",
"\n",
"```sql\n",
"SELECT\n",
" actor_id,\n",
" director_id\n",
"FROM ActorDirector\n",
"GROUP BY actor_id, director_id\n",
"HAVING COUNT(*) >= 3;\n",
"\n",
"Runtime 348 ms\n",
"Beats 57.64%\n",
"\n",
"```\n",
"\n",
"* 追加の手段(参考・必要時のみ): 事前にユニークな重複を除く必要があれば `SELECT DISTINCT actor_id, director_id, timestamp ...` の下位派生を作り `GROUP BY`。\n",
"\n",
"---\n",
"\n",
"## 4) 要点解説\n",
"\n",
"* **方針**: ペア単位で出現回数をカウント → しきい値(3回)でフィルタ → 必要列のみ投影。\n",
"* **NULL/重複**: 入力列はINTでNULL前提なし、`timestamp` は主キーで重複なし。\n",
"* **インデックス**: `PRIMARY KEY(timestamp)` だけだとペア集計で全表走査になりがち。\n",
"\n",
" * 可能なら **複合インデックス `(actor_id, director_id)`** を作成すると `GROUP BY` / `PARTITION BY` の集約が効率化。\n",
"\n",
"---\n",
"\n",
"## 5) 計算量(概算)\n",
"\n",
"* ウィンドウ解: 各パーティション集計で **O(N)**〜**O(N log N)**(実装・メモリアルゴ次第)。\n",
"* `GROUP BY` 解: ソート/ハッシュ集約で **O(N log N)** 近辺(適切なインデックスで実効はほぼ **O(N)**)。\n",
"\n",
"---\n",
"\n",
"## 6) 図解(Mermaid 超保守版)\n",
"\n",
"```mermaid\n",
"flowchart TD\n",
" A[入力 テーブル]\n",
" B[ペア単位の回数を算出]\n",
" C[回数が3以上を抽出]\n",
" D[出力 俳優ID 監督ID]\n",
" A --> B\n",
" B --> C\n",
" C --> D\n",
"```\n",
"\n",
"---\n",
"\n",
"### 補足(実運用メモ)\n",
"\n",
"* データ量が多い場合は `GROUP BY` 案+ `(actor_id, director_id)` の複合インデックスが最も簡潔で速い構成になりやすいです。\n",
"* 結果順は任意のため、**`ORDER BY` を付けない**ことで不要なソートを省きます。\n",
"\n",
"結論:**クエリ自体は最短経路**です。差が出るのは **実装より実行計画(インデックス・統計・メモリ)**。以下を順に打つと体感で大きく縮みます。\n",
"\n",
"---\n",
"\n",
"## 即効性のある改善(順に適用)\n",
"\n",
"1. **複合インデックスを追加(最重要)**\n",
" `GROUP BY actor_id, director_id` をインデックス順でなぞらせ、**ソート/テンポラリ回避**を狙います。\n",
"\n",
"```sql\n",
"CREATE INDEX idx_actor_director ON ActorDirector (actor_id, director_id);\n",
"```\n",
"\n",
"* これで `GROUP BY` 案はほぼ **インデックス順走査→集約** に変わります。\n",
"* 結果列がキーだけなので、**セカンダリインデックスのみ**で完結 (InnoDB はセカンダリ葉にPK含むが今回未参照)。\n",
"\n",
"2. **ウィンドウ版は封印、`GROUP BY` を採用**\n",
" ウィンドウ版は `COUNT OVER` → `DISTINCT` で無駄に行を増やします。最速はこれ:\n",
"\n",
"```sql\n",
"SELECT actor_id, director_id\n",
"FROM ActorDirector\n",
"GROUP BY actor_id, director_id\n",
"HAVING COUNT(*) >= 3;\n",
"```\n",
"\n",
"> インデックスが効けば **ファイルソート/テンポラリなし** で流せます。\n",
"\n",
"3. **統計の鮮度を上げる**\n",
"\n",
"```sql\n",
"ANALYZE TABLE ActorDirector;\n",
"```\n",
"\n",
"* 古い統計だとインデックスを握ってくれないことがあります。\n",
"\n",
"4. **ハッシュ集約の挙動を確認(MySQL 8)**\n",
" 場合によってはハッシュ集約が一時領域を使い遅くなることがあります。悪化時のみヒントで切替。\n",
"\n",
"```sql\n",
"-- 悪い計画が出たときだけ\n",
"SELECT /*+ NO_HASH_AGGREGATION() */\n",
" actor_id, director_id\n",
"FROM ActorDirector\n",
"GROUP BY actor_id, director_id\n",
"HAVING COUNT(*) >= 3;\n",
"```\n",
"\n",
"---\n",
"\n",
"## 追加の選択肢(データ量・更新頻度しだい)\n",
"\n",
"* **集計サマリ表**(マテビュー代替)\n",
"\n",
" ```sql\n",
" -- 初期ロード\n",
" CREATE TABLE CoopSummary AS\n",
" SELECT actor_id, director_id, COUNT(*) AS coop_cnt\n",
" FROM ActorDirector\n",
" GROUP BY actor_id, director_id;\n",
"\n",
" CREATE UNIQUE INDEX ux_coop ON CoopSummary(actor_id, director_id);\n",
" ```\n",
"\n",
" * 以降はバッチで増分反映(新規 `ActorDirector` のみ集計→`INSERT ... ON DUPLICATE KEY UPDATE`)。\n",
" * 本問は「閾値3以上の存在判定」なので `WHERE coop_cnt >= 3` で**即時応答**。\n",
"\n",
"---\n",
"\n",
"## 期待できる効果(目安)\n",
"\n",
"* 複合インデックス導入だけで、**中〜大規模**でも 2〜10倍短縮が珍しくありません。\n",
"* あなたの計測(~350ms)規模なら、**2桁〜100ms台**まで落ちる可能性が高いです(I/O・メモリ次第)。\n",
"\n",
"---\n",
"\n",
"## チェックリスト(必ず EXPLAIN)\n",
"\n",
"```sql\n",
"EXPLAIN\n",
"SELECT actor_id, director_id\n",
"FROM ActorDirector\n",
"GROUP BY actor_id, director_id\n",
"HAVING COUNT(*) >= 3;\n",
"```\n",
"\n",
"* `key = idx_actor_director` が選ばれているか\n",
"* `Extra` に **Using index**、**Using temporary/Using filesort が消えているか**\n",
"* 行数見積り(rows)が現実的か(統計が効いているか)\n",
"\n",
"---\n",
"\n",
"## まとめ(処方箋)\n",
"\n",
"* **ベースクエリ**:`GROUP BY ... HAVING COUNT(*) >= 3`\n",
"* **必須インデックス**:`(actor_id, director_id)`\n",
"* **統計更新**:`ANALYZE TABLE`\n",
"* (必要時のみ)**NO_HASH_AGGREGATION ヒント**\n",
"* 本番ワークロードでまだ重いなら、**集計サマリ表**で根本対策\n",
"\n",
"この順で手当すれば、数字はまだ縮みます。ボトルネックはクエリじゃなく**計画**—です。\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading