diff --git a/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_mysql.ipynb b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_mysql.ipynb new file mode 100644 index 00000000..22002cbb --- /dev/null +++ b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_mysql.ipynb @@ -0,0 +1,401 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4f6944bf", + "metadata": {}, + "source": [ + "# MySQL 8.0.40\n", + "\n", + "## 0) 前提\n", + "\n", + "* エンジン: **MySQL 8**\n", + "* 並び順: 任意(`ORDER BY` なし)\n", + "* `NOT IN` は NULL 罠のため回避(`EXISTS` / `LEFT JOIN ... IS NULL` を使用)\n", + "* 判定は **ID 基準**(`customer_id`)、`Customer` の重複は事前にユニーク化\n", + "\n", + "---\n", + "\n", + "## 1) 問題\n", + "\n", + "* `{{PROBLEM_STATEMENT}}`\n", + " 「`Product` に存在する **全ての product_key** を購入済みの `customer_id` を報告せよ」\n", + "* 入力テーブル例: `{{TABLES_OR_SCHEMAS}}`\n", + " `Customer(customer_id, product_key)`(重複あり)\n", + " `Product(product_key)`(主キー)\n", + "* 出力仕様: `{{OUTPUT_COLUMNS_AND_RULES}}`\n", + " 列: `customer_id` のみ、順序は任意\n", + "\n", + "---\n", + "\n", + "## 2) 最適解(単一クエリ)\n", + "\n", + "> まず **顧客×商品** の重複を除去し、`Product` の総数と顧客ごとの購入ユニーク数を比較します。\n", + "> ウィンドウ関数は不要なため使用しません(`COUNT(DISTINCT)` が最短で堅牢)。\n", + "\n", + "```sql\n", + "WITH uniq AS ( -- 顧客×商品の重複除去(NULL は無関係なので除外しても良い)\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + "),\n", + "prod AS ( -- 全商品の個数\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + ")\n", + "SELECT u.customer_id\n", + "FROM uniq AS u\n", + "GROUP BY u.customer_id\n", + "HAVING COUNT(*) = (SELECT total_products FROM prod);\n", + "\n", + "Runtime 550 ms\n", + "Beats 48.60%\n", + "\n", + "```\n", + "\n", + "* ポイント\n", + "\n", + " * `Customer` 側の重複でカウントが水増しされないように **`DISTINCT`** を適用\n", + " * `Product` は主キーで重複なし、総数は単純に `COUNT(*)`\n", + " * `ORDER BY` 不要(任意順)\n", + "\n", + "---\n", + "\n", + "## 3) 代替解\n", + "\n", + "> **関係除算**の定石:\n", + "> 「顧客 `cu` にとって、`Product` にあるどの `product_key` についても “未購入” が **存在しない**」= 二重 `NOT EXISTS` / `LEFT JOIN ... IS NULL`\n", + "\n", + "### A) `LEFT JOIN ... IS NULL` 版(読みやすく高速なことが多い)\n", + "\n", + "```sql\n", + "SELECT DISTINCT cu.customer_id\n", + "FROM Customer cu\n", + "-- Product の各 product_key に対して、当該顧客の購入レコードを付与\n", + "LEFT JOIN Customer c2\n", + " ON c2.customer_id = cu.customer_id\n", + " AND c2.product_key IS NOT NULL\n", + "LEFT JOIN Product p\n", + " ON p.product_key = c2.product_key\n", + "-- 「未購入の product_key が存在しない」ことを確認\n", + "-- 未購入検出は、全 Product を左結合し、欠損が 1 件も無いことを HAVING で判定しても良い\n", + "WHERE 1=1\n", + "GROUP BY cu.customer_id\n", + "HAVING COUNT(DISTINCT p.product_key) = (SELECT COUNT(*) FROM Product);\n", + "\n", + "Runtime 694 ms\n", + "Beats 11.75%\n", + "\n", + "```\n", + "\n", + "### B) 二重 `NOT EXISTS` 版(NULL 罠なし)\n", + "\n", + "```sql\n", + "SELECT DISTINCT cu.customer_id\n", + "FROM Customer cu\n", + "WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Product p\n", + " WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Customer c\n", + " WHERE c.customer_id = cu.customer_id\n", + " AND c.product_key = p.product_key\n", + " )\n", + ");\n", + "\n", + "Time Limit Exceeded\n", + "5 / 9 testcases passed\n", + "\n", + "```\n", + "\n", + "> 実務では **A) 集約比較** か **B) 二重 `NOT EXISTS`** がパフォーマンスと可読性のバランス良好。\n", + "> データ量・索引状況(`Customer(customer_id, product_key)`, `Product(product_key)`)で最終決定を。\n", + "\n", + "---\n", + "\n", + "## 4) 要点解説\n", + "\n", + "* **重複対策**: `Customer` の重複は **`SELECT DISTINCT`** で事前に除去し、正しいユニーク購入数を計測\n", + "* **NULL 安全性**: 判定は `Product` 側のキーに限定されるため、`Customer.product_key IS NOT NULL` を明示しても良い\n", + "* **除算パターン**:\n", + "\n", + " * 集約比較(`COUNT(DISTINCT) = 全数`)\n", + " * 二重 `NOT EXISTS`(“未購入が存在しない” を直接表現)\n", + " * `LEFT JOIN ... IS NULL`(未購入検出 → `IS NULL`)\n", + "* **インデックス**:\n", + "\n", + " * `Product(product_key)` は PK で OK\n", + " * `Customer(customer_id, product_key)` の複合索引推奨(結合・EXISTS・重複排除が加速)\n", + "\n", + "---\n", + "\n", + "## 5) 計算量(概算)\n", + "\n", + "* `uniq` の `DISTINCT`: **O(N log N)**(N = `Customer` 行数、複合索引で実効 **O(N)** 近似)\n", + "* 集約 `GROUP BY customer_id`: **O(N log N)**(ハッシュ/インデックスで短縮)\n", + "* サブクエリの `COUNT(*)`(`Product` 全数): **O(P)**(P = `Product` 行数、PK で即時)\n", + "\n", + "---\n", + "\n", + "## 6) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[Customer 入力]\n", + " B[Product 入力]\n", + " C[重複除去 customer_id と product_key の DISTINCT]\n", + " D[Product 総数を集計]\n", + " E[顧客ごとに購入ユニーク数を集計]\n", + " F[購入ユニーク数 と Product 総数 を比較]\n", + " G[出力 customer_id のみ]\n", + "\n", + " A --> C\n", + " B --> D\n", + " C --> E\n", + " D --> F\n", + " E --> F\n", + " F --> G\n", + "```\n", + "\n", + "---\n", + "\n", + "### ひとこと\n", + "\n", + "要件が「**全商品**」で固定なら、`COUNT(DISTINCT)` 比較が最短・明快。将来「カテゴリーごとに全購入」などに拡張する場合は、同じ設計で **分割キー(category_id 等)** を GROUP に足すだけでスケールします。\n", + "\n", + "まず結論:**1本目のクエリは“正しくない可能性”があります。**`Customer` 側の重複は除去していますが、**`Product` に存在しない商品まで数えてしまう**ため、`Product={5,6}`・顧客が `{5,6,7}` を買っているケースで誤って落ちます(3≠2)。\n", + "また、2本目は自己結合が過剰、3本目は索引が無いと `NOT EXISTS` 二重ネストでTLEになりがち。\n", + "\n", + "改善案は以下です。\n", + "\n", + "---\n", + "\n", + "## ✅ 正しいかつ速い(推奨・単一クエリ)\n", + "\n", + "*ポイント:先に **`Customer`×`Product` の交差(=Productに存在するkeyだけ)** に絞ってから、顧客ごとにユニーク数を数え、**全Product数と一致**で判定。さらに **Product総数を1回だけ計算**(`CROSS JOIN`)して再評価コストを消します。*\n", + "\n", + "```sql\n", + "-- ☆ 必要索引(後述)を作ったうえで実行\n", + "SELECT c.customer_id\n", + "FROM (\n", + " SELECT DISTINCT customer_id, product_key -- 重複除去\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + ") AS c\n", + "JOIN Product p\n", + " ON p.product_key = c.product_key -- Productに存在するkeyのみ対象\n", + "CROSS JOIN (\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + ") AS pr\n", + "GROUP BY c.customer_id\n", + "HAVING COUNT(*) = pr.total_products; -- 全商品 = 購入ユニーク数\n", + "\n", + "Unknown column 'pr.total_products' in 'having clause'\n", + "\n", + "```\n", + "\n", + "### ここが効きます\n", + "\n", + "* **正しさ**:`Product` に無い商品を数えない\n", + "* **高速化**:\n", + "\n", + " * `CROSS JOIN` で **Product総数を1回だけ**計算(スカラサブクエリの再評価を回避)\n", + " * `JOIN` キーに合う **複合索引** でスキャン削減\n", + " * `DISTINCT` は `(customer_id, product_key)` 複合索引があれば **インデックスオンリー** で効きやすい\n", + "\n", + "---\n", + "\n", + "## 🔧 索引(超重要)\n", + "\n", + "```sql\n", + "-- ProductはPK済み想定\n", + "ALTER TABLE Customer\n", + " ADD INDEX idx_cust_prod (customer_id, product_key),\n", + " ADD INDEX idx_prod_cust (product_key, customer_id);\n", + "```\n", + "\n", + "* 主処理は `JOIN Product p ON p.product_key = c.product_key` と\n", + " `GROUP BY c.customer_id, DISTINCT(customer_id, product_key)`。\n", + " その両方を **カバー** できるように **2本** 用意(MySQLは1クエリ1本のみ選択のため、実行計画次第で使い分け可能)。\n", + "\n", + "---\n", + "\n", + "## 🧹 既存クエリの改善ポイント\n", + "\n", + "### あなたの1本目(CTE版)\n", + "\n", + "```sql\n", + "WITH uniq AS (\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + "),\n", + "prod AS (SELECT COUNT(*) AS total_products FROM Product)\n", + "SELECT u.customer_id\n", + "FROM uniq u\n", + "GROUP BY u.customer_id\n", + "HAVING COUNT(*) = (SELECT total_products FROM prod);\n", + "\n", + "Runtime 549 ms\n", + "Beats 49.26%\n", + "\n", + "```\n", + "\n", + "**問題点**:`Product` に無い `product_key` まで数える → **誤判定**の恐れ。\n", + "**修正**:`uniq` を `Product` と結合してから集計するか、上の推奨案に入れ替え。\n", + "\n", + "### あなたの2本目(LEFT JOIN×自己結合)\n", + "\n", + "* `Customer cu` に `Customer c2` を自己結合してから `Product` へ…は **冗長**。\n", + "* `DISTINCT(customer_id, product_key)` を作って **直接 `Product` と結合**すればOK。\n", + "* つまり上の推奨案に置き換えで **簡潔かつ高速**。\n", + "\n", + "### あなたの3本目(NOT EXISTS×NOT EXISTS)\n", + "\n", + "* **正しいが重い**。TLEの主因は索引不足。\n", + "* 複合索引2本(上記)を追加すれば改善見込みはありますが、依然として集合除算の**カーディナリティが高いと遅くなりやすい**。\n", + "* それでも使うなら下記の形が読みやすく、最適化も乗りやすいです:\n", + "\n", + "```sql\n", + "SELECT DISTINCT cu.customer_id\n", + "FROM Customer cu\n", + "WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Product p\n", + " WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Customer c FORCE INDEX (idx_cust_prod)\n", + " WHERE c.customer_id = cu.customer_id\n", + " AND c.product_key = p.product_key\n", + " )\n", + ");\n", + "```\n", + "\n", + "> ただし本問題では **集約比較(推奨案)** が一般に速く、運用も安定です。\n", + "\n", + "---\n", + "\n", + "## 📏 期待できる効果(目安)\n", + "\n", + "* 正しく `Product` と交差してから数える+スカラサブクエリ排除(`CROSS JOIN`)+複合索引の適用で、\n", + " **550ms → 二桁〜低三桁ms台** まで下がるケースが多いです(データ分布・ヒストグラム次第)。\n", + "\n", + "---\n", + "\n", + "## 📝 補足Tips\n", + "\n", + "* `Product` の件数が極小(例:数個)なら、実行計画はほぼ何でも速いですが、**将来の拡張**(カテゴリ別「全買い」等)にも今の設計はそのまま拡張できます(`GROUP BY customer_id, category_id`/`COUNT(DISTINCT CASE WHEN p.category_id=... THEN product_key END)` など)。\n", + "* 一時テーブルで `(customer_id, product_key)` のユニーク集合を **事前作成**(永続化や再利用前提)できるなら、更に安定します。\n", + "\n", + "---\n", + "\n", + "### まとめ\n", + "\n", + "* **誤判定の芽**を摘む(`Product` と交差)\n", + "* **再評価を断つ**(`CROSS JOIN` で総数1回)\n", + "* **複合索引2本**でカバーリング\n", + " → シンプル&堅牢&高速、これでいきましょう。\n", + "\n", + " 原因は **`ONLY_FULL_GROUP_BY` が有効**な環境で、`HAVING` 句に **集計していない列(`pr.total_products`)** を直接参照しているためです。MySQL 8 では `HAVING` 内で非集計列を使うなら、その列は **`GROUP BY` に含める**か **集計関数で包む**必要があります。`pr.total_products` は定数ですが、規約上は同様に扱われます。\n", + "\n", + "### 修正案(どちらでもOK)\n", + "\n", + "#### 1) `GROUP BY` に定数列を追加(最も素直)\n", + "\n", + "```sql\n", + "SELECT c.customer_id\n", + "FROM (\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + ") AS c\n", + "JOIN Product p\n", + " ON p.product_key = c.product_key\n", + "CROSS JOIN (\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + ") AS pr\n", + "GROUP BY c.customer_id, pr.total_products\n", + "HAVING COUNT(*) = pr.total_products;\n", + "\n", + "Runtime 555 ms\n", + "Beats 45.18%\n", + "\n", + "```\n", + "\n", + "#### 2) `HAVING` 側で集計関数を使う(`MAX` or `ANY_VALUE`)\n", + "\n", + "```sql\n", + "SELECT c.customer_id\n", + "FROM (\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + ") AS c\n", + "JOIN Product p\n", + " ON p.product_key = c.product_key\n", + "CROSS JOIN (\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + ") AS pr\n", + "GROUP BY c.customer_id\n", + "HAVING COUNT(*) = MAX(pr.total_products); -- pr.total_products は行内で定数\n", + "\n", + "Runtime 499 ms\n", + "Beats 85.37%\n", + "\n", + "```\n", + "\n", + "> どちらもロジックは同一で、**`Product` に存在するキーだけ数える**+**総数は一度だけ計算**のメリットは維持されます。\n", + "\n", + "---\n", + "\n", + "### 参考(別解・等価)\n", + "\n", + "`HAVING` をシンプルにしたいなら、スカラサブクエリでも最適化で1回だけ評価されます:\n", + "\n", + "```sql\n", + "SELECT c.customer_id\n", + "FROM (\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + ") AS c\n", + "JOIN Product p\n", + " ON p.product_key = c.product_key\n", + "GROUP BY c.customer_id\n", + "HAVING COUNT(*) = (SELECT COUNT(*) FROM Product);\n", + "\n", + "Runtime 551 ms\n", + "Beats 47.91%\n", + "\n", + "```\n", + "\n", + "(こちらは最初の誤判定バグ:**`Product` に無い product_key を数える**問題は起きません。`Product` と `JOIN` しているため。)\n", + "\n", + "---\n", + "\n", + "### 仕上げの索引(パフォーマンス要)\n", + "\n", + "```sql\n", + "ALTER TABLE Customer\n", + " ADD INDEX idx_cust_prod (customer_id, product_key),\n", + " ADD INDEX idx_prod_cust (product_key, customer_id);\n", + "-- Product(product_key) はPKでOK\n", + "```\n", + "\n", + "これで “Unknown column 'pr.total_products' in 'having clause'” は解消し、正しさ+速度の両立ができます。\n", + "\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_pandas.ipynb b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_pandas.ipynb new file mode 100644 index 00000000..51b539a7 --- /dev/null +++ b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_pandas.ipynb @@ -0,0 +1,221 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ca447133", + "metadata": {}, + "source": [ + "\n", + "# Pandas 2.2.2用\n", + "\n", + "## 0) 前提\n", + "\n", + "* 環境: **Python 3.10.15 / pandas 2.2.2**\n", + "* **指定シグネチャ厳守**(関数名・引数名・返却列・順序)\n", + "* I O 禁止、不要な `print` や `sort_values` 禁止\n", + "\n", + "---\n", + "\n", + "## 1) 問題\n", + "\n", + "* `{{PROBLEM_STATEMENT}}`\n", + " 「`Product` に存在する **全ての `product_key`** を購入した `customer_id` を返せ」\n", + "* 入力 DF: `{{INPUT_DATAFRAMES}}`\n", + " `Customer(customer_id:int, product_key:int)`(重複行あり/`customer_id` 非NULL)\n", + " `Product(product_key:int)`(主キー)\n", + "* 出力: `{{OUTPUT_COLUMNS_AND_RULES}}`\n", + " 列: `customer_id` のみ(順序任意、重複なし)\n", + "\n", + "---\n", + "\n", + "## 2) 実装(指定シグネチャ厳守)\n", + "\n", + "> 原則は **列最小化 → ユニーク化 → セミジョイン(存在する product のみ) → 顧客ごと集計 → 条件抽出**。\n", + "\n", + "```python\n", + "import pandas as pd\n", + "\n", + "def customers_bought_all_products(customer: pd.DataFrame, product: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Returns:\n", + " pd.DataFrame: 列名と順序は ['customer_id']\n", + " \"\"\"\n", + " # 全商品のユニーク数(PKだが安全のため nunique)\n", + " total_products = product['product_key'].nunique()\n", + "\n", + " # Product が空なら(除算の公理より)全顧客が条件を満たす\n", + " if total_products == 0:\n", + " return customer[['customer_id']].drop_duplicates()\n", + "\n", + " # Product に存在する product_key のみに限定(セミジョイン)\n", + " valid_keys = product['product_key']\n", + " pairs = customer.loc[\n", + " customer['product_key'].isin(valid_keys) & customer['product_key'].notna(),\n", + " ['customer_id', 'product_key']\n", + " ].drop_duplicates()\n", + "\n", + " # 顧客ごとに「購入ユニーク数」を集計し、全商品数と一致する顧客を抽出\n", + " bought_cnt = pairs.groupby('customer_id')['product_key'].nunique()\n", + " keep_ids = bought_cnt.index[bought_cnt.eq(total_products)]\n", + "\n", + " # 指定の列・順序で返却(順序は任意)\n", + " out = pd.DataFrame({'customer_id': keep_ids})\n", + " return out\n", + "\n", + "Analyze Complexity\n", + "Runtime 294 ms\n", + "Beats 41.27%\n", + "Memory 67.23 MB\n", + "Beats 98.63%\n", + "\n", + "```\n", + "\n", + "* 追加メモ\n", + "\n", + " * 返却列は **`['customer_id']` のみ**。並びは任意なので `sort_values` 不要。\n", + " * `drop_duplicates` により `Customer` 側の重複行の水増しを防止。\n", + " * `isin(valid_keys)` で **Product に存在するキーのみ**を対象化(正しさ担保)。\n", + "\n", + "---\n", + "\n", + "## 3) アルゴリズム説明\n", + "\n", + "* 使用API:\n", + "\n", + " * `Series.nunique()`:全商品数の算出\n", + " * `Series.isin()`:セミジョイン(`Product` に存在するキーに限定)\n", + " * `DataFrame.drop_duplicates()`:`(customer_id, product_key)` の重複排除\n", + " * `GroupBy.nunique()`:顧客ごとのユニーク購入数\n", + "* **NULL / 重複 / 型**\n", + "\n", + " * `product_key` の NULL は対象外(`notna()` で明示)\n", + " * `Customer` の重複は `(customer_id, product_key)` 単位で除去\n", + " * 型は整数想定。混在がある場合は事前に `astype('Int64')` 等で正規化\n", + "\n", + "---\n", + "\n", + "## 4) 計算量(概算)\n", + "\n", + "* `isin`:**O(N)**(ハッシュセット化)\n", + "* `drop_duplicates`:**O(N)** 〜 **O(N log N)**\n", + "* `groupby().nunique()`:**O(N)** 近似(ハッシュ集約)\n", + "* メモリ:ユニークな `(customer_id, product_key)` の一時保持分\n", + "\n", + "---\n", + "\n", + "## 5) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[Customer] --> B[列最小化 customer_id, product_key]\n", + " C[Product] --> D[全商品数 nunique]\n", + " B --> E[Product に存在する key でフィルタ]\n", + " E --> F[drop_duplicates で重複除去]\n", + " F --> G[groupby customer_id で nunique]\n", + " D --> H[全商品数と一致で抽出]\n", + " G --> H\n", + " H --> I[出力 customer_id]\n", + "```\n", + "\n", + "---\n", + "\n", + "### 実務Tips(大型データ向け)\n", + "\n", + "* `valid_keys = product['product_key'].unique()` として numpy 配列にしてから `isin` してもOK(等価)。\n", + "* さらにメモリを詰める場合、`pairs` を作らず `groupby('customer_id')['product_key'].nunique()` を直接当てても良いですが、その場合も **Product に存在する key のみ**へ先に絞ることが重要です。\n", + "\n", + "良い数値です(特にメモリ 👍)。まだ **実行時間は縮みます**。主因は `isin` と `nunique` のハッシュ集約コストなので、ここを削ると効きます。\n", + "\n", + "要点だけ先に:\n", + "\n", + "* `isin` を **カテゴリ化(辞書化)** で代替 → O(1) 判定に\n", + "* `nunique` を **`drop_duplicates` → `value_counts`** に置換(既にユニーク化済なら `nunique` は不要)\n", + "* スカラ分岐(`Product` が空なら全員合格)を先に処理\n", + "\n", + "以下、**置き換え版(同シグネチャ)** です。\n", + "\n", + "```python\n", + "import pandas as pd\n", + "\n", + "def customers_bought_all_products(customer: pd.DataFrame, product: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Returns:\n", + " pd.DataFrame: 列名と順序は ['customer_id']\n", + " \"\"\"\n", + " # 1) 全商品のユニーク数(0なら除算の公理より全顧客が合格)\n", + " # ※ product_key がPK前提でも nunique() で安全取得\n", + " total_products = product['product_key'].nunique()\n", + " if total_products == 0:\n", + " return customer[['customer_id']].drop_duplicates()\n", + "\n", + " # 2) Customer.product_key を Product のユニーク集合でカテゴリ化\n", + " # → Product に存在しない key は codes = -1 になる(高速な O(1) マッピング)\n", + " prod_keys = pd.unique(product['product_key'])\n", + " cat = pd.Categorical(customer['product_key'], categories=prod_keys, ordered=False)\n", + " codes = pd.Series(cat.codes, index=customer.index) # int32、未登録は -1\n", + "\n", + " # 3) Product に存在するキーのみ残す(-1 は除外)\n", + " mask = codes.ge(0)\n", + " # 列最小化しつつ、(customer_id, code) の重複を先に除去\n", + " pairs = customer.loc[mask, ['customer_id']].assign(code=codes[mask]).drop_duplicates()\n", + "\n", + " # 4) すでに (customer_id, code) はユニークなので、\n", + " # 「顧客ごとの購入ユニーク数」は単純に value_counts で数えるだけで良い\n", + " cnt = pairs['customer_id'].value_counts(sort=False)\n", + "\n", + " # 5) 全商品数と一致する customer_id を抽出\n", + " keep_ids = cnt.index[cnt.eq(total_products)]\n", + "\n", + " # 6) 仕様列のみ返却(順序は任意)\n", + " out = pd.DataFrame({'customer_id': keep_ids})\n", + " return out\n", + "\n", + "Analyze Complexity\n", + "Runtime 271 ms\n", + "Beats 79.67%\n", + "Memory 68.04 MB\n", + "Beats 37.48%\n", + "\n", + "```\n", + "\n", + "### なぜ速いのか(ボトルネックの潰し方)\n", + "\n", + "* **`isin` → Categorical の `.codes`**\n", + " `isin` は右辺をセット化しても各行ハッシュ照合が必要。一方で `Categorical(categories=prod_keys)` は内部辞書により **ベクトル化 O(1)** でコード化でき、未登録は **-1** に落ちるのでフィルタも速い。\n", + "* **`nunique` → `drop_duplicates` + `value_counts`**\n", + " `(customer_id, product_key)` を重複排除済なら、あとは **顧客IDの件数**を数えるだけ。`value_counts(sort=False)` は単純なカウンタで速い。`nunique` のような追加ハッシュは不要。\n", + "* **早期リターン**(Product空集合)\n", + " ベンチではレアでも、理論上の最適分岐は入れておくと安定する。\n", + "\n", + "### 追加の微調整(必要に応じて)\n", + "\n", + "* `customer` が超巨大で `customer['product_key']` に欠損が多い場合:\n", + " `cat = pd.Categorical(customer['product_key'].dropna(), categories=prod_keys)` としてからインデックス揃えで戻すと、若干速く・軽くなります(ただし現行の `.ge(0)` フィルタで十分なことが多い)。\n", + "* `prod_keys = product['product_key'].to_numpy()` にしてもOK(`unique()` 同等)。\n", + "* もし `customer_id` が連番で密なら、`value_counts` の代わりに `groupby('customer_id', sort=False).size()` でも同等です。\n", + "\n", + "### 計算量(更新版)\n", + "\n", + "* カテゴリ化:**O(N)**(内部辞書参照)\n", + "* `drop_duplicates`:**O(N)**〜**O(N log N)**(実装依存)\n", + "* `value_counts`:**O(N)**\n", + " 総じて **線形近似**。現行 294ms からさらに短縮する余地があります(分布と N 次第ですが、二桁%改善は狙えます)。\n", + "\n", + "---\n", + "\n", + "#### まとめ\n", + "\n", + "* 良い実装ですが、**`isin` と `nunique` を外す**とまだ伸びます。\n", + "* 上記の置き換え版は **I/Oなし・sortなし・列最小化** の原則を守りつつ、理論上も正しい最短手筋です。\n", + "\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_posgres.ipynb b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_posgres.ipynb new file mode 100644 index 00000000..1da96c98 --- /dev/null +++ b/SQL/Leetcode/Intermediate Select/1045. Customers Who Bought All Products/gpt/Customers_Who_Bought_All_Products_posgres.ipynb @@ -0,0 +1,293 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "201eab96", + "metadata": {}, + "source": [ + "# PostgreSQL 16.6+\n", + "\n", + "## 0) 前提\n", + "\n", + "* エンジン: **PostgreSQL 16.6+**\n", + "* 並び順: 任意\n", + "* `NOT IN` 回避(`EXISTS` / `LEFT JOIN ... IS NULL` を推奨)\n", + "* 判定は ID 基準、表示は仕様どおり(`customer_id`)\n", + "\n", + "---\n", + "\n", + "## 1) 問題\n", + "\n", + "* `{{PROBLEM_STATEMENT}}`\n", + " Product テーブルに存在する **全ての `product_key`** を購入した `customer_id` を求める\n", + "* 入力: `{{TABLES_OR_SCHEMAS}}`\n", + " `Customer(customer_id int, product_key int)`(重複あり)\n", + " `Product(product_key int)`(主キー)\n", + "* 出力: `{{OUTPUT_COLUMNS_AND_RULES}}`\n", + " 列: `customer_id` のみ、順序任意、重複なし\n", + "\n", + "---\n", + "\n", + "## 2) 最適解(単一クエリ)\n", + "\n", + "> 先に `Customer` の重複を除去し、**`Product` に存在するキーだけ**に絞ってから顧客ごと件数を集計。全 `Product` 件数との一致で判定します。PostgreSQL では CTE を使い、総数は一度だけ評価します。\n", + "\n", + "```sql\n", + "WITH uniq AS ( -- 顧客×商品のユニーク集合\n", + " SELECT DISTINCT customer_id, product_key\n", + " FROM Customer\n", + " WHERE product_key IS NOT NULL\n", + "),\n", + "cp AS ( -- Product に存在する key のみに限定\n", + " SELECT u.customer_id, u.product_key\n", + " FROM uniq u\n", + " JOIN Product p USING (product_key)\n", + "),\n", + "prod AS ( -- Product 総数(1回だけ計算)\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + "),\n", + "win AS ( -- 顧客ごとの購入ユニーク数\n", + " SELECT\n", + " customer_id,\n", + " COUNT(*) AS bought_cnt\n", + " FROM cp\n", + " GROUP BY customer_id\n", + ")\n", + "SELECT w.customer_id\n", + "FROM win w\n", + "CROSS JOIN prod pr\n", + "WHERE w.bought_cnt = pr.total_products;\n", + "\n", + "Runtime 513 ms\n", + "Beats 54.78%\n", + "\n", + "```\n", + "\n", + "* ポイント\n", + "\n", + " * **正しさ**: `Product` に無い `product_key` を数えない(`JOIN Product` 済み)\n", + " * **効率**: `prod` は 1 回だけ評価、`uniq` で重複除去→集約を軽量化\n", + "\n", + "### 代替(関係除算の典型:二重 `NOT EXISTS`)\n", + "\n", + "```sql\n", + "SELECT DISTINCT c.customer_id\n", + "FROM Customer c\n", + "WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Product p\n", + " WHERE NOT EXISTS (\n", + " SELECT 1\n", + " FROM Customer c2\n", + " WHERE c2.customer_id = c.customer_id\n", + " AND c2.product_key = p.product_key\n", + " )\n", + ");\n", + "\n", + "Runtime 1872 ms\n", + "Beats 5.06%\n", + "\n", + "```\n", + "\n", + "> 読みやすいが、データ量次第でネスト `EXISTS` が重くなりやすい。上の集約比較の方が安定。\n", + "\n", + "---\n", + "\n", + "## 3) 要点解説\n", + "\n", + "* **重複/NULL**: `Customer` 側の重複は **`DISTINCT`** で除去。`product_key IS NOT NULL` を明示。\n", + "* **除算の定石**:\n", + "\n", + " 1. `JOIN Product` で対象を「存在する key」のみに限定\n", + " 2. 顧客ごとに **ユニーク購入数** を集計\n", + " 3. **全商品数と一致**で判定\n", + "* **ウィンドウ不要**: 本件は単純な集合比較のため、ウィンドウ関数なしが最短・高速。\n", + "\n", + "---\n", + "\n", + "## 4) 計算量(概算)\n", + "\n", + "* `uniq` の重複除去: **O(N log N)**(`(customer_id, product_key)` で実質線形近似)\n", + "* `cp` の結合: **O(N)** 近似(`Product(product_key)` がPKのため)\n", + "* 顧客集約: **O(N log N)**(ハッシュ集約で線形近似)\n", + "\n", + "---\n", + "\n", + "## 5) 図解(Mermaid 超保守版)\n", + "\n", + "```mermaid\n", + "flowchart TD\n", + " A[Customer 入力] --> B[重複除去 DISTINCT]\n", + " C[Product 入力] --> D[Product 総数カウント]\n", + " B --> E[Product に存在する key へ JOIN]\n", + " E --> F[顧客ごとユニーク数を集計]\n", + " D --> G[全商品数と一致判定]\n", + " F --> G\n", + " G --> H[出力 customer_id]\n", + "```\n", + "\n", + "---\n", + "\n", + "### 実運用メモ(推奨インデックス)\n", + "\n", + "```sql\n", + "-- Product は PK で OK\n", + "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_customer_cust_prod\n", + " ON Customer (customer_id, product_key);\n", + "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_customer_prod_cust\n", + " ON Customer (product_key, customer_id);\n", + "```\n", + "\n", + "* ユニーク化・結合・集約の全パスをカバーし、実行計画の自由度を高めます。\n", + "\n", + "この結果を見る限り、CTE版はもう十分に「正しい & 使える」水準ですが、まだチューニング余地があります。主に3点です。\n", + "\n", + "---\n", + "\n", + "## 1. CTEをインライン化してプランナーに最適化させる\n", + "\n", + "PostgreSQLは`WITH`を「最適化バリア」にしないようになってきていますが、状況次第では依然としてCTEがマテリアライズ扱いされることがあり、そのせいで余計なスキャンが発生するケースがあります。\n", + "`uniq -> cp -> win` は合成可能なので、単一クエリに潰してプランナーに自由度を与えます。\n", + "\n", + "```sql\n", + "SELECT x.customer_id\n", + "FROM (\n", + " SELECT\n", + " c.customer_id,\n", + " COUNT(DISTINCT c.product_key) AS bought_cnt\n", + " FROM Customer c\n", + " JOIN Product p\n", + " ON p.product_key = c.product_key -- Productに存在するkeyだけ数える\n", + " WHERE c.product_key IS NOT NULL -- NULL防衛\n", + " GROUP BY c.customer_id\n", + ") AS x\n", + "CROSS JOIN (\n", + " SELECT COUNT(*) AS total_products\n", + " FROM Product\n", + ") AS pr\n", + "WHERE x.bought_cnt = pr.total_products;\n", + "\n", + "Runtime 502 ms\n", + "Beats 62.73%\n", + "\n", + "```\n", + "\n", + "### これで狙えること\n", + "\n", + "* `uniq` の `SELECT DISTINCT` と `cp` の結合を、PostgreSQLにまとめて最適化させる\n", + "* `COUNT(DISTINCT ...)` によって「顧客ごとのユニーク商品数」を一気に出す\n", + " → 手動で一時テーブルを積まない分、プランがシンプルになる\n", + "\n", + "実際、この形は論理的にあなたのCTE版と同じ意味を保っています(`uniq`の役割= `COUNT(DISTINCT ...)` で内包)。\n", + "\n", + "### 注意\n", + "\n", + "* `COUNT(DISTINCT c.product_key)` は、データ量が非常に大きい場合にハッシュ集約のコストが支配する可能性があります。ただし典型的なLeetCode/業務レベルの件数ではまずボトルネックにならず、CTE分割より速いことが多いです。\n", + "\n", + "---\n", + "\n", + "## 2. 不要なCROSS JOINを外し、定数として比較する(さらに単純化)\n", + "\n", + "PostgreSQLはスカラサブクエリをかなりうまくキャッシュします。`CROSS JOIN` 自体は悪ではないですが、読みやすさ&最適化計画の自由度の観点で、最終比較はスカラサブクエリに戻してしまっても問題ありません。\n", + "\n", + "```sql\n", + "SELECT\n", + " c.customer_id\n", + "FROM Customer c\n", + "JOIN Product p\n", + " ON p.product_key = c.product_key\n", + "WHERE c.product_key IS NOT NULL\n", + "GROUP BY c.customer_id\n", + "HAVING COUNT(DISTINCT c.product_key) = (\n", + " SELECT COUNT(*) FROM Product\n", + ");\n", + "\n", + "Runtime 595 ms\n", + "Beats 21.28%\n", + "\n", + "```\n", + "\n", + "メリット:\n", + "\n", + "* 読む側に優しい\n", + "* 実行計画も単純化(1回の集約+スカラサブクエリ)\n", + "* 多段CTEよりインラインhash aggの最適化を受けやすい\n", + "\n", + "実務ではこの1本でOKにすることが多いです。LeetCode系でも通るパターン。\n", + "\n", + "---\n", + "\n", + "## 3. インデックス設計で 513ms → さらに圧縮\n", + "\n", + "論理式をいくら磨いても、I/Oボトルネックが太いままだと頭打ちです。今回のWHERE・JOIN・GROUP BY・COUNT(DISTINCT)に効くインデックスを貼ると、劇的に走査量が下がります。\n", + "\n", + "おすすめは2本。\n", + "\n", + "```sql\n", + "-- 1. 顧客→商品 方向\n", + "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_customer_customer_product\n", + " ON Customer (customer_id, product_key);\n", + "\n", + "-- 2. 商品→顧客 方向(JOIN用)\n", + "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_customer_product_customer\n", + " ON Customer (product_key, customer_id);\n", + "```\n", + "\n", + "なぜ2本?\n", + "\n", + "* `JOIN Product p ON p.product_key = c.product_key` では `Customer.product_key` での絞り込み・ルックアップが効く\n", + "* `GROUP BY c.customer_id` + `COUNT(DISTINCT c.product_key)` では `(customer_id, product_key)` の順序が効く\n", + "\n", + " * PostgreSQLは`COUNT(DISTINCT ...)`を内部的にソート/ハッシュするので、`(customer_id, product_key)`の順のインデックスは「同じ顧客の商品のまとまり」を局所化しやすい\n", + " * つまりメモリに優しい\n", + "\n", + "`Product.product_key` はPK前提でもう張られているはずなのでそのままでOK。\n", + "\n", + "これを張るだけで「winテーブル集約」みたいな手順を作らなくても、プランナーがIndex Scan/Hash Aggregateを組んでくれるようになります。\n", + "特にデータが\n", + "\n", + "* Customer行数 >> Product行数\n", + "* 特定のcustomer_idに大量のproduct_keyがある(ヘビーユーザー)\n", + " という分布だと効果が大きいです。\n", + "\n", + "---\n", + "\n", + "## まとめ(どこを変えるべき?)\n", + "\n", + "* **まずやるべき最適化**\n", + " 下記の最小版に置き換えること。CTEを潰し、`COUNT(DISTINCT)`に寄せ、比較はスカラサブクエリでOK。\n", + "\n", + " ```sql\n", + " SELECT\n", + " c.customer_id\n", + " FROM Customer c\n", + " JOIN Product p\n", + " ON p.product_key = c.product_key\n", + " WHERE c.product_key IS NOT NULL\n", + " GROUP BY c.customer_id\n", + " HAVING COUNT(DISTINCT c.product_key) = (\n", + " SELECT COUNT(*) FROM Product\n", + " );\n", + " ```\n", + "\n", + "* **プラスαでやるべき最適化**\n", + " `Customer(customer_id, product_key)` と `Customer(product_key, customer_id)` の2本の複合インデックスを貼る。\n", + " これが実行時間を一番押し下げる現実的な手段。\n", + "\n", + "* **しなくていいこと**\n", + " 二重 `NOT EXISTS` 版はTLE気味(1872ms)で、これはもう「遅いけど正しいリファレンス実装」という位置づけでOK。さらに最適化する価値は低い。\n", + "\n", + "つまり、ロジック面はもう合格。今はプランナーに寄せた形にする+インデックスの用意。この2点で 513ms からまだ落とす余地はあります。\n", + "\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}