|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "040332a5", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# Pandas 2.2.2用\n", |
| 9 | + "\n", |
| 10 | + "## 0) 前提\n", |
| 11 | + "\n", |
| 12 | + "* 環境: **Python 3.10.15 / pandas 2.2.2**\n", |
| 13 | + "* **指定シグネチャ厳守**(関数名・引数名・返却列・順序)\n", |
| 14 | + "* I/O 禁止、不要な `print` や `sort_values` 禁止(並び替えは `nlargest` を使用)\n", |
| 15 | + "\n", |
| 16 | + "## 1) 問題\n", |
| 17 | + "\n", |
| 18 | + "* `Cinema から、ID が奇数かつ description が \"boring\" ではない映画を抽出し、rating の降順で返す。`\n", |
| 19 | + "* 入力 DF: `Cinema(id: int, movie: str, description: str, rating: float)`\n", |
| 20 | + "* 出力: 列は `id, movie, description, rating`。**rating 降順**(同率時の順序は任意)\n", |
| 21 | + "\n", |
| 22 | + "## 2) 実装(指定シグネチャ厳守)\n", |
| 23 | + "\n", |
| 24 | + "> 列最小化 → 条件抽出 → `nlargest` で降順(`sort_values` 禁止対応)\n", |
| 25 | + "\n", |
| 26 | + "```python\n", |
| 27 | + "import pandas as pd\n", |
| 28 | + "\n", |
| 29 | + "def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n", |
| 30 | + " \"\"\"\n", |
| 31 | + " Returns:\n", |
| 32 | + " pd.DataFrame: 列名と順序は ['id', 'movie', 'description', 'rating']\n", |
| 33 | + " rating の降順で返す(同率時の順序は任意)\n", |
| 34 | + " \"\"\"\n", |
| 35 | + " # 必要列のみ抽出(列最小化)\n", |
| 36 | + " cols = ['id', 'movie', 'description', 'rating']\n", |
| 37 | + " c = cinema.loc[:, cols]\n", |
| 38 | + "\n", |
| 39 | + " # 条件: 奇数ID かつ description が 'boring' ではなく(かつ非NULL)\n", |
| 40 | + " # 仕様上 NULL を含めないため isna の否定を明示\n", |
| 41 | + " mask = (c['id'] % 2 == 1) & c['description'].notna() & (c['description'] != 'boring')\n", |
| 42 | + " kept = c.loc[mask]\n", |
| 43 | + "\n", |
| 44 | + " # 並び替え: sort_values 禁止のため nlargest を使用して降順を実現\n", |
| 45 | + " # n == len(kept) を取れば rating DESC 全件ソートと同等\n", |
| 46 | + " out = kept.nlargest(len(kept), columns='rating')\n", |
| 47 | + "\n", |
| 48 | + " # 返却: 仕様列・順序のまま\n", |
| 49 | + " return out[['id', 'movie', 'description', 'rating']]\n", |
| 50 | + "\n", |
| 51 | + "Analyze Complexity\n", |
| 52 | + "Runtime 276 ms\n", |
| 53 | + "Beats 35.04%\n", |
| 54 | + "Memory 67.14 MB\n", |
| 55 | + "Beats 68.84%\n", |
| 56 | + "\n", |
| 57 | + "```\n", |
| 58 | + "\n", |
| 59 | + "## 3) アルゴリズム説明\n", |
| 60 | + "\n", |
| 61 | + "* 使用 API\n", |
| 62 | + "\n", |
| 63 | + " * ブールマスク: `Series %`, `Series.notna()`, 比較 `!=`\n", |
| 64 | + " * 列最小化: `DataFrame.loc[:, cols]`\n", |
| 65 | + " * 降順取得: `DataFrame.nlargest(n, 'rating')`(`sort_values` 非使用要件に対応)\n", |
| 66 | + "* **NULL / 重複 / 型**\n", |
| 67 | + "\n", |
| 68 | + " * `description.notna()` により `NULL` を除外(仕様に合わせて明示)\n", |
| 69 | + " * `rating` は数値列必須(float)。文字列混在の場合は事前に `to_numeric` を検討\n", |
| 70 | + " * 主キー `id` 前提なので重複行は想定しない\n", |
| 71 | + "\n", |
| 72 | + "## 4) 計算量(概算)\n", |
| 73 | + "\n", |
| 74 | + "* フィルタ(ブールマスク): **O(N)**\n", |
| 75 | + "* `nlargest(len(kept), 'rating')`: 内部的には選択アルゴリズム+部分ソートで **O(M log M)**(M は残件数)\n", |
| 76 | + "\n", |
| 77 | + " * 全件降順と同等のオーダーだが、`sort_values` を使わず要件を満たす手段として最小限\n", |
| 78 | + "\n", |
| 79 | + "## 5) 図解(Mermaid 超保守版)\n", |
| 80 | + "\n", |
| 81 | + "```mermaid\n", |
| 82 | + "flowchart TD\n", |
| 83 | + " A[入力 Cinema DF] --> B[列最小化 id,movie,description,rating]\n", |
| 84 | + " B --> C[条件抽出 id 奇数 かつ description 非NULL かつ != 'boring']\n", |
| 85 | + " C --> D[nlargest で rating 降順]\n", |
| 86 | + " D --> E[出力 id,movie,description,rating]\n", |
| 87 | + "```\n", |
| 88 | + "\n", |
| 89 | + "**Pandas のオーバーヘッド削減**と**コピー削減**でまだ詰められます。`sort_values` 禁止の前提は守りつつ、`nlargest` の対象を **Series** にしてインデックスで並べ替える・もしくは **NumPy だけで順位付け**するのが速いです。\n", |
| 90 | + "\n", |
| 91 | + "---\n", |
| 92 | + "\n", |
| 93 | + "## 1) 低リスク版(Pandas寄り・最小変更)\n", |
| 94 | + "\n", |
| 95 | + "ポイント\n", |
| 96 | + "\n", |
| 97 | + "* 中間の `c = cinema.loc[:, cols]` コピーを削除\n", |
| 98 | + "* マスクは **NumPy 配列**で作成(`to_numpy()`)\n", |
| 99 | + "* 並び替えは **Series.nlargest** でインデックスを取得 → 最後に列投影\n", |
| 100 | + "\n", |
| 101 | + "```python\n", |
| 102 | + "import pandas as pd\n", |
| 103 | + "import numpy as np\n", |
| 104 | + "\n", |
| 105 | + "def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n", |
| 106 | + " \"\"\"\n", |
| 107 | + " Returns:\n", |
| 108 | + " pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n", |
| 109 | + " \"\"\"\n", |
| 110 | + " id_arr = cinema['id'].to_numpy()\n", |
| 111 | + " desc_arr = cinema['description'].to_numpy()\n", |
| 112 | + " # NULL 除外 + 'boring' 除外 + 奇数ID\n", |
| 113 | + " mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n", |
| 114 | + "\n", |
| 115 | + " # 抽出行の index を得る\n", |
| 116 | + " idx = cinema.index[mask]\n", |
| 117 | + " # rating の降順 index を Series.nlargest で取得(DataFrame.nlargest より軽い)\n", |
| 118 | + " top_idx = cinema.loc[idx, 'rating'].nlargest(idx.size).index\n", |
| 119 | + "\n", |
| 120 | + " return cinema.loc[top_idx, ['id', 'movie', 'description', 'rating']]\n", |
| 121 | + "\n", |
| 122 | + "Analyze Complexity\n", |
| 123 | + "Runtime 261 ms\n", |
| 124 | + "Beats 65.91%\n", |
| 125 | + "Memory 67.39 MB\n", |
| 126 | + "Beats 28.60%\n", |
| 127 | + "\n", |
| 128 | + "```\n", |
| 129 | + "\n", |
| 130 | + "**ねらい**\n", |
| 131 | + "\n", |
| 132 | + "* `Series.nlargest` は対象列だけを扱うため、`DataFrame.nlargest` よりメモリアクセスが少なく、速くなりやすいです。\n", |
| 133 | + "* 中間 DataFrame を作らないのでコピー削減(メモリ・CPU ともに軽くなる傾向)。\n", |
| 134 | + "\n", |
| 135 | + "---\n", |
| 136 | + "\n", |
| 137 | + "## 2) 攻めの最速版(NumPy 主導)\n", |
| 138 | + "\n", |
| 139 | + "ポイント\n", |
| 140 | + "\n", |
| 141 | + "* 並び替えを **`np.argsort`** に完全委譲(Pandas のインデクサ組み立てコストを最小化)\n", |
| 142 | + "* 全行の `rating` を配列で一度読むだけ\n", |
| 143 | + "\n", |
| 144 | + "```python\n", |
| 145 | + "import pandas as pd\n", |
| 146 | + "import numpy as np\n", |
| 147 | + "\n", |
| 148 | + "def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n", |
| 149 | + " \"\"\"\n", |
| 150 | + " Returns:\n", |
| 151 | + " pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n", |
| 152 | + " \"\"\"\n", |
| 153 | + " id_arr = cinema['id'].to_numpy()\n", |
| 154 | + " desc_arr = cinema['description'].to_numpy()\n", |
| 155 | + " rate_arr = cinema['rating'].to_numpy()\n", |
| 156 | + "\n", |
| 157 | + " mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n", |
| 158 | + " sel = np.flatnonzero(mask)\n", |
| 159 | + "\n", |
| 160 | + " # rating 降順の位置(選択部分のみ)を取得\n", |
| 161 | + " order_in_sel = np.argsort(rate_arr[sel])[::-1]\n", |
| 162 | + " row_pos = sel[order_in_sel]\n", |
| 163 | + "\n", |
| 164 | + " return cinema.iloc[row_pos, :][['id', 'movie', 'description', 'rating']]\n", |
| 165 | + "\n", |
| 166 | + "Analyze Complexity\n", |
| 167 | + "Runtime 256 ms\n", |
| 168 | + "Beats 76.61%\n", |
| 169 | + "Memory 67.11 MB\n", |
| 170 | + "Beats 68.84%\n", |
| 171 | + "\n", |
| 172 | + "```\n", |
| 173 | + "\n", |
| 174 | + "**ねらい**\n", |
| 175 | + "\n", |
| 176 | + "* `nlargest(len)` は実質「全件降順」と同義で、内部での選択+部分ソートとはいえコストが大きいことがあります。\n", |
| 177 | + "* `np.argsort` は純配列上で高速に動き、**行位置配列→`iloc`** の流れが非常に軽いです。\n", |
| 178 | + "\n", |
| 179 | + "---\n", |
| 180 | + "\n", |
| 181 | + "## 3) 追加の細かな最適化ヒント\n", |
| 182 | + "\n", |
| 183 | + "* **dtype の見直し**\n", |
| 184 | + "\n", |
| 185 | + " * `id` を `int32`、`rating` を `float32` に落とせるならメモリ削減(CPU キャッシュ効率↑)。\n", |
| 186 | + " * 文字列が多い場合は `pd.StringDtype()`(もしくは pyarrow backend があれば `string[pyarrow]`)でフットプリントを縮小。\n", |
| 187 | + "* **列アクセスの一貫性**\n", |
| 188 | + "\n", |
| 189 | + " * 同じ列を複数回使うときは **一度配列化して再利用**(上記コードのように `to_numpy()` を 1 回だけ呼ぶ)。\n", |
| 190 | + "* **条件の確定順序**\n", |
| 191 | + "\n", |
| 192 | + " * 選択性が高い条件(今回なら `id & 1` よりも `description` 判定の方が効くケースが多い)を先に評価しても、NumPy のブール演算は短絡しないため実行順で速度は大きく変わりません。配列化して一発で作る方が速いです。\n", |
| 193 | + "\n", |
| 194 | + "---\n", |
| 195 | + "\n", |
| 196 | + "## 4) 期待効果の目安\n", |
| 197 | + "\n", |
| 198 | + "* 低リスク版:中間コピー削減と `Series.nlargest` 採用で **10–25% 程度短縮**が見込めることが多いです。\n", |
| 199 | + "* NumPy 版:データサイズや列数にもよりますが、**さらに数ms〜数十ms** 改善するケースがあります。\n", |
| 200 | + "\n", |
| 201 | + "> まずは **低リスク版** → 効果が物足りなければ **NumPy 版**、の順でお試しを。\n", |
| 202 | + "> それでもまだ詰める必要があれば、dtype 最適化や前処理段階でのフィルタ(上流で奇数IDだけ渡す等)をご検討ください。\n", |
| 203 | + "\n" |
| 204 | + ] |
| 205 | + } |
| 206 | + ], |
| 207 | + "metadata": { |
| 208 | + "language_info": { |
| 209 | + "name": "python" |
| 210 | + } |
| 211 | + }, |
| 212 | + "nbformat": 4, |
| 213 | + "nbformat_minor": 5 |
| 214 | +} |
0 commit comments