Skip to content

Commit 434843e

Browse files
authored
Merge pull request #201 from myoshi2891/dev-from-macmini
SQL: 620. Not Boring Movies Basic Select
2 parents 6eb2d1b + f1fcdb9 commit 434843e

3 files changed

Lines changed: 588 additions & 0 deletions

File tree

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "7b1caa6a",
6+
"metadata": {},
7+
"source": [
8+
"# MySQL 8.0.40\n",
9+
"\n",
10+
"## 0) 前提\n",
11+
"\n",
12+
"* エンジン: **MySQL 8**\n",
13+
"* 並び順: 本問は **仕様で降順指定あり**(`ORDER BY rating DESC`)\n",
14+
"* `NOT IN` は NULL 罠のため回避(本問では不要)\n",
15+
"* 判定は **ID 基準**、表示は仕様どおりの列名と順序\n",
16+
"\n",
17+
"## 1) 問題\n",
18+
"\n",
19+
"* `映画テーブル Cinema から、ID が奇数かつ description が \"boring\" ではない映画を抽出し、rating の降順で返せ。`\n",
20+
"* 入力テーブル例: `Cinema(id, movie, description, rating)`\n",
21+
"* 出力仕様: `id, movie, description, rating` を **rating 降順**で返す\n",
22+
"\n",
23+
"## 2) 最適解(単一クエリ)\n",
24+
"\n",
25+
"> 本問はウィンドウ不要。単純な条件抽出+降順ソートで 1 クエリ。\n",
26+
"\n",
27+
"```sql\n",
28+
"SELECT\n",
29+
" id,\n",
30+
" movie,\n",
31+
" description,\n",
32+
" rating\n",
33+
"FROM Cinema\n",
34+
"WHERE (id % 2) = 1 -- 奇数IDのみ\n",
35+
" AND description <> 'boring' -- 文字列一致で除外(NULL は自動的に除外される)\n",
36+
"ORDER BY rating DESC; -- 仕様で降順指定\n",
37+
"\n",
38+
"Runtime 241 ms\n",
39+
"Beats 34.15%\n",
40+
"\n",
41+
"```\n",
42+
"\n",
43+
"## 3) 代替解\n",
44+
"\n",
45+
"> 同義。奇数判定をビット演算にするだけ。実行計画・結果は同等。\n",
46+
"\n",
47+
"```sql\n",
48+
"SELECT\n",
49+
" id,\n",
50+
" movie,\n",
51+
" description,\n",
52+
" rating\n",
53+
"FROM Cinema\n",
54+
"WHERE (id & 1) = 1\n",
55+
" AND description <> 'boring'\n",
56+
"ORDER BY rating DESC;\n",
57+
"\n",
58+
"Runtime 249 ms\n",
59+
"Beats 25.20%\n",
60+
"\n",
61+
"```\n",
62+
"\n",
63+
"## 4) 要点解説\n",
64+
"\n",
65+
"* **奇数判定**: `id % 2 = 1` でも `(id & 1) = 1` でも可。整数主キーならどちらも安全。\n",
66+
"* **\"boring\" 除外**: `<> 'boring'` は `description IS NULL` 行を含めない(`UNKNOWN` のため WHERE で落ちる)。NULL を含めたいなら `OR description IS NULL` を追加する。\n",
67+
"* **順序**: 本テンプレでは「順不同を優先」とあるが、本問は仕様で **降順必須** のため `ORDER BY rating DESC` を入れる。\n",
68+
"\n",
69+
"## 5) 計算量(概算)\n",
70+
"\n",
71+
"* テーブルフルスキャン時: **O(N)**。`id`(PK)の演算は行ごと定数時間、`description <> 'boring'` も行ごと定数時間。\n",
72+
"* インデックス: 本条件は選択性が低いので索引効果は限定的(`rating` で並び替えるため filesort が走りやすい)。\n",
73+
"\n",
74+
"## 6) 図解(Mermaid 超保守版)\n",
75+
"\n",
76+
"```mermaid\n",
77+
"flowchart TD\n",
78+
" A[入力 Cinema] --> B[条件抽出 id 奇数]\n",
79+
" B --> C[\"条件抽出 description <> \\\"boring\\\"\"]\n",
80+
" C --> D[rating 降順で並べ替え]\n",
81+
" D --> E[出力 id, movie, description, rating]\n",
82+
"```\n",
83+
"\n",
84+
"この問題の範囲(クエリだけで勝負、スキーマ変更なし)だと、ほぼ最短距離が書けています。差分 241ms vs 249ms は誤差レベルで、`%2` と `&1` の優劣は測定ノイズの域です。\n",
85+
"それでも「もう少しだけ良くする/明確にする」観点で、実用的な改善ポイントを挙げます。\n",
86+
"\n",
87+
"## クエリだけでできる微調整\n",
88+
"\n",
89+
"1. **順序の決定性を高める(同率タイを安定化)**\n",
90+
" rating が同じ映画がある場合、返却順が揺れます。審査系や再現性重視ならタイブレークを追加。\n",
91+
"\n",
92+
"```sql\n",
93+
"SELECT id, movie, description, rating\n",
94+
"FROM Cinema\n",
95+
"WHERE (id & 1) = 1\n",
96+
" AND description <> 'boring'\n",
97+
"ORDER BY rating DESC, id DESC; -- 安定化\n",
98+
"\n",
99+
"Runtime 222 ms\n",
100+
"Beats 67.10%\n",
101+
"\n",
102+
"```\n",
103+
"\n",
104+
"※速度影響はほぼ無し、可読性と再現性の向上。\n",
105+
"\n",
106+
"2. **NULL の扱いを明示**\n",
107+
" 仕様どおり「boring ではない」を厳密に“NULL は含めない”と解釈するなら明示しておくと安心。\n",
108+
"\n",
109+
"```sql\n",
110+
"... AND description IS NOT NULL\n",
111+
" AND description <> 'boring'\n",
112+
"```\n",
113+
"\n",
114+
"逆に **NULL も許可**したいなら `OR description IS NULL` を併記。\n",
115+
"\n",
116+
"3. **大文字小文字の扱いを固定**\n",
117+
" 環境の照合順序で “Boring” などの扱いが変わります。区別したいならバイナリ比較に。\n",
118+
"\n",
119+
"```sql\n",
120+
"... AND BINARY description <> 'boring'\n",
121+
"-- もしくは\n",
122+
"... AND description COLLATE utf8mb4_bin <> 'boring'\n",
123+
"```\n",
124+
"\n",
125+
"> ここまでが LeetCode 的に現実的な「クエリ単体」改善。速度はほぼ変わらず、**結果の安定性・意図の明確さ**が主効果です。\n",
126+
"\n",
127+
"## スキーマ変更が許される現場向け最適化(参考)\n",
128+
"\n",
129+
"> テーブルが大きく、実運用で速度を突き詰めたい場合。\n",
130+
"\n",
131+
"* **filesort 回避用インデックス**(`ORDER BY rating DESC` をインデックス順で満たす)\n",
132+
"\n",
133+
" ```sql\n",
134+
" CREATE INDEX idx_cinema_rating_desc ON Cinema (rating DESC, id);\n",
135+
" ```\n",
136+
"\n",
137+
" *効果*: 並べ替えコストを大幅削減(ただし `description <> 'boring'` は残差条件で評価)。\n",
138+
" *ポイント*: 取り出し列 `movie, description` は二次索引から PK 経由でルックアップされます(InnoDB 仕様)。\n",
139+
"\n",
140+
"* **関数インデックスで偶数/奇数を sargable に**\n",
141+
"\n",
142+
" ```sql\n",
143+
" CREATE INDEX idx_cinema_odd_rating ON Cinema ( (id & 1), rating DESC );\n",
144+
" -- そしてクエリは WHERE (id & 1) = 1 AND description <> 'boring'\n",
145+
" ```\n",
146+
"\n",
147+
" *効果*: まず `(id & 1)=1` で範囲を半減 → そのまま `rating DESC` でインデックススキャンし、\n",
148+
" `description <> 'boring'` はフィルタで落とす。大規模データで効きやすいです。\n",
149+
"\n",
150+
"* **description 側の選択性が高い場合**\n",
151+
" “boring” が多い/少ないで有効度が変わりますが、**読み取り量を減らす**方向に張るなら:\n",
152+
"\n",
153+
" ```sql\n",
154+
" CREATE INDEX idx_cinema_desc_rating ON Cinema (description, rating DESC);\n",
155+
" ```\n",
156+
"\n",
157+
" *注意*: `<> 'boring'` はレンジ分割(`< 'boring'` と `> 'boring'`)になるため、\n",
158+
" インデックス効用はデータ分布に強く依存します。実測&EXPLAIN で判断を。\n",
159+
"\n",
160+
"## まとめ\n",
161+
"\n",
162+
"* **クエリ単体では既に最適レベル**。実測 241ms と 249ms の差は誤差で、どちらも OK。\n",
163+
"* 品質面の小改善は **タイブレーク追加**・**NULL/大文字小文字の扱いを明示**。\n",
164+
"* 実運用で速度を上げるなら **`ORDER BY` 用の降順インデックス** と **(id & 1) の関数インデックス**が効きます。\n",
165+
"\n",
166+
"必要なら、あなたの想定データ量・分布を仮定して `EXPLAIN` の読み方とインデックス案をもう少し踏み込みで出します。\n",
167+
"\n"
168+
]
169+
}
170+
],
171+
"metadata": {
172+
"language_info": {
173+
"name": "python"
174+
}
175+
},
176+
"nbformat": 4,
177+
"nbformat_minor": 5
178+
}
Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "040332a5",
6+
"metadata": {},
7+
"source": [
8+
"# Pandas 2.2.2用\n",
9+
"\n",
10+
"## 0) 前提\n",
11+
"\n",
12+
"* 環境: **Python 3.10.15 / pandas 2.2.2**\n",
13+
"* **指定シグネチャ厳守**(関数名・引数名・返却列・順序)\n",
14+
"* I/O 禁止、不要な `print` や `sort_values` 禁止(並び替えは `nlargest` を使用)\n",
15+
"\n",
16+
"## 1) 問題\n",
17+
"\n",
18+
"* `Cinema から、ID が奇数かつ description が \"boring\" ではない映画を抽出し、rating の降順で返す。`\n",
19+
"* 入力 DF: `Cinema(id: int, movie: str, description: str, rating: float)`\n",
20+
"* 出力: 列は `id, movie, description, rating`。**rating 降順**(同率時の順序は任意)\n",
21+
"\n",
22+
"## 2) 実装(指定シグネチャ厳守)\n",
23+
"\n",
24+
"> 列最小化 → 条件抽出 → `nlargest` で降順(`sort_values` 禁止対応)\n",
25+
"\n",
26+
"```python\n",
27+
"import pandas as pd\n",
28+
"\n",
29+
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
30+
" \"\"\"\n",
31+
" Returns:\n",
32+
" pd.DataFrame: 列名と順序は ['id', 'movie', 'description', 'rating']\n",
33+
" rating の降順で返す(同率時の順序は任意)\n",
34+
" \"\"\"\n",
35+
" # 必要列のみ抽出(列最小化)\n",
36+
" cols = ['id', 'movie', 'description', 'rating']\n",
37+
" c = cinema.loc[:, cols]\n",
38+
"\n",
39+
" # 条件: 奇数ID かつ description が 'boring' ではなく(かつ非NULL)\n",
40+
" # 仕様上 NULL を含めないため isna の否定を明示\n",
41+
" mask = (c['id'] % 2 == 1) & c['description'].notna() & (c['description'] != 'boring')\n",
42+
" kept = c.loc[mask]\n",
43+
"\n",
44+
" # 並び替え: sort_values 禁止のため nlargest を使用して降順を実現\n",
45+
" # n == len(kept) を取れば rating DESC 全件ソートと同等\n",
46+
" out = kept.nlargest(len(kept), columns='rating')\n",
47+
"\n",
48+
" # 返却: 仕様列・順序のまま\n",
49+
" return out[['id', 'movie', 'description', 'rating']]\n",
50+
"\n",
51+
"Analyze Complexity\n",
52+
"Runtime 276 ms\n",
53+
"Beats 35.04%\n",
54+
"Memory 67.14 MB\n",
55+
"Beats 68.84%\n",
56+
"\n",
57+
"```\n",
58+
"\n",
59+
"## 3) アルゴリズム説明\n",
60+
"\n",
61+
"* 使用 API\n",
62+
"\n",
63+
" * ブールマスク: `Series %`, `Series.notna()`, 比較 `!=`\n",
64+
" * 列最小化: `DataFrame.loc[:, cols]`\n",
65+
" * 降順取得: `DataFrame.nlargest(n, 'rating')`(`sort_values` 非使用要件に対応)\n",
66+
"* **NULL / 重複 / 型**\n",
67+
"\n",
68+
" * `description.notna()` により `NULL` を除外(仕様に合わせて明示)\n",
69+
" * `rating` は数値列必須(float)。文字列混在の場合は事前に `to_numeric` を検討\n",
70+
" * 主キー `id` 前提なので重複行は想定しない\n",
71+
"\n",
72+
"## 4) 計算量(概算)\n",
73+
"\n",
74+
"* フィルタ(ブールマスク): **O(N)**\n",
75+
"* `nlargest(len(kept), 'rating')`: 内部的には選択アルゴリズム+部分ソートで **O(M log M)**(M は残件数)\n",
76+
"\n",
77+
" * 全件降順と同等のオーダーだが、`sort_values` を使わず要件を満たす手段として最小限\n",
78+
"\n",
79+
"## 5) 図解(Mermaid 超保守版)\n",
80+
"\n",
81+
"```mermaid\n",
82+
"flowchart TD\n",
83+
" A[入力 Cinema DF] --> B[列最小化 id,movie,description,rating]\n",
84+
" B --> C[条件抽出 id 奇数 かつ description 非NULL かつ != 'boring']\n",
85+
" C --> D[nlargest で rating 降順]\n",
86+
" D --> E[出力 id,movie,description,rating]\n",
87+
"```\n",
88+
"\n",
89+
"**Pandas のオーバーヘッド削減**と**コピー削減**でまだ詰められます。`sort_values` 禁止の前提は守りつつ、`nlargest` の対象を **Series** にしてインデックスで並べ替える・もしくは **NumPy だけで順位付け**するのが速いです。\n",
90+
"\n",
91+
"---\n",
92+
"\n",
93+
"## 1) 低リスク版(Pandas寄り・最小変更)\n",
94+
"\n",
95+
"ポイント\n",
96+
"\n",
97+
"* 中間の `c = cinema.loc[:, cols]` コピーを削除\n",
98+
"* マスクは **NumPy 配列**で作成(`to_numpy()`)\n",
99+
"* 並び替えは **Series.nlargest** でインデックスを取得 → 最後に列投影\n",
100+
"\n",
101+
"```python\n",
102+
"import pandas as pd\n",
103+
"import numpy as np\n",
104+
"\n",
105+
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
106+
" \"\"\"\n",
107+
" Returns:\n",
108+
" pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n",
109+
" \"\"\"\n",
110+
" id_arr = cinema['id'].to_numpy()\n",
111+
" desc_arr = cinema['description'].to_numpy()\n",
112+
" # NULL 除外 + 'boring' 除外 + 奇数ID\n",
113+
" mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n",
114+
"\n",
115+
" # 抽出行の index を得る\n",
116+
" idx = cinema.index[mask]\n",
117+
" # rating の降順 index を Series.nlargest で取得(DataFrame.nlargest より軽い)\n",
118+
" top_idx = cinema.loc[idx, 'rating'].nlargest(idx.size).index\n",
119+
"\n",
120+
" return cinema.loc[top_idx, ['id', 'movie', 'description', 'rating']]\n",
121+
"\n",
122+
"Analyze Complexity\n",
123+
"Runtime 261 ms\n",
124+
"Beats 65.91%\n",
125+
"Memory 67.39 MB\n",
126+
"Beats 28.60%\n",
127+
"\n",
128+
"```\n",
129+
"\n",
130+
"**ねらい**\n",
131+
"\n",
132+
"* `Series.nlargest` は対象列だけを扱うため、`DataFrame.nlargest` よりメモリアクセスが少なく、速くなりやすいです。\n",
133+
"* 中間 DataFrame を作らないのでコピー削減(メモリ・CPU ともに軽くなる傾向)。\n",
134+
"\n",
135+
"---\n",
136+
"\n",
137+
"## 2) 攻めの最速版(NumPy 主導)\n",
138+
"\n",
139+
"ポイント\n",
140+
"\n",
141+
"* 並び替えを **`np.argsort`** に完全委譲(Pandas のインデクサ組み立てコストを最小化)\n",
142+
"* 全行の `rating` を配列で一度読むだけ\n",
143+
"\n",
144+
"```python\n",
145+
"import pandas as pd\n",
146+
"import numpy as np\n",
147+
"\n",
148+
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
149+
" \"\"\"\n",
150+
" Returns:\n",
151+
" pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n",
152+
" \"\"\"\n",
153+
" id_arr = cinema['id'].to_numpy()\n",
154+
" desc_arr = cinema['description'].to_numpy()\n",
155+
" rate_arr = cinema['rating'].to_numpy()\n",
156+
"\n",
157+
" mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n",
158+
" sel = np.flatnonzero(mask)\n",
159+
"\n",
160+
" # rating 降順の位置(選択部分のみ)を取得\n",
161+
" order_in_sel = np.argsort(rate_arr[sel])[::-1]\n",
162+
" row_pos = sel[order_in_sel]\n",
163+
"\n",
164+
" return cinema.iloc[row_pos, :][['id', 'movie', 'description', 'rating']]\n",
165+
"\n",
166+
"Analyze Complexity\n",
167+
"Runtime 256 ms\n",
168+
"Beats 76.61%\n",
169+
"Memory 67.11 MB\n",
170+
"Beats 68.84%\n",
171+
"\n",
172+
"```\n",
173+
"\n",
174+
"**ねらい**\n",
175+
"\n",
176+
"* `nlargest(len)` は実質「全件降順」と同義で、内部での選択+部分ソートとはいえコストが大きいことがあります。\n",
177+
"* `np.argsort` は純配列上で高速に動き、**行位置配列→`iloc`** の流れが非常に軽いです。\n",
178+
"\n",
179+
"---\n",
180+
"\n",
181+
"## 3) 追加の細かな最適化ヒント\n",
182+
"\n",
183+
"* **dtype の見直し**\n",
184+
"\n",
185+
" * `id` を `int32`、`rating` を `float32` に落とせるならメモリ削減(CPU キャッシュ効率↑)。\n",
186+
" * 文字列が多い場合は `pd.StringDtype()`(もしくは pyarrow backend があれば `string[pyarrow]`)でフットプリントを縮小。\n",
187+
"* **列アクセスの一貫性**\n",
188+
"\n",
189+
" * 同じ列を複数回使うときは **一度配列化して再利用**(上記コードのように `to_numpy()` を 1 回だけ呼ぶ)。\n",
190+
"* **条件の確定順序**\n",
191+
"\n",
192+
" * 選択性が高い条件(今回なら `id & 1` よりも `description` 判定の方が効くケースが多い)を先に評価しても、NumPy のブール演算は短絡しないため実行順で速度は大きく変わりません。配列化して一発で作る方が速いです。\n",
193+
"\n",
194+
"---\n",
195+
"\n",
196+
"## 4) 期待効果の目安\n",
197+
"\n",
198+
"* 低リスク版:中間コピー削減と `Series.nlargest` 採用で **10–25% 程度短縮**が見込めることが多いです。\n",
199+
"* NumPy 版:データサイズや列数にもよりますが、**さらに数ms〜数十ms** 改善するケースがあります。\n",
200+
"\n",
201+
"> まずは **低リスク版** → 効果が物足りなければ **NumPy 版**、の順でお試しを。\n",
202+
"> それでもまだ詰める必要があれば、dtype 最適化や前処理段階でのフィルタ(上流で奇数IDだけ渡す等)をご検討ください。\n",
203+
"\n"
204+
]
205+
}
206+
],
207+
"metadata": {
208+
"language_info": {
209+
"name": "python"
210+
}
211+
},
212+
"nbformat": 4,
213+
"nbformat_minor": 5
214+
}

0 commit comments

Comments
 (0)