Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7b1caa6a",
"metadata": {},
"source": [
"# MySQL 8.0.40\n",
"\n",
"## 0) 前提\n",
"\n",
"* エンジン: **MySQL 8**\n",
"* 並び順: 本問は **仕様で降順指定あり**(`ORDER BY rating DESC`)\n",
"* `NOT IN` は NULL 罠のため回避(本問では不要)\n",
"* 判定は **ID 基準**、表示は仕様どおりの列名と順序\n",
"\n",
"## 1) 問題\n",
"\n",
"* `映画テーブル Cinema から、ID が奇数かつ description が \"boring\" ではない映画を抽出し、rating の降順で返せ。`\n",
"* 入力テーブル例: `Cinema(id, movie, description, rating)`\n",
"* 出力仕様: `id, movie, description, rating` を **rating 降順**で返す\n",
"\n",
"## 2) 最適解(単一クエリ)\n",
"\n",
"> 本問はウィンドウ不要。単純な条件抽出+降順ソートで 1 クエリ。\n",
"\n",
"```sql\n",
"SELECT\n",
" id,\n",
" movie,\n",
" description,\n",
" rating\n",
"FROM Cinema\n",
"WHERE (id % 2) = 1 -- 奇数IDのみ\n",
" AND description <> 'boring' -- 文字列一致で除外(NULL は自動的に除外される)\n",
"ORDER BY rating DESC; -- 仕様で降順指定\n",
"\n",
"Runtime 241 ms\n",
"Beats 34.15%\n",
"\n",
"```\n",
"\n",
"## 3) 代替解\n",
"\n",
"> 同義。奇数判定をビット演算にするだけ。実行計画・結果は同等。\n",
"\n",
"```sql\n",
"SELECT\n",
" id,\n",
" movie,\n",
" description,\n",
" rating\n",
"FROM Cinema\n",
"WHERE (id & 1) = 1\n",
" AND description <> 'boring'\n",
"ORDER BY rating DESC;\n",
"\n",
"Runtime 249 ms\n",
"Beats 25.20%\n",
"\n",
"```\n",
"\n",
"## 4) 要点解説\n",
"\n",
"* **奇数判定**: `id % 2 = 1` でも `(id & 1) = 1` でも可。整数主キーならどちらも安全。\n",
"* **\"boring\" 除外**: `<> 'boring'` は `description IS NULL` 行を含めない(`UNKNOWN` のため WHERE で落ちる)。NULL を含めたいなら `OR description IS NULL` を追加する。\n",
"* **順序**: 本テンプレでは「順不同を優先」とあるが、本問は仕様で **降順必須** のため `ORDER BY rating DESC` を入れる。\n",
"\n",
"## 5) 計算量(概算)\n",
"\n",
"* テーブルフルスキャン時: **O(N)**。`id`(PK)の演算は行ごと定数時間、`description <> 'boring'` も行ごと定数時間。\n",
"* インデックス: 本条件は選択性が低いので索引効果は限定的(`rating` で並び替えるため filesort が走りやすい)。\n",
"\n",
"## 6) 図解(Mermaid 超保守版)\n",
"\n",
"```mermaid\n",
"flowchart TD\n",
" A[入力 Cinema] --> B[条件抽出 id 奇数]\n",
" B --> C[\"条件抽出 description <> \\\"boring\\\"\"]\n",
" C --> D[rating 降順で並べ替え]\n",
" D --> E[出力 id, movie, description, rating]\n",
"```\n",
"\n",
"この問題の範囲(クエリだけで勝負、スキーマ変更なし)だと、ほぼ最短距離が書けています。差分 241ms vs 249ms は誤差レベルで、`%2` と `&1` の優劣は測定ノイズの域です。\n",
"それでも「もう少しだけ良くする/明確にする」観点で、実用的な改善ポイントを挙げます。\n",
"\n",
"## クエリだけでできる微調整\n",
"\n",
"1. **順序の決定性を高める(同率タイを安定化)**\n",
" rating が同じ映画がある場合、返却順が揺れます。審査系や再現性重視ならタイブレークを追加。\n",
"\n",
"```sql\n",
"SELECT id, movie, description, rating\n",
"FROM Cinema\n",
"WHERE (id & 1) = 1\n",
" AND description <> 'boring'\n",
"ORDER BY rating DESC, id DESC; -- 安定化\n",
"\n",
"Runtime 222 ms\n",
"Beats 67.10%\n",
"\n",
"```\n",
"\n",
"※速度影響はほぼ無し、可読性と再現性の向上。\n",
"\n",
"2. **NULL の扱いを明示**\n",
" 仕様どおり「boring ではない」を厳密に“NULL は含めない”と解釈するなら明示しておくと安心。\n",
"\n",
"```sql\n",
"... AND description IS NOT NULL\n",
" AND description <> 'boring'\n",
"```\n",
"\n",
"逆に **NULL も許可**したいなら `OR description IS NULL` を併記。\n",
"\n",
"3. **大文字小文字の扱いを固定**\n",
" 環境の照合順序で “Boring” などの扱いが変わります。区別したいならバイナリ比較に。\n",
"\n",
"```sql\n",
"... AND BINARY description <> 'boring'\n",
"-- もしくは\n",
"... AND description COLLATE utf8mb4_bin <> 'boring'\n",
"```\n",
"\n",
"> ここまでが LeetCode 的に現実的な「クエリ単体」改善。速度はほぼ変わらず、**結果の安定性・意図の明確さ**が主効果です。\n",
"\n",
"## スキーマ変更が許される現場向け最適化(参考)\n",
"\n",
"> テーブルが大きく、実運用で速度を突き詰めたい場合。\n",
"\n",
"* **filesort 回避用インデックス**(`ORDER BY rating DESC` をインデックス順で満たす)\n",
"\n",
" ```sql\n",
" CREATE INDEX idx_cinema_rating_desc ON Cinema (rating DESC, id);\n",
" ```\n",
"\n",
" *効果*: 並べ替えコストを大幅削減(ただし `description <> 'boring'` は残差条件で評価)。\n",
" *ポイント*: 取り出し列 `movie, description` は二次索引から PK 経由でルックアップされます(InnoDB 仕様)。\n",
"\n",
"* **関数インデックスで偶数/奇数を sargable に**\n",
"\n",
" ```sql\n",
" CREATE INDEX idx_cinema_odd_rating ON Cinema ( (id & 1), rating DESC );\n",
" -- そしてクエリは WHERE (id & 1) = 1 AND description <> 'boring'\n",
" ```\n",
"\n",
" *効果*: まず `(id & 1)=1` で範囲を半減 → そのまま `rating DESC` でインデックススキャンし、\n",
" `description <> 'boring'` はフィルタで落とす。大規模データで効きやすいです。\n",
"\n",
"* **description 側の選択性が高い場合**\n",
" “boring” が多い/少ないで有効度が変わりますが、**読み取り量を減らす**方向に張るなら:\n",
"\n",
" ```sql\n",
" CREATE INDEX idx_cinema_desc_rating ON Cinema (description, rating DESC);\n",
" ```\n",
"\n",
" *注意*: `<> 'boring'` はレンジ分割(`< 'boring'` と `> 'boring'`)になるため、\n",
" インデックス効用はデータ分布に強く依存します。実測&EXPLAIN で判断を。\n",
"\n",
"## まとめ\n",
"\n",
"* **クエリ単体では既に最適レベル**。実測 241ms と 249ms の差は誤差で、どちらも OK。\n",
"* 品質面の小改善は **タイブレーク追加**・**NULL/大文字小文字の扱いを明示**。\n",
"* 実運用で速度を上げるなら **`ORDER BY` 用の降順インデックス** と **(id & 1) の関数インデックス**が効きます。\n",
"\n",
"必要なら、あなたの想定データ量・分布を仮定して `EXPLAIN` の読み方とインデックス案をもう少し踏み込みで出します。\n",
"\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "040332a5",
"metadata": {},
"source": [
"# Pandas 2.2.2用\n",
"\n",
"## 0) 前提\n",
"\n",
"* 環境: **Python 3.10.15 / pandas 2.2.2**\n",
"* **指定シグネチャ厳守**(関数名・引数名・返却列・順序)\n",
"* I/O 禁止、不要な `print` や `sort_values` 禁止(並び替えは `nlargest` を使用)\n",
"\n",
"## 1) 問題\n",
"\n",
"* `Cinema から、ID が奇数かつ description が \"boring\" ではない映画を抽出し、rating の降順で返す。`\n",
"* 入力 DF: `Cinema(id: int, movie: str, description: str, rating: float)`\n",
"* 出力: 列は `id, movie, description, rating`。**rating 降順**(同率時の順序は任意)\n",
"\n",
"## 2) 実装(指定シグネチャ厳守)\n",
"\n",
"> 列最小化 → 条件抽出 → `nlargest` で降順(`sort_values` 禁止対応)\n",
"\n",
"```python\n",
"import pandas as pd\n",
"\n",
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" Returns:\n",
" pd.DataFrame: 列名と順序は ['id', 'movie', 'description', 'rating']\n",
" rating の降順で返す(同率時の順序は任意)\n",
" \"\"\"\n",
" # 必要列のみ抽出(列最小化)\n",
" cols = ['id', 'movie', 'description', 'rating']\n",
" c = cinema.loc[:, cols]\n",
"\n",
" # 条件: 奇数ID かつ description が 'boring' ではなく(かつ非NULL)\n",
" # 仕様上 NULL を含めないため isna の否定を明示\n",
" mask = (c['id'] % 2 == 1) & c['description'].notna() & (c['description'] != 'boring')\n",
" kept = c.loc[mask]\n",
"\n",
" # 並び替え: sort_values 禁止のため nlargest を使用して降順を実現\n",
" # n == len(kept) を取れば rating DESC 全件ソートと同等\n",
" out = kept.nlargest(len(kept), columns='rating')\n",
"\n",
" # 返却: 仕様列・順序のまま\n",
" return out[['id', 'movie', 'description', 'rating']]\n",
"\n",
"Analyze Complexity\n",
"Runtime 276 ms\n",
"Beats 35.04%\n",
"Memory 67.14 MB\n",
"Beats 68.84%\n",
"\n",
"```\n",
"\n",
"## 3) アルゴリズム説明\n",
"\n",
"* 使用 API\n",
"\n",
" * ブールマスク: `Series %`, `Series.notna()`, 比較 `!=`\n",
" * 列最小化: `DataFrame.loc[:, cols]`\n",
" * 降順取得: `DataFrame.nlargest(n, 'rating')`(`sort_values` 非使用要件に対応)\n",
"* **NULL / 重複 / 型**\n",
"\n",
" * `description.notna()` により `NULL` を除外(仕様に合わせて明示)\n",
" * `rating` は数値列必須(float)。文字列混在の場合は事前に `to_numeric` を検討\n",
" * 主キー `id` 前提なので重複行は想定しない\n",
"\n",
"## 4) 計算量(概算)\n",
"\n",
"* フィルタ(ブールマスク): **O(N)**\n",
"* `nlargest(len(kept), 'rating')`: 内部的には選択アルゴリズム+部分ソートで **O(M log M)**(M は残件数)\n",
"\n",
" * 全件降順と同等のオーダーだが、`sort_values` を使わず要件を満たす手段として最小限\n",
"\n",
"## 5) 図解(Mermaid 超保守版)\n",
"\n",
"```mermaid\n",
"flowchart TD\n",
" A[入力 Cinema DF] --> B[列最小化 id,movie,description,rating]\n",
" B --> C[条件抽出 id 奇数 かつ description 非NULL かつ != 'boring']\n",
" C --> D[nlargest で rating 降順]\n",
" D --> E[出力 id,movie,description,rating]\n",
"```\n",
"\n",
"**Pandas のオーバーヘッド削減**と**コピー削減**でまだ詰められます。`sort_values` 禁止の前提は守りつつ、`nlargest` の対象を **Series** にしてインデックスで並べ替える・もしくは **NumPy だけで順位付け**するのが速いです。\n",
"\n",
"---\n",
"\n",
"## 1) 低リスク版(Pandas寄り・最小変更)\n",
"\n",
"ポイント\n",
"\n",
"* 中間の `c = cinema.loc[:, cols]` コピーを削除\n",
"* マスクは **NumPy 配列**で作成(`to_numpy()`)\n",
"* 並び替えは **Series.nlargest** でインデックスを取得 → 最後に列投影\n",
"\n",
"```python\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" Returns:\n",
" pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n",
" \"\"\"\n",
" id_arr = cinema['id'].to_numpy()\n",
" desc_arr = cinema['description'].to_numpy()\n",
" # NULL 除外 + 'boring' 除外 + 奇数ID\n",
" mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n",
"\n",
" # 抽出行の index を得る\n",
" idx = cinema.index[mask]\n",
" # rating の降順 index を Series.nlargest で取得(DataFrame.nlargest より軽い)\n",
" top_idx = cinema.loc[idx, 'rating'].nlargest(idx.size).index\n",
"\n",
" return cinema.loc[top_idx, ['id', 'movie', 'description', 'rating']]\n",
"\n",
"Analyze Complexity\n",
"Runtime 261 ms\n",
"Beats 65.91%\n",
"Memory 67.39 MB\n",
"Beats 28.60%\n",
"\n",
"```\n",
"\n",
"**ねらい**\n",
"\n",
"* `Series.nlargest` は対象列だけを扱うため、`DataFrame.nlargest` よりメモリアクセスが少なく、速くなりやすいです。\n",
"* 中間 DataFrame を作らないのでコピー削減(メモリ・CPU ともに軽くなる傾向)。\n",
"\n",
"---\n",
"\n",
"## 2) 攻めの最速版(NumPy 主導)\n",
"\n",
"ポイント\n",
"\n",
"* 並び替えを **`np.argsort`** に完全委譲(Pandas のインデクサ組み立てコストを最小化)\n",
"* 全行の `rating` を配列で一度読むだけ\n",
"\n",
"```python\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"def select_non_boring_odd_movies(cinema: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" Returns:\n",
" pd.DataFrame: ['id', 'movie', 'description', 'rating'] を rating 降順で返す\n",
" \"\"\"\n",
" id_arr = cinema['id'].to_numpy()\n",
" desc_arr = cinema['description'].to_numpy()\n",
" rate_arr = cinema['rating'].to_numpy()\n",
"\n",
" mask = ((id_arr & 1) == 1) & (desc_arr == desc_arr) & (desc_arr != 'boring')\n",
" sel = np.flatnonzero(mask)\n",
"\n",
" # rating 降順の位置(選択部分のみ)を取得\n",
" order_in_sel = np.argsort(rate_arr[sel])[::-1]\n",
" row_pos = sel[order_in_sel]\n",
"\n",
" return cinema.iloc[row_pos, :][['id', 'movie', 'description', 'rating']]\n",
"\n",
"Analyze Complexity\n",
"Runtime 256 ms\n",
"Beats 76.61%\n",
"Memory 67.11 MB\n",
"Beats 68.84%\n",
"\n",
"```\n",
"\n",
"**ねらい**\n",
"\n",
"* `nlargest(len)` は実質「全件降順」と同義で、内部での選択+部分ソートとはいえコストが大きいことがあります。\n",
"* `np.argsort` は純配列上で高速に動き、**行位置配列→`iloc`** の流れが非常に軽いです。\n",
"\n",
"---\n",
"\n",
"## 3) 追加の細かな最適化ヒント\n",
"\n",
"* **dtype の見直し**\n",
"\n",
" * `id` を `int32`、`rating` を `float32` に落とせるならメモリ削減(CPU キャッシュ効率↑)。\n",
" * 文字列が多い場合は `pd.StringDtype()`(もしくは pyarrow backend があれば `string[pyarrow]`)でフットプリントを縮小。\n",
"* **列アクセスの一貫性**\n",
"\n",
" * 同じ列を複数回使うときは **一度配列化して再利用**(上記コードのように `to_numpy()` を 1 回だけ呼ぶ)。\n",
"* **条件の確定順序**\n",
"\n",
" * 選択性が高い条件(今回なら `id & 1` よりも `description` 判定の方が効くケースが多い)を先に評価しても、NumPy のブール演算は短絡しないため実行順で速度は大きく変わりません。配列化して一発で作る方が速いです。\n",
"\n",
"---\n",
"\n",
"## 4) 期待効果の目安\n",
"\n",
"* 低リスク版:中間コピー削減と `Series.nlargest` 採用で **10–25% 程度短縮**が見込めることが多いです。\n",
"* NumPy 版:データサイズや列数にもよりますが、**さらに数ms〜数十ms** 改善するケースがあります。\n",
"\n",
"> まずは **低リスク版** → 効果が物足りなければ **NumPy 版**、の順でお試しを。\n",
"> それでもまだ詰める必要があれば、dtype 最適化や前処理段階でのフィルタ(上流で奇数IDだけ渡す等)をご検討ください。\n",
"\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading