Skip to content

Commit cada98b

Browse files
authored
Merge pull request #210 from myoshi2891/dev-from-macmini
SQL: Basic Join 1075. Project Employees I
2 parents a97ca10 + 5c96055 commit cada98b

2 files changed

Lines changed: 698 additions & 0 deletions

File tree

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "dd3c3113",
6+
"metadata": {},
7+
"source": [
8+
"## 0) 前提\n",
9+
"\n",
10+
"* 環境: **Python 3.10.15 / pandas 2.2.2**\n",
11+
"* **指定シグネチャ厳守**\n",
12+
"\n",
13+
" * 関数名: `project_employees`\n",
14+
" * 引数名: `project`, `employee`\n",
15+
" * 返却列: `[\"project_id\", \"average_years\"]`\n",
16+
" * 列順: 上記順序\n",
17+
"* I/O 禁止(ファイル / 標準出力)、`print` / `sort_values` は使用しない\n",
18+
"\n",
19+
"---\n",
20+
"\n",
21+
"## 1) 問題\n",
22+
"\n",
23+
"* `{{PROBLEM_STATEMENT}}`\n",
24+
" 各プロジェクトについて、そのプロジェクトにアサインされている従業員の\n",
25+
" **平均経験年数 (`experience_years`) を小数第 2 位に丸めて** 求める。\n",
26+
"\n",
27+
"* 入力 DF: `{{INPUT_DATAFRAMES}}`\n",
28+
"\n",
29+
" * `project: pd.DataFrame`\n",
30+
"\n",
31+
" | column | dtype |\n",
32+
" | ----------- | ----- |\n",
33+
" | project_id | int |\n",
34+
" | employee_id | int |\n",
35+
"\n",
36+
" 各行は「従業員 `employee_id` がプロジェクト `project_id` に所属している」ことを表す。\n",
37+
"\n",
38+
" * `employee: pd.DataFrame`\n",
39+
"\n",
40+
" | column | dtype |\n",
41+
" | ---------------- | ------ |\n",
42+
" | employee_id | int |\n",
43+
" | name | object |\n",
44+
" | experience_years | int |\n",
45+
"\n",
46+
" 各行は従業員 1 名の情報。`experience_years` は NULL なし。\n",
47+
"\n",
48+
"* 出力: `{{OUTPUT_COLUMNS_AND_RULES}}`\n",
49+
"\n",
50+
" * 戻り値: `pd.DataFrame`\n",
51+
"\n",
52+
" * 列と意味:\n",
53+
"\n",
54+
" * `project_id`: プロジェクト ID\n",
55+
" * `average_years`: そのプロジェクトに所属する従業員の `experience_years` の平均値(小数第 2 位で丸め)\n",
56+
"\n",
57+
" * 各 `project_id` につき 1 行\n",
58+
"\n",
59+
" * 並び順は任意(`sort_values` 禁止のためソートしない)\n",
60+
"\n",
61+
"---\n",
62+
"\n",
63+
"## 2) 実装(指定シグネチャ厳守)\n",
64+
"\n",
65+
"> 列を最小化しつつ `merge` → `groupby.mean` → `round` の順に処理します。\n",
66+
"> 今回はグループ内順位や条件抽出は不要なので、シンプルな集約だけで OK です。\n",
67+
"\n",
68+
"```python\n",
69+
"import pandas as pd\n",
70+
"\n",
71+
"def project_employees(project: pd.DataFrame, employee: pd.DataFrame) -> pd.DataFrame:\n",
72+
" \"\"\"\n",
73+
" 各プロジェクトごとの平均経験年数を計算する。\n",
74+
"\n",
75+
" Args:\n",
76+
" project (pd.DataFrame): 列 ['project_id', 'employee_id']\n",
77+
" employee (pd.DataFrame): 列 ['employee_id', 'name', 'experience_years']\n",
78+
"\n",
79+
" Returns:\n",
80+
" pd.DataFrame: 列名と順序は ['project_id', 'average_years']\n",
81+
" \"\"\"\n",
82+
" # 1) 列最小化: employee 側は平均に必要な列だけに絞る\n",
83+
" emp_exp = employee[[\"employee_id\", \"experience_years\"]]\n",
84+
"\n",
85+
" # 2) JOIN: project に experience_years を紐づける\n",
86+
" merged = project.merge(emp_exp, on=\"employee_id\", how=\"left\")\n",
87+
"\n",
88+
" # 3) プロジェクトごとに平均値を計算\n",
89+
" out = (\n",
90+
" merged\n",
91+
" .groupby(\"project_id\", as_index=False)[\"experience_years\"]\n",
92+
" .mean()\n",
93+
" )\n",
94+
"\n",
95+
" # 4) 列名を仕様どおりにリネームし、小数第 2 位に丸める\n",
96+
" out = out.rename(columns={\"experience_years\": \"average_years\"})\n",
97+
" out[\"average_years\"] = out[\"average_years\"].round(2)\n",
98+
"\n",
99+
" return out\n",
100+
"\n",
101+
"Analyze Complexity\n",
102+
"Runtime 283 ms\n",
103+
"Beats 62.61%\n",
104+
"Memory 69.14 MB\n",
105+
"Beats 18.00%\n",
106+
"\n",
107+
"```\n",
108+
"\n",
109+
"---\n",
110+
"\n",
111+
"## 3) アルゴリズム説明\n",
112+
"\n",
113+
"### 使用 API\n",
114+
"\n",
115+
"* `DataFrame[...]`\n",
116+
" → 列のサブセットを取り、**列最小化**(不要列を運ばないことでメモリ削減)。\n",
117+
"* `DataFrame.merge`\n",
118+
" → `project` と `employee` を `employee_id` で結合し、従業員の経験年数をプロジェクトに紐付け。\n",
119+
"* `DataFrame.groupby` + `GroupBy.mean`\n",
120+
" → `project_id` ごとに `experience_years` の平均を計算。\n",
121+
"* `DataFrame.rename`\n",
122+
" → 出力列名を問題仕様どおり `average_years` に。\n",
123+
"* `Series.round`\n",
124+
" → 平均値を小数第 2 位に丸める。\n",
125+
"\n",
126+
"### NULL / 重複 / 型の扱い\n",
127+
"\n",
128+
"* `experience_years` は問題文より **NULL なし** 前提なので、`mean()` で NULL ケアは不要。\n",
129+
"* `project` 側は `(project_id, employee_id)` が主キーなので、\n",
130+
" 「同じプロジェクトに同じ従業員が複数行いる」ことはなく、重複による二重カウントも発生しない。\n",
131+
"* `mean()` の戻り値は `float64`。\n",
132+
" `round(2)` で `2.0` → 画面では `2.00` 相当の精度を保持できる(pandas の表示設定次第)。\n",
133+
"\n",
134+
"---\n",
135+
"\n",
136+
"## 4) 計算量(概算)\n",
137+
"\n",
138+
"`N = len(project)`, `M = len(employee)` とします。\n",
139+
"\n",
140+
"* 列最小化: `employee[[\"employee_id\", \"experience_years\"]]`\n",
141+
" → **O(M)**\n",
142+
"* `merge`(ハッシュ結合想定): `project.merge(emp_exp, on=\"employee_id\")`\n",
143+
" → **O(N + M)** 近辺\n",
144+
"* `groupby(\"project_id\").mean()`\n",
145+
" → **O(N)**(ハッシュベースのグループ集計)\n",
146+
"\n",
147+
"したがって、全体はおおよそ **O(N + M)** 時間、メモリは\n",
148+
"結合後の一時 DataFrame(行数 ≒ N、列数は 3 列)相当となります。\n",
149+
"\n",
150+
"---\n",
151+
"\n",
152+
"## 5) 図解(Mermaid 超保守版)\n",
153+
"\n",
154+
"```mermaid\n",
155+
"flowchart TD\n",
156+
" A[project<br/>project_id, employee_id]\n",
157+
" B[employee<br/>employee_id, experience_years, name]\n",
158+
" C[列最小化<br/>employee → employee_id, experience_years のみ]\n",
159+
" D[merge<br/>on employee_id]\n",
160+
" E[\"groupby project_id<br/>mean(experience_years)\"]\n",
161+
" F[\"列名変更 + round(2)<br/>average_years\"]\n",
162+
" G[出力<br/>project_id, average_years]\n",
163+
"\n",
164+
" B --> C\n",
165+
" A --> D\n",
166+
" C --> D\n",
167+
" D --> E\n",
168+
" E --> F\n",
169+
" F --> G\n",
170+
"```\n",
171+
"\n",
172+
"この形なら、LeetCode の pandas 版「Project Employees I」にそのままコピペ投入できるはずです。\n",
173+
"\n",
174+
"結論から言うと、\n",
175+
"\n",
176+
"* **計算量レベルではほぼ限界まで来ていて、大きな高速化は期待しにくい**\n",
177+
"* ただし、**merge → map に変えることで、メモリとわずかな実行時間改善の余地はある**\n",
178+
"* 特にメモリ使用量(Beats 18%)を少しでも良くしたいなら、「結合の仕方」を軽くするのがポイント\n",
179+
"\n",
180+
"という感じです。\n",
181+
"\n",
182+
"---\n",
183+
"\n",
184+
"## 1) いまのアプローチの整理\n",
185+
"\n",
186+
"あなたの現行コード(要約)は:\n",
187+
"\n",
188+
"```python\n",
189+
"emp_exp = employee[[\"employee_id\", \"experience_years\"]]\n",
190+
"\n",
191+
"merged = project.merge(emp_exp, on=\"employee_id\", how=\"left\")\n",
192+
"\n",
193+
"out = (\n",
194+
" merged\n",
195+
" .groupby(\"project_id\", as_index=False)[\"experience_years\"]\n",
196+
" .mean()\n",
197+
")\n",
198+
"\n",
199+
"out = out.rename(columns={\"experience_years\": \"average_years\"})\n",
200+
"out[\"average_years\"] = out[\"average_years\"].round(2)\n",
201+
"```\n",
202+
"\n",
203+
"やっていることは完全に正しくて、アルゴリズム的にも\n",
204+
"\n",
205+
"* 結合: O(N + M)\n",
206+
"* groupby: O(N)\n",
207+
"\n",
208+
"で、ここから**オーダーを変える改善はできません**。\n",
209+
"\n",
210+
"LeetCode の 283ms / Beats 62% という数字も、\n",
211+
"環境ノイズを含めて「十分良い」側です。\n",
212+
"\n",
213+
"---\n",
214+
"\n",
215+
"## 2) 改善ポイント:`merge` → `map` で軽量化\n",
216+
"\n",
217+
"`project` 側にはすでに `employee_id` が入っているので、\n",
218+
"\n",
219+
"> わざわざ `merge` で行を膨らませるのではなく、\n",
220+
"> **`employee_id → experience_years` のマッピングを作って `map` する**\n",
221+
"\n",
222+
"方が、メモリ的には少し有利になり得ます。\n",
223+
"\n",
224+
"### 修正版コード(`map` ベース)\n",
225+
"\n",
226+
"```python\n",
227+
"import pandas as pd\n",
228+
"\n",
229+
"def project_employees_i(project: pd.DataFrame, employee: pd.DataFrame) -> pd.DataFrame:\n",
230+
" \"\"\"\n",
231+
" 各プロジェクトごとの平均経験年数を計算する。\n",
232+
"\n",
233+
" Args:\n",
234+
" project (pd.DataFrame): 列 ['project_id', 'employee_id']\n",
235+
" employee (pd.DataFrame): 列 ['employee_id', 'name', 'experience_years']\n",
236+
"\n",
237+
" Returns:\n",
238+
" pd.DataFrame: 列名と順序は ['project_id', 'average_years']\n",
239+
" \"\"\"\n",
240+
" # 1) employee_id -> experience_years のマップを作る(列最小化 + index 化)\n",
241+
" emp_exp = employee.set_index(\"employee_id\")[\"experience_years\"]\n",
242+
"\n",
243+
" # 2) project に対応する経験年数列を付与(merge の代わりに map)\n",
244+
" # 列最小化: project 側も必要な列だけ使う\n",
245+
" proj = project[[\"project_id\", \"employee_id\"]].copy()\n",
246+
" proj[\"experience_years\"] = proj[\"employee_id\"].map(emp_exp)\n",
247+
"\n",
248+
" # 3) 集約に不要な employee_id を削る(メモリ削減の一手)\n",
249+
" proj = proj[[\"project_id\", \"experience_years\"]]\n",
250+
"\n",
251+
" # 4) プロジェクトごとに平均値を計算\n",
252+
" out = (\n",
253+
" proj\n",
254+
" .groupby(\"project_id\", as_index=False)[\"experience_years\"]\n",
255+
" .mean()\n",
256+
" )\n",
257+
"\n",
258+
" # 5) 列名変更 + 小数第2位に丸める\n",
259+
" out = out.rename(columns={\"experience_years\": \"average_years\"})\n",
260+
" out[\"average_years\"] = out[\"average_years\"].round(2)\n",
261+
"\n",
262+
" return out\n",
263+
"\n",
264+
"Analyze Complexity\n",
265+
"Runtime 283 ms\n",
266+
"Beats 62.61%\n",
267+
"Memory 68.14 MB\n",
268+
"Beats 98.61%\n",
269+
"\n",
270+
"```\n",
271+
"\n",
272+
"### この書き方の狙い\n",
273+
"\n",
274+
"* `merge` は「左・右両方の列をフルで抱えた新しい DataFrame」を作るので、そのぶんメモリを食いやすい\n",
275+
"* `map` は「Series の値を 1 列だけ投げ込む」だけなので、結合の中では比較的軽い\n",
276+
"* `employee` 側は `set_index` して `Series` 化していて、\n",
277+
" **キー→値** のマップとして最小限の形にしている\n",
278+
"\n",
279+
"大幅なメモリ削減にはならないかもしれませんが、\n",
280+
"\n",
281+
"* 一時的に作るオブジェクトがやや小さくなる\n",
282+
"* 実行時間も微妙に改善する可能性はある\n",
283+
"\n",
284+
"ので、LeetCode のメモリ percentile を数ポイント押し上げられる可能性はあります。\n",
285+
"\n",
286+
"---\n",
287+
"\n",
288+
"## 3) どこまで改善を狙うべきか?\n",
289+
"\n",
290+
"正直なところ、今回の問題は\n",
291+
"\n",
292+
"* 入力サイズもそれほど極端ではない\n",
293+
"* アルゴリズムも「JOIN + GROUP BY」で頭打ち\n",
294+
"\n",
295+
"なので、**すでに「クエリとしては十分に最適な部類」** です。\n",
296+
"\n",
297+
"283ms / Beats 62% という数字は、少しコードをいじったくらいで\n",
298+
"劇的に変わるものではないので、\n",
299+
"\n",
300+
"* 可読性\n",
301+
"* 素直さ(変なトリックを使わない)\n",
302+
"* バグの入りにくさ\n",
303+
"\n",
304+
"を優先しつつ、上の `map` 版のような「軽い改善」を入れておけばかなり良いラインだと思います。\n",
305+
"\n",
306+
"---\n",
307+
"\n",
308+
"もしこの後、\n",
309+
"\n",
310+
"* 別の LeetCode pandas 問題\n",
311+
"* もっと重いグループ演算(複数条件 / 上位 k / window関数っぽい処理)\n",
312+
"\n",
313+
"などが出てきたら、そこは **`groupby.transform` / `rank` / `merge` 戦略**を総合的に組み立てる練習ネタにできます。\n"
314+
]
315+
}
316+
],
317+
"metadata": {
318+
"language_info": {
319+
"name": "python"
320+
}
321+
},
322+
"nbformat": 4,
323+
"nbformat_minor": 5
324+
}

0 commit comments

Comments
 (0)