Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
363 changes: 363 additions & 0 deletions Shell/Bash/Leetcode/192. Word Frequency/WordFrequency.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "43f342e3",
"metadata": {},
"source": [
"# 192. Word Frequency - Bash解法\n",
"\n",
"## 問題概要\n",
"\n",
"テキストファイル `words.txt` から各単語の出現頻度を集計し、頻度の降順で出力する問題です。\n",
"\n",
"---\n",
"\n",
"## 解答(スクリプト版)\n",
"\n",
"`wordfreq.sh`(Bash, POSIX ツールのみ)\n",
"\n",
"```bash\n",
"#!/usr/bin/env bash\n",
"set -euo pipefail\n",
"\n",
"# 使い方: ./wordfreq.sh [path/to/words.txt]\n",
"# 引数が未指定なら ./words.txt を読む\n",
"input=\"${1:-words.txt}\"\n",
"\n",
"# 1) 全ての空白(スペース/タブ/改行など)を改行にし、連続空白は1つに圧縮\n",
"# 2) ソート\n",
"# 3) uniq -c で頻度集計\n",
"# 4) 頻度(第1列)で数値降順ソート\n",
"# 5) \"単語 頻度\" の並びに整形\n",
"LC_ALL=C tr -s '[:space:]' '\\n' < \"$input\" \\\n",
" | sort \\\n",
" | uniq -c \\\n",
" | sort -nr \\\n",
" | awk '{print $2, $1}'\n",
Comment on lines +20 to +37

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

入力ファイルの存在チェックを追加することを推奨します。

set -e によりファイルが存在しない場合はスクリプトが終了しますが、エラーメッセージがユーザーフレンドリーではありません。明示的なバリデーションを追加すると、デバッグが容易になります。

♻️ 推奨される修正
 input="${1:-words.txt}"
 
+# 入力ファイルの存在確認
+if [[ ! -f "$input" ]]; then
+  echo "Error: File '$input' not found." >&2
+  exit 1
+fi
+
 # 1) 全ての空白(スペース/タブ/改行など)を改行にし、連続空白は1つに圧縮
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"```bash\n",
"#!/usr/bin/env bash\n",
"set -euo pipefail\n",
"\n",
"# 使い方: ./wordfreq.sh [path/to/words.txt]\n",
"# 引数が未指定なら ./words.txt を読む\n",
"input=\"${1:-words.txt}\"\n",
"\n",
"# 1) 全ての空白(スペース/タブ/改行など)を改行にし、連続空白は1つに圧縮\n",
"# 2) ソート\n",
"# 3) uniq -c で頻度集計\n",
"# 4) 頻度(第1列)で数値降順ソート\n",
"# 5) \"単語 頻度\" の並びに整形\n",
"LC_ALL=C tr -s '[:space:]' '\\n' < \"$input\" \\\n",
" | sort \\\n",
" | uniq -c \\\n",
" | sort -nr \\\n",
" | awk '{print $2, $1}'\n",
"
🤖 Prompt for AI Agents
In `@Shell/Bash/Leetcode/192`. Word Frequency/WordFrequency.ipynb around lines 20
- 37, The script lacks a user-friendly check that the input file (variable
input) exists and is readable before running the pipeline (LC_ALL=C tr ... |
sort ... | awk ...); add an explicit validation right after input is set that
tests the file (e.g. [ -r "$input" ] or [ -f "$input" ] && [ -r "$input" ]) and,
if the check fails, print a clear error message to stderr (using the input
variable for context) and exit with a non-zero status (exit 1) so users get a
helpful message instead of a generic shell failure from set -euo pipefail.

"```\n",
"\n",
"### 実行方法\n",
"\n",
"```bash\n",
"chmod +x wordfreq.sh\n",
"./wordfreq.sh # カレントの words.txt を集計\n",
"# もしくは\n",
"./wordfreq.sh /path/to/words.txt\n",
"```\n",
"\n",
"---\n",
"\n",
"## 解答(パイプのみの1行版)\n",
"\n",
"```bash\n",
"LC_ALL=C tr -s '[:space:]' '\\n' < words.txt | sort | uniq -c | sort -nr | awk '{print $2, $1}'\n",
"```\n",
"\n",
"---\n",
"\n",
"## 入出力例\n",
"\n",
"### 入力 (`words.txt`)\n",
"\n",
"```text\n",
"the day is sunny the the\n",
"the sunny is is\n",
"```\n",
"\n",
"### 出力\n",
"\n",
"```text\n",
"the 4\n",
"is 3\n",
"sunny 2\n",
"day 1\n",
"```\n",
"\n",
"---\n",
"\n",
"## 処理フロー図解\n",
"\n",
"```mermaid\n",
"flowchart LR\n",
" A[\"words.txt<br/>入力ファイル\"] --> B[\"<code>tr -s &#91;:space:&#93; \\\\n</code><br/>全ての空白→改行<br/>連続空白を1つに圧縮\"]\n",
" B --> C[\"<code>sort</code><br/>辞書順整列\"]\n",
" C --> D[\"<code>uniq -c</code><br/>連続同一語をカウント\"]\n",
" D --> E[\"<code>sort -nr</code><br/>頻度で降順ソート\"]\n",
" E --> F[\"<code>awk &#123;print $2, $1&#125;</code><br/>「単語 頻度」形式に整形\"]\n",
" F --> G[\"結果出力\"]\n",
"```\n",
"\n",
"---\n",
"\n",
"## ステップ別の処理詳細\n",
"\n",
"### 入力データ\n",
"\n",
"```text\n",
"the day is sunny the the\n",
"the sunny is is\n",
"```\n",
"\n",
"### ステップ1: `tr -s '[:space:]' '\\n'`\n",
"\n",
"全ての空白文字(スペース・タブ・改行)を改行に変換し、連続する空白は1つに圧縮します。\n",
"\n",
"```text\n",
"the\n",
"day\n",
"is\n",
"sunny\n",
"the\n",
"the\n",
"the\n",
"sunny\n",
"is\n",
"is\n",
"```\n",
"\n",
"### ステップ2: `sort`\n",
"\n",
"単語を辞書順に整列します(`uniq -c` は連続した同一行のみカウントするため必須)。\n",
"\n",
"```text\n",
"day\n",
"is\n",
"is\n",
"is\n",
"sunny\n",
"sunny\n",
"the\n",
"the\n",
"the\n",
"the\n",
"```\n",
"\n",
"### ステップ3: `uniq -c`\n",
"\n",
"連続する同一単語をカウントします。\n",
"\n",
"```text\n",
" 1 day\n",
" 3 is\n",
" 2 sunny\n",
" 4 the\n",
"```\n",
"\n",
"### ステップ4: `sort -nr`\n",
"\n",
"頻度(第1列)で数値降順ソートします。\n",
"\n",
"```text\n",
" 4 the\n",
" 3 is\n",
" 2 sunny\n",
" 1 day\n",
"```\n",
"\n",
"### ステップ5: `awk '{print $2, $1}'`\n",
"\n",
"「単語 頻度」の形式に整形します。\n",
"\n",
"```text\n",
"the 4\n",
"is 3\n",
"sunny 2\n",
"day 1\n",
"```\n",
"\n",
"---\n",
"\n",
"## アルゴリズムの解説\n",
"\n",
"### なぜこの順番なのか?\n",
"\n",
"1. **`tr` で正規化**\n",
" - 様々な空白文字(スペース・タブ・改行)を統一的に処理\n",
" - 連続空白の圧縮により空行を防止\n",
"\n",
"2. **最初の `sort` が必須**\n",
" - `uniq -c` は**連続した**同一行のみカウント\n",
" - 事前に整列することで同じ単語を隣接させる\n",
"\n",
"3. **`uniq -c` で集計**\n",
" - 連続する同一単語の出現回数をカウント\n",
" - 出力形式: `<頻度> <単語>`\n",
"\n",
"4. **`sort -nr` で降順**\n",
" - `-n`: 数値としてソート\n",
" - `-r`: 降順(reverse)\n",
"\n",
"5. **`awk` で整形**\n",
" - 列の順序を入れ替え: `$2 $1` → `<単語> <頻度>`\n",
"\n",
"---\n",
"\n",
"## 代替解法(awk メイン)\n",
"\n",
"`awk` の連想配列を使った方法:\n",
"\n",
"```bash\n",
"awk '{for(i=1;i<=NF;i++) c[$i]++} END{for(w in c) print w, c[w]}' words.txt \\\n",
" | LC_ALL=C sort -k2,2nr\n",
"```\n",
"\n",
"### 処理の流れ\n",
"\n",
"1. `awk` で各単語をカウント\n",
" - `NF`: 行内のフィールド数(空白区切り)\n",
" - `c[$i]++`: 連想配列でカウント\n",
"\n",
"2. `END` ブロックで出力\n",
" - `for(w in c)`: 全ての単語をループ\n",
" - `print w, c[w]`: 単語と頻度を出力\n",
"\n",
"3. `sort -k2,2nr` で頻度降順ソート\n",
" - `-k2,2`: 第2列(頻度)でソート\n",
" - `n`: 数値ソート\n",
" - `r`: 降順\n",
"\n",
"---\n",
"\n",
"## パフォーマンス最適化のポイント\n",
"\n",
"### 1. ロケール設定\n",
"\n",
"```bash\n",
"LC_ALL=C\n",
"```\n",
"\n",
"- C ロケールを使用することで `sort` が高速化\n",
"- バイト単位の比較により安定した動作\n",
"\n",
"### 2. 空行の除去(必要に応じて)\n",
"\n",
"`tr -s` を使っていれば基本的に不要ですが、念のため:\n",
"\n",
"```bash\n",
"... | grep -v '^$' | ...\n",
"```\n",
"\n",
"### 3. 入力ファイルの柔軟な指定\n",
"\n",
"スクリプト版では引数でファイルパスを指定可能:\n",
"\n",
"```bash\n",
"input=\"${1:-words.txt}\"\n",
"```\n",
"\n",
"---\n",
"\n",
"## 応用例\n",
"\n",
"### 圧縮ファイルの処理\n",
"\n",
"```bash\n",
"zcat compressed.txt.gz | tr -s '[:space:]' '\\n' | sort | uniq -c | sort -nr | awk '{print $2, $1}'\n",
"```\n",
"\n",
"### ストリーム処理\n",
"\n",
"```bash\n",
"curl -s https://example.com/text.txt | tr -s '[:space:]' '\\n' | sort | uniq -c | sort -nr | awk '{print $2, $1}'\n",
Comment on lines +259 to +262

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

ストリーム処理の例に堅牢性の改善を検討してください。

curl -s はプログレス表示を抑制しますが、HTTPエラー時には引き続き出力を返します。本番環境での使用を想定する場合、-f オプションの追加を推奨します。

curl -sf https://example.com/text.txt | tr -s '[:space:]' '\n' | ...

-f (fail silently) により、HTTPエラー時にパイプラインへの不正な入力を防止できます。

🤖 Prompt for AI Agents
In `@Shell/Bash/Leetcode/192`. Word Frequency/WordFrequency.ipynb around lines 259
- 262, ストリーム処理の例で使われている curl コマンド ("curl -s https://example.com/text.txt | tr -s
'[:space:]' '\n' | sort | uniq -c | sort -nr | awk '{print $2, $1}'") は HTTP
エラー時にも出力を流してしまう可能性があるため、堅牢性向上のために curl に -f(および必要なら -s)を追加してください:該当セルの curl
コマンドを "curl -sf ..." のように修正して、HTTPエラー発生時にパイプラインへの不正な入力が流れないようにしてください。

"```\n",
"\n",
"### 大文字小文字を区別しない\n",
"\n",
"```bash\n",
"LC_ALL=C tr -s '[:space:]' '\\n' < words.txt \\\n",
" | tr '[:upper:]' '[:lower:]' \\\n",
" | sort \\\n",
" | uniq -c \\\n",
" | sort -nr \\\n",
" | awk '{print $2, $1}'\n",
"```\n",
"\n",
"---\n",
"\n",
"## よくある質問\n",
"\n",
"### Q1: `tr -s` の `-s` オプションは何をする?\n",
"\n",
"**A:** `-s` (squeeze) は連続する文字を1つに圧縮します。\n",
"\n",
"```bash\n",
"# 例: 連続するスペースを1つに\n",
"echo \"a b c\" | tr -s ' '\n",
"# 出力: a b c\n",
"```\n",
"\n",
"### Q2: なぜ `LC_ALL=C` を使うのか?\n",
"\n",
"**A:** \n",
"- ロケール依存の文字比較を避ける\n",
"- バイト単位の比較で高速化\n",
"- 環境による動作の違いを防ぐ\n",
"\n",
"### Q3: `uniq -c` の出力形式は?\n",
"\n",
"**A:** `<頻度><スペース><単語>` の形式で出力されます。\n",
"\n",
"```text\n",
" 4 the\n",
" 3 is\n",
"```\n",
"\n",
"先頭にスペースが入るため、`awk` で列を入れ替える際は `$1` が頻度、`$2` が単語になります。\n",
"\n",
"---\n",
"\n",
"## Mermaid図の注意点\n",
"\n",
"Mermaid でコマンドを含むラベルを書く際の安全な記法:\n",
"\n",
"### 特殊文字のエスケープ\n",
"\n",
"- 角かっこ `[` `]` → `&#91;` `&#93;`\n",
"- 波かっこ `{` `}` → `&#123;` `&#125;`\n",
"- バックスラッシュ `\\` → `\\\\`\n",
"- シングルクォート `'` → `&apos;`(必要な場合)\n",
"\n",
"### 推奨記法\n",
"\n",
"```mermaid\n",
"flowchart LR\n",
" A[\"ノード名\"] --> B[\"<code>コマンド</code><br/>説明文\"]\n",
"```\n",
"\n",
"- ラベル全体を二重引用符 `[\"...\"]` で囲む\n",
"- コマンド部分は `<code>` タグで囲む\n",
"- 改行は `<br/>` を使用\n",
"\n",
"---\n",
"\n",
"## まとめ\n",
"\n",
"この問題の解法ポイント:\n",
"\n",
"1. **`tr`** で空白を正規化\n",
"2. **`sort`** で同一単語を隣接させる\n",
"3. **`uniq -c`** で頻度をカウント\n",
"4. **`sort -nr`** で頻度降順ソート\n",
"5. **`awk`** で出力形式を整形\n",
"\n",
"シンプルな POSIX ツールの組み合わせで効率的に処理できます。\n",
"\n",
"主な改善点:\n",
"1. 重複セクションを完全に削除\n",
"2. 構造を論理的に整理(問題→解答→詳細→応用)\n",
"3. Mermaid図を1つに統一(安全な記法を使用)\n",
"4. よくある質問セクションを追加\n",
"5. 応用例を充実\n",
"6. Mermaid記法の注意点を最後にまとめ"
Comment on lines +346 to +352

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

開発メモがドキュメントに残っています。

「主な改善点」セクション(項目1-6)は、リファクタリング時の内部メモのようです。エンドユーザー向けチュートリアルには不要な情報であり、削除を推奨します。

🔧 推奨される修正
 シンプルな POSIX ツールの組み合わせで効率的に処理できます。
-\n",
-    "\n",
-    "主な改善点:\n",
-    "1. 重複セクションを完全に削除\n",
-    "2. 構造を論理的に整理(問題→解答→詳細→応用)\n",
-    "3. Mermaid図を1つに統一(安全な記法を使用)\n",
-    "4. よくある質問セクションを追加\n",
-    "5. 応用例を充実\n",
-    "6. Mermaid記法の注意点を最後にまとめ"
+"
🤖 Prompt for AI Agents
In `@Shell/Bash/Leetcode/192`. Word Frequency/WordFrequency.ipynb around lines 346
- 352, Remove the developer notes block titled "主な改善点:" (the markdown cell
containing items 1–6) from the notebook so end-user documentation contains only
the problem, solution, details and examples; locate the markdown cell that
begins with "主な改善点:" and delete it (and any duplicate "重複セクション" cells), then
verify there are no other internal-refactor notes remaining and update the
notebook's visible sections/TOC if any headings or links referenced that removed
cell.

]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
Comment on lines +356 to +360

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

ノートブックのメタデータについて。

language_info.name"python" に設定されていますが、これは Bash チュートリアルです。現状は Markdown のみで実行セルがないため問題ありませんが、将来的に Bash コードセルを追加する場合は、bash カーネル設定への変更を検討してください。

🤖 Prompt for AI Agents
In `@Shell/Bash/Leetcode/192`. Word Frequency/WordFrequency.ipynb around lines 356
- 360, Noted that notebook metadata sets language_info.name = "python" while
this is a Bash tutorial; update the notebook metadata
(metadata.language_info.name) to a bash-appropriate kernel name (e.g., "bash" or
the exact kernel spec you intend) or remove the language_info entry if you
prefer to keep the notebook markdown-only; ensure the kernel spec matches any
future bash code cells by updating metadata.kernelspec.name and
metadata.kernelspec.display_name accordingly so bash cells will execute
correctly.

"nbformat": 4,
"nbformat_minor": 5
}
Loading