fix(agent-core): terminate MCP stdio process tree on close#1038
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3aba54c28
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async function ensureProcessTreeTerminated(pid: number | null | undefined): Promise<void> { | ||
| if (pid === null || pid === undefined || pid <= 0) return; | ||
|
|
||
| if (await waitForProcessExit(pid, PROCESS_EXIT_POLL_MS)) return; |
There was a problem hiding this comment.
Kill the process tree before losing the root process
When client.close() makes the wrapper/server exit before this helper runs (for example an npx/cmd wrapper that terminates on stdin close or SIGTERM while leaving the real MCP server alive), this early return treats the disappeared root PID as success and skips taskkill /T/tree cleanup entirely. At that point the grandchild is orphaned and can no longer be found from the saved parent PID, so the zombie-server scenario described by this fix still leaks processes; the tree needs to be captured/killed before waiting for the immediate child to vanish, or otherwise track descendants separately.
Useful? React with 👍 / 👎.
The @modelcontextprotocol/sdk StdioClientTransport.close() only signals the immediate child process. On Windows, stdio MCP servers are typically launched via npx/uvx wrappers; killing the wrapper leaves the grandchild node server process orphaned. Over reconnects/shutdowns this accumulates many zombie node.exe processes. Capture the transport pid before the SDK clears it, then after the SDK's graceful close escalate to a platform-specific process-tree kill: - Windows: taskkill /T (/F for the force phase) - POSIX: signal the process group, with EPERM/ESRCH fallbacks Add a regression test with a fixture that spawns a long-lived grandchild and asserts both the server and grandchild are reaped within the close deadline.
d3aba54 to
277caba
Compare
|
@chatgpt-codex-connector[bot] 感谢指出这个 race condition。已经更新实现:
测试仍然全部通过(14/14)。 |
|
To use Codex here, create a Codex account and connect to github. |
Problem
On Windows, stdio MCP servers launched via
npx -y <package>end up with two processes: thenpxwrapper and the actualnodeserver. The MCP SDK'sStdioClientTransport.close()only kills the immediate child, so thenodeserver is left orphaned. Over session reconnects/shutdowns this accumulates many zombienode.exeprocesses.Fix
Capture the transport pid before the SDK clears it, then after the SDK's graceful close escalate to a platform-specific process-tree kill:
taskkill /T(with/Ffor the force phase)kill(-pid, SIGTERM/SIGKILL)), withEPERM/ESRCHfallbacksChanges
packages/agent-core/src/mcp/client-stdio.ts: addensureProcessTreeTerminated,killProcessTree,waitForProcessExit,isProcessAlive; call tree cleanup fromcloseStartedClientpackages/agent-core/test/mcp/fixtures/grandchild-stdio-server.mjs: new fixture that spawns a long-lived grandchildpackages/agent-core/test/mcp/client-stdio.test.ts: regression test asserting both server and grandchild are reaped within the close deadlineVerification
pnpm --filter @moonshot-ai/agent-core typecheckpassespnpm --filter @moonshot-ai/agent-core test -- client-stdio.test.tspasses (14/14, including the new test)Fixes the zombie process issue observed when multiple MCP servers are restarted within a single Kimi session.