Skip to content

fix(agent-core): terminate MCP stdio process tree on close#1038

Open
qyz7438 wants to merge 1 commit into
MoonshotAI:mainfrom
qyz7438:fix/mcp-stdio-process-tree-cleanup
Open

fix(agent-core): terminate MCP stdio process tree on close#1038
qyz7438 wants to merge 1 commit into
MoonshotAI:mainfrom
qyz7438:fix/mcp-stdio-process-tree-cleanup

Conversation

@qyz7438

@qyz7438 qyz7438 commented Jun 23, 2026

Copy link
Copy Markdown

Problem

On Windows, stdio MCP servers launched via npx -y <package> end up with two processes: the npx wrapper and the actual node server. The MCP SDK's StdioClientTransport.close() only kills the immediate child, so the node server is left orphaned. Over session reconnects/shutdowns this accumulates many zombie node.exe processes.

Fix

Capture the transport pid before the SDK clears it, then after the SDK's graceful close escalate to a platform-specific process-tree kill:

  • Windows: taskkill /T (with /F for the force phase)
  • POSIX: signal the process group (kill(-pid, SIGTERM/SIGKILL)), with EPERM/ESRCH fallbacks

Changes

  • packages/agent-core/src/mcp/client-stdio.ts: add ensureProcessTreeTerminated, killProcessTree, waitForProcessExit, isProcessAlive; call tree cleanup from closeStartedClient
  • packages/agent-core/test/mcp/fixtures/grandchild-stdio-server.mjs: new fixture that spawns a long-lived grandchild
  • packages/agent-core/test/mcp/client-stdio.test.ts: regression test asserting both server and grandchild are reaped within the close deadline

Verification

  • pnpm --filter @moonshot-ai/agent-core typecheck passes
  • pnpm --filter @moonshot-ai/agent-core test -- client-stdio.test.ts passes (14/14, including the new test)

Fixes the zombie process issue observed when multiple MCP servers are restarted within a single Kimi session.

@changeset-bot

changeset-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 277caba

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3aba54c28

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

async function ensureProcessTreeTerminated(pid: number | null | undefined): Promise<void> {
if (pid === null || pid === undefined || pid <= 0) return;

if (await waitForProcessExit(pid, PROCESS_EXIT_POLL_MS)) return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Kill the process tree before losing the root process

When client.close() makes the wrapper/server exit before this helper runs (for example an npx/cmd wrapper that terminates on stdin close or SIGTERM while leaving the real MCP server alive), this early return treats the disappeared root PID as success and skips taskkill /T/tree cleanup entirely. At that point the grandchild is orphaned and can no longer be found from the saved parent PID, so the zombie-server scenario described by this fix still leaks processes; the tree needs to be captured/killed before waiting for the immediate child to vanish, or otherwise track descendants separately.

Useful? React with 👍 / 👎.

The @modelcontextprotocol/sdk StdioClientTransport.close() only signals the
immediate child process. On Windows, stdio MCP servers are typically launched
via npx/uvx wrappers; killing the wrapper leaves the grandchild node server
process orphaned. Over reconnects/shutdowns this accumulates many zombie
node.exe processes.

Capture the transport pid before the SDK clears it, then after the SDK's
graceful close escalate to a platform-specific process-tree kill:
- Windows: taskkill /T (/F for the force phase)
- POSIX: signal the process group, with EPERM/ESRCH fallbacks

Add a regression test with a fixture that spawns a long-lived grandchild and
asserts both the server and grandchild are reaped within the close deadline.
@qyz7438 qyz7438 force-pushed the fix/mcp-stdio-process-tree-cleanup branch from d3aba54 to 277caba Compare June 23, 2026 13:42
@qyz7438

qyz7438 commented Jun 23, 2026

Copy link
Copy Markdown
Author

@chatgpt-codex-connector[bot] 感谢指出这个 race condition。已经更新实现:

  1. closeStartedClient() 现在会在调用 client.close() 的同时启动 ensureProcessTreeTerminated(),而不是等 SDK 关闭完成后再清理。
  2. ensureProcessTreeTerminated() 改为先直接发起 tree-kill,再等待进程退出。这样就避免了 wrapper 先消失、导致 taskkill /T 来不及枚举后代的场景。

测试仍然全部通过(14/14)。

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant