Skip to content

[v1.x] Backport transport test deflakes from main (tests-only)#2837

Merged
maxisbey merged 17 commits into
maxisbey/v1x-interaction-backportfrom
maxisbey/v1x-tests-deflake
Jun 11, 2026
Merged

[v1.x] Backport transport test deflakes from main (tests-only)#2837
maxisbey merged 17 commits into
maxisbey/v1x-interaction-backportfrom
maxisbey/v1x-tests-deflake

Conversation

@maxisbey

Copy link
Copy Markdown
Contributor

Backports the transport test deflaking work from main (#2764, #2765, #2767, #2788) to v1.x. Tests-only: the diff touches nothing under src/.

Stacked on #2717 (maxisbey/v1x-interaction-backport), which supplies the in-process streaming bridge these tests are built on.

Motivation and Context

The v1.x transport tests still run over real sockets with uvicorn servers in threads or subprocesses, and CI pays for it: ephemeral port races, server startup polling, and a real-clock session-timeout bet that loses under scheduler load. main fixed all of this by moving the tests in process; this PR ports those rewrites to v1.x so the v1.x matrix stops flaking on them.

Two of the rewritten files have direct v1.x CI evidence (test_streamable_http_multiple_reconnections and test_json_response failures on Windows legs); the rest share the same socket-based harness and mechanism. The session-timeout deflake matters for #2717 itself: the suite that PR backports still carries the pre-#2788 real-clock test, which fails roughly 1 in 3 full-suite runs under coverage load, once per transport leg. Folding the first three commits here into #2717 instead would work equally well.

Adaptations made for v1.x, each noted in the commit that makes it:

  • The 4-line src/ delta from Run StreamableHTTP transport tests in process instead of over sockets #2767 (a pragma removal and a stale docstring note) is dropped; v1.x has no strict pragma checking, so the retained pragma is harmless.
  • 9 wildcard-Accept-header test params are omitted because v1.x predates the RFC 7231 wildcard parsing from fix: accept wildcard media types in Accept header per RFC 7231 #2152. The v1.x-only deprecation tests (streamablehttp_client alias, deprecated transport params) are re-added in in-process form, so no existing v1.x test is lost.
  • In-process serving exposes two v1.x-specific teardown behaviors that the subprocess harness hid: sse-starlette < 3.0 binds a module-global exit event to the first event loop that serves a response (guarded with a reset fixture, exercised by the lowest-direct legs), and the server transport leaks anyio memory streams whose GC-time warnings would otherwise escape the per-test filters (handled with scoped filters plus a per-test gc.collect() so isolated runs stay clean).
  • The streaming bridge now ignores response messages after the response has completed, matching what a client of a real ASGI server observes, with contract tests pinning the behavior.
  • One ported test verifies inside its connection loop because Python 3.11 loses line tracing after an async-with teardown throws (Bad interaction between asyncio task cancellation and trace functions (only 3.11) python/cpython#106749), which otherwise misreports coverage on 3.11 legs.

How Has This Been Tested?

  • Full v1.x gate (pytest plus 100% line and branch coverage) green throughout the series.
  • The rewritten files were run repeatedly in a scratch CI environment replicating the v1.x matrix (Windows and Ubuntu, Python 3.10 to 3.14, locked and lowest-direct resolutions): roughly 900 repetitions of the affected files with zero failures, including the lowest-direct legs that install sse-starlette 1.6.1.
  • Contrast evidence for the timeout deflake: on the pre-deflake base, 5 of 8 full-suite runs under coverage load failed on test_session_level_timeout_applies_to_every_request; with the virtual-clock version the failure has never reproduced (28 full-suite runs plus 150 isolated repetitions).
  • The full-suite failures that did appear during matrix stress were all in files this PR does not touch (the known Windows stdio child-cleanup and experimental-tasks flakes), which are out of scope here because fixing them requires src/ changes.

Breaking Changes

None. Tests-only; no public interface is touched.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Draft while #2717 is in review. If #2717 is amended, the only coupling points are tests/interaction/transports/__init__.py (the re-export added here) and the small _bridge.py and conftest.py changes; everything else is independent of its internals. Once #2717 merges this can be retargeted to v1.x.

AI Disclaimer

maxisbey added 15 commits June 11, 2026 14:52
The session-level timeout test now runs on trio's virtual clock. At teardown
the streamable-HTTP client abandons its httpx/httpx-sse response generators;
trio's asyncgen finalizer warns about each one (asyncio finalizes abandoned
generators silently at loop shutdown), and filterwarnings=error turns that
into a test failure. Scope-ignore the two third-party generator signatures,
and make the bridge's response body delegate to the memory stream's own
iterator so the harness itself leaves no abandoned generator.
Tests-only backport to v1.x; adapted from main commit b3025f9.
…ty tests

The three tests that complete the initialize handshake leak anyio memory
streams at transport teardown when run in process. The scoped filters in
tests/interaction/conftest.py cover full-suite runs, but they only load
when that package is collected, so a targeted run of this file alone
turns the ResourceWarning into a failure under filterwarnings=error.
Mirror the module-level filterwarnings marks already used in
tests/shared/test_sse.py.
#2765)

Tests-only backport to v1.x; adapted from main commit ed39e73.
sse-starlette <3.0 stores its exit Event on the AppStatus class the first
time an EventSourceResponse runs; the Event is bound to that test's event
loop and breaks every later in-process SSE response on the same worker.
test_http_unicode.py serves all its requests as EventSourceResponses
(json_response=False) but had no reset fixture, so on a sse-starlette<3.0
install (CI's lowest-direct legs) it could poison the worker for any later
SSE-based test.

- Copy the autouse AppStatus reset fixture into test_http_unicode.py.
- Reset on both sides of the yield, here and in test_sse.py, so each
  module also survives a stale Event left behind by an earlier test.
- Correct the filterwarnings comments in both files: the item-scoped
  markers cannot cover the GC flush at session cleanup, so isolated
  runs without xdist (-n 0) still exit nonzero after all tests pass.
…lters

The module-level filterwarnings marks only apply while a test in this file
is the active warning context, but v1's leaked memory streams sit in
reference cycles and were garbage-collected later: at pytest's
session-unconfigure unraisable sweep (every test in this file passed but
pytest exited 1 when run without xdist, e.g. -n 0 for --pdb debugging), or
during an unrelated test on the same worker, where the global
filterwarnings = ["error"] turned the deallocator warning into a failure
of that innocent test.

An autouse fixture now runs gc.collect() in each test's teardown, so the
deallocator warnings fire where the scoped ignores apply. Verified: every
previously failing single-test and full-file -n 0 invocation now exits 0,
and cross-file runs no longer misattribute the warnings.
…bal exit event

tests/shared/test_streamable_http.py and tests/server/test_streamable_http_security.py
serve responses in process; on sse-starlette <3.0 (what the lowest-direct CI legs
install) the first EventSourceResponse binds AppStatus.should_exit_event to that
test's event loop and every later SSE response in the module fails. Port the
both-sides reset fixture from tests/client/test_http_unicode.py into both files.

Also give the security module the same per-test gc.collect() teardown as
test_streamable_http.py so leaked-stream warnings stay inside the module's scoped
filters, and drop the comment claiming a -n 0 run exits nonzero after all tests
pass: with the flush in place it exits 0.
tests/shared/test_sse.py and tests/client/test_http_unicode.py carry the
same scoped MemoryObject*Stream filterwarnings marks as
tests/shared/test_streamable_http.py and had the same gap: the leaked
streams sit in reference cycles, so their deallocator warnings fired at
pytest's session-unconfigure unraisable sweep, exiting 1 after all tests
passed when run without xdist (-n 0). Both files documented this in
their filter-mark comments.

Apply the same autouse fixture that fixed test_streamable_http.py — a
gc.collect() in each test's teardown, where the item-scoped ignores
still apply — and update the comments to describe the fixed behavior.

Verified: -n 0 runs of each file now exit 0 (previously 26 passed/exit 1
and 2 passed/exit 1), and five consecutive default-option runs of both
files together pass.
Once an application sends the final http.response.body chunk, the bridge
now drops any further http.response.start/body messages, matching what a
real ASGI server's client observes. Starlette's request_response sends a
trailing response when an endpoint's sub-application has already completed
a rejection response, so the SSE security rejection tests no longer depend
on scheduling for the client to read the first status. Pinned by two new
bridge contract tests (registered as harness self-tests).

Also widen the trio-leg unraisable-warning filter to the whole
httpx/httpx-sse generator chain: abandoning EventSource.aiter_sse abandons
the nested aiter_lines -> aiter_text -> aiter_bytes -> aiter_raw
generators, and which link the finalizer reports depends on GC timing and
Python version.
Statements after the connection loop sit in the Python 3.11 trace-loss
shadow (python/cpython#106749) and were reported uncovered on 3.11
matrix cells; verifying inside the traced region removes the gap
without a coverage exclusion.
The arc fires after the final async-with teardown, inside the same
Python 3.11 trace-loss shadow as the previous commit; only the arc is
excluded, the loop body stays measured.
The checks must stay after the streamable_http_client context exits so a
teardown-time mutation cannot escape them, which on Python 3.11 places
them in the trace-loss shadow of python/cpython#106749: they execute and
assert on every matrix cell but go unmeasured on 3.11. Same convention
as the suite's existing lax exclusions.
@maxisbey maxisbey marked this pull request as ready for review June 11, 2026 15:02
maxisbey added 2 commits June 11, 2026 15:56
The TestChildProcessCleanup tests asserted that spawned writers had
started after fixed sleeps (0.5s startup, 0.3s observation window). On
loaded CI runners the child interpreter can take longer than that to
boot, so the "child should be writing" assertion failed with
"assert 0 > 0" before the cleanup logic under test ever ran.

Replace the fixed sleeps with bounded polling:

- _wait_for_first_write polls until a marker file has grown, proving
  the writer reached its write loop, with a 15s timeout.
- _wait_for_writes_to_stop polls until two samples taken 0.3s apart
  (3x the writers' 0.1s write interval) observe the same size; if the
  file never stops growing the timeout fails the test, so a genuine
  cleanup failure is still reported.

Also terminate the spawned process tree in each test's finally block,
so a failed assertion can no longer leak a running process tree, and
collect garbage before leaving each test so subprocess transports are
finalized while the test's ResourceWarning filters are still active.

Removing the unconditional sleeps also makes the tests faster.
… tests

Two follow-up fixes to the child process cleanup tests:

Require three consecutive stable samples before declaring writes
stopped. The previous check exited on the first pair of samples 0.3s
apart with no growth, retrying within the 15s budget, which made it
easier for a CPU-starved (but alive) writer to be mistaken for a
terminated one. The counter resets on any observed growth, and a file
that never stops growing still fails the test via the timeout.

Only re-terminate the process tree in the finally block if the test
failed before reaching its own _terminate_process_tree call. The
unconditional second call ran termination against an already-closed
job object handle on Windows and logged a spurious fallback warning
on POSIX in every passing run. The skipped branch only executes on
failing runs, so it is excluded from coverage.
@maxisbey maxisbey merged commit a3234b6 into maxisbey/v1x-interaction-backport Jun 11, 2026
19 checks passed
@maxisbey maxisbey deleted the maxisbey/v1x-tests-deflake branch June 11, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants