gh-149146: Add dealloc-depth counter fallback to trashcan trigger#151595
gh-149146: Add dealloc-depth counter fallback to trashcan trigger#151595bhuvi27 wants to merge 1 commit into
Conversation
Under memory pressure (for example RLIMIT_AS), Python's trashcan trigger in _Py_Dealloc never fires. The trigger relies on _Py_RecursionLimit_GetMargin, which compares the machine stack pointer against c_stack_soft_limit. When RLIMIT_AS prevents the kernel from growing the C stack, the kernel SIGSEGVs while the stack pointer is still megabytes above the soft limit (see sibling issue pythongh-150722 for an LLDB trace showing SP ~7.2 MB above c_stack_hard_limit at the SIGSEGV), so the trashcan never deposits and the recursive tuple_dealloc chain runs out the stack. Add a per-thread c_dealloc_depth counter as a fallback trigger, mirroring the historical _PyTrash_UNWIND_LEVEL=50 protection that was removed when the trashcan was consolidated into _Py_Dealloc. _Py_RecursionLimit_GetMargin remains the primary signal; the counter only kicks in when the stack-pointer signal cannot fire. _PyTrash_thread_destroy_chain itself bumps c_dealloc_depth for the duration of the drain so any _Py_Dealloc invoked while draining cannot recursively re-enter destroy_chain (which would rebuild the same unbounded recursion the trashcan exists to prevent). This mirrors the historical delete_nesting bookkeeping. The counter is per-tstate (no atomics needed in the free-threaded build) and is only touched for GC types, so non-GC types pay zero overhead. Add a regression test in test_gc that builds a 100,000-deep (b, None) tuple chain and asserts that ``del b`` cleans it up without a C-stack overflow.
|
IMO the existing
Issue gh-149146 is really a corner case: when the OS fails to enlarge the stack because the system memory is exhausted. IMO we should do nothing for this case. |
vstinner
left a comment
There was a problem hiding this comment.
I don't think that this approach is correct (see my previous comment).
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
|
cc @markshannon |
|
Thanks for the review, @vstinner. The concern is that gh-149146 ends in SIGSEGV during cleanup after a legitimate MemoryError — graceful failure on allocation, hard crash on teardown. That felt worth fixing. On the depth limit: it's only a fallback when the margin signal can't fire (stack page growth blocked by RLIMIT_AS, see gh-150722). Normal deep chains still go through the margin path unchanged. @markshannon — would you prefer fixing the margin mechanism itself (gh-150722) over a fallback counter? Happy to pivot or close if the consensus is won't-fix. |
|
Python has no way to check if the process reached RLIMIT_AS limit, and so that extending the stack will fail with a segfault. IMO the current margin mechanism works as expected, and there is nothing we can do to catch stack overflow when RLIMIT_AS limit is reached. This PR is just a workaround and it can affect badly performance when there is enough memory (most common case), since it limits the trashcan list to 50 entries which is small. I close the issue. As I wrote previously, I don't think that it's right approach. |
Fixes #149146.
while True: b = (b, None)under RLIMIT_AS segfaults during cleanup:Docker with
ulimit -v 524288:MemoryError→lost sys.stderr→ segfault, exit 139.The trashcan deposit in
_Py_Dealloconly fires when_Py_RecursionLimit_GetMargin(tstate) < 2, which compares the stack pointer againstc_stack_soft_limit. Under RLIMIT_AS the kernel SIGSEGVs trying to grow the stack while SP is still nowhere near the soft limit — issue #150722 has an LLDB trace from the same reporter showing SP ~7 MB abovec_stack_hard_limitat the fault. So the trashcan never deposits and the recursivetuple_dealloc → Py_XDECREF → _Py_Dealloc → tuple_deallocchain just runs into the wall.This adds a per-thread
c_dealloc_depthcounter and uses it as a secondary trigger inside_Py_Dealloc, depositing at depth 50 (matches the old_PyTrash_UNWIND_LEVEL). The stack-pointer margin stays the primary signal; the counter only kicks in when SP can't.One thing I missed on the first pass: the drain side has the same problem.
margin >= 4is also true in this scenario, so every_Py_Deallocexit would re-enter_PyTrash_thread_destroy_chainfrom inside the deallocator it just called. Fix:destroy_chainbumps the counter for the duration of the drain (same idea as the olddelete_nesting), and the drain check in_Py_Deallocbecomes justc_dealloc_depth == 0. Counter is only touched for GC types so non-GC types pay nothing.Tests: new
test_trashcan_nested_tuple_deepintest_gc.py(100k nested tuples, must clean up without crash);test_gc test_exceptions test_capiall green locally. Docker repro now exits 1 withMemoryErrorinstead of 139 withSegmentation fault.#150722 is the same SP-signal blindness in the comparison path (
Py_EnterRecursiveCall) — separate code path, separate PR.