fix(amber): bound region termination retry instead of looping forever#5737
Conversation
…or user to see and fix
…nation-bounded-retry
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 435 | 0.266 | 23,215/30,690/30,690 us | 🔴 +8.9% / 🔴 +97.0% |
| 🟢 | bs=100 sw=10 sl=64 | 959 | 0.585 | 103,890/120,058/120,058 us | 🟢 -24.4% / 🔴 +9.5% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,114 | 0.68 | 902,990/949,073/949,073 us | ⚪ within ±5% / 🟢 -12.3% |
Baseline details
Latest main 7a38b6c from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 435 tuples/sec | 465 tuples/sec | 758.88 tuples/sec | -6.5% | -42.7% |
| bs=10 sw=10 sl=64 | MB/s | 0.266 MB/s | 0.284 MB/s | 0.463 MB/s | -6.3% | -42.6% |
| bs=10 sw=10 sl=64 | p50 | 23,215 us | 21,310 us | 12,965 us | +8.9% | +79.1% |
| bs=10 sw=10 sl=64 | p95 | 30,690 us | 30,290 us | 15,578 us | +1.3% | +97.0% |
| bs=10 sw=10 sl=64 | p99 | 30,690 us | 30,290 us | 18,378 us | +1.3% | +67.0% |
| bs=100 sw=10 sl=64 | throughput | 959 tuples/sec | 939 tuples/sec | 968.9 tuples/sec | +2.1% | -1.0% |
| bs=100 sw=10 sl=64 | MB/s | 0.585 MB/s | 0.573 MB/s | 0.591 MB/s | +2.1% | -1.1% |
| bs=100 sw=10 sl=64 | p50 | 103,890 us | 104,630 us | 102,767 us | -0.7% | +1.1% |
| bs=100 sw=10 sl=64 | p95 | 120,058 us | 158,773 us | 109,629 us | -24.4% | +9.5% |
| bs=100 sw=10 sl=64 | p99 | 120,058 us | 158,773 us | 118,129 us | -24.4% | +1.6% |
| bs=1000 sw=10 sl=64 | throughput | 1,114 tuples/sec | 1,113 tuples/sec | 997.01 tuples/sec | +0.1% | +11.7% |
| bs=1000 sw=10 sl=64 | MB/s | 0.68 MB/s | 0.679 MB/s | 0.609 MB/s | +0.1% | +11.7% |
| bs=1000 sw=10 sl=64 | p50 | 902,990 us | 895,733 us | 1,009,306 us | +0.8% | -10.5% |
| bs=1000 sw=10 sl=64 | p95 | 949,073 us | 955,612 us | 1,051,088 us | -0.7% | -9.7% |
| bs=1000 sw=10 sl=64 | p99 | 949,073 us | 955,612 us | 1,082,535 us | -0.7% | -12.3% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,459.28,200,128000,435,0.266,23215.27,30689.55,30689.55
1,100,10,64,20,2086.08,2000,1280000,959,0.585,103889.63,120057.57,120057.57
2,1000,10,64,20,17955.11,20000,12800000,1114,0.680,902989.93,949073.45,949073.45…7Ball/texera into fix/region-termination-bounded-retry
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5737 +/- ##
============================================
- Coverage 54.89% 54.82% -0.08%
+ Complexity 2962 2958 -4
============================================
Files 1117 1115 -2
Lines 43133 43037 -96
Branches 4648 4636 -12
============================================
- Hits 23680 23597 -83
+ Misses 18064 18053 -11
+ Partials 1389 1387 -2
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Automated Reviewer SuggestionsBased on the
|
|
/request-review @aglinxinyuan |
f16b303 to
7ec4e83
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates Amber’s region termination behavior so that RegionExecutionCoordinator.terminateWorkersWithRetry no longer retries EndWorker indefinitely. Instead, it enforces a bounded retry budget (production defaults ≈30s) and fails fast with a descriptive error that includes the region id and affected worker ids, improving diagnosability of CI flakes like the DataProcessingSpec timeout described in #5614.
Changes:
- Add bounded termination retry logic with configurable
maxTerminationAttemptsandkillRetryDelay, plus descriptive failure on exhaustion. - Add/extend scheduler test support utilities for multi-worker regions in tests.
- Add unit tests covering retry-budget exhaustion, boundary (off-by-one) behavior, multi-worker aggregation, and default values.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| amber/src/main/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinator.scala | Introduces bounded termination retry parameters + defaults; fails termination with a descriptive exception after budget exhaustion. |
| amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinatorSpec.scala | Adds focused tests for give-up behavior, boundary conditions, multi-worker error reporting, and default budget values. |
| amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionCoordinatorTestSupport.scala | Adds helpers to create/seed regions with multiple workers to support the new tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aglinxinyuan
left a comment
There was a problem hiding this comment.
LGTM — bounding the retry is the right call. I traced the failure path to confirm it does what the description says: on budget exhaustion the failed terminationFuture skips the setPhase(Completed) flatMap, propagates through coordinateRegionExecutors' Future.collect, and surfaces as a FatalError in PortCompletedHandler (hits the case other branch, so relatedWorkerId = None — but the message itself carries the worker ids). Defaults are tunable and the only other constructor call site uses them, so the blast radius is minimal. Test coverage is good too — the "succeeds on the last permitted attempt" case nicely pins the attempt >= maxTerminationAttempts off-by-one.
One thing to confirm before merge — the budget is ~30s, but the stalls in #5614 showed ~60s silent windows. In production (no test timeout) a worker that would have drained late now gets the workflow failed at 30s instead of eventually completing. Are we comfortable that 30s is enough headroom for a slow-but-recoverable drain? Fine with it either way since it's a constructor param — just want it to be a deliberate choice rather than fall out of the 150 * 200ms arithmetic.
Couple of optional nits:
- The give-up message lists every region worker as "still not terminated," but
terminateWorkersfails the whole batch viaFuture.collectif any singleEndWorkerfails — so a worker that terminated fine still gets named. Something like "region workers" would read more accurately. - The companion-object comment says "Max EndWorker retries," but
maxTerminationAttemptscounts attempts (1 initial + 149 retries = 150) — minor wording.
Agree with the scope note that this is a fail-fast/diagnosability change, not a fix for the underlying termination flake (the Iceberg-commit contention during teardown is the real culprit) — but it's a strictly better failure mode than a silent hang, so no objection to merging once the budget question is settled.
…#5958) ### What changes were proposed in this PR? - Backport of #5737 to `release/v1.2`: bound the EndWorker termination retry loop with a configurable budget (`maxTerminationAttempts`, default 150) and per-attempt delay (`killRetryDelay`, default 200ms, ~30s total) instead of retrying forever. - When the budget is exhausted, the region fails loudly with an `IllegalStateException` that names every still-running worker and chains the underlying cause, so a stuck region surfaces to the user instead of hanging silently. - Adds `RegionExecutionCoordinator.DefaultMaxTerminationAttempts` / `DefaultKillRetryDelay`, threads the two new constructor params through, and extends the test fixtures with give-up and boundary coverage. ### Any related issues, documentation, discussions? Backport of #5737 to release/v1.2. ### How was this PR tested? - Run `sbt "WorkflowExecutionService/testOnly org.apache.texera.amber.engine.architecture.scheduling.RegionExecutionCoordinatorSpec"` (JDK 17); expect all 7 tests to pass (2 existing retry tests plus 5 new give-up/boundary tests). - Give-up path: a worker that never drains fails after the budget with "could not be terminated after N attempts" naming the stuck worker(s); the single-attempt, last-attempt-success, multi-worker aggregation, and default-budget cases are each pinned by a new spec. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.8 in compliance with ASF
What changes were proposed in this PR?
RegionExecutionCoordinator.terminateWorkersWithRetryretried worker termination with no limit, so a worker that never finished draining (itsEndWorkerkept failing) left the region's termination future unresolved and the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout inDataProcessingSpecwith no indication of the stuck region or workers.maxTerminationAttemptsbudget (default 150, about 30s at the existing 200ms delay); on exhaustion the termination future fails with a descriptive error naming the region and the still-unterminated workers, instead of retrying indefinitely.maxTerminationAttemptsandkillRetryDelayconstructor parameters with production defaults so the behavior is unit-testable without long waits.Any related issues, documentation, discussions?
Closes: #5614
How was this PR tested?
sbt "WorkflowExecutionService/testOnly *RegionExecutionCoordinatorSpec"(under JDK 17); expect 3 tests succeeded, 0 failed.give up with a descriptive error once the EndWorker retry budget is exhausted: forcesEndWorkerto always fail, then asserts the completion future fails withIllegalStateExceptioncontaining "could not be terminated after 3 attempts", the coordinator is not marked completed, exactly 3EndWorkercalls were made, and the worker actor ref is retained.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF