fix(amber): bound region termination retry instead of looping forever#5958
Conversation
Automated Reviewer SuggestionsBased on the
|
|
/request-review @aglinxinyuan @xuang7 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/v1.2 #5958 +/- ##
===============================================
Coverage 52.10% 52.10%
+ Complexity 2494 2490 -4
===============================================
Files 1065 1065
Lines 41297 41315 +18
Branches 4421 4424 +3
===============================================
+ Hits 21517 21529 +12
- Misses 18521 18522 +1
- Partials 1259 1264 +5
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR backports the Amber region-termination retry bound to release/v1.2, preventing RegionExecutionCoordinator from retrying EndWorker forever by introducing a configurable retry budget and delay, and failing the region loudly once the budget is exhausted.
Changes:
- Add configurable
maxTerminationAttempts(default 150) andkillRetryDelay(default 200ms) to bound termination retries and surface a descriptive failure when exhausted. - Improve failure diagnostics by emitting an
IllegalStateExceptionthat includes worker IDs and preserves the underlying cause. - Extend test coverage with give-up and boundary cases, including multi-worker aggregation, plus new test support helpers for multi-worker regions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| amber/src/main/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinator.scala | Introduces retry-budget defaults, threads new params through the coordinator, and implements bounded retry with a descriptive give-up failure. |
| amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinatorSpec.scala | Adds multiple new specs covering give-up behavior, off-by-one boundary, multi-worker error aggregation, and default values. |
| amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionCoordinatorTestSupport.scala | Adds helper APIs for creating multi-worker regions and seeding worker executions to support the new tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What changes were proposed in this PR?
release/v1.2: bound the EndWorker termination retry loop with a configurable budget (maxTerminationAttempts, default 150) and per-attempt delay (killRetryDelay, default 200ms, ~30s total) instead of retrying forever.IllegalStateExceptionthat names every still-running worker and chains the underlying cause, so a stuck region surfaces to the user instead of hanging silently.RegionExecutionCoordinator.DefaultMaxTerminationAttempts/DefaultKillRetryDelay, threads the two new constructor params through, and extends the test fixtures with give-up and boundary coverage.Any related issues, documentation, discussions?
Backport of #5737 to release/v1.2.
How was this PR tested?
sbt "WorkflowExecutionService/testOnly org.apache.texera.amber.engine.architecture.scheduling.RegionExecutionCoordinatorSpec"(JDK 17); expect all 7 tests to pass (2 existing retry tests plus 5 new give-up/boundary tests).Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF