Skip to content

fix(amber): bound region termination retry instead of looping forever#5958

Merged
aglinxinyuan merged 1 commit into
apache:release/v1.2from
Ma77Ball:backport/region-termination-bounded-retry-v1.2
Jun 27, 2026
Merged

fix(amber): bound region termination retry instead of looping forever#5958
aglinxinyuan merged 1 commit into
apache:release/v1.2from
Ma77Ball:backport/region-termination-bounded-retry-v1.2

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Backport of fix(amber): bound region termination retry instead of looping forever #5737 to release/v1.2: bound the EndWorker termination retry loop with a configurable budget (maxTerminationAttempts, default 150) and per-attempt delay (killRetryDelay, default 200ms, ~30s total) instead of retrying forever.
  • When the budget is exhausted, the region fails loudly with an IllegalStateException that names every still-running worker and chains the underlying cause, so a stuck region surfaces to the user instead of hanging silently.
  • Adds RegionExecutionCoordinator.DefaultMaxTerminationAttempts / DefaultKillRetryDelay, threads the two new constructor params through, and extends the test fixtures with give-up and boundary coverage.

Any related issues, documentation, discussions?

Backport of #5737 to release/v1.2.

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly org.apache.texera.amber.engine.architecture.scheduling.RegionExecutionCoordinatorSpec" (JDK 17); expect all 7 tests to pass (2 existing retry tests plus 5 new give-up/boundary tests).
  • Give-up path: a worker that never drains fails after the budget with "could not be terminated after N attempts" naming the stuck worker(s); the single-attempt, last-attempt-success, multi-worker aggregation, and default-budget cases are each pinned by a new spec.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @aglinxinyuan, @Yicong-Huang, @kunwp1
    You can notify them by mentioning @aglinxinyuan, @Yicong-Huang, @kunwp1 in a comment.

@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan @xuang7

@github-actions github-actions Bot requested review from aglinxinyuan and xuang7 June 26, 2026 22:47
@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.10%. Comparing base (8e01d1a) to head (0863c11).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ecture/scheduling/RegionExecutionCoordinator.scala 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##             release/v1.2    #5958   +/-   ##
===============================================
  Coverage           52.10%   52.10%           
+ Complexity           2494     2490    -4     
===============================================
  Files                1065     1065           
  Lines               41297    41315   +18     
  Branches             4421     4424    +3     
===============================================
+ Hits                21517    21529   +12     
- Misses              18521    18522    +1     
- Partials             1259     1264    +5     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (ø) Carriedforward from 8e01d1a
agent-service 34.36% <ø> (ø) Carriedforward from 8e01d1a
amber 52.44% <90.47%> (+0.01%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from 8e01d1a
config-service 56.06% <ø> (ø) Carriedforward from 8e01d1a
file-service 58.59% <ø> (ø) Carriedforward from 8e01d1a
frontend 46.38% <ø> (ø) Carriedforward from 8e01d1a
pyamber 90.77% <ø> (ø) Carriedforward from 8e01d1a
python 90.75% <ø> (ø) Carriedforward from 8e01d1a
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 8e01d1a

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xuang7 xuang7 added release/v1.2 back porting to release/v1.2 and removed release/v1.2 back porting to release/v1.2 labels Jun 26, 2026
@aglinxinyuan aglinxinyuan requested a review from Copilot June 27, 2026 00:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports the Amber region-termination retry bound to release/v1.2, preventing RegionExecutionCoordinator from retrying EndWorker forever by introducing a configurable retry budget and delay, and failing the region loudly once the budget is exhausted.

Changes:

  • Add configurable maxTerminationAttempts (default 150) and killRetryDelay (default 200ms) to bound termination retries and surface a descriptive failure when exhausted.
  • Improve failure diagnostics by emitting an IllegalStateException that includes worker IDs and preserves the underlying cause.
  • Extend test coverage with give-up and boundary cases, including multi-worker aggregation, plus new test support helpers for multi-worker regions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
amber/src/main/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinator.scala Introduces retry-budget defaults, threads new params through the coordinator, and implements bounded retry with a descriptive give-up failure.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinatorSpec.scala Adds multiple new specs covering give-up behavior, off-by-one boundary, multi-worker error aggregation, and default values.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionCoordinatorTestSupport.scala Adds helper APIs for creating multi-worker regions and seeding worker executions to support the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aglinxinyuan aglinxinyuan merged commit 8629905 into apache:release/v1.2 Jun 27, 2026
40 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants