Skip to content

fix(amber): bound region termination retry instead of looping forever#5737

Merged
xuang7 merged 16 commits into
apache:mainfrom
Ma77Ball:fix/region-termination-bounded-retry
Jun 27, 2026
Merged

fix(amber): bound region termination retry instead of looping forever#5737
xuang7 merged 16 commits into
apache:mainfrom
Ma77Ball:fix/region-termination-bounded-retry

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • RegionExecutionCoordinator.terminateWorkersWithRetry retried worker termination with no limit, so a worker that never finished draining (its EndWorker kept failing) left the region's termination future unresolved and the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout in DataProcessingSpec with no indication of the stuck region or workers.
  • Bounded the retry by a new maxTerminationAttempts budget (default 150, about 30s at the existing 200ms delay); on exhaustion the termination future fails with a descriptive error naming the region and the still-unterminated workers, instead of retrying indefinitely.
  • Made maxTerminationAttempts and killRetryDelay constructor parameters with production defaults so the behavior is unit-testable without long waits.
  • Scope note: this is a fail-fast/diagnosability change (it converts a silent hang into a fast, explicit failure, matching the pattern in fix(amber): surface writer-thread failure as FatalError instead of silent hang #4683), not a guaranteed elimination of the underlying termination flake.

Any related issues, documentation, discussions?

Closes: #5614

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly *RegionExecutionCoordinatorSpec" (under JDK 17); expect 3 tests succeeded, 0 failed.
  • New test give up with a descriptive error once the EndWorker retry budget is exhausted: forces EndWorker to always fail, then asserts the completion future fails with IllegalStateException containing "could not be terminated after 3 attempts", the coordinator is not marked completed, exactly 3 EndWorker calls were made, and the worker actor ref is retained.
  • Existing tests in the same spec (gracefulStop ordering, transient-failure retry-then-succeed) still pass, confirming the normal and transient paths are unchanged.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 3 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 7a38b6c benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 435 0.266 23,215/30,690/30,690 us 🔴 +8.9% / 🔴 +97.0%
🟢 bs=100 sw=10 sl=64 959 0.585 103,890/120,058/120,058 us 🟢 -24.4% / 🔴 +9.5%
bs=1000 sw=10 sl=64 1,114 0.68 902,990/949,073/949,073 us ⚪ within ±5% / 🟢 -12.3%
Baseline details

Latest main 7a38b6c from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 435 tuples/sec 465 tuples/sec 758.88 tuples/sec -6.5% -42.7%
bs=10 sw=10 sl=64 MB/s 0.266 MB/s 0.284 MB/s 0.463 MB/s -6.3% -42.6%
bs=10 sw=10 sl=64 p50 23,215 us 21,310 us 12,965 us +8.9% +79.1%
bs=10 sw=10 sl=64 p95 30,690 us 30,290 us 15,578 us +1.3% +97.0%
bs=10 sw=10 sl=64 p99 30,690 us 30,290 us 18,378 us +1.3% +67.0%
bs=100 sw=10 sl=64 throughput 959 tuples/sec 939 tuples/sec 968.9 tuples/sec +2.1% -1.0%
bs=100 sw=10 sl=64 MB/s 0.585 MB/s 0.573 MB/s 0.591 MB/s +2.1% -1.1%
bs=100 sw=10 sl=64 p50 103,890 us 104,630 us 102,767 us -0.7% +1.1%
bs=100 sw=10 sl=64 p95 120,058 us 158,773 us 109,629 us -24.4% +9.5%
bs=100 sw=10 sl=64 p99 120,058 us 158,773 us 118,129 us -24.4% +1.6%
bs=1000 sw=10 sl=64 throughput 1,114 tuples/sec 1,113 tuples/sec 997.01 tuples/sec +0.1% +11.7%
bs=1000 sw=10 sl=64 MB/s 0.68 MB/s 0.679 MB/s 0.609 MB/s +0.1% +11.7%
bs=1000 sw=10 sl=64 p50 902,990 us 895,733 us 1,009,306 us +0.8% -10.5%
bs=1000 sw=10 sl=64 p95 949,073 us 955,612 us 1,051,088 us -0.7% -9.7%
bs=1000 sw=10 sl=64 p99 949,073 us 955,612 us 1,082,535 us -0.7% -12.3%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,459.28,200,128000,435,0.266,23215.27,30689.55,30689.55
1,100,10,64,20,2086.08,2000,1280000,959,0.585,103889.63,120057.57,120057.57
2,1000,10,64,20,17955.11,20000,12800000,1114,0.680,902989.93,949073.45,949073.45

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.82%. Comparing base (7a38b6c) to head (888625b).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ecture/scheduling/RegionExecutionCoordinator.scala 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5737      +/-   ##
============================================
- Coverage     54.89%   54.82%   -0.08%     
+ Complexity     2962     2958       -4     
============================================
  Files          1117     1115       -2     
  Lines         43133    43037      -96     
  Branches       4648     4636      -12     
============================================
- Hits          23680    23597      -83     
+ Misses        18064    18053      -11     
+ Partials       1389     1387       -2     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø) Carriedforward from 2779df9
agent-service 34.36% <ø> (ø) Carriedforward from 2779df9
amber 57.07% <90.47%> (+0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø) Carriedforward from 2779df9
config-service 51.56% <ø> (ø) Carriedforward from 2779df9
file-service 59.02% <ø> (ø) Carriedforward from 2779df9
frontend 48.39% <ø> (-0.23%) ⬇️ Carriedforward from 2779df9
notebook-migration-service 78.57% <ø> (ø) Carriedforward from 2779df9
pyamber 90.20% <ø> (ø) Carriedforward from 2779df9
python 90.76% <ø> (ø) Carriedforward from 2779df9
workflow-compiling-service 55.14% <ø> (ø) Carriedforward from 2779df9

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @tanishqgandhi1908, @Yicong-Huang
    You can notify them by mentioning @tanishqgandhi1908, @Yicong-Huang in a comment.

@Ma77Ball Ma77Ball marked this pull request as ready for review June 23, 2026 10:12
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan

@Ma77Ball Ma77Ball force-pushed the fix/region-termination-bounded-retry branch from f16b303 to 7ec4e83 Compare June 23, 2026 10:15
@github-actions github-actions Bot requested a review from aglinxinyuan June 23, 2026 10:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Amber’s region termination behavior so that RegionExecutionCoordinator.terminateWorkersWithRetry no longer retries EndWorker indefinitely. Instead, it enforces a bounded retry budget (production defaults ≈30s) and fails fast with a descriptive error that includes the region id and affected worker ids, improving diagnosability of CI flakes like the DataProcessingSpec timeout described in #5614.

Changes:

  • Add bounded termination retry logic with configurable maxTerminationAttempts and killRetryDelay, plus descriptive failure on exhaustion.
  • Add/extend scheduler test support utilities for multi-worker regions in tests.
  • Add unit tests covering retry-budget exhaustion, boundary (off-by-one) behavior, multi-worker aggregation, and default values.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
amber/src/main/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinator.scala Introduces bounded termination retry parameters + defaults; fails termination with a descriptive exception after budget exhaustion.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinatorSpec.scala Adds focused tests for give-up behavior, boundary conditions, multi-worker error reporting, and default budget values.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionCoordinatorTestSupport.scala Adds helpers to create/seed regions with multiple workers to support the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aglinxinyuan aglinxinyuan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — bounding the retry is the right call. I traced the failure path to confirm it does what the description says: on budget exhaustion the failed terminationFuture skips the setPhase(Completed) flatMap, propagates through coordinateRegionExecutors' Future.collect, and surfaces as a FatalError in PortCompletedHandler (hits the case other branch, so relatedWorkerId = None — but the message itself carries the worker ids). Defaults are tunable and the only other constructor call site uses them, so the blast radius is minimal. Test coverage is good too — the "succeeds on the last permitted attempt" case nicely pins the attempt >= maxTerminationAttempts off-by-one.

One thing to confirm before merge — the budget is ~30s, but the stalls in #5614 showed ~60s silent windows. In production (no test timeout) a worker that would have drained late now gets the workflow failed at 30s instead of eventually completing. Are we comfortable that 30s is enough headroom for a slow-but-recoverable drain? Fine with it either way since it's a constructor param — just want it to be a deliberate choice rather than fall out of the 150 * 200ms arithmetic.

Couple of optional nits:

  • The give-up message lists every region worker as "still not terminated," but terminateWorkers fails the whole batch via Future.collect if any single EndWorker fails — so a worker that terminated fine still gets named. Something like "region workers" would read more accurately.
  • The companion-object comment says "Max EndWorker retries," but maxTerminationAttempts counts attempts (1 initial + 149 retries = 150) — minor wording.

Agree with the scope note that this is a fail-fast/diagnosability change, not a fix for the underlying termination flake (the Iceberg-commit contention during teardown is the real culprit) — but it's a strictly better failure mode than a silent hang, so no objection to merging once the budget question is settled.

@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 24, 2026
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

@xuang7 / @xinyuan could you drop the release/v1.2 label here? It can't cherry-pick cleanly onto v1.2, so I opened a dedicated backport PR (#5958) that's green.

@xuang7 xuang7 removed the release/v1.2 back porting to release/v1.2 label Jun 26, 2026
@xuang7 xuang7 added this pull request to the merge queue Jun 26, 2026
Merged via the queue into apache:main with commit ef05070 Jun 27, 2026
17 checks passed
aglinxinyuan pushed a commit that referenced this pull request Jun 27, 2026
…#5958)

### What changes were proposed in this PR?
- Backport of #5737 to `release/v1.2`: bound the EndWorker termination
retry loop with a configurable budget (`maxTerminationAttempts`, default
150) and per-attempt delay (`killRetryDelay`, default 200ms, ~30s total)
instead of retrying forever.
- When the budget is exhausted, the region fails loudly with an
`IllegalStateException` that names every still-running worker and chains
the underlying cause, so a stuck region surfaces to the user instead of
hanging silently.
- Adds `RegionExecutionCoordinator.DefaultMaxTerminationAttempts` /
`DefaultKillRetryDelay`, threads the two new constructor params through,
and extends the test fixtures with give-up and boundary coverage.
### Any related issues, documentation, discussions?
Backport of #5737 to release/v1.2.
### How was this PR tested?
- Run `sbt "WorkflowExecutionService/testOnly
org.apache.texera.amber.engine.architecture.scheduling.RegionExecutionCoordinatorSpec"`
(JDK 17); expect all 7 tests to pass (2 existing retry tests plus 5 new
give-up/boundary tests).
- Give-up path: a worker that never drains fails after the budget with
"could not be terminated after N attempts" naming the stuck worker(s);
the single-attempt, last-attempt-success, multi-worker aggregation, and
default-budget cases are each pinned by a new spec.
### Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky DataProcessingSpec intermittently fails the amber CI job

5 participants