fix(amber): bound region termination retry instead of looping forever by Ma77Ball · Pull Request #5737 · apache/texera

Ma77Ball · 2026-06-16T22:38:26Z

What changes were proposed in this PR?

RegionExecutionCoordinator.terminateWorkersWithRetry retried worker termination with no limit, so a worker that never finished draining (its EndWorker kept failing) left the region's termination future unresolved and the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout in DataProcessingSpec with no indication of the stuck region or workers.
Bounded the retry by a new maxTerminationAttempts budget (default 150, about 30s at the existing 200ms delay); on exhaustion the termination future fails with a descriptive error naming the region and the still-unterminated workers, instead of retrying indefinitely.
Made maxTerminationAttempts and killRetryDelay constructor parameters with production defaults so the behavior is unit-testable without long waits.
Scope note: this is a fail-fast/diagnosability change (it converts a silent hang into a fast, explicit failure, matching the pattern in fix(amber): surface writer-thread failure as FatalError instead of silent hang #4683), not a guaranteed elimination of the underlying termination flake.

Any related issues, documentation, discussions?

Closes: #5614

How was this PR tested?

Run sbt "WorkflowExecutionService/testOnly *RegionExecutionCoordinatorSpec" (under JDK 17); expect 3 tests succeeded, 0 failed.
New test give up with a descriptive error once the EndWorker retry budget is exhausted: forces EndWorker to always fail, then asserts the completion future fails with IllegalStateException containing "could not be terminated after 3 attempts", the coordinator is not marked completed, exactly 3 EndWorker calls were made, and the worker actor ref is retained.
Existing tests in the same spec (gracefulStop ordering, transient-failure retry-then-succeed) still pass, confirming the normal and transient paths are unchanged.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

…or user to see and fix

…nation-bounded-retry

github-actions · 2026-06-16T22:43:23Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 3 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 7a38b6c benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	435	0.266	23,215/30,690/30,690 us	🔴 +8.9% / 🔴 +97.0%
🟢	bs=100 sw=10 sl=64	959	0.585	103,890/120,058/120,058 us	🟢 -24.4% / 🔴 +9.5%
⚪	bs=1000 sw=10 sl=64	1,114	0.68	902,990/949,073/949,073 us	⚪ within ±5% / 🟢 -12.3%

Baseline details

Latest main 7a38b6c from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	435 tuples/sec	465 tuples/sec	758.88 tuples/sec	-6.5%	-42.7%
bs=10 sw=10 sl=64	MB/s	0.266 MB/s	0.284 MB/s	0.463 MB/s	-6.3%	-42.6%
bs=10 sw=10 sl=64	p50	23,215 us	21,310 us	12,965 us	+8.9%	+79.1%
bs=10 sw=10 sl=64	p95	30,690 us	30,290 us	15,578 us	+1.3%	+97.0%
bs=10 sw=10 sl=64	p99	30,690 us	30,290 us	18,378 us	+1.3%	+67.0%
bs=100 sw=10 sl=64	throughput	959 tuples/sec	939 tuples/sec	968.9 tuples/sec	+2.1%	-1.0%
bs=100 sw=10 sl=64	MB/s	0.585 MB/s	0.573 MB/s	0.591 MB/s	+2.1%	-1.1%
bs=100 sw=10 sl=64	p50	103,890 us	104,630 us	102,767 us	-0.7%	+1.1%
bs=100 sw=10 sl=64	p95	120,058 us	158,773 us	109,629 us	-24.4%	+9.5%
bs=100 sw=10 sl=64	p99	120,058 us	158,773 us	118,129 us	-24.4%	+1.6%
bs=1000 sw=10 sl=64	throughput	1,114 tuples/sec	1,113 tuples/sec	997.01 tuples/sec	+0.1%	+11.7%
bs=1000 sw=10 sl=64	MB/s	0.68 MB/s	0.679 MB/s	0.609 MB/s	+0.1%	+11.7%
bs=1000 sw=10 sl=64	p50	902,990 us	895,733 us	1,009,306 us	+0.8%	-10.5%
bs=1000 sw=10 sl=64	p95	949,073 us	955,612 us	1,051,088 us	-0.7%	-9.7%
bs=1000 sw=10 sl=64	p99	949,073 us	955,612 us	1,082,535 us	-0.7%	-12.3%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,459.28,200,128000,435,0.266,23215.27,30689.55,30689.55
1,100,10,64,20,2086.08,2000,1280000,959,0.585,103889.63,120057.57,120057.57
2,1000,10,64,20,17955.11,20000,12800000,1114,0.680,902989.93,949073.45,949073.45

…7Ball/texera into fix/region-termination-bounded-retry

codecov-commenter · 2026-06-16T23:05:46Z

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.82%. Comparing base (7a38b6c) to head (888625b).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ecture/scheduling/RegionExecutionCoordinator.scala	90.47%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5737      +/-   ##
============================================
- Coverage     54.89%   54.82%   -0.08%     
+ Complexity     2962     2958       -4     
============================================
  Files          1117     1115       -2     
  Lines         43133    43037      -96     
  Branches       4648     4636      -12     
============================================
- Hits          23680    23597      -83     
+ Misses        18064    18053      -11     
+ Partials       1389     1387       -2

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`		Carriedforward from 2779df9
agent-service	`34.36% <ø> (ø)`		Carriedforward from 2779df9
amber	`57.07% <90.47%> (+0.01%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`		Carriedforward from 2779df9
config-service	`51.56% <ø> (ø)`		Carriedforward from 2779df9
file-service	`59.02% <ø> (ø)`		Carriedforward from 2779df9
frontend	`48.39% <ø> (-0.23%)`	⬇️	Carriedforward from 2779df9
notebook-migration-service	`78.57% <ø> (ø)`		Carriedforward from 2779df9
pyamber	`90.20% <ø> (ø)`		Carriedforward from 2779df9
python	`90.76% <ø> (ø)`		Carriedforward from 2779df9
workflow-compiling-service	`55.14% <ø> (ø)`		Carriedforward from 2779df9

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-06-22T07:45:28Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @tanishqgandhi1908, @Yicong-Huang
You can notify them by mentioning @tanishqgandhi1908, @Yicong-Huang in a comment.

Ma77Ball · 2026-06-23T10:13:04Z

/request-review @aglinxinyuan

Copilot

Pull request overview

This PR updates Amber’s region termination behavior so that RegionExecutionCoordinator.terminateWorkersWithRetry no longer retries EndWorker indefinitely. Instead, it enforces a bounded retry budget (production defaults ≈30s) and fails fast with a descriptive error that includes the region id and affected worker ids, improving diagnosability of CI flakes like the DataProcessingSpec timeout described in #5614.

Changes:

Add bounded termination retry logic with configurable maxTerminationAttempts and killRetryDelay, plus descriptive failure on exhaustion.
Add/extend scheduler test support utilities for multi-worker regions in tests.
Add unit tests covering retry-budget exhaustion, boundary (off-by-one) behavior, multi-worker aggregation, and default values.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
amber/src/main/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinator.scala	Introduces bounded termination retry parameters + defaults; fails termination with a descriptive exception after budget exhaustion.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionExecutionCoordinatorSpec.scala	Adds focused tests for give-up behavior, boundary conditions, multi-worker error reporting, and default budget values.
amber/src/test/scala/org/apache/texera/amber/engine/architecture/scheduling/RegionCoordinatorTestSupport.scala	Adds helpers to create/seed regions with multiple workers to support the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aglinxinyuan

LGTM — bounding the retry is the right call. I traced the failure path to confirm it does what the description says: on budget exhaustion the failed terminationFuture skips the setPhase(Completed) flatMap, propagates through coordinateRegionExecutors' Future.collect, and surfaces as a FatalError in PortCompletedHandler (hits the case other branch, so relatedWorkerId = None — but the message itself carries the worker ids). Defaults are tunable and the only other constructor call site uses them, so the blast radius is minimal. Test coverage is good too — the "succeeds on the last permitted attempt" case nicely pins the attempt >= maxTerminationAttempts off-by-one.

One thing to confirm before merge — the budget is ~30s, but the stalls in #5614 showed ~60s silent windows. In production (no test timeout) a worker that would have drained late now gets the workflow failed at 30s instead of eventually completing. Are we comfortable that 30s is enough headroom for a slow-but-recoverable drain? Fine with it either way since it's a constructor param — just want it to be a deliberate choice rather than fall out of the 150 * 200ms arithmetic.

Couple of optional nits:

The give-up message lists every region worker as "still not terminated," but terminateWorkers fails the whole batch via Future.collect if any single EndWorker fails — so a worker that terminated fine still gets named. Something like "region workers" would read more accurately.
The companion-object comment says "Max EndWorker retries," but maxTerminationAttempts counts attempts (1 initial + 149 retries = 150) — minor wording.

Agree with the scope note that this is a fail-fast/diagnosability change, not a fix for the underlying termination flake (the Iceberg-commit contention during teardown is the real culprit) — but it's a strictly better failure mode than a silent hang, so no objection to merging once the budget question is settled.

Ma77Ball · 2026-06-26T22:38:13Z

@xuang7 / @xinyuan could you drop the release/v1.2 label here? It can't cherry-pick cleanly onto v1.2, so I opened a dedicated backport PR (#5958) that's green.

…#5958) ### What changes were proposed in this PR? - Backport of #5737 to `release/v1.2`: bound the EndWorker termination retry loop with a configurable budget (`maxTerminationAttempts`, default 150) and per-attempt delay (`killRetryDelay`, default 200ms, ~30s total) instead of retrying forever. - When the budget is exhausted, the region fails loudly with an `IllegalStateException` that names every still-running worker and chains the underlying cause, so a stuck region surfaces to the user instead of hanging silently. - Adds `RegionExecutionCoordinator.DefaultMaxTerminationAttempts` / `DefaultKillRetryDelay`, threads the two new constructor params through, and extends the test fixtures with give-up and boundary coverage. ### Any related issues, documentation, discussions? Backport of #5737 to release/v1.2. ### How was this PR tested? - Run `sbt "WorkflowExecutionService/testOnly org.apache.texera.amber.engine.architecture.scheduling.RegionExecutionCoordinatorSpec"` (JDK 17); expect all 7 tests to pass (2 existing retry tests plus 5 new give-up/boundary tests). - Give-up path: a worker that never drains fails after the budget with "could not be terminated after N attempts" naming the stuck worker(s); the single-attempt, last-attempt-success, multi-worker aggregation, and default-budget cases are each pinned by a new spec. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.8 in compliance with ASF

Ma77Ball added 2 commits June 16, 2026 15:34

capped retries for region shutdown so when it fails it fails loudly f…

fd003ac

…or user to see and fix

Merge branch 'main' of github.com:apache/texera into fix/region-termi…

a3b8949

…nation-bounded-retry

github-actions Bot assigned Ma77Ball Jun 16, 2026

github-actions Bot added engine fix labels Jun 16, 2026

Ma77Ball added 3 commits June 16, 2026 15:45

Merge branch 'main' into fix/region-termination-bounded-retry

d05443b

added test cases for new code'

5db7936

Merge branch 'fix/region-termination-bounded-retry' of github.com:Ma7…

ad4d56e

…7Ball/texera into fix/region-termination-bounded-retry

Ma77Ball added 6 commits June 16, 2026 16:44

Merge branch 'main' into fix/region-termination-bounded-retry

f581296

re-run ci test

95829c3

Merge branch 'main' into fix/region-termination-bounded-retry

2835fe6

Merge branch 'main' into fix/region-termination-bounded-retry

a948651

Merge branch 'main' into fix/region-termination-bounded-retry

24d0c65

added test cases for code coverage

fe13ee3

Ma77Ball marked this pull request as ready for review June 23, 2026 10:12

fixed minor grammar issues

7ec4e83

Ma77Ball force-pushed the fix/region-termination-bounded-retry branch from f16b303 to 7ec4e83 Compare June 23, 2026 10:15

Merge branch 'main' into fix/region-termination-bounded-retry

e493942

github-actions Bot requested a review from aglinxinyuan June 23, 2026 10:41

Merge branch 'main' into fix/region-termination-bounded-retry

bcfb688

aglinxinyuan requested a review from Copilot June 23, 2026 21:20

Copilot started reviewing on behalf of aglinxinyuan June 23, 2026 21:20 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

merge upstream/main

7f9f807

aglinxinyuan approved these changes Jun 24, 2026

View reviewed changes

xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 24, 2026

Ma77Ball mentioned this pull request Jun 26, 2026

fix(amber): bound region termination retry instead of looping forever #5958

Merged

xuang7 removed the release/v1.2 back porting to release/v1.2 label Jun 26, 2026

Merge branch 'main' into fix/region-termination-bounded-retry

888625b

xuang7 added this pull request to the merge queue Jun 26, 2026

Merged via the queue into apache:main with commit ef05070 Jun 27, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(amber): bound region termination retry instead of looping forever#5737

fix(amber): bound region termination retry instead of looping forever#5737
xuang7 merged 16 commits into
apache:mainfrom
Ma77Ball:fix/region-termination-bounded-retry

Ma77Ball commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Ma77Ball commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

aglinxinyuan left a comment

Uh oh!

Ma77Ball commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Ma77Ball commented Jun 16, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Reviewer Suggestions

Uh oh!

Ma77Ball commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

aglinxinyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Ma77Ball commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented Jun 16, 2026 •

edited

Loading

codecov-commenter commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading