Skip to content

fix(amber): surface writer-thread failure as FatalError instead of silent hang#4683

Merged
Yicong-Huang merged 3 commits into
apache:mainfrom
Yicong-Huang:fix/iceberg-write-fatal-fallback
May 6, 2026
Merged

fix(amber): surface writer-thread failure as FatalError instead of silent hang#4683
Yicong-Huang merged 3 commits into
apache:mainfrom
Yicong-Huang:fix/iceberg-write-fatal-fallback

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented May 2, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

When OutputPortResultWriterThread.run() throws (e.g. iceberg commit-retry budget exhausted), the writer thread dies silently and the worker still reports portCompleted to the controller. The user sees a 1-minute completion timeout with no signal pointing at iceberg.

Capture the failure on the writer thread, re-throw it from OutputManager.closeOutputStorageWriterIfNeeded, and let the existing DP-thread → worker-actor → controller-supervisor path surface it as a FatalError to the client.

Any related issues, documentation, discussions?

Closes #4682.

How was this PR tested?

OutputPortResultWriterThreadSpec (6 tests) covers clean run, putOne failure (close() still runs), close() failure, both-fail (close() suppressed on putOne), and OutputManager.closeOutputStorageWriterIfNeeded re-throw + no-op cases.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7, 1M context)

@codecov-commenter

codecov-commenter commented May 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.51%. Comparing base (8b5dbf8) to head (b215402).

Files with missing lines Patch % Lines
...worker/managers/OutputPortResultWriterThread.scala 69.23% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4683      +/-   ##
============================================
+ Coverage     42.49%   42.51%   +0.01%     
- Complexity     2180     2185       +5     
============================================
  Files          1005     1005              
  Lines         37429    37436       +7     
  Branches       3914     3918       +4     
============================================
+ Hits          15907    15915       +8     
+ Misses        20558    20557       -1     
  Partials        964      964              
Flag Coverage Δ
amber 43.16% <71.42%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang

Copy link
Copy Markdown
Contributor Author

Another natural occurrence to fold into the test plan: PR #4648's release backport leg hit this exact path today.

Tracked separately in #4682.

@Yicong-Huang Yicong-Huang force-pushed the fix/iceberg-write-fatal-fallback branch from 97eb43d to 9262054 Compare May 3, 2026 05:03

@kunwp1 kunwp1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have no changes to suggest.

@aglinxinyuan

Copy link
Copy Markdown
Contributor

Please wait for @Xiao-zhen-Liu 's review.

@aglinxinyuan aglinxinyuan removed their request for review May 4, 2026 07:43

@Xiao-zhen-Liu Xiao-zhen-Liu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, and I hope this can also solve the flaky and unreproducible #3950 test case issue. Left two minor suggestions before merge.

Yicong-Huang and others added 3 commits May 5, 2026 23:37
…lent hang

OutputPortResultWriterThread previously let exceptions in close()
escape Thread.run(), so the iceberg commit failure was invisible to
the worker, the controller still saw normal portCompleted, downstream
operators read incomplete data, and the test/user observed only a
1-minute Await timeout. Capture the failure, re-throw on
closeOutputStorageWriterIfNeeded, let DPThread's existing
MainThreadDelegateMessage path route it to the worker actor, and
let the Controller's AllForOneStrategy supervisor emit FatalError
to the client immediately.

Closes apache#4682.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add OutputPortResultWriterThreadSpec with 4 tests:

- OutputPortResultWriterThread leaves getFailure empty on a clean run.
- OutputPortResultWriterThread captures a close() exception in
  getFailure so the worker can re-throw it.
- OutputManager.closeOutputStorageWriterIfNeeded re-throws the writer
  thread's captured failure (this is the bridge from the writer
  thread to the DP thread → worker actor → controller supervisor →
  FatalError to client).
- OutputManager.closeOutputStorageWriterIfNeeded is a no-op when the
  port has no writer thread.

Together these pin every link of the fatal-error chain that this PR
introduces. The OutputManager test reaches into the private
outputPortResultWriterThreads map by reflection rather than going
through addPort, which would require a real iceberg URI; the test
file otherwise only depends on a 4-method stub BufferedItemWriter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review: a putOne failure mid-loop bypassed close() and leaked
the underlying writer's file handles. Move close() into a finally
clause; if both legs fail, attach close()'s exception as suppressed on
the original.

Tests: rework StubWriter to take onPutOne/onClose thunks, add a
putOne-failure test (asserts close() still ran) and a both-fail test
(asserts the original is captured with the close() failure as
suppressed).
@Yicong-Huang Yicong-Huang force-pushed the fix/iceberg-write-fatal-fallback branch from c709d9e to b215402 Compare May 6, 2026 06:42
@Yicong-Huang Yicong-Huang enabled auto-merge (squash) May 6, 2026 06:46
@Yicong-Huang Yicong-Huang merged commit 34b004d into apache:main May 6, 2026
18 checks passed
@Yicong-Huang Yicong-Huang deleted the fix/iceberg-write-fatal-fallback branch May 6, 2026 06:54
aglinxinyuan added a commit that referenced this pull request May 7, 2026
Mirror the result-writer cleanup added in #4683: if the state-writer
thread captured a failure (e.g. iceberg commit retries exhausted),
re-throw it from closeOutputStorageWriterIfNeeded so the DP thread
surfaces a FatalError to the controller instead of silently announcing
port completion.
yangzhang75 pushed a commit to yangzhang75/texera that referenced this pull request Jun 22, 2026
…lent hang (apache#4683)

### What changes were proposed in this PR?

When `OutputPortResultWriterThread.run()` throws (e.g. iceberg
commit-retry budget exhausted), the writer thread dies silently and the
worker still reports `portCompleted` to the controller. The user sees a
1-minute completion timeout with no signal pointing at iceberg.

Capture the failure on the writer thread, re-throw it from
`OutputManager.closeOutputStorageWriterIfNeeded`, and let the existing
DP-thread → worker-actor → controller-supervisor path surface it as a
`FatalError` to the client.

### Any related issues, documentation, discussions?

Closes apache#4682.

### How was this PR tested?

`OutputPortResultWriterThreadSpec` (6 tests) covers clean run, putOne
failure (close() still runs), close() failure, both-fail (close()
suppressed on putOne), and
`OutputManager.closeOutputStorageWriterIfNeeded` re-throw + no-op cases.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7, 1M context)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aglinxinyuan pushed a commit to aglinxinyuan/texera that referenced this pull request Jun 27, 2026
…apache#5737)

### What changes were proposed in this PR?
- `RegionExecutionCoordinator.terminateWorkersWithRetry` retried worker
termination with no limit, so a worker that never finished draining (its
`EndWorker` kept failing) left the region's termination future
unresolved and the workflow never reached COMPLETED, surfacing as an
opaque ~1 minute timeout in `DataProcessingSpec` with no indication of
the stuck region or workers.
- Bounded the retry by a new `maxTerminationAttempts` budget (default
150, about 30s at the existing 200ms delay); on exhaustion the
termination future fails with a descriptive error naming the region and
the still-unterminated workers, instead of retrying indefinitely.
- Made `maxTerminationAttempts` and `killRetryDelay` constructor
parameters with production defaults so the behavior is unit-testable
without long waits.
- Scope note: this is a fail-fast/diagnosability change (it converts a
silent hang into a fast, explicit failure, matching the pattern in
apache#4683), not a guaranteed elimination of the underlying termination
flake.
### Any related issues, documentation, discussions?
Closes: apache#5614
### How was this PR tested?
- Run `sbt "WorkflowExecutionService/testOnly
*RegionExecutionCoordinatorSpec"` (under JDK 17); expect 3 tests
succeeded, 0 failed.
- New test `give up with a descriptive error once the EndWorker retry
budget is exhausted`: forces `EndWorker` to always fail, then asserts
the completion future fails with `IllegalStateException` containing
"could not be terminated after 3 attempts", the coordinator is not
marked completed, exactly 3 `EndWorker` calls were made, and the worker
actor ref is retained.
- Existing tests in the same spec (gracefulStop ordering,
transient-failure retry-then-succeed) still pass, confirming the normal
and transient paths are unchanged.
### Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg writer-thread failure dies silently, masking root cause as 1-minute test timeout

5 participants