refactor(execution-service): consolidate init-error reporting into WorkflowExecutionService by aglinxinyuan · Pull Request #5922 · apache/texera

aglinxinyuan · 2026-06-24T03:29:42Z

What changes were proposed in this PR?

Consolidates execution init-error reporting into a single site owned by WorkflowExecutionService, replacing the two-phase split #5781 introduced (per @Yicong-Huang's suggestion, tracked in #5921).

WorkflowExecutionService now registers its fatalErrors → WorkflowErrorEvent (and state) diff handler as the first construction action. Construction itself does no external work and cannot throw — it only assigns workflowSettings, creates a WebsocketInput (a PublishSubject), and registers the handler. All throwing work lives in executeWorkflow(), which runs after executionService.onNext(...) publishes the execution, so any failure there is recorded by errorHandler into the metadata store and surfaced by the handler that connectToExecution forwards.
WorkflowService.initExecutionService drops the pre-publish errorSubject fallback: the executionPublished gating and reportFatalErrorsToSubscribers are gone, and the catch is simply errorHandler(e). The now-unused errorSubject field and its connect() subscription are removed.
WorkflowServiceSpec is removed — it only tested the deleted reportFatalErrorsToSubscribers; the surfacing behavior is exercised by the integration/e2e suites.

Net: a single reporting path (the metadata-store diff handler), with no change to the init-error surfacing #5781 added — construction is provably side-effect-free, so the pre-publish window no longer has a failure mode.

Any related issues, documentation, discussions?

Resolves #5921 — the follow-up refactor agreed during the #5781 review.

How was this PR tested?

scalafmtCheckAll clean. This is a behavior-preserving refactor of init-error reporting: the single reporting path is the metadata-store diff handler that already surfaced post-publish errors, and construction is side-effect-free so no error can escape it. The full compile, scalafix, and the integration/e2e suites that exercise init-error surfacing run in CI.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF.

…rkflowExecutionService Per apache#5921 (follow-up to apache#5781): make error reporting a single site owned by WorkflowExecutionService instead of the two-phase split in WorkflowService.initExecutionService. - WorkflowExecutionService registers its fatalErrors -> WorkflowErrorEvent diff handler as the FIRST construction action. Construction does no external work and cannot throw (workflowSettings assignment, WebsocketInput creation, handler registration); all throwing work is in executeWorkflow(), which runs after the execution is published, so its failures surface through this same handler. - WorkflowService.initExecutionService drops the pre-publish errorSubject fallback (reportFatalErrorsToSubscribers + the executionPublished gating); the catch is simply errorHandler(e). Removes the now-unused errorSubject field and its connect() subscription. - Remove WorkflowServiceSpec (it only tested the removed reportFatalErrorsToSubscribers); the behavior is exercised by the integration/e2e suites. Resolves apache#5921.

github-actions · 2026-06-24T03:29:56Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @Yicong-Huang
You can notify them by mentioning @Yicong-Huang in a comment.

Copilot

Pull request overview

This PR refactors Amber’s workflow execution initialization error surfacing to use a single reporting path owned by WorkflowExecutionService (the execution metadata-store diff handler), removing the pre-publish websocket fallback previously implemented in WorkflowService.

Changes:

Removed WorkflowService’s pre-publish errorSubject fallback and simplified init-time exception handling to rely on the metadata-store diff handler.
Reordered WorkflowExecutionService construction to register the fatal-error/state diff handler before other initialization steps.
Deleted WorkflowServiceSpec which only covered the removed fallback helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala	Removes `errorSubject` + pre-publish fallback logic; relies on metadata-store diff handler for fatal error surfacing.
amber/src/main/scala/org/apache/texera/web/service/WorkflowExecutionService.scala	Registers metadata-store diff handler as the first constructor action to ensure fatal error/state events always have an emitter.
amber/src/test/scala/org/apache/texera/web/service/WorkflowServiceSpec.scala	Removes unit tests tied to the deleted fallback method.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-06-24T03:35:09Z

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 4 worse · ⚪ 7 noise (<±5%) · 0 without baseline

Compared against main 16a3ddf benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	426	0.26	24,060/30,872/30,872 us	🔴 +12.9% / 🔴 +101.3%
🔴	bs=100 sw=10 sl=64	907	0.554	107,931/153,813/153,813 us	🟢 -7.7% / 🔴 +39.1%
⚪	bs=1000 sw=10 sl=64	1,120	0.683	899,255/952,462/952,462 us	⚪ within ±5% / 🟢 +14.3%

Baseline details

Latest main 16a3ddf from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	426 tuples/sec	456 tuples/sec	757.16 tuples/sec	-6.6%	-43.7%
bs=10 sw=10 sl=64	MB/s	0.26 MB/s	0.279 MB/s	0.462 MB/s	-6.8%	-43.7%
bs=10 sw=10 sl=64	p50	24,060 us	21,301 us	12,971 us	+12.9%	+85.5%
bs=10 sw=10 sl=64	p95	30,872 us	33,174 us	15,333 us	-6.9%	+101.3%
bs=10 sw=10 sl=64	p99	30,872 us	33,174 us	18,732 us	-6.9%	+64.8%
bs=100 sw=10 sl=64	throughput	907 tuples/sec	943 tuples/sec	957.66 tuples/sec	-3.8%	-5.3%
bs=100 sw=10 sl=64	MB/s	0.554 MB/s	0.575 MB/s	0.585 MB/s	-3.7%	-5.2%
bs=100 sw=10 sl=64	p50	107,931 us	101,067 us	103,982 us	+6.8%	+3.8%
bs=100 sw=10 sl=64	p95	153,813 us	166,581 us	110,583 us	-7.7%	+39.1%
bs=100 sw=10 sl=64	p99	153,813 us	166,581 us	118,369 us	-7.7%	+29.9%
bs=1000 sw=10 sl=64	throughput	1,120 tuples/sec	1,108 tuples/sec	979.6 tuples/sec	+1.1%	+14.3%
bs=1000 sw=10 sl=64	MB/s	0.683 MB/s	0.676 MB/s	0.598 MB/s	+1.0%	+14.2%
bs=1000 sw=10 sl=64	p50	899,255 us	900,555 us	1,024,553 us	-0.1%	-12.2%
bs=1000 sw=10 sl=64	p95	952,462 us	972,396 us	1,063,789 us	-2.0%	-10.5%
bs=1000 sw=10 sl=64	p99	952,462 us	972,396 us	1,096,239 us	-2.0%	-13.1%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,469.53,200,128000,426,0.260,24059.55,30872.20,30872.20
1,100,10,64,20,2204.62,2000,1280000,907,0.554,107930.87,153813.25,153813.25
2,1000,10,64,20,17860.97,20000,12800000,1120,0.683,899255.00,952462.43,952462.43

codecov-commenter · 2026-06-24T03:45:49Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.62%. Comparing base (16a3ddf) to head (5777ac8).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...rg/apache/texera/web/service/WorkflowService.scala	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5922      +/-   ##
============================================
- Coverage     56.64%   56.62%   -0.02%     
+ Complexity     3028     3017      -11     
============================================
  Files          1124     1124              
  Lines         43298    43291       -7     
  Branches       4667     4666       -1     
============================================
- Hits          24525    24513      -12     
+ Misses        17335    17333       -2     
- Partials       1438     1445       +7

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`		Carriedforward from 7de1803
agent-service	`44.59% <ø> (ø)`		Carriedforward from 7de1803
amber	`58.54% <50.00%> (-0.08%)`	⬇️
computing-unit-managing-service	`0.00% <ø> (ø)`		Carriedforward from 7de1803
config-service	`52.30% <ø> (ø)`		Carriedforward from 7de1803
file-service	`62.81% <ø> (ø)`		Carriedforward from 7de1803
frontend	`49.42% <ø> (+0.02%)`	⬆️	Carriedforward from 7de1803
notebook-migration-service	`78.57% <ø> (ø)`		Carriedforward from 7de1803
pyamber	`90.20% <ø> (ø)`		Carriedforward from 7de1803
python	`90.76% <ø> (ø)`		Carriedforward from 7de1803
workflow-compiling-service	`55.14% <ø> (ø)`		Carriedforward from 7de1803

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Yicong-Huang · 2026-06-24T03:57:30Z

Need test coverage ;)

@Yicong-Huang

Address review (Copilot + @Yicong-Huang): deleting WorkflowServiceSpec left the single-reporting-site invariant untested. Add WorkflowExecutionServiceSpec: - a recorded fatal error surfaces as a WorkflowErrorEvent through the metadata-store diff handler registered at construction (the regression guard for init-error surfacing); - a state change emits a WorkflowStateEvent (handler's other branch). The construction-unused controllerConfig/resultService are passed as null on purpose, so a future change that does external work during construction (which would reopen the pre-publish gap) fails this test.

aglinxinyuan · 2026-06-24T07:32:17Z

Added test coverage in 351fed2f — WorkflowExecutionServiceSpec guards that a recorded fatal error surfaces as a WorkflowErrorEvent via the metadata-store handler (plus a state change → WorkflowStateEvent). 🙂

Yicong-Huang

LGTM, left comments inline

…trim comments Address review feedback on apache#5922: - Simplify the init-error comments in WorkflowService / WorkflowExecutionService to drop the alternative-design framing (single reporting site / pre-publish fallback), keeping only the mechanism. - Add a regression test asserting fatal errors recorded at successive phases (mirroring WorkflowService's errorHandler, which the service invokes at the compile / runtime-creation / startWorkflow phases) all surface as WorkflowErrorEvents through the construction-registered diff handler. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@Yicong-Huang

…rkflowExecutionService (apache#5922) ### What changes were proposed in this PR? Consolidates execution init-error reporting into a **single site owned by `WorkflowExecutionService`**, replacing the two-phase split apache#5781 introduced (per @Yicong-Huang's suggestion, tracked in apache#5921). - `WorkflowExecutionService` now registers its `fatalErrors → WorkflowErrorEvent` (and state) diff handler as the **first** construction action. Construction itself does no external work and cannot throw — it only assigns `workflowSettings`, creates a `WebsocketInput` (a `PublishSubject`), and registers the handler. All throwing work lives in `executeWorkflow()`, which runs **after** `executionService.onNext(...)` publishes the execution, so any failure there is recorded by `errorHandler` into the metadata store and surfaced by the handler that `connectToExecution` forwards. - `WorkflowService.initExecutionService` drops the pre-publish `errorSubject` fallback: the `executionPublished` gating and `reportFatalErrorsToSubscribers` are gone, and the catch is simply `errorHandler(e)`. The now-unused `errorSubject` field and its `connect()` subscription are removed. - `WorkflowServiceSpec` is removed — it only tested the deleted `reportFatalErrorsToSubscribers`; the surfacing behavior is exercised by the integration/e2e suites. Net: a single reporting path (the metadata-store diff handler), with no change to the init-error surfacing apache#5781 added — construction is provably side-effect-free, so the pre-publish window no longer has a failure mode. ### Any related issues, documentation, discussions? Resolves apache#5921 — the follow-up refactor agreed during the apache#5781 review. ### How was this PR tested? `scalafmtCheckAll` clean. This is a behavior-preserving refactor of init-error reporting: the single reporting path is the metadata-store diff handler that already surfaced post-publish errors, and construction is side-effect-free so no error can escape it. The full compile, scalafix, and the integration/e2e suites that exercise init-error surfacing run in CI. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.8 in compliance with ASF. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

Reworks the loop back-edge write address per the apache#5900 review decision: the URI is constant per-execution config, not per-iteration data, so it no longer rides State rows / the StateFrame envelope (the merged 3-column format has no loop_start_state_uri). Instead the scheduler resolves it and ships it to workers at setup: - PhysicalOp gains a compile-time `isLoopStart` marker (set by LoopStartOpDesc, mirroring requiresMaterializedExecution). - WorkflowExecutionCoordinator derives {LoopStart logical op id -> state URI of its single input port} from the final resource-allocated schedule; the two single-port/single-reader guards that lived in the Python worker's _compute_loop_start_id move controller-side. - InitializeExecutorRequest gains `map<string,string> loopStartStateUris` (field 4), sent to every worker each region (re-)run; the Python handler stores it on Context. No static LoopStart<->LoopEnd pairing is needed: a Loop End selects the entry by the loop_start_id carried on the StateFrame it consumes (nested loops work unchanged). - MainLoop: _compute_loop_start_id shrinks to the id parse; _jump_to_loop_start looks the URI up from the injected config and fails loudly (before the jump RPC) if it is missing; the inner-LoopStart nested pass-through gate keys on frame.loop_start_id instead of the removed URI field. - LoopIntegrationSpec adopts the per-suite id isolation from apache#5888 (specId 5), and WorkflowService drops a leftover errorSubject call removed on main by apache#5922. Tests: reworked test_main_loop.py to the 3-arg emit convention and the config-driven jump (plus a new missing-URI fail-loud case), added InitializeExecutorHandler coverage for the new field, pinned isLoopStart true/false in the Loop Start/End descriptor specs.

Copilot AI review requested due to automatic review settings June 24, 2026 03:29

github-actions Bot assigned aglinxinyuan Jun 24, 2026

github-actions Bot added the engine label Jun 24, 2026

Copilot started reviewing on behalf of aglinxinyuan June 24, 2026 03:30 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala Outdated

aglinxinyuan requested a review from Yicong-Huang June 24, 2026 03:35

Yicong-Huang reviewed Jun 29, 2026

View reviewed changes

Comment thread amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala Outdated

Yicong-Huang approved these changes Jun 29, 2026

View reviewed changes

Comment thread amber/src/test/scala/org/apache/texera/web/service/WorkflowExecutionServiceSpec.scala

aglinxinyuan and others added 2 commits June 28, 2026 20:40

Merge branch 'main' into consolidate-execservice-error-reporting

5777ac8

aglinxinyuan enabled auto-merge June 29, 2026 03:56

aglinxinyuan added this pull request to the merge queue Jun 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 29, 2026

Yicong-Huang added this pull request to the merge queue Jun 29, 2026

Merged via the queue into apache:main with commit 7f3950f Jun 29, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(execution-service): consolidate init-error reporting into WorkflowExecutionService#5922

refactor(execution-service): consolidate init-error reporting into WorkflowExecutionService#5922
Yicong-Huang merged 4 commits into
apache:mainfrom
aglinxinyuan:consolidate-execservice-error-reporting

aglinxinyuan commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 24, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented Jun 24, 2026

Uh oh!

aglinxinyuan commented Jun 24, 2026

Uh oh!

Uh oh!

Yicong-Huang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

aglinxinyuan commented Jun 24, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 24, 2026

Automated Reviewer Suggestions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

codecov-commenter commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Yicong-Huang commented Jun 24, 2026

Uh oh!

aglinxinyuan commented Jun 24, 2026

Uh oh!

Uh oh!

Yicong-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 24, 2026 •

edited

Loading

codecov-commenter commented Jun 24, 2026 •

edited

Loading