feat(amber): supporting consistent operator stats retrieval by shengquan-ni · Pull Request #3557 · apache/texera

shengquan-ni · 2025-07-11T20:38:57Z

This PR introduces consistent operator retrieval in QueryStats by adopting a reverse-topological strategy.

Problem

Previously, we retrieved operator statistics by sending direct control messages to each worker independently. This approach could result in inconsistent snapshots: for example, a downstream operator might appear as completed while its upstream is still running, which is logically impossible in a pipelined execution.

Consistent Retrieval via Reverse-Topological Order

We now compute a layered topological order of operators, where operators in the same layer have the same rank, and retrieve stats layer by layer in reverse topological order. Only after all relevant stats are retrieved do we update the execution state and send the update to the frontend. This guarantees that downstream stats are not visible before their upstream stats are available, maintaining a consistent global view.

This approach increases the time complexity to O(# of layers), compared to the previous O(1) method. Therefore, we should avoid issuing stats queries too frequently, as doing so may flood the control message queue and delay other control operations.

SubDAG Query and Race Condition Mitigation

In some scenarios—such as querying the stats of a completed worker—we now query the entire subDAG rooted at that operator to ensure upstream context is also retrieved. This avoids inconsistencies in localized queries.

However, this introduces the possibility of a race condition:

A global query stats request is fired.
Operator A's stats are retrieved at timestamp T₀.
A subDAG stats query (including operator A) is fired.
Operator A's stats are retrieved again at timestamp T₁.
The subDAG query finishes and updates the execution state with timestamp T₁.
The earlier global query finishes and overwrites A's stats with older data from T₀.

To prevent this, we now attach a nanosecond-level timestamp to each execution state update, and only allow updates with newer timestamps to overwrite the existing state.

…a/texera into shengquan-consistent-stats

aglinxinyuan · 2025-07-11T23:53:18Z

Need more testing. Will do it offline.

) This PR introduces consistent operator retrieval in `QueryStats` by adopting a reverse-topological strategy. ## Problem Previously, we retrieved operator statistics by sending direct control messages to each worker independently. This approach could result in inconsistent snapshots: for example, a downstream operator might appear as completed while its upstream is still running, which is logically impossible in a pipelined execution. ## Consistent Retrieval via Reverse-Topological Order We now compute a layered topological order of operators, where operators in the same layer have the same rank, and retrieve stats layer by layer in reverse topological order. Only after all relevant stats are retrieved do we update the execution state and send the update to the frontend. This guarantees that downstream stats are not visible before their upstream stats are available, maintaining a consistent global view. This approach increases the time complexity to O(# of layers), compared to the previous O(1) method. Therefore, we should avoid issuing stats queries too frequently, as doing so may flood the control message queue and delay other control operations. ## SubDAG Query and Race Condition Mitigation In some scenarios—such as querying the stats of a completed worker—we now query the entire subDAG rooted at that operator to ensure upstream context is also retrieved. This avoids inconsistencies in localized queries. However, this introduces the possibility of a race condition: 1. A global query stats request is fired. 2. Operator A's stats are retrieved at timestamp T₀. 3. A subDAG stats query (including operator A) is fired. 4. Operator A's stats are retrieved again at timestamp T₁. 5. The subDAG query finishes and updates the execution state with timestamp T₁. 6. The earlier global query finishes and overwrites A's stats with older data from T₀. To prevent this, we now attach a **nanosecond-level** timestamp to each execution state update, and only allow updates with newer timestamps to overwrite the existing state.

wip

c118184

shengquan-ni self-assigned this Jul 11, 2025

shengquan-ni added the engine label Jul 11, 2025

shengquan-ni and others added 2 commits July 11, 2025 13:39

Merge branch 'master' into shengquan-consistent-stats

8e3e593

Update QueryWorkerStatisticsHandler.scala

d940c31

shengquan-ni requested review from aglinxinyuan and bobbai00 July 11, 2025 20:41

shengquan-ni added 2 commits July 11, 2025 13:42

Merge branch 'shengquan-consistent-stats' of https://github.com/Texer…

6bcd481

…a/texera into shengquan-consistent-stats

Update QueryWorkerStatisticsHandler.scala

bf6d2a1

aglinxinyuan approved these changes Jul 11, 2025

View reviewed changes

shengquan-ni and others added 3 commits July 11, 2025 16:46

Merge branch 'master' into shengquan-consistent-stats

ff81d4e

Update WorkerExecution.scala

6e3c4e0

Merge branch 'shengquan-consistent-stats' of https://github.com/Texer…

201e174

…a/texera into shengquan-consistent-stats

fix

fa31c16

shengquan-ni changed the title ~~feat(engine): supporting consistent operator stats retrieval~~ feat(amer): supporting consistent operator stats retrieval Jul 12, 2025

shengquan-ni changed the title ~~feat(amer): supporting consistent operator stats retrieval~~ feat(amber): supporting consistent operator stats retrieval Jul 12, 2025

shengquan-ni merged commit 875708d into master Jul 12, 2025
11 checks passed

shengquan-ni deleted the shengquan-consistent-stats branch July 12, 2025 21:07

This was referenced Jun 29, 2026

Fast source operator stays orange (RUNNING) after the workflow completes #6010

Open

fix(amber): order worker state by version, not timestamp #6011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(amber): supporting consistent operator stats retrieval #3557

feat(amber): supporting consistent operator stats retrieval #3557
shengquan-ni merged 9 commits into
masterfrom
shengquan-consistent-stats

shengquan-ni commented Jul 11, 2025 •

edited

Loading

Uh oh!

aglinxinyuan commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shengquan-ni commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Consistent Retrieval via Reverse-Topological Order

SubDAG Query and Race Condition Mitigation

Uh oh!

aglinxinyuan commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shengquan-ni commented Jul 11, 2025 •

edited

Loading