Skip to content

test(workflow-operator): add ReservoirSamplingOpExec spec#5384

Merged
aglinxinyuan merged 8 commits into
apache:mainfrom
Ma77Ball:test/reservoir-sampling-op-exec
Jun 7, 2026
Merged

test(workflow-operator): add ReservoirSamplingOpExec spec#5384
aglinxinyuan merged 8 commits into
apache:mainfrom
Ma77Ball:test/reservoir-sampling-op-exec

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Add ReservoirSamplingOpExecSpec, the first dedicated spec for the streaming reservoir-sampling executor.
  • Cover processTuple buffering (emits nothing per tuple), and onFinish returning all input tuples in order when input size equals k.
  • Cover keeping exactly k unique, input-drawn samples when input exceeds k, determinism of the seeded RNG (with a check that replacement actually happens), multi-worker partitioning of k via equallyPartitionGoal (k=10 across 3 workers yields 4,3,3), and open resetting state for executor reuse.
  • Characterize the input-size-below-k edge case, where the unfilled fixed-size reservoir emits null padding on finish (flagged in the spec as a likely bug for a follow-up fix).

Any related issues, documentation, or discussions?

Closes: #5383

How was this PR tested?

  • Run sbt "WorkflowOperator/testOnly *ReservoirSamplingOpExecSpec" and expect all 7 examples to pass.
  • This is a test-only PR (no production code changed), so the spec itself is the verification; the input-below-k example intentionally asserts the current null-padding behavior, so a green run confirms that characterization rather than a fix.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@codecov-commenter

codecov-commenter commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.45%. Comparing base (5c2eaa2) to head (a2bcf92).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5384      +/-   ##
============================================
- Coverage     51.61%   51.45%   -0.16%     
- Complexity     2448     2460      +12     
============================================
  Files          1067     1065       -2     
  Lines         41255    41203      -52     
  Branches       4437     4426      -11     
============================================
- Hits          21292    21202      -90     
- Misses        18712    18754      +42     
+ Partials       1251     1247       -4     
Flag Coverage Δ *Carryforward flag
access-control-service 42.22% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from 8aa81af
amber 52.29% <ø> (+0.13%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.06% <ø> (ø)
file-service 38.32% <ø> (ø)
frontend 45.81% <ø> (-0.56%) ⬇️ Carriedforward from 8aa81af
pyamber 90.84% <ø> (+0.15%) ⬆️ Carriedforward from 8aa81af
python 90.84% <ø> (ø) Carriedforward from 8aa81af
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Ma77Ball

Ma77Ball commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan

@github-actions github-actions Bot requested a review from aglinxinyuan June 5, 2026 18:16
@aglinxinyuan aglinxinyuan enabled auto-merge June 6, 2026 22:39
@aglinxinyuan aglinxinyuan added this pull request to the merge queue Jun 7, 2026
Merged via the queue into apache:main with commit 38f5ac5 Jun 7, 2026
18 checks passed
@xuang7

xuang7 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @Ma77Ball, we would like to backport this PR because #5606 depends on the tests introduced here. Could you open a backport PR targeting the release/v1.2 branch? Thank you!

yangzhang75 pushed a commit to yangzhang75/texera that referenced this pull request Jun 22, 2026
### What changes were proposed in this PR?
- Add `ReservoirSamplingOpExecSpec`, the first dedicated spec for the
streaming reservoir-sampling executor.
- Cover `processTuple` buffering (emits nothing per tuple), and
`onFinish` returning all input tuples in order when input size equals k.
- Cover keeping exactly k unique, input-drawn samples when input exceeds
k, determinism of the seeded RNG (with a check that replacement actually
happens), multi-worker partitioning of k via `equallyPartitionGoal`
(k=10 across 3 workers yields 4,3,3), and `open` resetting state for
executor reuse.
- Characterize the input-size-below-k edge case, where the unfilled
fixed-size reservoir emits null padding on finish (flagged in the spec
as a likely bug for a follow-up fix).
### Any related issues, documentation, or discussions?
Closes: apache#5383
### How was this PR tested?
- Run `sbt "WorkflowOperator/testOnly *ReservoirSamplingOpExecSpec"` and
expect all 7 examples to pass.
- This is a test-only PR (no production code changed), so the spec
itself is the verification; the input-below-k example intentionally
asserts the current null-padding behavior, so a green run confirms that
characterization rather than a fix.
### Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add unit tests for ReservoirSamplingOpExec

5 participants