test(workflow-operator): add unit test coverage for scan source executors by aglinxinyuan · Pull Request #6076 · apache/texera

aglinxinyuan · 2026-07-03T07:03:21Z

What changes were proposed in this PR?

Add unit test coverage for the four file-reading scan source executors, selected from the Codecov report. No production-code changes. Every test is driven by a local temp file (file: URI → ReadonlyLocalFileDocument); no live DB/network/S3/Iceberg.

File	Codecov before	Missed lines targeted
`ParallelCSVScanSourceOpExec.scala`	0%	43 — construct + schema init, `open` byte-range partitioning (both `endOffset` branches + mid-file partial-line skip + header skip), `produceTuple` happy path, short-row null padding, all-null (blank) line discard, `close`
`JSONLScanSourceOpExec.scala`	0%	23 — construct, `produceTuple`, `open` line counting + worker partitioning (both ternary branches), row limit, `close`
`CSVOldScanSourceOpExec.scala`	0%	21 — construct + header-based schema, `open` (header vs no-header start offset), `produceTuple` + limit, `close` null-guard before open
`ArrowSourceOpExec.scala`	0%	30 — construct, `open` success + error-wrapping path, `produceTuple` batch iteration + offset/limit, `close` no-op before open

New specs live in each executor's own package so their private[...] constructors are reachable, mirroring the existing FileScanSourceOpExecSpec temp-file/URI pattern.

Any related issues, documentation, discussions?

Follow-up to the review feedback on #6043: prioritize tests that fill uncovered code paths.

How was this PR tested?

sbt "WorkflowOperator/testOnly *ParallelCSVScanSourceOpExecSpec *JSONLScanSourceOpExecSpec *CSVOldScanSourceOpExecSpec *ArrowSourceOpExecSpec" — 17 tests, all green
sbt "WorkflowOperator/Test/scalafmtCheck" and sbt "WorkflowOperator/scalafixAll --check" — clean

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])

…tors Targets Codecov-missed lines in the file-reading source executors, all driven by local temp files (no live services): ParallelCSVScanSourceOpExec (43), JSONLScanSourceOpExec (23), CSVOldScanSourceOpExec (21), ArrowSourceOpExec (30) — open/produceTuple/close, worker partitioning, offset/limit, short-row padding, blank-line discard, and the Arrow open-failure path.

github-actions · 2026-07-03T07:03:38Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

No candidates found from git blame history.

codecov-commenter · 2026-07-03T07:06:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.48%. Comparing base (a53d95a) to head (ce6ed2f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6076      +/-   ##
============================================
+ Coverage     57.21%   57.48%   +0.27%     
- Complexity     3103     3139      +36     
============================================
  Files          1130     1130              
  Lines         43825    43825              
  Branches       4747     4747              
============================================
+ Hits          25075    25194     +119     
+ Misses        17313    17189     -124     
- Partials       1437     1442       +5

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.59% <ø> (ø)`		Carriedforward from 1c5cf64
amber	`60.19% <ø> (+0.70%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (ø)`
file-service	`62.81% <ø> (ø)`
frontend	`50.12% <ø> (ø)`		Carriedforward from 1c5cf64
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`91.15% <ø> (ø)`		Carriedforward from 1c5cf64
python	`90.69% <ø> (ø)`		Carriedforward from 1c5cf64
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR adds Scala unit tests in common/workflow-operator to increase coverage for four file-reading scan source executors (CSV, legacy CSV, JSONL, Arrow) without modifying production code. The tests are primarily temp-file driven and exercise executor open/produceTuple/close paths plus partitioning/limit behavior.

Changes:

Add ParallelCSVScanSourceOpExecSpec covering header handling, null-padding, blank-line discard, and multi-worker byte-range partitioning.
Add JSONLScanSourceOpExecSpec covering basic JSONL ingestion, worker partitioning, and row limiting.
Add CSVOldScanSourceOpExecSpec and ArrowSourceOpExecSpec covering schema/iteration/offset+limit/error wrapping and close-before-open behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/source/scan/json/JSONLScanSourceOpExecSpec.scala	New JSONL executor unit tests for reading, partitioning, and limit behavior.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/source/scan/csvOld/CSVOldScanSourceOpExecSpec.scala	New legacy CSV executor unit tests for header/no-header reading, limit, and safe close.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/source/scan/csv/ParallelCSVScanSourceOpExecSpec.scala	New parallel CSV executor unit tests for parsing edge cases and multi-worker byte partitioning.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/source/scan/arrow/ArrowSourceOpExecSpec.scala	New Arrow executor unit tests for open/produceTuple iteration, offset+limit, error wrapping, and safe close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-07-03T07:08:51Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 1 worse · ⚪ 12 noise (<±5%) · 0 without baseline

Compared against main a53d95a benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	390	0.238	25,166/30,115/30,115 us	🟢 -16.9% / 🔴 +97.0%
⚪	bs=100 sw=10 sl=64	798	0.487	123,286/150,325/150,325 us	⚪ within ±5% / 🔴 +38.7%
⚪	bs=1000 sw=10 sl=64	913	0.557	1,098,136/1,128,440/1,128,440 us	⚪ within ±5% / 🔴 +10.1%

Baseline details

Latest main a53d95a from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	390 tuples/sec	405 tuples/sec	770.95 tuples/sec	-3.7%	-49.4%
bs=10 sw=10 sl=64	MB/s	0.238 MB/s	0.247 MB/s	0.471 MB/s	-3.6%	-49.4%
bs=10 sw=10 sl=64	p50	25,166 us	22,127 us	12,775 us	+13.7%	+97.0%
bs=10 sw=10 sl=64	p95	30,115 us	36,251 us	15,286 us	-16.9%	+97.0%
bs=10 sw=10 sl=64	p99	30,115 us	36,251 us	18,795 us	-16.9%	+60.2%
bs=100 sw=10 sl=64	throughput	798 tuples/sec	814 tuples/sec	976.93 tuples/sec	-2.0%	-18.3%
bs=100 sw=10 sl=64	MB/s	0.487 MB/s	0.497 MB/s	0.596 MB/s	-2.0%	-18.3%
bs=100 sw=10 sl=64	p50	123,286 us	121,673 us	102,557 us	+1.3%	+20.2%
bs=100 sw=10 sl=64	p95	150,325 us	147,390 us	108,383 us	+2.0%	+38.7%
bs=100 sw=10 sl=64	p99	150,325 us	147,390 us	115,249 us	+2.0%	+30.4%
bs=1000 sw=10 sl=64	throughput	913 tuples/sec	918 tuples/sec	1,009 tuples/sec	-0.5%	-9.5%
bs=1000 sw=10 sl=64	MB/s	0.557 MB/s	0.561 MB/s	0.616 MB/s	-0.7%	-9.6%
bs=1000 sw=10 sl=64	p50	1,098,136 us	1,091,196 us	997,695 us	+0.6%	+10.1%
bs=1000 sw=10 sl=64	p95	1,128,440 us	1,145,592 us	1,036,731 us	-1.5%	+8.8%
bs=1000 sw=10 sl=64	p99	1,128,440 us	1,145,592 us	1,069,334 us	-1.5%	+5.5%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,512.96,200,128000,390,0.238,25166.46,30115.24,30115.24
1,100,10,64,20,2506.31,2000,1280000,798,0.487,123286.39,150324.86,150324.86
2,1000,10,64,20,21908.36,20000,12800000,913,0.557,1098135.77,1128440.04,1128440.04

- JSONL: write keys name-then-id so the sorted-field-order assertion actually validates reordering - ParallelCSV: use uneven row widths so the two-worker split lands mid-row, deterministically exercising the leading-partial-line skip

mengw15

LGTM

Copilot AI review requested due to automatic review settings July 3, 2026 07:03

github-actions Bot assigned aglinxinyuan Jul 3, 2026

github-actions Bot added the common label Jul 3, 2026

Copilot started reviewing on behalf of aglinxinyuan July 3, 2026 07:03 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread ...test/scala/org/apache/texera/amber/operator/source/scan/json/JSONLScanSourceOpExecSpec.scala

Comment thread ...scala/org/apache/texera/amber/operator/source/scan/csv/ParallelCSVScanSourceOpExecSpec.scala Outdated

aglinxinyuan requested a review from mengw15 July 3, 2026 07:41

mengw15 approved these changes Jul 3, 2026

View reviewed changes

aglinxinyuan added this pull request to the merge queue Jul 3, 2026

Merged via the queue into apache:main with commit 48ab6ca Jul 3, 2026
23 checks passed

aglinxinyuan deleted the test-scan-source-execs branch July 3, 2026 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(workflow-operator): add unit test coverage for scan source executors#6076

test(workflow-operator): add unit test coverage for scan source executors#6076
aglinxinyuan merged 2 commits into
apache:mainfrom
aglinxinyuan:test-scan-source-execs

aglinxinyuan commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

codecov-commenter commented Jul 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

mengw15 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

aglinxinyuan commented Jul 3, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jul 3, 2026

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

mengw15 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading