Skip to content

feat: Reduce memory consumption when writing sorted shuffle files#82

Merged
sunchao merged 1 commit into
apache:mainfrom
sunchao:reduce-memory-shuffle
Feb 22, 2024
Merged

feat: Reduce memory consumption when writing sorted shuffle files#82
sunchao merged 1 commit into
apache:mainfrom
sunchao:reduce-memory-shuffle

Conversation

@sunchao

@sunchao sunchao commented Feb 21, 2024

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Currently in columnar shuffle, when spills are triggered, Comet will write sorted shuffle data to disk on the native side. As part of the process, it will also do row to columnar conversion, and write out the shuffle data in Arrow format. However, the row to columnar conversion will try to buffer all the shuffle data in memory before writing them to disk. This will potentially cause OOM since spills usually are triggered when memory is under pressure, and the Spark executor may not have enough memory left.

Instead, this PR changes the logic so it no longer buffers all shuffle data in memory. Instead, it immediately flushes out the data to disk once a certain threshold has been reached. A (internal) config parameter spark.comet.columnar.shuffle.batch.size is introduced to control the threshold, and by default it is 8K.

What changes are included in this PR?

Changed the logic of writing sorted shuffle data to no longer buffer all data in memory, but instead batch by batch.

How are these changes tested?

…ache#1490)

This tries to reduce the memory consumption during writing out sorted shuffle files. Currently when doing the row to columnar conversion, we will append try to append all the outputs in array builders before writing them out to disk.
@sunchao sunchao merged commit 71d0e59 into apache:main Feb 22, 2024
@sunchao

sunchao commented Feb 22, 2024

Copy link
Copy Markdown
Member Author

Thanks, merged

@sunchao sunchao deleted the reduce-memory-shuffle branch February 22, 2024 19:21
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request May 3, 2026
`merge-metrics: delete-only with duplicates - Partitioned=false,
CDF=false` test asserts `numTargetFilesAdded == 1`. Vanilla Spark 3.5
produces 1 because AQE coalesces the post-MERGE shuffle partitions
down to 1. With Comet's `CometColumnarExchange` participating in the
shuffle chain, AQE's coalesce settles at 2 partitions, producing 2
output files. Both outputs are equally correct -- the test author
anticipated this in MergeIntoMetricsBase.scala line 1024:
"Depending on the Spark version, for non-partitioned tables we may
add 1 or 2 files."

Update the Spark-3.5 shim from 1 to 2 in the regression diff. The
underlying Comet exchange / AQE-coalesce interaction is logged for
follow-up in Task apache#82 (Item 9), but the test itself is now satisfied.

Verified by `DescribeDeltaHistorySuite -z "delete-only with
duplicates - Partitioned = false, CDF = false"` passing in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#84)

readChangeFeed reads previously declined to vanilla Spark: a CDF read is a
RowDataSourceScanExec over DeltaCDFRelation (a CatalystScan), a scan family
Comet never intercepted. Wire it natively end-to-end:

- DeltaIntegration (core): isCdfRelation (pure class-name check) + transformCdf
  reflection bridge to the contrib.
- CometExecRule (core): intercept RowDataSourceScanExec(DeltaCDFRelation) and
  replace it with the native exec. Runs in preColumnarTransitions + query-stage
  prep, so it fires on simple non-AQE CDF plans too.
- CometDeltaNativeScan.convertCdf (contrib): build a DeltaScanCommon with
  kernel_read + cdf_read + the [start, end] version range and the full CDF
  output schema (data + _change_type/_commit_version/_commit_timestamp), gated
  by COMET_DELTA_NATIVE_ENABLED. Clamps a requested endingVersion that exceeds
  the table's latest (kernel's TableChanges errors where Delta clamps).
- CometDeltaCdfScanExec (new contrib exec): single-partition CometLeafExec that
  runs the cdf DeltaScan op; the native DeltaKernelScanExec reconstructs
  delta-kernel TableChanges(start, end) and calls execute(). No per-file tasks,
  DPP, or encryption coupling.
- DeltaReflection.extractCdfLatestVersion: the relation's pinned snapshot
  version, used for the end-version clamp.

The native read path (read_cdf_via_kernel / read_all_cdf, proto cdf fields,
planner wiring) already existed from dc60b71; this commit supplies the Scala
interception that reaches it.

Tests (verified under -Dspark.version=4.1.1; full contrib package 152/0):
CometDeltaCdcSuite flipped from "asserts decline" to "asserts native engagement
+ matches vanilla" (3/3); CometDeltaScanConfAuditSuite GAP CDF flipped to assert
engagement; CometDeltaCdfReflectionReproSuite adds an engagement+correctness
test. This unblocks deleting DeltaSyntheticColumnsExec (apache#82) -- CDC no longer
needs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…nthesis only (apache#82)

With CDC now reading natively (apache#84), nothing legitimately needs the legacy stacked
DeltaSyntheticColumnsExec. Close its two remaining routes in CometDeltaNativeScan.convert:

- The `case None` subset residual (was the CDC-family + rare DML declines) now declines to
  vanilla Spark via withFallbackReason instead of buildTaskListFromAddFiles. CDC-family reads
  no longer reach here (they're intercepted upstream as CometDeltaCdfScanExec); the rare DML
  declines (CM-id materialised row_commit_version; OPTIMIZE file-not-found race) correctly fall
  back to Spark.
- Remove the `spark.comet.delta.synthesizeInWorker.enabled` escape hatch: synthesizeInWorker is
  now `!isSubsetFileIndex` (regular reads always synthesize in-worker; matched DML rewrites too;
  everything else declines). So synthesizeInWorker is always true at proto-build time and the
  native side never constructs the exec.

Deletes the now-dead buildTaskListFromAddFiles + its partition-name map. The native
DeltaSyntheticColumnsExec is now dead code (removed in a follow-up). Verified: full contrib
package 152/0 under -Dspark.version=4.1.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
The exec is now dead code: with CDC reading natively (apache#84) and the
synthesizeInWorker escape hatch removed (b604ccf), the driver never sets
synthesize_in_worker=false for a native read, so core's planner never
constructs DeltaSyntheticColumnsExec. Remove it:

- Delete contrib/delta/native/src/synthetic_columns.rs (the exec).
- native/core planner (delta_scan.rs): drop the need_synthetics branch that
  built the exec; DeltaKernelScanExec is the only scan exec now and always
  applies the DV itself (apply_dv = true). In-worker synthesis (data +
  partitions + row_index/row_id/is_row_deleted/row_commit_version/_metadata.*,
  by name) is the sole native synthesis path.
- lib.rs: drop the module + its doc reference.

Verified: full contrib package 152/0 under -Dspark.version=4.1.1 with the exec
deleted. Residual cosmetic cleanup left as follow-up (the now-always-true
apply_dv field + the set-but-ignored proto emit flags + stale comments).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
 follow-up)

The spark.comet.delta.synthesizeInWorker.enabled escape hatch was removed in
b604ccf (in-worker synthesis is now the only native path); drop the now-unused
config entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#85)

The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/
apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected:

- Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read
  stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82),
  with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only
  as clearly-labeled history / rejected alternatives.
- delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner
  (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76).
- CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec,
  split multi-partition (apache#84/#2) -- corrected docs that called it unsupported,
  declined, or a synthetic-columns fallback.
- 08-known-limitations.md: removed all of Part B (B1-B9 were development-time
  regressions, all now fixed + guarded) and A3 (path-based CDF now engages
  native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e
  credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7
  CM-id repoint). 466 -> 230 lines.
- Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow
  references, and supported-feature lists (added CDF, _metadata, INT96) across
  the build / README / user-guide docs.

Every claim verified against code; markdown passes prettier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants