feat: Reduce memory consumption when writing sorted shuffle files by sunchao · Pull Request #82 · apache/datafusion-comet

sunchao · 2024-02-21T22:20:57Z

Which issue does this PR close?

Closes #.

Rationale for this change

Currently in columnar shuffle, when spills are triggered, Comet will write sorted shuffle data to disk on the native side. As part of the process, it will also do row to columnar conversion, and write out the shuffle data in Arrow format. However, the row to columnar conversion will try to buffer all the shuffle data in memory before writing them to disk. This will potentially cause OOM since spills usually are triggered when memory is under pressure, and the Spark executor may not have enough memory left.

Instead, this PR changes the logic so it no longer buffers all shuffle data in memory. Instead, it immediately flushes out the data to disk once a certain threshold has been reached. A (internal) config parameter spark.comet.columnar.shuffle.batch.size is introduced to control the threshold, and by default it is 8K.

What changes are included in this PR?

Changed the logic of writing sorted shuffle data to no longer buffer all data in memory, but instead batch by batch.

How are these changes tested?

…ache#1490) This tries to reduce the memory consumption during writing out sorted shuffle files. Currently when doing the row to columnar conversion, we will append try to append all the outputs in array builders before writing them out to disk.

sunchao · 2024-02-22T19:21:18Z

Thanks, merged

`merge-metrics: delete-only with duplicates - Partitioned=false, CDF=false` test asserts `numTargetFilesAdded == 1`. Vanilla Spark 3.5 produces 1 because AQE coalesces the post-MERGE shuffle partitions down to 1. With Comet's `CometColumnarExchange` participating in the shuffle chain, AQE's coalesce settles at 2 partitions, producing 2 output files. Both outputs are equally correct -- the test author anticipated this in MergeIntoMetricsBase.scala line 1024: "Depending on the Spark version, for non-partitioned tables we may add 1 or 2 files." Update the Spark-3.5 shim from 1 to 2 in the regression diff. The underlying Comet exchange / AQE-coalesce interaction is logged for follow-up in Task apache#82 (Item 9), but the test itself is now satisfied. Verified by `DescribeDeltaHistorySuite -z "delete-only with duplicates - Partitioned = false, CDF = false"` passing in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pache#84) readChangeFeed reads previously declined to vanilla Spark: a CDF read is a RowDataSourceScanExec over DeltaCDFRelation (a CatalystScan), a scan family Comet never intercepted. Wire it natively end-to-end: - DeltaIntegration (core): isCdfRelation (pure class-name check) + transformCdf reflection bridge to the contrib. - CometExecRule (core): intercept RowDataSourceScanExec(DeltaCDFRelation) and replace it with the native exec. Runs in preColumnarTransitions + query-stage prep, so it fires on simple non-AQE CDF plans too. - CometDeltaNativeScan.convertCdf (contrib): build a DeltaScanCommon with kernel_read + cdf_read + the [start, end] version range and the full CDF output schema (data + _change_type/_commit_version/_commit_timestamp), gated by COMET_DELTA_NATIVE_ENABLED. Clamps a requested endingVersion that exceeds the table's latest (kernel's TableChanges errors where Delta clamps). - CometDeltaCdfScanExec (new contrib exec): single-partition CometLeafExec that runs the cdf DeltaScan op; the native DeltaKernelScanExec reconstructs delta-kernel TableChanges(start, end) and calls execute(). No per-file tasks, DPP, or encryption coupling. - DeltaReflection.extractCdfLatestVersion: the relation's pinned snapshot version, used for the end-version clamp. The native read path (read_cdf_via_kernel / read_all_cdf, proto cdf fields, planner wiring) already existed from dc60b71; this commit supplies the Scala interception that reaches it. Tests (verified under -Dspark.version=4.1.1; full contrib package 152/0): CometDeltaCdcSuite flipped from "asserts decline" to "asserts native engagement + matches vanilla" (3/3); CometDeltaScanConfAuditSuite GAP CDF flipped to assert engagement; CometDeltaCdfReflectionReproSuite adds an engagement+correctness test. This unblocks deleting DeltaSyntheticColumnsExec (apache#82) -- CDC no longer needs it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nthesis only (apache#82) With CDC now reading natively (apache#84), nothing legitimately needs the legacy stacked DeltaSyntheticColumnsExec. Close its two remaining routes in CometDeltaNativeScan.convert: - The `case None` subset residual (was the CDC-family + rare DML declines) now declines to vanilla Spark via withFallbackReason instead of buildTaskListFromAddFiles. CDC-family reads no longer reach here (they're intercepted upstream as CometDeltaCdfScanExec); the rare DML declines (CM-id materialised row_commit_version; OPTIMIZE file-not-found race) correctly fall back to Spark. - Remove the `spark.comet.delta.synthesizeInWorker.enabled` escape hatch: synthesizeInWorker is now `!isSubsetFileIndex` (regular reads always synthesize in-worker; matched DML rewrites too; everything else declines). So synthesizeInWorker is always true at proto-build time and the native side never constructs the exec. Deletes the now-dead buildTaskListFromAddFiles + its partition-name map. The native DeltaSyntheticColumnsExec is now dead code (removed in a follow-up). Verified: full contrib package 152/0 under -Dspark.version=4.1.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The exec is now dead code: with CDC reading natively (apache#84) and the synthesizeInWorker escape hatch removed (b604ccf), the driver never sets synthesize_in_worker=false for a native read, so core's planner never constructs DeltaSyntheticColumnsExec. Remove it: - Delete contrib/delta/native/src/synthetic_columns.rs (the exec). - native/core planner (delta_scan.rs): drop the need_synthetics branch that built the exec; DeltaKernelScanExec is the only scan exec now and always applies the DV itself (apply_dv = true). In-worker synthesis (data + partitions + row_index/row_id/is_row_deleted/row_commit_version/_metadata.*, by name) is the sole native synthesis path. - lib.rs: drop the module + its doc reference. Verified: full contrib package 152/0 under -Dspark.version=4.1.1 with the exec deleted. Residual cosmetic cleanup left as follow-up (the now-always-true apply_dv field + the set-but-ignored proto emit flags + stale comments). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

follow-up) The spark.comet.delta.synthesizeInWorker.enabled escape hatch was removed in b604ccf (in-worker synthesis is now the only native path); drop the now-unused config entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…pache#85) The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/ apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected: - Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82), with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only as clearly-labeled history / rejected alternatives. - delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76). - CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec, split multi-partition (apache#84/#2) -- corrected docs that called it unsupported, declined, or a synthetic-columns fallback. - 08-known-limitations.md: removed all of Part B (B1-B9 were development-time regressions, all now fixed + guarded) and A3 (path-based CDF now engages native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7 CM-id repoint). 466 -> 230 lines. - Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow references, and supported-feature lists (added CDF, _metadata, INT96) across the build / README / user-guide docs. Every claim verified against code; markdown passes prettier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viirya approved these changes Feb 21, 2024

View reviewed changes

sunchao merged commit 71d0e59 into apache:main Feb 22, 2024

sunchao deleted the reduce-memory-shuffle branch February 22, 2024 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Reduce memory consumption when writing sorted shuffle files#82

feat: Reduce memory consumption when writing sorted shuffle files#82
sunchao merged 1 commit into
apache:mainfrom
sunchao:reduce-memory-shuffle

sunchao commented Feb 21, 2024

Uh oh!

sunchao commented Feb 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunchao commented Feb 21, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

sunchao commented Feb 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants