feat: Add struct/map as unsupported map key/value for columnar shuffle#84
Merged
Conversation
sunchao
approved these changes
Feb 22, 2024
sunchao
left a comment
Member
There was a problem hiding this comment.
LGTM - has been reviewed internally
Member
Author
|
Merged. Thanks. |
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
…e#76/apache#84, Rust foundation) Add the executor-side Change Data Feed read using delta-kernel-rs's TableChanges API, the kernel-native way to read change data (corrects the earlier mistaken "kernel can't do CDC"). - read_cdf_via_kernel: TableChanges::try_new(url, start, Option<end>) -> into_scan_builder(). build().execute() -> Arrow batches with data + _change_type / _commit_version / _commit_timestamp (column mapping / partitions / DVs resolved by kernel). Single-task: kernel's CDF per-file API (scan_metadata / CdfScanFile / get_cdf_transform_expr) is pub(crate), so only the all-in-one execute() is usable -- one partition reads the whole bounded version range. - DeltaKernelScanExec gains `cdf: Option<(start, end)>`; read_all routes to read_all_cdf, which reuses the by-name assembler to place the CDF columns in scan.output order. - proto: DeltaScanCommon.cdf_read / cdf_start_version / cdf_end_version. core maps them and bypasses the kernel data-column schema requirement (CDF ships no per-file tasks/schemas). Additive + dormant: the Scala side doesn't set cdf_read yet, so existing reads are unchanged (contrib Rust tests 109/0). Remaining (apache#84): Scala CDF detection/routing -- intercept DeltaCDFRelation (path readChangeFeed) + CdcAddFileIndex (streaming), extract start/end versions + the CDF output schema -- then CometDeltaCdcSuite, then retire DeltaSyntheticColumnsExec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
Groundwork for native Change Data Feed reads. A readChangeFeed read is a RowDataSourceScanExec over CDCReader$DeltaCDFRelation (a CatalystScan BaseRelation), NOT a FileSourceScanExec/HadoopFsRelation -- which is why Comet doesn't currently intercept it and CDC declines. Add DeltaReflection.isCdfRelation / extractCdfTableRoot / extractCdfVersions to pull the table root and inclusive [start, end] version range a native kernel TableChanges read needs. Guarded by CometDeltaCdfReflectionReproSuite, which also pins the plan-shape assumption (RowDataSourceScanExec over DeltaCDFRelation). Note for the follow-up interception (recorded on apache#84): the marker->native conversion is AQE-query-stage-prep-gated, and CDF reads aren't AQE-wrapped, so the existing CometDeltaScanMarker path won't convert CDF -- the interception must emit CometNativeExec(CometDeltaNativeScanExec) directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
…pache#84) readChangeFeed reads previously declined to vanilla Spark: a CDF read is a RowDataSourceScanExec over DeltaCDFRelation (a CatalystScan), a scan family Comet never intercepted. Wire it natively end-to-end: - DeltaIntegration (core): isCdfRelation (pure class-name check) + transformCdf reflection bridge to the contrib. - CometExecRule (core): intercept RowDataSourceScanExec(DeltaCDFRelation) and replace it with the native exec. Runs in preColumnarTransitions + query-stage prep, so it fires on simple non-AQE CDF plans too. - CometDeltaNativeScan.convertCdf (contrib): build a DeltaScanCommon with kernel_read + cdf_read + the [start, end] version range and the full CDF output schema (data + _change_type/_commit_version/_commit_timestamp), gated by COMET_DELTA_NATIVE_ENABLED. Clamps a requested endingVersion that exceeds the table's latest (kernel's TableChanges errors where Delta clamps). - CometDeltaCdfScanExec (new contrib exec): single-partition CometLeafExec that runs the cdf DeltaScan op; the native DeltaKernelScanExec reconstructs delta-kernel TableChanges(start, end) and calls execute(). No per-file tasks, DPP, or encryption coupling. - DeltaReflection.extractCdfLatestVersion: the relation's pinned snapshot version, used for the end-version clamp. The native read path (read_cdf_via_kernel / read_all_cdf, proto cdf fields, planner wiring) already existed from dc60b71; this commit supplies the Scala interception that reaches it. Tests (verified under -Dspark.version=4.1.1; full contrib package 152/0): CometDeltaCdcSuite flipped from "asserts decline" to "asserts native engagement + matches vanilla" (3/3); CometDeltaScanConfAuditSuite GAP CDF flipped to assert engagement; CometDeltaCdfReflectionReproSuite adds an engagement+correctness test. This unblocks deleting DeltaSyntheticColumnsExec (apache#82) -- CDC no longer needs it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
…nthesis only (apache#82) With CDC now reading natively (apache#84), nothing legitimately needs the legacy stacked DeltaSyntheticColumnsExec. Close its two remaining routes in CometDeltaNativeScan.convert: - The `case None` subset residual (was the CDC-family + rare DML declines) now declines to vanilla Spark via withFallbackReason instead of buildTaskListFromAddFiles. CDC-family reads no longer reach here (they're intercepted upstream as CometDeltaCdfScanExec); the rare DML declines (CM-id materialised row_commit_version; OPTIMIZE file-not-found race) correctly fall back to Spark. - Remove the `spark.comet.delta.synthesizeInWorker.enabled` escape hatch: synthesizeInWorker is now `!isSubsetFileIndex` (regular reads always synthesize in-worker; matched DML rewrites too; everything else declines). So synthesizeInWorker is always true at proto-build time and the native side never constructs the exec. Deletes the now-dead buildTaskListFromAddFiles + its partition-name map. The native DeltaSyntheticColumnsExec is now dead code (removed in a follow-up). Verified: full contrib package 152/0 under -Dspark.version=4.1.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
The exec is now dead code: with CDC reading natively (apache#84) and the synthesizeInWorker escape hatch removed (b604ccf), the driver never sets synthesize_in_worker=false for a native read, so core's planner never constructs DeltaSyntheticColumnsExec. Remove it: - Delete contrib/delta/native/src/synthetic_columns.rs (the exec). - native/core planner (delta_scan.rs): drop the need_synthetics branch that built the exec; DeltaKernelScanExec is the only scan exec now and always applies the DV itself (apply_dv = true). In-worker synthesis (data + partitions + row_index/row_id/is_row_deleted/row_commit_version/_metadata.*, by name) is the sole native synthesis path. - lib.rs: drop the module + its doc reference. Verified: full contrib package 152/0 under -Dspark.version=4.1.1 with the exec deleted. Residual cosmetic cleanup left as follow-up (the now-always-true apply_dv field + the set-but-ignored proto emit flags + stale comments). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj
added a commit
to schenksj/datafusion-comet
that referenced
this pull request
Jun 9, 2026
…pache#85) The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/ apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected: - Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82), with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only as clearly-labeled history / rejected alternatives. - delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76). - CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec, split multi-partition (apache#84/#2) -- corrected docs that called it unsupported, declined, or a synthetic-columns fallback. - 08-known-limitations.md: removed all of Part B (B1-B9 were development-time regressions, all now fixed + guarded) and A3 (path-based CDF now engages native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7 CM-id repoint). 466 -> 230 lines. - Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow references, and supported-feature lists (added CDF, _metadata, INT96) across the build / README / user-guide docs. Every claim verified against code; markdown passes prettier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
This patch adds more unsupported partitioning types. So user applications can fallback to Spark early without running into error.
What changes are included in this PR?
How are these changes tested?