feat: Add struct/map as unsupported map key/value for columnar shuffle by viirya · Pull Request #84 · apache/datafusion-comet

viirya · 2024-02-22T04:28:07Z

Which issue does this PR close?

Closes #.

Rationale for this change

This patch adds more unsupported partitioning types. So user applications can fallback to Spark early without running into error.

What changes are included in this PR?

How are these changes tested?

sunchao

LGTM - has been reviewed internally

viirya · 2024-02-22T19:25:41Z

Merged. Thanks.

…e#76/apache#84, Rust foundation) Add the executor-side Change Data Feed read using delta-kernel-rs's TableChanges API, the kernel-native way to read change data (corrects the earlier mistaken "kernel can't do CDC"). - read_cdf_via_kernel: TableChanges::try_new(url, start, Option<end>) -> into_scan_builder(). build().execute() -> Arrow batches with data + _change_type / _commit_version / _commit_timestamp (column mapping / partitions / DVs resolved by kernel). Single-task: kernel's CDF per-file API (scan_metadata / CdfScanFile / get_cdf_transform_expr) is pub(crate), so only the all-in-one execute() is usable -- one partition reads the whole bounded version range. - DeltaKernelScanExec gains `cdf: Option<(start, end)>`; read_all routes to read_all_cdf, which reuses the by-name assembler to place the CDF columns in scan.output order. - proto: DeltaScanCommon.cdf_read / cdf_start_version / cdf_end_version. core maps them and bypasses the kernel data-column schema requirement (CDF ships no per-file tasks/schemas). Additive + dormant: the Scala side doesn't set cdf_read yet, so existing reads are unchanged (contrib Rust tests 109/0). Remaining (apache#84): Scala CDF detection/routing -- intercept DeltaCDFRelation (path readChangeFeed) + CdcAddFileIndex (streaming), extract start/end versions + the CDF output schema -- then CometDeltaCdcSuite, then retire DeltaSyntheticColumnsExec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Groundwork for native Change Data Feed reads. A readChangeFeed read is a RowDataSourceScanExec over CDCReader$DeltaCDFRelation (a CatalystScan BaseRelation), NOT a FileSourceScanExec/HadoopFsRelation -- which is why Comet doesn't currently intercept it and CDC declines. Add DeltaReflection.isCdfRelation / extractCdfTableRoot / extractCdfVersions to pull the table root and inclusive [start, end] version range a native kernel TableChanges read needs. Guarded by CometDeltaCdfReflectionReproSuite, which also pins the plan-shape assumption (RowDataSourceScanExec over DeltaCDFRelation). Note for the follow-up interception (recorded on apache#84): the marker->native conversion is AQE-query-stage-prep-gated, and CDF reads aren't AQE-wrapped, so the existing CometDeltaScanMarker path won't convert CDF -- the interception must emit CometNativeExec(CometDeltaNativeScanExec) directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…pache#84) readChangeFeed reads previously declined to vanilla Spark: a CDF read is a RowDataSourceScanExec over DeltaCDFRelation (a CatalystScan), a scan family Comet never intercepted. Wire it natively end-to-end: - DeltaIntegration (core): isCdfRelation (pure class-name check) + transformCdf reflection bridge to the contrib. - CometExecRule (core): intercept RowDataSourceScanExec(DeltaCDFRelation) and replace it with the native exec. Runs in preColumnarTransitions + query-stage prep, so it fires on simple non-AQE CDF plans too. - CometDeltaNativeScan.convertCdf (contrib): build a DeltaScanCommon with kernel_read + cdf_read + the [start, end] version range and the full CDF output schema (data + _change_type/_commit_version/_commit_timestamp), gated by COMET_DELTA_NATIVE_ENABLED. Clamps a requested endingVersion that exceeds the table's latest (kernel's TableChanges errors where Delta clamps). - CometDeltaCdfScanExec (new contrib exec): single-partition CometLeafExec that runs the cdf DeltaScan op; the native DeltaKernelScanExec reconstructs delta-kernel TableChanges(start, end) and calls execute(). No per-file tasks, DPP, or encryption coupling. - DeltaReflection.extractCdfLatestVersion: the relation's pinned snapshot version, used for the end-version clamp. The native read path (read_cdf_via_kernel / read_all_cdf, proto cdf fields, planner wiring) already existed from dc60b71; this commit supplies the Scala interception that reaches it. Tests (verified under -Dspark.version=4.1.1; full contrib package 152/0): CometDeltaCdcSuite flipped from "asserts decline" to "asserts native engagement + matches vanilla" (3/3); CometDeltaScanConfAuditSuite GAP CDF flipped to assert engagement; CometDeltaCdfReflectionReproSuite adds an engagement+correctness test. This unblocks deleting DeltaSyntheticColumnsExec (apache#82) -- CDC no longer needs it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nthesis only (apache#82) With CDC now reading natively (apache#84), nothing legitimately needs the legacy stacked DeltaSyntheticColumnsExec. Close its two remaining routes in CometDeltaNativeScan.convert: - The `case None` subset residual (was the CDC-family + rare DML declines) now declines to vanilla Spark via withFallbackReason instead of buildTaskListFromAddFiles. CDC-family reads no longer reach here (they're intercepted upstream as CometDeltaCdfScanExec); the rare DML declines (CM-id materialised row_commit_version; OPTIMIZE file-not-found race) correctly fall back to Spark. - Remove the `spark.comet.delta.synthesizeInWorker.enabled` escape hatch: synthesizeInWorker is now `!isSubsetFileIndex` (regular reads always synthesize in-worker; matched DML rewrites too; everything else declines). So synthesizeInWorker is always true at proto-build time and the native side never constructs the exec. Deletes the now-dead buildTaskListFromAddFiles + its partition-name map. The native DeltaSyntheticColumnsExec is now dead code (removed in a follow-up). Verified: full contrib package 152/0 under -Dspark.version=4.1.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The exec is now dead code: with CDC reading natively (apache#84) and the synthesizeInWorker escape hatch removed (b604ccf), the driver never sets synthesize_in_worker=false for a native read, so core's planner never constructs DeltaSyntheticColumnsExec. Remove it: - Delete contrib/delta/native/src/synthetic_columns.rs (the exec). - native/core planner (delta_scan.rs): drop the need_synthetics branch that built the exec; DeltaKernelScanExec is the only scan exec now and always applies the DV itself (apply_dv = true). In-worker synthesis (data + partitions + row_index/row_id/is_row_deleted/row_commit_version/_metadata.*, by name) is the sole native synthesis path. - lib.rs: drop the module + its doc reference. Verified: full contrib package 152/0 under -Dspark.version=4.1.1 with the exec deleted. Residual cosmetic cleanup left as follow-up (the now-always-true apply_dv field + the set-but-ignored proto emit flags + stale comments). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…pache#85) The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/ apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected: - Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82), with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only as clearly-labeled history / rejected alternatives. - delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76). - CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec, split multi-partition (apache#84/#2) -- corrected docs that called it unsupported, declined, or a synthetic-columns fallback. - 08-known-limitations.md: removed all of Part B (B1-B9 were development-time regressions, all now fixed + guarded) and A3 (path-based CDF now engages native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7 CM-id repoint). 466 -> 230 lines. - Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow references, and supported-feature lists (added CDF, _metadata, INT96) across the build / README / user-guide docs. Every claim verified against code; markdown passes prettier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: Add struct/map as unsupported map key/value for columnar shuffle

abe8252

sunchao approved these changes Feb 22, 2024

View reviewed changes

viirya merged commit b9b7441 into apache:main Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add struct/map as unsupported map key/value for columnar shuffle#84

feat: Add struct/map as unsupported map key/value for columnar shuffle#84
viirya merged 1 commit into
apache:mainfrom
viirya:add_unsupported_types

viirya commented Feb 22, 2024

Uh oh!

sunchao left a comment

Uh oh!

viirya commented Feb 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

viirya commented Feb 22, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants