Skip to content

feat: Add struct/map as unsupported map key/value for columnar shuffle#84

Merged
viirya merged 1 commit into
apache:mainfrom
viirya:add_unsupported_types
Feb 22, 2024
Merged

feat: Add struct/map as unsupported map key/value for columnar shuffle#84
viirya merged 1 commit into
apache:mainfrom
viirya:add_unsupported_types

Conversation

@viirya

@viirya viirya commented Feb 22, 2024

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

This patch adds more unsupported partitioning types. So user applications can fallback to Spark early without running into error.

What changes are included in this PR?

How are these changes tested?

@sunchao sunchao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - has been reviewed internally

@viirya viirya merged commit b9b7441 into apache:main Feb 22, 2024
@viirya

viirya commented Feb 22, 2024

Copy link
Copy Markdown
Member Author

Merged. Thanks.

schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…e#76/apache#84, Rust foundation)

Add the executor-side Change Data Feed read using delta-kernel-rs's TableChanges API, the
kernel-native way to read change data (corrects the earlier mistaken "kernel can't do CDC").

- read_cdf_via_kernel: TableChanges::try_new(url, start, Option<end>) -> into_scan_builder().
  build().execute() -> Arrow batches with data + _change_type / _commit_version /
  _commit_timestamp (column mapping / partitions / DVs resolved by kernel). Single-task: kernel's
  CDF per-file API (scan_metadata / CdfScanFile / get_cdf_transform_expr) is pub(crate), so only
  the all-in-one execute() is usable -- one partition reads the whole bounded version range.
- DeltaKernelScanExec gains `cdf: Option<(start, end)>`; read_all routes to read_all_cdf, which
  reuses the by-name assembler to place the CDF columns in scan.output order.
- proto: DeltaScanCommon.cdf_read / cdf_start_version / cdf_end_version. core maps them and
  bypasses the kernel data-column schema requirement (CDF ships no per-file tasks/schemas).

Additive + dormant: the Scala side doesn't set cdf_read yet, so existing reads are unchanged
(contrib Rust tests 109/0). Remaining (apache#84): Scala CDF detection/routing -- intercept
DeltaCDFRelation (path readChangeFeed) + CdcAddFileIndex (streaming), extract start/end versions
+ the CDF output schema -- then CometDeltaCdcSuite, then retire DeltaSyntheticColumnsExec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
Groundwork for native Change Data Feed reads. A readChangeFeed read is a
RowDataSourceScanExec over CDCReader$DeltaCDFRelation (a CatalystScan
BaseRelation), NOT a FileSourceScanExec/HadoopFsRelation -- which is why Comet
doesn't currently intercept it and CDC declines.

Add DeltaReflection.isCdfRelation / extractCdfTableRoot / extractCdfVersions to
pull the table root and inclusive [start, end] version range a native kernel
TableChanges read needs. Guarded by CometDeltaCdfReflectionReproSuite, which
also pins the plan-shape assumption (RowDataSourceScanExec over DeltaCDFRelation).

Note for the follow-up interception (recorded on apache#84): the marker->native
conversion is AQE-query-stage-prep-gated, and CDF reads aren't AQE-wrapped, so
the existing CometDeltaScanMarker path won't convert CDF -- the interception
must emit CometNativeExec(CometDeltaNativeScanExec) directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#84)

readChangeFeed reads previously declined to vanilla Spark: a CDF read is a
RowDataSourceScanExec over DeltaCDFRelation (a CatalystScan), a scan family
Comet never intercepted. Wire it natively end-to-end:

- DeltaIntegration (core): isCdfRelation (pure class-name check) + transformCdf
  reflection bridge to the contrib.
- CometExecRule (core): intercept RowDataSourceScanExec(DeltaCDFRelation) and
  replace it with the native exec. Runs in preColumnarTransitions + query-stage
  prep, so it fires on simple non-AQE CDF plans too.
- CometDeltaNativeScan.convertCdf (contrib): build a DeltaScanCommon with
  kernel_read + cdf_read + the [start, end] version range and the full CDF
  output schema (data + _change_type/_commit_version/_commit_timestamp), gated
  by COMET_DELTA_NATIVE_ENABLED. Clamps a requested endingVersion that exceeds
  the table's latest (kernel's TableChanges errors where Delta clamps).
- CometDeltaCdfScanExec (new contrib exec): single-partition CometLeafExec that
  runs the cdf DeltaScan op; the native DeltaKernelScanExec reconstructs
  delta-kernel TableChanges(start, end) and calls execute(). No per-file tasks,
  DPP, or encryption coupling.
- DeltaReflection.extractCdfLatestVersion: the relation's pinned snapshot
  version, used for the end-version clamp.

The native read path (read_cdf_via_kernel / read_all_cdf, proto cdf fields,
planner wiring) already existed from dc60b71; this commit supplies the Scala
interception that reaches it.

Tests (verified under -Dspark.version=4.1.1; full contrib package 152/0):
CometDeltaCdcSuite flipped from "asserts decline" to "asserts native engagement
+ matches vanilla" (3/3); CometDeltaScanConfAuditSuite GAP CDF flipped to assert
engagement; CometDeltaCdfReflectionReproSuite adds an engagement+correctness
test. This unblocks deleting DeltaSyntheticColumnsExec (apache#82) -- CDC no longer
needs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…nthesis only (apache#82)

With CDC now reading natively (apache#84), nothing legitimately needs the legacy stacked
DeltaSyntheticColumnsExec. Close its two remaining routes in CometDeltaNativeScan.convert:

- The `case None` subset residual (was the CDC-family + rare DML declines) now declines to
  vanilla Spark via withFallbackReason instead of buildTaskListFromAddFiles. CDC-family reads
  no longer reach here (they're intercepted upstream as CometDeltaCdfScanExec); the rare DML
  declines (CM-id materialised row_commit_version; OPTIMIZE file-not-found race) correctly fall
  back to Spark.
- Remove the `spark.comet.delta.synthesizeInWorker.enabled` escape hatch: synthesizeInWorker is
  now `!isSubsetFileIndex` (regular reads always synthesize in-worker; matched DML rewrites too;
  everything else declines). So synthesizeInWorker is always true at proto-build time and the
  native side never constructs the exec.

Deletes the now-dead buildTaskListFromAddFiles + its partition-name map. The native
DeltaSyntheticColumnsExec is now dead code (removed in a follow-up). Verified: full contrib
package 152/0 under -Dspark.version=4.1.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
The exec is now dead code: with CDC reading natively (apache#84) and the
synthesizeInWorker escape hatch removed (b604ccf), the driver never sets
synthesize_in_worker=false for a native read, so core's planner never
constructs DeltaSyntheticColumnsExec. Remove it:

- Delete contrib/delta/native/src/synthetic_columns.rs (the exec).
- native/core planner (delta_scan.rs): drop the need_synthetics branch that
  built the exec; DeltaKernelScanExec is the only scan exec now and always
  applies the DV itself (apply_dv = true). In-worker synthesis (data +
  partitions + row_index/row_id/is_row_deleted/row_commit_version/_metadata.*,
  by name) is the sole native synthesis path.
- lib.rs: drop the module + its doc reference.

Verified: full contrib package 152/0 under -Dspark.version=4.1.1 with the exec
deleted. Residual cosmetic cleanup left as follow-up (the now-always-true
apply_dv field + the set-but-ignored proto emit flags + stale comments).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#85)

The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/
apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected:

- Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read
  stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82),
  with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only
  as clearly-labeled history / rejected alternatives.
- delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner
  (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76).
- CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec,
  split multi-partition (apache#84/#2) -- corrected docs that called it unsupported,
  declined, or a synthetic-columns fallback.
- 08-known-limitations.md: removed all of Part B (B1-B9 were development-time
  regressions, all now fixed + guarded) and A3 (path-based CDF now engages
  native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e
  credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7
  CM-id repoint). 466 -> 230 lines.
- Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow
  references, and supported-feature lists (added CDF, _metadata, INT96) across
  the build / README / user-guide docs.

Every claim verified against code; markdown passes prettier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants