Skip to content

fix: Appending null values to element array builders of StructBuilder for null row in a StructArray#78

Merged
viirya merged 1 commit into
apache:mainfrom
viirya:fix_shuffle_builder
Feb 21, 2024
Merged

fix: Appending null values to element array builders of StructBuilder for null row in a StructArray#78
viirya merged 1 commit into
apache:mainfrom
viirya:fix_shuffle_builder

Conversation

@viirya

@viirya viirya commented Feb 21, 2024

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #79.

Rationale for this change

When encountering a null row, besides appending a null value to StructBuilder by calling its append_null, we also need to append null values to all its element array builders, so their lengths are kept the same.

Otherwise, when building the StructArray, arrow will report the error that our customer found during testing columnar shuffle:

```
Caused by: org.apache.comet.CometRuntimeException: StructBuilder and field_builders are of unequal lengths.
```

What changes are included in this PR?

How are these changes tested?

@sunchao sunchao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Already reviewed internally.

@viirya viirya merged commit 4bbc307 into apache:main Feb 21, 2024
@viirya

viirya commented Feb 21, 2024

Copy link
Copy Markdown
Member Author

Merged. Thanks.

schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 7, 2026
…che#78 schema-change-since-analysis)

Regression from the kernel-schema-shipping work: DeltaColumnMappingSuite "column
mapping batch scan should detect physical name changes" read the current data
instead of null-filling, because the driver fed `ScanBuilder::with_schema` the
LIVE snapshot schema. Kernel resolves physical names from the schema passed to
`with_schema` (StateInfo::try_new -> StructField::make_physical), so it must be
fed the schema the query was PLANNED with -- then kernel's field-id matching
null-fills any column whose id changed since analysis (Delta's schema-on-read
escape hatch), which a pure-kernel engine handles itself with no fallback.

- The JVM ships the analysis-time read schema as Delta schema JSON
  (`StructType.json` from DeltaScanRule's stashed reference schema, carrying
  `delta.columnMapping.physicalName` + id at every level). The driver parses it
  (`serde_json` -> kernel `StructType`, the same format kernel reads from the log)
  and feeds it straight to `with_schema`; it falls back to projecting the live
  snapshot by column name only when no analysis-time schema is available.
- The analysis-time JSON and the Arrow-IPC names are mutually exclusive on the
  wire (ship the JSON when present, else the names) -- no redundant double-ship.
- Fix `planDeltaReadSchemas` to build kernel schemas when EITHER carrier is
  present (it previously gated on the IPC only, so a JSON-only payload silently
  produced no schemas).

Red->green guard: CometDeltaSchemaChangeReproSuite (Comet returned data, Spark
null-filled; now both null-fill). Own-suite "physical name changes" goes green;
all contrib Delta suites stay green. (DeltaColumnMappingSuite "explicit id
matching" -- a contrived manual field-id repoint -- is a separate kernel
id-vs-name matching nuance, tracked in apache#79.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 7, 2026
…op the Arrow-IPC path)

The kernel-read path previously shipped the data-read schema two ways -- Arrow-IPC
column names (projected against the snapshot driver-side) and, for apache#78, the
analysis-time schema as Delta JSON. The JSON carrier subsumes both: it's the query's
data columns drawn from the analysis-time schema (falling back to the snapshot
schema), carrying delta.columnMapping.physicalName + id at every nesting level --
the same Delta-JSON format kernel deserializes for the log schema. So drop the IPC
carrier entirely:

- `dataReadSchemaJson` now sources annotations from analyzedSchema.orElse(snapshot
  schema), using each required data column's annotated field (or the required field
  as-is for non-column-mapping tables). Removed `dataReadSchemaIpc`.
- Removed the `projectedSchemaIpc` arg from `planDeltaScan` / `planDeltaReadSchemas`
  (JNI + Native.scala) and the `projected_columns` / `build_read_schema` snapshot-
  projection fallback from scan.rs; the driver parses one JSON via `read_schema_from_json`.

One wire carrier, no Arrow-IPC marshalling, no driver-side snapshot re-projection.
All contrib Delta suites green (incl. the apache#78 schema-change repro).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#85)

The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/
apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected:

- Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read
  stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82),
  with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only
  as clearly-labeled history / rejected alternatives.
- delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner
  (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76).
- CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec,
  split multi-partition (apache#84/#2) -- corrected docs that called it unsupported,
  declined, or a synthetic-columns fallback.
- 08-known-limitations.md: removed all of Part B (B1-B9 were development-time
  regressions, all now fixed + guarded) and A3 (path-based CDF now engages
  native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e
  credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7
  CM-id repoint). 466 -> 230 lines.
- Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow
  references, and supported-feature lists (added CDF, _metadata, INT96) across
  the build / README / user-guide docs.

Every claim verified against code; markdown passes prettier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Appending null values to element array builders of StructBuilder for null row in a StructArray

2 participants