Skip to content

test: Add TPC-DS test results#77

Merged
sunchao merged 1 commit into
apache:mainfrom
sunchao:add-tpcds-results
Feb 21, 2024
Merged

test: Add TPC-DS test results#77
sunchao merged 1 commit into
apache:mainfrom
sunchao:add-tpcds-results

Conversation

@sunchao

@sunchao sunchao commented Feb 21, 2024

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

This adds back TPC-DS query results for CometTPCDSQuerySuite, which serves as a extra test suite to ensure Comet produces same results as Spark does.

What changes are included in this PR?

Adds back TPC-DS query results golden files to the repo. These files were generated using CometTPCDSQuerySuite with Comet turned off, to make sure it is the same as Spark's results.

How are these changes tested?

@sunchao sunchao changed the title test: Add TPC-DS tests results test: Add TPC-DS test results Feb 21, 2024
@sunchao

sunchao commented Feb 21, 2024

Copy link
Copy Markdown
Member Author

Will need to add a separate Github job to check the TPC-DS query results. Will do it later.

@sunchao sunchao merged commit 4c5fc75 into apache:main Feb 21, 2024
@sunchao

sunchao commented Feb 21, 2024

Copy link
Copy Markdown
Member Author

Thanks, merged

@sunchao sunchao deleted the add-tpcds-results branch February 21, 2024 22:41
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request May 13, 2026
Closes the streaming + MERGE-with-DV gap (apache#77). Previously the
pre-materialised FileIndex code path declined Comet whenever any
AddFile carried a DeletionVectorDescriptor, forcing fallback to
Spark+Delta. Now we materialise the DV on the driver via Delta's
HadoopFileSystemDVStore (reflection, no compile-time dep) and feed
the resulting row-index list through the proto's existing
deleted_row_indexes field; the native planner already wraps DV'd
file groups in DeltaDvFilterExec.

ExtractedAddFile gains a dvDescriptor: AnyRef field; the convert
path materialises indexes for any AddFile that carries one, falling
back if reflection or the DV read fails (silently dropping a DV
would be a correctness violation).

Verified against DeletionVectorsSuite (293/293 passed) and the
MergeIntoDVsSuite metrics tests (DV write + subsequent DV-aware
read both go through native).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…che#77)

Core's native/core/.../delta_scan.rs held ~250 lines of Delta-specific scan planning
(kernel schema selection, KernelScanFile mapping, storage-config/S3-bucket resolution,
final_output_indices reorder, the kernel_read gate). Move all of it into
comet_contrib_delta::planner::plan_delta_scan, so core stays free of Delta planning logic
(cleaner for upstreaming apache#4366 -- reviewers see core untouched by Delta).

Core's delta_scan.rs is now a thin shim: it computes the requested + partition Arrow schemas
(core owns the proto->arrow `to_arrow_datatype` converter, used across the planner, and the
contrib crate can't depend on core -- that would cycle), calls the contrib planner, and wraps
the returned ExecutionPlan in a SparkPlan. No behaviour change.

Verified: full contrib package 152/0 under Spark 4.1.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schenksj added a commit to schenksj/datafusion-comet that referenced this pull request Jun 9, 2026
…pache#85)

The design docs had drifted from the kernel-read refactor (apache#76/apache#77/apache#80/apache#81/apache#82/
apache#84/#2/apache#78/apache#86). Audited all 13 docs against current code and corrected:

- Removed the deleted ParquetSource + DV-sweep + DeltaSyntheticColumnsExec read
  stack as the "current" path everywhere; it is now kernel-read only (apache#50/apache#82),
  with DeltaKernelScanExec doing in-worker synthesis. The old stack is kept only
  as clearly-labeled history / rejected alternatives.
- delta_scan.rs is a ~72-line shim delegating to comet_contrib_delta::planner
  (apache#77); column-mapping physicalisation dropped, kernel ships the schemas (apache#76).
- CDF (readChangeFeed) is kernel-native via TableChanges -> CometDeltaCdfScanExec,
  split multi-partition (apache#84/#2) -- corrected docs that called it unsupported,
  declined, or a synthetic-columns fallback.
- 08-known-limitations.md: removed all of Part B (B1-B9 were development-time
  regressions, all now fixed + guarded) and A3 (path-based CDF now engages
  native, apache#84); kept only genuine current limitations (A1 DPP residual, A2e
  credential residual, A4 VARIANT, A5 decline gates, A6 INT96 kernel gap, A7
  CM-id repoint). 466 -> 230 lines.
- Fixed config keys, build/module layout, JNI symbols, file paths, CI workflow
  references, and supported-feature lists (added CDF, _metadata, INT96) across
  the build / README / user-guide docs.

Every claim verified against code; markdown passes prettier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants