Skip to content

Auto cast values on export part and partition just like insert select#1779

Open
arthurpassos wants to merge 14 commits into
antalya-26.3from
feature/antalya-26.3/export-cast-compatibility
Open

Auto cast values on export part and partition just like insert select#1779
arthurpassos wants to merge 14 commits into
antalya-26.3from
feature/antalya-26.3/export-cast-compatibility

Conversation

@arthurpassos

@arthurpassos arthurpassos commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Applies to part and partition export the same auto cast ClickHouse INSERT ... SELECT does.

Current rules in this PR:

  • Schema column count must match
  • Check is done by column position, not by name.
  • Lossy casts are allowed if export_merge_tree_part_allow_lossy_cast=1

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown

Workflow [PR], commit [5161c90]

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99002b9740

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


/// Align the pipeline header with the destination's non-materialized sample block,
/// using the same `makeConvertingActions(Position)` call INSERT SELECT performs.
addExportConvertingActions(plan_for_part, *destination_storage, local_context);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Convert partition values before creating the import sink

When the destination is partitioned object storage and the partition column type changes in a castable way (for example source year String, destination year UInt32, both PARTITION BY year), this new row conversion happens only after destination_storage->import has already computed the S3/Hive partition path from block_with_partition_values using the source part's minmax block. That means a schema pair now accepted by the synchronous validation can either fail inside partition-key evaluation with the destination actions seeing source-typed inputs, or write files under a path based on the uncast source value while the exported rows were cast, so the export is not equivalent to INSERT SELECT.

Useful? React with 👍 / 👎.

@arthurpassos arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.3 labels May 11, 2026
@arthurpassos

Copy link
Copy Markdown
Collaborator Author

Hm... the approach of following the same INSERT ... SELECT rules for casting is somewhat problematic for partition exports because the partition column results might be different, and the export operation does not repartition.

@svb-alt svb-alt added the roadmap Key features and improvements for Antalya project label Jun 7, 2026
@arthurpassos

Copy link
Copy Markdown
Collaborator Author

@codex Review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5693b6068

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


/// Align the pipeline header with the destination's non-materialized sample block,
/// using the same `makeConvertingActions(Position)` call INSERT SELECT performs.
addExportConvertingActions(plan_for_part, *destination_storage, local_context);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Convert exported partition values as well

When the partition column itself needs a cast (for example a source UInt16 partition key exported to an Iceberg table whose partition column is String, which canBeSafelyCast accepts without opt-in), this step only converts the row stream. StorageObjectStorage::import still receives the source-typed block_with_partition_values, and the Iceberg commit paths still use the source part's partition.value, so the data files are written with destination types while the object path/manifest partition value remains source-typed; for Iceberg this can fail at commit or record an incorrectly typed partition. Please apply the same conversion to the partition values used for import/commit whenever partition-key columns are cast.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected, at least for now.

@arthurpassos

Copy link
Copy Markdown
Collaborator Author

The commit 308159a aims to address the following: I was previously not casting the partition key and using the source partition key since the partition expressions were equal and the data was supposed to belong to the same partition.

On the other hand, even if the partition exp is the same, if we r dealing with different data types, chances are the partition key will be different depending on the transform used. For instance:

(id Int32, year UInt16) partition by year
x
(id Int32, year String) partition by year

or

(id Int32, year UInt16) partition by icebergBucket(5, year)
x
(id Int32, year String) partition by icebergBucket(5, year) -- the result of icebergBucket is different

arthur :) SELECT icebergBucket(5, '50');

SELECT icebergBucket(5, '50')

Query id: 828c5a2b-cbd0-4dc8-a36c-81a19f9ad4b1

   ┌─icebergBucket(5, '50')─┐
1. │                      2 │
   └────────────────────────┘

1 row in set. Elapsed: 0.006 sec. 

arthur :) SELECT icebergBucket(5, 50);

SELECT icebergBucket(5, 50)

Query id: 48ab602b-5cdf-4e5e-b65b-5aec174c1e0d

   ┌─icebergBucket(5, 50)─┐
1. │                    3 │
   └──────────────────────┘

1 row in set. Elapsed: 0.006 sec. 

arthur :) 


@mkmkme mkmkme left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!



def test_export_part_with_castable_narrowing_values_fit(cluster):
"""A lossy narrowing (id Int64 -> Int32) succeeds once the user opts in via

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a test that fails EXPORT PART when the setting is not set and there is a lossy conversion would be nice

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have those for export partition. You mean one for plain export part as well?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that moment I hadn't yet seen the tests for export partition. But yeah since we're mostly mirroring the tests between export partition/export partition, would be great to have one for export part as well

)


def test_export_partition_lossy_cast_rejected_without_optin(cluster):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without_**optin** :)

@DimensionWieldr

DimensionWieldr commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

AI Triage, edited for conciseness:

CI Failures Analysis

Related to this PR (regression test updates needed)

  • /s3/minio/export tests/export part/sanity/mismatched columns — expects exitcode 122, got 20 (NUMBER_OF_COLUMNS_DOESNT_MATCH) after auto-cast behavior change
  • /s3/minio/export tests/export partition/sanity/mismatched columns — same
  • /iceberg/export partition/.../source-only schema drift is rejected at EXPORT — expects INCOMPATIBLE_COLUMNS (122), got NUMBER_OF_COLUMNS_DOESNT_MATCH (20)
  • /iceberg/.../system tables partition sorting keys — stale regression pin (5b1ccf771); needs 0e1dbef8b+ for PR 1874 expectations
    All upstream checks (Stateless, Integration, Stress, Fuzzer) passed.

Infrastructure Issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 port-antalya PRs to be ported to all new Antalya releases roadmap Key features and improvements for Antalya project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants