Auto cast values on export part and partition just like insert select#1779
Auto cast values on export part and partition just like insert select#1779arthurpassos wants to merge 14 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99002b9740
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| /// Align the pipeline header with the destination's non-materialized sample block, | ||
| /// using the same `makeConvertingActions(Position)` call INSERT SELECT performs. | ||
| addExportConvertingActions(plan_for_part, *destination_storage, local_context); |
There was a problem hiding this comment.
Convert partition values before creating the import sink
When the destination is partitioned object storage and the partition column type changes in a castable way (for example source year String, destination year UInt32, both PARTITION BY year), this new row conversion happens only after destination_storage->import has already computed the S3/Hive partition path from block_with_partition_values using the source part's minmax block. That means a schema pair now accepted by the synchronous validation can either fail inside partition-key evaluation with the destination actions seeing source-typed inputs, or write files under a path based on the uncast source value while the exported rows were cast, so the export is not equivalent to INSERT SELECT.
Useful? React with 👍 / 👎.
|
Hm... the approach of following the same |
|
@codex Review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5693b6068
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| /// Align the pipeline header with the destination's non-materialized sample block, | ||
| /// using the same `makeConvertingActions(Position)` call INSERT SELECT performs. | ||
| addExportConvertingActions(plan_for_part, *destination_storage, local_context); |
There was a problem hiding this comment.
Convert exported partition values as well
When the partition column itself needs a cast (for example a source UInt16 partition key exported to an Iceberg table whose partition column is String, which canBeSafelyCast accepts without opt-in), this step only converts the row stream. StorageObjectStorage::import still receives the source-typed block_with_partition_values, and the Iceberg commit paths still use the source part's partition.value, so the data files are written with destination types while the object path/manifest partition value remains source-typed; for Iceberg this can fail at commit or record an incorrectly typed partition. Please apply the same conversion to the partition values used for import/commit whenever partition-key columns are cast.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
It is expected, at least for now.
|
The commit 308159a aims to address the following: I was previously not casting the partition key and using the source partition key since the partition expressions were equal and the data was supposed to belong to the same partition. On the other hand, even if the partition exp is the same, if we r dealing with different data types, chances are the partition key will be different depending on the transform used. For instance: (id Int32, year UInt16) partition by year or (id Int32, year UInt16) partition by icebergBucket(5, year) |
|
|
||
|
|
||
| def test_export_part_with_castable_narrowing_values_fit(cluster): | ||
| """A lossy narrowing (id Int64 -> Int32) succeeds once the user opts in via |
There was a problem hiding this comment.
Perhaps a test that fails EXPORT PART when the setting is not set and there is a lossy conversion would be nice
There was a problem hiding this comment.
We do have those for export partition. You mean one for plain export part as well?
There was a problem hiding this comment.
At that moment I hadn't yet seen the tests for export partition. But yeah since we're mostly mirroring the tests between export partition/export partition, would be great to have one for export part as well
| ) | ||
|
|
||
|
|
||
| def test_export_partition_lossy_cast_rejected_without_optin(cluster): |
|
AI Triage, edited for conciseness: CI Failures AnalysisRelated to this PR (regression test updates needed)
Infrastructure Issues
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Applies to part and partition export the same auto cast ClickHouse INSERT ... SELECT does.
Current rules in this PR:
export_merge_tree_part_allow_lossy_cast=1Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: