Optimize away unused UNNEST under duplicate-insensitive aggregates#22161
Optimize away unused UNNEST under duplicate-insensitive aggregates#22161kosiew wants to merge 15 commits into
UNNEST under duplicate-insensitive aggregates#22161Conversation
- Pruned unused Unnest under aggregate/group-by when safe. - Handled direct Unnest and Projection -> Unnest scenarios. - Ensured Unnest is retained when empty/null lists may drop rows. - Added unit tests to cover these changes. - Updated safe GROUP BY explain in sqllogictest: no Unnest/UnnestExec. - Added regression tests for empty/null grouped scenarios to ensure Unnest retention.
- Added a unit test for the optimizer to verify that GROUP BY elements keep UNNEST in the case of `keep_referenced_unnest_under_group_by`. - Implemented SLT DISTINCT regression test for `SELECT DISTINCT id ... UNNEST(make_array(...))`, ensuring the correctness of results and that UNNEST is pruned via the distinct-to-aggregate path.
- Removed Vec allocation for unnested input indices to enhance performance. - Replaced try_fold boolean flow with a more explicit for loop for better readability. - Introduced is_unnested_input_index function to clarify input index processing. - Simplified logic for handling empty lists in .all() calls. - Added has_valid_first_value helper for validation purposes. - Introduced new test helpers: id_schema, list_literal_expr, and id_elem_unnest_plan to facilitate testing. - Streamlined repeated optimizer test setup for efficiency. - Improved construction of empty-list tests for clarity.
Expected result is empty, because the outer list is non-empty but the inner list is empty, so recursive UNNEST removes all rows before grouping. main returns no results, this PR returns |
- Pruned only UNNEST list entries with depth == 1 in the optimizer. - Added unit regression test for recursive unused UNNEST under GROUP BY. - Included Omega359 SQL reproduction in the unnest SQL logic test, expecting an empty result.
|
@Omega359 The current guard only proves that the first list level is a non-empty, non-null literal ( I’ll fix this by making the optimization more conservative for recursive list unnesting: only prune when the rule can prove every unnest step preserves at least one row. Practically, I’ll block pruning for list |
…AIN for Omega repro - Introduced a mixed safe and recursive unsafe unit test named `keep_mixed_depth_unused_unnest_under_group_by` to verify pruning at any depth greater than 1 in Unnest blocks. - Added SQL EXPLAIN functionality in the file `datafusion/sqllogictest/test_files/unnest.slt` to ensure both logical Unnest and physical UnnestExec are present in the execution plan.
- Added `keep_unused_null_literal_unnest_under_group_by_with_preserve_nulls_false` - Tests NULL list literal with `preserve_nulls=false` to verify that Unnest is retained. - Added `keep_unused_non_literal_unnest_under_group_by` - Tests a table list column that is not a literal/provable to verify that Unnest is retained.
- Removed `indices.clone()` by modifying `RequiredIndices::split_off(&self, ...)` - Simplified `is_unnested_input_index` function - Removed duplicate validity check for the first list value - Introduced a `group_by_id` test helper
- Renamed helper function for clarity: remove_row_preserving_unused_unnest_from_duplicate_insensitive_input - Added a safety comment regarding row-preserving behavior with no required expression depending on unnested output - Introduced a new unit test to verify functionality: keep_unused_unnest_under_group_by_with_visible_aggregate, including a guard for new_aggr_expr.is_empty() - Clarified SLT comment to specify “non-empty literal UNNEST” for better understanding
|
@Omega359 |
|
The fix looks good. I wasn't able to find any other issues however codex did: Incorrect pruning with volatile GROUP BY expressions in mod.rs. datafusion/sqllogictest/test_files/unnest.slt: The second query is the failing repro. On the PR it returns 3, but it should return 9 because UNNEST(make_array(1, 2, 3)) expands the 3 input rows to 9 rows, and uuid() should be evaluated per row before grouping. A conservative fix is to reject pruning when any required/group expression is volatile, e.g. check expr.is_volatile() before returning Ok(true) from can_remove_unused_unnest_for_exprs. |
- Reject unused UNNEST pruning when required/group expression is volatile. - Covers both direct UNNEST and Projection -> UNNEST scenarios. - Added 2 optimizer unit regression tests. test(unnest): include GROUP BY uuid() regression case - Added a new regression test for the GROUP BY uuid() scenario as per reviewer's request.
…est.slt - Added tests for GROUP BY id and uuid(), highlighting the volatile and deterministic behavior of the key. - Included a test for SELECT DISTINCT uuid() to demonstrate how volatile DISTINCT interacts with UNNEST row multiplication.
…hysical plans - Asserts that Unnest remains in the logical plan - Asserts that UnnestExec remains in the physical plan
- Added TableScanBuilder import from datafusion_expr::logical_plan - Added expr_to_columns import from datafusion_expr::utils - Removed unused TableScan import to clear unused import warning
|
thanks @Omega359 |
|
LGTM :) |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kosiew and @Omega359
FYI @simonvandel
I spent some time reviewing the code and it looks ok but I don't fully undesrtand the transformation and don't quite follow why it is always correct (as in how sure are we that this won't get wrong answers)
| } | ||
| } | ||
|
|
||
| fn remove_row_preserving_unused_unnest_from_duplicate_insensitive_input( |
There was a problem hiding this comment.
Some comments here might help future readers -- specifically an example SQL / plan that is matched and the rewrite that is done
| fn remove_row_preserving_unused_unnest_from_duplicate_insensitive_input( | ||
| input: &LogicalPlan, | ||
| required_exprs: &[Expr], | ||
| ) -> Result<Option<LogicalPlan>> { |
There was a problem hiding this comment.
Could we use the Trnasformed structure here rather than Option? Then ou could probably do owned updates easier
There was a problem hiding this comment.
Amended to Transformed.
| LogicalPlan::Unnest(unnest) | ||
| if can_remove_unused_unnest_for_exprs(unnest, required_exprs)? => | ||
| { | ||
| Ok(Some(Arc::unwrap_or_clone(Arc::clone(&unnest.input)))) |
There was a problem hiding this comment.
the unwrap_or_clone is never going to work as there is another reference (unnuest input)
You would have to adjust this function to take the owned input which might be nice
| || unnest.struct_type_columns.contains(&input_index) | ||
| } | ||
|
|
||
| fn unnest_preserves_at_least_one_row_per_input(unnest: &Unnest) -> bool { |
There was a problem hiding this comment.
It might be worth considering making these methods on Unnest so they are easier to discover
There was a problem hiding this comment.
- Moved is_unnested_input_index helper onto Unnest:
- added Unnest::is_unnested_input_index(...)
Not moved:
- unnest_preserves_at_least_one_row_per_input
- unnest_input_expr
- literal_non_empty_list
Those stayed optimizer-local because they encode this rule’s conservative proof, not general
Unnest behavior.
| } | ||
| } | ||
|
|
||
| fn literal_non_empty_list(expr: &Expr) -> Option<bool> { |
There was a problem hiding this comment.
this also seems like it would be easier to find as a method on Expr
There was a problem hiding this comment.
Should we keep it here because it is not a general Expr property? It is optimizer-rule-specific conservative proof for row-preserving UNNEST.
|
|
||
| fn has_valid_non_empty_first_value( | ||
| array: &impl Array, | ||
| first_value_non_empty: impl FnOnce() -> bool, |
There was a problem hiding this comment.
whu does it need to be a closure? it would probably be shorter and easier to read if you eithe rjust passed in another boolean first_value_non_empty or just inlined this function 🤔
|
@alamb
This rewrite is intentionally much narrower than general “unused operator” pruning. The transformation only fires for an aggregate with no aggregate expressions, i.e. a duplicate-insensitive grouping-only aggregate. It also requires:
So the matched case is effectively: SELECT id
FROM (
SELECT id, UNNEST([1, 2]) AS elem
FROM t
)
GROUP BY idwhich can become: SELECT id
FROM t
GROUP BY idbecause We added regression tests for the unsafe cases too: empty/null lists, recursive unnest, non-literal list inputs, referenced unnested output, visible aggregates like I’ll add/expand the code comment to make this invariant clearer. |
- Updated optimize_projections module with: - Added rewrite example and safety comment. - Modified helper to return Transformed<LogicalPlan>. - Adjusted helper to consume owned input where applicable. - Removed unnecessary Arc::unwrap_or_clone(Arc::clone(...)) pattern. - Simplified closure helper to a boolean helper. - Kept Expr helper local to each response. - Enhanced logical_plan with: - Added Unnest::is_unnested_input_index.
- Clarifies the purpose of the invariant concerning unused-op pruning - Details the grouping-only aggregate invariant - Explains why row-preserving and duplicate-insensitive properties ensure safe rewriting - Specifies exact cases that are rejected for being unsafe
Which issue does this PR close?
Rationale for this change
This change adds a conservative optimizer rewrite for a narrowly scoped safe case where an unused
UNNESTcan be removed without changing query semantics.UNNESTnormally changes row cardinality, so removing it broadly is unsafe. However, when:GROUP BYwithout aggregate expressions orDISTINCT),UNNESTis guaranteed to preserve at least one row per input row,then the
UNNESTonly introduces duplicate rows that are eliminated by the parent operator.This optimization avoids unnecessary
Unnest/UnnestExecoperators in plans while preserving correctness for empty lists, null lists, andpreserve_nullsbehavior.What changes are included in this PR?
Added optimizer logic in
optimize_projectionsto remove unusedLogicalPlan::Unnestoperators under duplicate-insensitive aggregates.Restricted the rewrite to conservative safe cases:
UNNESTinput is provably row-preserving.Added helper logic to:
Added support for pruning through a projection above
UNNEST.Updated
RequiredIndices::split_offto take&self.Added optimizer unit tests covering:
UNNEST,preserve_nulls=false,Added sqllogictest coverage for:
DISTINCT,Are these changes tested?
Yes.
This PR adds targeted optimizer unit tests in
optimize_projectionsas well as sqllogictest coverage inunnest.slt, including regression cases for:UNNEST,DISTINCT,Are there any user-facing changes?
No user-facing API changes.
This change only affects logical/physical plan optimization by removing unnecessary
UNNESToperators in narrowly scoped safe cases.LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.