Skip to content

Optimize away unused UNNEST under duplicate-insensitive aggregates#22161

Open
kosiew wants to merge 15 commits into
apache:mainfrom
kosiew:unnested-pruned-02-20118
Open

Optimize away unused UNNEST under duplicate-insensitive aggregates#22161
kosiew wants to merge 15 commits into
apache:mainfrom
kosiew:unnested-pruned-02-20118

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented May 14, 2026

Which issue does this PR close?

Rationale for this change

This change adds a conservative optimizer rewrite for a narrowly scoped safe case where an unused UNNEST can be removed without changing query semantics.

UNNEST normally changes row cardinality, so removing it broadly is unsafe. However, when:

  • the unnested output columns are not referenced by parent-visible expressions,
  • the parent is duplicate-insensitive (such as GROUP BY without aggregate expressions or DISTINCT),
  • and the UNNEST is guaranteed to preserve at least one row per input row,

then the UNNEST only introduces duplicate rows that are eliminated by the parent operator.

This optimization avoids unnecessary Unnest / UnnestExec operators in plans while preserving correctness for empty lists, null lists, and preserve_nulls behavior.

What changes are included in this PR?

  • Added optimizer logic in optimize_projections to remove unused LogicalPlan::Unnest operators under duplicate-insensitive aggregates.

  • Restricted the rewrite to conservative safe cases:

    • no aggregate expressions are present,
    • required expressions do not reference unnested outputs,
    • and the UNNEST input is provably row-preserving.
  • Added helper logic to:

    • detect whether required expressions depend on unnested columns,
    • determine whether unnested inputs are list/struct outputs,
    • verify that literal list inputs are non-empty and non-null,
    • preserve semantics for empty and null list values.
  • Added support for pruning through a projection above UNNEST.

  • Updated RequiredIndices::split_off to take &self.

  • Added optimizer unit tests covering:

    • removal of unused non-empty literal UNNEST,
    • removal through projections,
    • preservation when unnested columns are referenced,
    • preservation for aggregates with aggregate expressions,
    • preservation for empty lists,
    • preservation for null lists and preserve_nulls=false,
    • preservation for non-literal unnests.
  • Added sqllogictest coverage for:

    • grouped queries,
    • DISTINCT,
    • physical and logical plans,
    • unsafe counterexamples involving empty/null lists.

Are these changes tested?

Yes.

This PR adds targeted optimizer unit tests in optimize_projections as well as sqllogictest coverage in unnest.slt, including regression cases for:

  • safe removal of unused row-preserving UNNEST,
  • DISTINCT,
  • referenced unnested columns,
  • aggregate expressions,
  • empty lists,
  • null lists,
  • and non-literal inputs.

Are there any user-facing changes?

No user-facing API changes.

This change only affects logical/physical plan optimization by removing unnecessary UNNEST operators in narrowly scoped safe cases.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 3 commits May 14, 2026 11:45
- Pruned unused Unnest under aggregate/group-by when safe.
- Handled direct Unnest and Projection -> Unnest scenarios.
- Ensured Unnest is retained when empty/null lists may drop rows.
- Added unit tests to cover these changes.
- Updated safe GROUP BY explain in sqllogictest: no Unnest/UnnestExec.
- Added regression tests for empty/null grouped scenarios to ensure Unnest retention.
- Added a unit test for the optimizer to verify that GROUP BY elements keep UNNEST in the case of `keep_referenced_unnest_under_group_by`.
- Implemented SLT DISTINCT regression test for `SELECT DISTINCT id ... UNNEST(make_array(...))`, ensuring the correctness of results and that UNNEST is pruned via the distinct-to-aggregate path.
- Removed Vec allocation for unnested input indices to enhance performance.
- Replaced try_fold boolean flow with a more explicit for loop for better readability.
- Introduced is_unnested_input_index function to clarify input index processing.
- Simplified logic for handling empty lists in .all() calls.
- Added has_valid_first_value helper for validation purposes.
- Introduced new test helpers: id_schema, list_literal_expr, and id_elem_unnest_plan to facilitate testing.
- Streamlined repeated optimizer test setup for efficiency.
- Improved construction of empty-list tests for clarity.
@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels May 14, 2026
@kosiew kosiew marked this pull request as ready for review May 14, 2026 04:57
@Omega359
Copy link
Copy Markdown
Contributor

SELECT id
FROM (
    SELECT id, UNNEST(UNNEST(make_array(arrow_cast(make_array(), 'List(Int64)')))) AS elem
    FROM unused_unnest_pruning
)
GROUP BY id
ORDER BY id;

Expected result is empty, because the outer list is non-empty but the inner list is empty, so recursive UNNEST removes all rows before grouping.

main returns no results, this PR returns
1
2
3

- Pruned only UNNEST list entries with depth == 1 in the optimizer.
- Added unit regression test for recursive unused UNNEST under GROUP BY.
- Included Omega359 SQL reproduction in the unnest SQL logic test, expecting an empty result.
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 25, 2026

@Omega359
Good catch — this is a semantic regression.

The current guard only proves that the first list level is a non-empty, non-null literal (unnest_preserves_at_least_one_row_per_input / literal_non_empty_list). That is not enough for recursive/nested unnest. In the example, the outer list has one element, so the guard incorrectly treats it as row-preserving, but the inner list is empty and the second unnest removes all rows before the GROUP BY.

I’ll fix this by making the optimization more conservative for recursive list unnesting: only prune when the rule can prove every unnest step preserves at least one row. Practically, I’ll block pruning for list UNNEST entries with depth > 1 unless/until we add a recursive literal proof. I’ll also add the exact SQL regression (and a unit test) so this case stays unoptimized and returns the empty result, preserving empty/null list and preserve_nulls behavior.

kosiew added 4 commits May 25, 2026 11:05
…AIN for Omega repro

- Introduced a mixed safe and recursive unsafe unit test named `keep_mixed_depth_unused_unnest_under_group_by` to verify pruning at any depth greater than 1 in Unnest blocks.
- Added SQL EXPLAIN functionality in the file `datafusion/sqllogictest/test_files/unnest.slt` to ensure both logical Unnest and physical UnnestExec are present in the execution plan.
- Added `keep_unused_null_literal_unnest_under_group_by_with_preserve_nulls_false`
- Tests NULL list literal with `preserve_nulls=false` to verify that Unnest is retained.

- Added `keep_unused_non_literal_unnest_under_group_by`
- Tests a table list column that is not a literal/provable to verify that Unnest is retained.
- Removed `indices.clone()` by modifying `RequiredIndices::split_off(&self, ...)`
- Simplified `is_unnested_input_index` function
- Removed duplicate validity check for the first list value
- Introduced a `group_by_id` test helper
- Renamed helper function for clarity: remove_row_preserving_unused_unnest_from_duplicate_insensitive_input
- Added a safety comment regarding row-preserving behavior with no required expression depending on unnested output
- Introduced a new unit test to verify functionality: keep_unused_unnest_under_group_by_with_visible_aggregate, including a guard for new_aggr_expr.is_empty()
- Clarified SLT comment to specify “non-empty literal UNNEST” for better understanding
@kosiew kosiew marked this pull request as draft May 26, 2026 06:24
@kosiew kosiew marked this pull request as ready for review May 26, 2026 07:22
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 26, 2026

@Omega359
ptal again.

@Omega359
Copy link
Copy Markdown
Contributor

Omega359 commented May 26, 2026

The fix looks good. I wasn't able to find any other issues however codex did:

Incorrect pruning with volatile GROUP BY expressions in mod.rs.
can_remove_unused_unnest_for_exprs only checks whether required expressions reference unnested output columns. For GROUP BY uuid() / GROUP BY random(), there are no column refs, so the rule removes a row-multiplying UNNEST. That changes the number of volatile evaluations and therefore the number of groups.

datafusion/sqllogictest/test_files/unnest.slt:

# Volatile GROUP BY expressions are evaluated once per input row, so removing a
# row-multiplying UNNEST changes the number of groups even when elem is unused.
query I
SELECT COUNT(*)
FROM (
    SELECT 1
    FROM (
        SELECT id, UNNEST(make_array(1, 2, 3)) AS elem
        FROM unused_unnest_pruning
    )
    GROUP BY uuid()
);
----
9

The second query is the failing repro. On the PR it returns 3, but it should return 9 because UNNEST(make_array(1, 2, 3)) expands the 3 input rows to 9 rows, and uuid() should be evaluated per row before grouping.

A conservative fix is to reject pruning when any required/group expression is volatile, e.g. check expr.is_volatile() before returning Ok(true) from can_remove_unused_unnest_for_exprs.

kosiew added 4 commits May 27, 2026 16:16
- Reject unused UNNEST pruning when required/group expression is volatile.
- Covers both direct UNNEST and Projection -> UNNEST scenarios.
- Added 2 optimizer unit regression tests.

test(unnest): include GROUP BY uuid() regression case

- Added a new regression test for the GROUP BY uuid() scenario as per reviewer's request.
…est.slt

- Added tests for GROUP BY id and uuid(), highlighting the volatile and deterministic behavior of the key.
- Included a test for SELECT DISTINCT uuid() to demonstrate how volatile DISTINCT interacts with UNNEST row multiplication.
…hysical plans

- Asserts that Unnest remains in the logical plan
- Asserts that UnnestExec remains in the physical plan
@kosiew kosiew marked this pull request as draft May 27, 2026 08:28
- Added TableScanBuilder import from datafusion_expr::logical_plan
- Added expr_to_columns import from datafusion_expr::utils
- Removed unused TableScan import to clear unused import warning
@kosiew kosiew marked this pull request as ready for review May 27, 2026 09:42
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 27, 2026

thanks @Omega359
I adopted the suggested conservative fix and added a few more edge cases.

@Omega359
Copy link
Copy Markdown
Contributor

LGTM :)

@kosiew kosiew requested a review from alamb May 29, 2026 08:29
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew and @Omega359

FYI @simonvandel

I spent some time reviewing the code and it looks ok but I don't fully undesrtand the transformation and don't quite follow why it is always correct (as in how sure are we that this won't get wrong answers)

}
}

fn remove_row_preserving_unused_unnest_from_duplicate_insensitive_input(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments here might help future readers -- specifically an example SQL / plan that is matched and the rewrite that is done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

fn remove_row_preserving_unused_unnest_from_duplicate_insensitive_input(
input: &LogicalPlan,
required_exprs: &[Expr],
) -> Result<Option<LogicalPlan>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the Trnasformed structure here rather than Option? Then ou could probably do owned updates easier

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended to Transformed.

LogicalPlan::Unnest(unnest)
if can_remove_unused_unnest_for_exprs(unnest, required_exprs)? =>
{
Ok(Some(Arc::unwrap_or_clone(Arc::clone(&unnest.input))))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unwrap_or_clone is never going to work as there is another reference (unnuest input)

You would have to adjust this function to take the owned input which might be nice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

|| unnest.struct_type_columns.contains(&input_index)
}

fn unnest_preserves_at_least_one_row_per_input(unnest: &Unnest) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth considering making these methods on Unnest so they are easier to discover

Copy link
Copy Markdown
Contributor Author

@kosiew kosiew May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Moved is_unnested_input_index helper onto Unnest:
    • added Unnest::is_unnested_input_index(...)

Not moved:

  • unnest_preserves_at_least_one_row_per_input
  • unnest_input_expr
  • literal_non_empty_list

Those stayed optimizer-local because they encode this rule’s conservative proof, not general
Unnest behavior.

}
}

fn literal_non_empty_list(expr: &Expr) -> Option<bool> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also seems like it would be easier to find as a method on Expr

Copy link
Copy Markdown
Contributor Author

@kosiew kosiew May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep it here because it is not a general Expr property? It is optimizer-rule-specific conservative proof for row-preserving UNNEST.


fn has_valid_non_empty_first_value(
array: &impl Array,
first_value_non_empty: impl FnOnce() -> bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whu does it need to be a closure? it would probably be shorter and easier to read if you eithe rjust passed in another boolean first_value_non_empty or just inlined this function 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented May 30, 2026

@alamb
I made changes to address the above comments.

don't fully undesrtand the transformation and don't quite follow why it is always correct (as in how sure are we that this won't get wrong answers)

This rewrite is intentionally much narrower than general “unused operator” pruning.

The transformation only fires for an aggregate with no aggregate expressions, i.e. a duplicate-insensitive grouping-only aggregate. It also requires:

  • no required/group expression references any unnested output column
  • no required/group expression is volatile
  • the UNNEST is proven row-preserving: currently only depth-1 literal lists with a valid, non-empty first value
  • empty/null lists, recursive unnests, non-literal unnests, referenced unnested columns, and visible aggregate expressions are all rejected

So the matched case is effectively:

SELECT id
FROM (
  SELECT id, UNNEST([1, 2]) AS elem
  FROM t
)
GROUP BY id

which can become:

SELECT id
FROM t
GROUP BY id

because elem is unused, every input row still contributes at least one row after unnest, and grouping by id collapses any duplicate rows introduced by unnest.

We added regression tests for the unsafe cases too: empty/null lists, recursive unnest, non-literal list inputs, referenced unnested output, visible aggregates like COUNT(*), and volatile grouping expressions such as uuid(). Those cases keep the UNNEST.

I’ll add/expand the code comment to make this invariant clearer.

- Updated optimize_projections module with:
- Added rewrite example and safety comment.
- Modified helper to return Transformed<LogicalPlan>.
- Adjusted helper to consume owned input where applicable.
- Removed unnecessary Arc::unwrap_or_clone(Arc::clone(...)) pattern.
- Simplified closure helper to a boolean helper.
- Kept Expr helper local to each response.

- Enhanced logical_plan with:
- Added Unnest::is_unnested_input_index.
- Clarifies the purpose of the invariant concerning unused-op pruning
- Details the grouping-only aggregate invariant
- Explains why row-preserving and duplicate-insensitive properties ensure safe rewriting
- Specifies exact cases that are rejected for being unsafe
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label May 30, 2026
@kosiew kosiew requested a review from alamb May 30, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants