Skip to content

fix: outer reference confusing calcite optimizations#964

Open
mbwhite wants to merge 2 commits into
substrait-io:mainfrom
mbwhite:outer-references
Open

fix: outer reference confusing calcite optimizations#964
mbwhite wants to merge 2 commits into
substrait-io:mainfrom
mbwhite:outer-references

Conversation

@mbwhite

@mbwhite mbwhite commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

fix: outer reference ocnfusing post calcite optimizations

Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
@nielspardon

Copy link
Copy Markdown
Member

this seems to require a ./gradlew spotlessApply

Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
@mbwhite mbwhite changed the title fix: outer reference ocnfusing post calcite optimizations fix: outer reference confusing calcite optimizations Jun 29, 2026

@nielspardon nielspardon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the crash this fixes. When a Calcite optimizer only partially decorrelates a plan (the owning Correlate is rewritten into a join, but $corN.col is left in a Filter condition whose getVariablesSet() is now empty), the CorrelationId was never registered, the access was dropped from fieldAccessDepthMap, and RexExpressionConverter.visitFieldAccess then unboxed a null depth into an int → NPE. Registering the id in visit(Filter) removes that NPE, and for genuinely-nested references (depth ≥ 1) it produces correct outer references.

One concern with the result: for the common single-reference case the pre-scan registers the correlation at depth 0, so the converter emits newRootStructOuterReference(..., stepsOut=0). That is an invalid Substrait plan, not just an odd one — the proto is explicit that steps_out must be >= 1:

OuterReference.steps_out: "Number of subquery boundaries to traverse up for this field's reference. Must be >= 1."
https://github.com/substrait-io/substrait/blob/8675ba395ce966e593ca66575930f59dc359b419/proto/substrait/algebra.proto#L1649-L1653

Consistently, FieldReference.isOuterReference() is outerReferenceStepsOut().orElse(0) > 0, and on the reverse path a 0 reference satisfies neither isSimpleRootReference() nor isOuterReference() and falls through both branches of ExpressionRexConverter.visit(FieldReference). (Emitting a plain non-outer reference instead is also wrong: the field index is in the correlated relation's coordinate system, not the Filter's input — e.g. $cor0.SS_ITEM_SK is index 2 in STORE_SALES but index 2 in ITEM is I_REC_START_DATE.)

So I'd suggest pairing the resolver fix with a guard so the unrepresentable depth-0 case fails loudly instead of producing an invalid plan.

Proposal 1 — guard the converter (isthmus/.../expression/RexExpressionConverter.java, visitFieldAccess, CORREL_VARIABLE case; that file isn't in this diff, hence here rather than inline):

case CORREL_VARIABLE:
  {
    Integer stepsOut = relVisitor.getFieldAccessDepth(fieldAccess);
    // Substrait requires steps_out >= 1 for an outer reference
    // (proto/substrait/algebra.proto, OuterReference.steps_out: "Must be >= 1.").
    // A null/0 depth means the correlation does not step out of any subquery,
    // typically a plan that was only partially decorrelated (the owning Correlate
    // was rewritten into a join but the $cor reference remains in this Filter).
    // We can't represent that with steps_out, so fail clearly rather than emit an
    // invalid steps_out=0 reference. Id-based resolution (substrait#1031) would
    // handle this; tracked in #869.
    if (stepsOut == null || stepsOut == 0) {
      throw new UnsupportedOperationException(
          String.format(
              "Cannot convert outer reference %s: it does not step out of any "
                  + "subquery (steps_out=%s). This usually indicates a plan that was "
                  + "only partially decorrelated; Substrait requires steps_out >= 1.",
              fieldAccess, stepsOut));
    }
    return FieldReference.newRootStructOuterReference(
        fieldAccess.getField().getIndex(),
        typeConverter.toSubstrait(fieldAccess.getType()),
        stepsOut);
  }

This also subsumes the original NPE (the null case) under one clear message.

Follow-up: the proper long-term fix — id-based outer-reference resolution from substrait-io/substrait#1031, which would let these partially-decorrelated / DAG plans actually convert — is already tracked under #869 (Update to Substrait v0.89.0). No new issue needed; just point the guard comment at #869.

public RexNode visitFieldAccess(RexFieldAccess fieldAccess) {
if (fieldAccess.getReferenceExpr() instanceof RexCorrelVariable) {
CorrelationId id = ((RexCorrelVariable) fieldAccess.getReferenceExpr()).id;
nestedDepth.putIfAbsent(id, 0);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering the id here is the right move — it removes the downstream NPE and lets genuinely-nested references (depth ≥ 1) resolve correctly.

The thing to be aware of: for a single top-level residual reference (no enclosing subquery), this anchors the id at depth 0 and the access is recorded at 0. Downstream that becomes newRootStructOuterReference(..., stepsOut=0), which is an invalid Substrait plan — the proto requires steps_out >= 1 (algebra.proto#L1649-L1653). See the guard proposal in the review summary so this fails loudly instead of emitting an invalid reference.

// Post-fix: the Filter condition pre-scan registers $cor0, so the field access is recorded.
final Map<RexFieldAccess, Integer> fieldAccessDepthMap = buildOuterFieldRefMap(calciteRel);
Assertions.assertEquals(1, fieldAccessDepthMap.size());
validateOuterRef(fieldAccessDepthMap, "$cor0", "SS_ITEM_SK", 0);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion enshrines the invalid steps_out=0 output as the expected behaviour. Since the proto requires steps_out >= 1, depth 0 isn't representable.

Proposal: keep the assertion that the resolver now registers the correlation (it proves the NPE is gone), then assert that conversion is rejected rather than producing an invalid plan:

// Resolver registers the dangling correlation (no longer silently dropped → no NPE)...
Map<RexFieldAccess, Integer> map = buildOuterFieldRefMap(calciteRel);
Assertions.assertEquals(1, map.size());
validateOuterRef(map, "$cor0", "SS_ITEM_SK", 0);
// ...but steps_out=0 is not a valid Substrait outer reference (must be >= 1),
// so conversion fails clearly rather than emitting an invalid plan.
UnsupportedOperationException ex =
    Assertions.assertThrows(
        UnsupportedOperationException.class,
        () -> SubstraitRelVisitor.convert(calciteRel, converterProvider));
Assertions.assertTrue(ex.getMessage().contains("steps_out"));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants