fix: outer reference confusing calcite optimizations#964
Conversation
Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
|
this seems to require a |
Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
nielspardon
left a comment
There was a problem hiding this comment.
Nice catch on the crash this fixes. When a Calcite optimizer only partially decorrelates a plan (the owning Correlate is rewritten into a join, but $corN.col is left in a Filter condition whose getVariablesSet() is now empty), the CorrelationId was never registered, the access was dropped from fieldAccessDepthMap, and RexExpressionConverter.visitFieldAccess then unboxed a null depth into an int → NPE. Registering the id in visit(Filter) removes that NPE, and for genuinely-nested references (depth ≥ 1) it produces correct outer references.
One concern with the result: for the common single-reference case the pre-scan registers the correlation at depth 0, so the converter emits newRootStructOuterReference(..., stepsOut=0). That is an invalid Substrait plan, not just an odd one — the proto is explicit that steps_out must be >= 1:
OuterReference.steps_out: "Number of subquery boundaries to traverse up for this field's reference. Must be >= 1."
— https://github.com/substrait-io/substrait/blob/8675ba395ce966e593ca66575930f59dc359b419/proto/substrait/algebra.proto#L1649-L1653
Consistently, FieldReference.isOuterReference() is outerReferenceStepsOut().orElse(0) > 0, and on the reverse path a 0 reference satisfies neither isSimpleRootReference() nor isOuterReference() and falls through both branches of ExpressionRexConverter.visit(FieldReference). (Emitting a plain non-outer reference instead is also wrong: the field index is in the correlated relation's coordinate system, not the Filter's input — e.g. $cor0.SS_ITEM_SK is index 2 in STORE_SALES but index 2 in ITEM is I_REC_START_DATE.)
So I'd suggest pairing the resolver fix with a guard so the unrepresentable depth-0 case fails loudly instead of producing an invalid plan.
Proposal 1 — guard the converter (isthmus/.../expression/RexExpressionConverter.java, visitFieldAccess, CORREL_VARIABLE case; that file isn't in this diff, hence here rather than inline):
case CORREL_VARIABLE:
{
Integer stepsOut = relVisitor.getFieldAccessDepth(fieldAccess);
// Substrait requires steps_out >= 1 for an outer reference
// (proto/substrait/algebra.proto, OuterReference.steps_out: "Must be >= 1.").
// A null/0 depth means the correlation does not step out of any subquery,
// typically a plan that was only partially decorrelated (the owning Correlate
// was rewritten into a join but the $cor reference remains in this Filter).
// We can't represent that with steps_out, so fail clearly rather than emit an
// invalid steps_out=0 reference. Id-based resolution (substrait#1031) would
// handle this; tracked in #869.
if (stepsOut == null || stepsOut == 0) {
throw new UnsupportedOperationException(
String.format(
"Cannot convert outer reference %s: it does not step out of any "
+ "subquery (steps_out=%s). This usually indicates a plan that was "
+ "only partially decorrelated; Substrait requires steps_out >= 1.",
fieldAccess, stepsOut));
}
return FieldReference.newRootStructOuterReference(
fieldAccess.getField().getIndex(),
typeConverter.toSubstrait(fieldAccess.getType()),
stepsOut);
}This also subsumes the original NPE (the null case) under one clear message.
Follow-up: the proper long-term fix — id-based outer-reference resolution from substrait-io/substrait#1031, which would let these partially-decorrelated / DAG plans actually convert — is already tracked under #869 (Update to Substrait v0.89.0). No new issue needed; just point the guard comment at #869.
| public RexNode visitFieldAccess(RexFieldAccess fieldAccess) { | ||
| if (fieldAccess.getReferenceExpr() instanceof RexCorrelVariable) { | ||
| CorrelationId id = ((RexCorrelVariable) fieldAccess.getReferenceExpr()).id; | ||
| nestedDepth.putIfAbsent(id, 0); |
There was a problem hiding this comment.
Registering the id here is the right move — it removes the downstream NPE and lets genuinely-nested references (depth ≥ 1) resolve correctly.
The thing to be aware of: for a single top-level residual reference (no enclosing subquery), this anchors the id at depth 0 and the access is recorded at 0. Downstream that becomes newRootStructOuterReference(..., stepsOut=0), which is an invalid Substrait plan — the proto requires steps_out >= 1 (algebra.proto#L1649-L1653). See the guard proposal in the review summary so this fails loudly instead of emitting an invalid reference.
| // Post-fix: the Filter condition pre-scan registers $cor0, so the field access is recorded. | ||
| final Map<RexFieldAccess, Integer> fieldAccessDepthMap = buildOuterFieldRefMap(calciteRel); | ||
| Assertions.assertEquals(1, fieldAccessDepthMap.size()); | ||
| validateOuterRef(fieldAccessDepthMap, "$cor0", "SS_ITEM_SK", 0); |
There was a problem hiding this comment.
This assertion enshrines the invalid steps_out=0 output as the expected behaviour. Since the proto requires steps_out >= 1, depth 0 isn't representable.
Proposal: keep the assertion that the resolver now registers the correlation (it proves the NPE is gone), then assert that conversion is rejected rather than producing an invalid plan:
// Resolver registers the dangling correlation (no longer silently dropped → no NPE)...
Map<RexFieldAccess, Integer> map = buildOuterFieldRefMap(calciteRel);
Assertions.assertEquals(1, map.size());
validateOuterRef(map, "$cor0", "SS_ITEM_SK", 0);
// ...but steps_out=0 is not a valid Substrait outer reference (must be >= 1),
// so conversion fails clearly rather than emitting an invalid plan.
UnsupportedOperationException ex =
Assertions.assertThrows(
UnsupportedOperationException.class,
() -> SubstraitRelVisitor.convert(calciteRel, converterProvider));
Assertions.assertTrue(ex.getMessage().contains("steps_out"));
fix: outer reference ocnfusing post calcite optimizations