Skip to content

refactor: route shim-registered expressions through CometExpressionSerde#4139

Merged
andygrove merged 19 commits into
apache:mainfrom
andygrove:feat/issue-4077-shim-serde
Jun 11, 2026
Merged

refactor: route shim-registered expressions through CometExpressionSerde#4139
andygrove merged 19 commits into
apache:mainfrom
andygrove:feat/issue-4077-shim-serde

Conversation

@andygrove

@andygrove andygrove commented Apr 29, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4077.

Rationale for this change

StringDecode, ToPrettyString, and (4.x) MapSort were registered through CometExprShim.sparkVersionSpecificExprToProtoInternal and bypassed the CometExpressionSerde framework. They had no per-expression enabled config, no support-level gate, and no entry in the auto-generated compatibility guide.

A second consequence: because the shim's match was version-private, the docs generator only saw the version-agnostic serde maps, so there was no way to publish per-Spark-version compatibility pages.

What changes are included in this PR?

Shim-side category maps

  • Add four category-keyed maps to CometExprShim: sparkVersionSpecificStringExpressions, sparkVersionSpecificMathExpressions, sparkVersionSpecificMiscExpressions, sparkVersionSpecificMapExpressions. Each per-Spark-version trait fills these with the classes only available (or only handled) on that Spark version.
  • QueryPlanSerde merges these into the global category maps, so the auto-generated compatibility docs and the per-expression enabled config now reach version-specific expressions.
  • Make binaryOutputStyle public on the trait so serde objects can read it via QueryPlanSerde.binaryOutputStyle.

Migrate shim-resident expressions to CometExpressionSerde

  • StringDecode (3.4 / 3.5) → CometStringDecode
  • ToPrettyString (3.5 / 4.x) → CometToPrettyString
  • MapSort (4.x) → CometMapSort

WidthBucket was originally migrated by this PR, but during the merge main's #4538 had already routed it through CometCodegenDispatch[WidthBucket]; this PR keeps main's serde and only contributes the sparkVersionSpecificMathExpressions map entry and CometWidthBucketSuite.

Share the 4.x trait body

New Spark4xCometExprShim (under spark-4.x/) holds the parts identical across 4.0 / 4.1 / 4.2: KnownNotContainsNull, Invoke-on-StructsToJsonEvaluator/ParseUrlEvaluator, the StaticInvoke-on-StringDecode rewrite, and lengthOfJsonArray / DayName / MonthName / structured-text dispatch (the latter via the existing CometExprShim4x it now extends). The per-minor-version CometExprShim traits override only binaryOutputStyle and add make_time / to_time / try_to_time (4.1+).

Per-Spark-version compatibility pages (GenerateDocs)

  • GenerateDocs.main accepts an optional second arg: a per-Spark-version subdirectory name (e.g. spark-3.5). When supplied, pages are written to compatibility/expressions/<version>/; when omitted, the legacy flat compatibility/expressions/ layout is preserved for released-version doc trees.
  • This is what makes the new sparkVersionSpecific*Expressions maps observable in the docs: each Spark version's docs reflect its own version-specific expressions.

How are these changes tested?

  • New framework tests (CometStringDecodeSuite, CometToPrettyStringSuite, CometMapSortSuite, CometWidthBucketSuite) toggle spark.comet.expression.<X>.enabled and assert fallback to Spark when disabled. These fail before the refactor and pass after.
  • Existing CometStringExpressionSuite, CometMapExpressionSuite, etc. continue to pass — no behavior change for accepted expressions.

@andygrove andygrove marked this pull request as draft April 29, 2026 00:14
…serde

# Conflicts:
#	spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala
#	spark/src/main/spark-4.1/org/apache/comet/shims/CometExprShim.scala
#	spark/src/main/spark-4.2/org/apache/comet/shims/CometExprShim.scala
…iteral block

YAML literal blocks (`|`) do not strip `#` comments, so the previous
inline comment lines were passed to the runner as bogus suite class
names. Move the reference into a proper YAML comment outside the block
so dev/ci/check-suites.py still finds the substring without polluting
the suite list.
@andygrove andygrove marked this pull request as ready for review April 29, 2026 12:58
protected def binaryOutputStyle: BinaryOutputStyle = BinaryOutputStyle.HEX_DISCRETE
def binaryOutputStyle: BinaryOutputStyle = BinaryOutputStyle.HEX_DISCRETE

def versionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def versionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =
def sparkVersionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =

?

Otherwise it is not clear what version referring to, if its spark, comet, etc?

import org.apache.spark.sql.CometTestBase
import org.apache.spark.sql.execution.{ProjectExec, SparkPlan}

class CometWidthBucketFrameworkSuite extends CometTestBase {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what Framework means for width_bucket test


import org.apache.comet.CometSparkSessionExtensions.isSpark40Plus

class CometStringDecodeFrameworkSuite extends CometTestBase {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the word Framework looks redundant

Rename versionSpecific* shim methods to sparkVersionSpecific* so the
"Spark version" intent is explicit, and drop the redundant "Framework"
suffix from the new test suites.
@andygrove

Copy link
Copy Markdown
Member Author

Thanks @comphead. I addressed feedback. Could you take another look?

…serde

# Conflicts:
#	spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
@andygrove andygrove modified the milestone: 0.16.0 May 6, 2026
andygrove added 4 commits May 28, 2026 11:21
Closes apache#510 (or contributes to it).

The compatibility guide is now generated separately for each supported Spark
profile (3.4, 3.5, 4.0, 4.1). Source category templates moved under
_category_template/, and docs/build.sh runs GenerateDocs once per profile,
populating per-version subdirectories under compatibility/expressions/. Hand
-written pages and configs.md remain shared.

GenerateDocs accepts an optional second argument naming the per-version
subdirectory; without it, the legacy flat layout is preserved (used by older
released doc trees that don't carry the new structure).

Combined with the shim-to-framework migration in this PR, the per-version
output now reflects real differences (e.g. StringDecode appears in the Spark
3.4 unsupported list but not on Spark 4.x where it's handled via a
StaticInvoke rewrite).
# Conflicts:
#	.github/workflows/pr_build_linux.yml
#	.github/workflows/pr_build_macos.yml
#	docs/source/user-guide/latest/compatibility/expressions/index.md
#	spark/src/main/scala/org/apache/comet/GenerateDocs.scala
#	spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
#	spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala
#	spark/src/main/spark-4.1/org/apache/comet/shims/CometExprShim.scala
#	spark/src/main/spark-4.2/org/apache/comet/shims/CometExprShim.scala

@mbutrovich mbutrovich left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

@andygrove andygrove merged commit 8fa83e7 into apache:main Jun 11, 2026
71 checks passed
@andygrove andygrove deleted the feat/issue-4077-shim-serde branch June 11, 2026 19:19
@andygrove

Copy link
Copy Markdown
Member Author

Merged. Thanks @mbutrovich!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expressions added via CometExprShim bypass the CometExpressionSerde framework

3 participants