refactor: route shim-registered expressions through CometExpressionSerde by andygrove · Pull Request #4139 · apache/datafusion-comet

andygrove · 2026-04-29T00:05:53Z

Which issue does this PR close?

Closes #4077.

Rationale for this change

StringDecode, ToPrettyString, and (4.x) MapSort were registered through CometExprShim.sparkVersionSpecificExprToProtoInternal and bypassed the CometExpressionSerde framework. They had no per-expression enabled config, no support-level gate, and no entry in the auto-generated compatibility guide.

A second consequence: because the shim's match was version-private, the docs generator only saw the version-agnostic serde maps, so there was no way to publish per-Spark-version compatibility pages.

What changes are included in this PR?

Shim-side category maps

Add four category-keyed maps to CometExprShim: sparkVersionSpecificStringExpressions, sparkVersionSpecificMathExpressions, sparkVersionSpecificMiscExpressions, sparkVersionSpecificMapExpressions. Each per-Spark-version trait fills these with the classes only available (or only handled) on that Spark version.
QueryPlanSerde merges these into the global category maps, so the auto-generated compatibility docs and the per-expression enabled config now reach version-specific expressions.
Make binaryOutputStyle public on the trait so serde objects can read it via QueryPlanSerde.binaryOutputStyle.

Migrate shim-resident expressions to CometExpressionSerde

StringDecode (3.4 / 3.5) → CometStringDecode
ToPrettyString (3.5 / 4.x) → CometToPrettyString
MapSort (4.x) → CometMapSort

WidthBucket was originally migrated by this PR, but during the merge main's #4538 had already routed it through CometCodegenDispatch[WidthBucket]; this PR keeps main's serde and only contributes the sparkVersionSpecificMathExpressions map entry and CometWidthBucketSuite.

Share the 4.x trait body

New Spark4xCometExprShim (under spark-4.x/) holds the parts identical across 4.0 / 4.1 / 4.2: KnownNotContainsNull, Invoke-on-StructsToJsonEvaluator/ParseUrlEvaluator, the StaticInvoke-on-StringDecode rewrite, and lengthOfJsonArray / DayName / MonthName / structured-text dispatch (the latter via the existing CometExprShim4x it now extends). The per-minor-version CometExprShim traits override only binaryOutputStyle and add make_time / to_time / try_to_time (4.1+).

Per-Spark-version compatibility pages (GenerateDocs)

GenerateDocs.main accepts an optional second arg: a per-Spark-version subdirectory name (e.g. spark-3.5). When supplied, pages are written to compatibility/expressions/<version>/; when omitted, the legacy flat compatibility/expressions/ layout is preserved for released-version doc trees.
This is what makes the new sparkVersionSpecific*Expressions maps observable in the docs: each Spark version's docs reflect its own version-specific expressions.

How are these changes tested?

New framework tests (CometStringDecodeSuite, CometToPrettyStringSuite, CometMapSortSuite, CometWidthBucketSuite) toggle spark.comet.expression.<X>.enabled and assert fallback to Spark when disabled. These fail before the refactor and pass after.
Existing CometStringExpressionSuite, CometMapExpressionSuite, etc. continue to pass — no behavior change for accepted expressions.

Register CometStringDecode (shared under spark-3.x/) in the versionSpecificStringExpressions map for Spark 3.4 and 3.5, and remove the now-redundant case branch from versionSpecificExprToProtoInternal.

…serde # Conflicts: # spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala # spark/src/main/spark-4.1/org/apache/comet/shims/CometExprShim.scala # spark/src/main/spark-4.2/org/apache/comet/shims/CometExprShim.scala

… shim body

…iteral block YAML literal blocks (`|`) do not strip `#` comments, so the previous inline comment lines were passed to the runner as bogus suite class names. Move the reference into a proper YAML comment outside the block so dev/ci/check-suites.py still finds the substring without polluting the suite list.

comphead · 2026-04-29T15:22:08Z

-  protected def binaryOutputStyle: BinaryOutputStyle = BinaryOutputStyle.HEX_DISCRETE
+  def binaryOutputStyle: BinaryOutputStyle = BinaryOutputStyle.HEX_DISCRETE
+
+  def versionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =


Suggested change

def versionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =

def sparkVersionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =

?

Otherwise it is not clear what version referring to, if its spark, comet, etc?

comphead · 2026-04-29T15:23:16Z

+import org.apache.spark.sql.CometTestBase
+import org.apache.spark.sql.execution.{ProjectExec, SparkPlan}
+
+class CometWidthBucketFrameworkSuite extends CometTestBase {


I'm not sure what Framework means for width_bucket test

comphead · 2026-04-29T15:23:48Z

+
+import org.apache.comet.CometSparkSessionExtensions.isSpark40Plus
+
+class CometStringDecodeFrameworkSuite extends CometTestBase {


the word Framework looks redundant

Rename versionSpecific* shim methods to sparkVersionSpecific* so the "Spark version" intent is explicit, and drop the redundant "Framework" suffix from the new test suites.

andygrove · 2026-05-05T20:15:54Z

Thanks @comphead. I addressed feedback. Could you take another look?

…serde # Conflicts: # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

Closes apache#510 (or contributes to it). The compatibility guide is now generated separately for each supported Spark profile (3.4, 3.5, 4.0, 4.1). Source category templates moved under _category_template/, and docs/build.sh runs GenerateDocs once per profile, populating per-version subdirectories under compatibility/expressions/. Hand -written pages and configs.md remain shared. GenerateDocs accepts an optional second argument naming the per-version subdirectory; without it, the legacy flat layout is preserved (used by older released doc trees that don't carry the new structure). Combined with the shim-to-framework migration in this PR, the per-version output now reflects real differences (e.g. StringDecode appears in the Spark 3.4 unsupported list but not on Spark 4.x where it's handled via a StaticInvoke rewrite).

# Conflicts: # .github/workflows/pr_build_linux.yml # .github/workflows/pr_build_macos.yml # docs/source/user-guide/latest/compatibility/expressions/index.md # spark/src/main/scala/org/apache/comet/GenerateDocs.scala # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala # spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala # spark/src/main/spark-4.1/org/apache/comet/shims/CometExprShim.scala # spark/src/main/spark-4.2/org/apache/comet/shims/CometExprShim.scala

mbutrovich

Thanks @andygrove!

andygrove · 2026-06-11T19:19:23Z

Merged. Thanks @mbutrovich!

…rde (apache#4139)

andygrove added 9 commits April 28, 2026 16:00

refactor: add versionSpecific* map hooks to CometExprShim

1e31abb

refactor: merge versionSpecific* maps into QueryPlanSerde categories

736abe1

docs: explain why QueryPlanSerde category maps wrap in a typed base val

b10128b

refactor: migrate StringDecode to CometExpressionSerde framework

a914b12

Register CometStringDecode (shared under spark-3.x/) in the versionSpecificStringExpressions map for Spark 3.4 and 3.5, and remove the now-redundant case branch from versionSpecificExprToProtoInternal.

refactor: migrate WidthBucket (Spark 3.5) to CometExpressionSerde

b11bc97

refactor: migrate ToPrettyString (Spark 3.5) to CometExpressionSerde

f10412e

refactor: use QueryPlanSerde.binaryOutputStyle in CometToPrettyString

df3aa32

refactor: migrate WidthBucket (Spark 4.x) to CometExpressionSerde

971ab91

refactor: migrate ToPrettyString (Spark 4.x) to CometExpressionSerde

aa8e356

andygrove marked this pull request as draft April 29, 2026 00:14

andygrove added 4 commits April 28, 2026 18:18

refactor: migrate MapSort to CometExpressionSerde and share Spark 4.x…

af2e8b6

… shim body

ci: add framework test suites to PR workflows

7d7a6e0

andygrove marked this pull request as ready for review April 29, 2026 12:58

comphead reviewed Apr 29, 2026

View reviewed changes

refactor: clarify shim API names per review feedback

83ea3db

Rename versionSpecific* shim methods to sparkVersionSpecific* so the "Spark version" intent is explicit, and drop the redundant "Framework" suffix from the new test suites.

Merge remote-tracking branch 'origin/main' into feat/issue-4077-shim-…

34a513d

…serde # Conflicts: # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

andygrove modified the milestone: 0.16.0 May 6, 2026

andygrove added 4 commits May 28, 2026 11:21

ci: add CometStringDecodeSuite to pr_build_macos workflow

6c5f722

Merge remote-tracking branch 'apache/main' into pr-4139

1b5d790

mbutrovich approved these changes Jun 11, 2026

View reviewed changes

andygrove merged commit 8fa83e7 into apache:main Jun 11, 2026
71 checks passed

andygrove deleted the feat/issue-4077-shim-serde branch June 11, 2026 19:19

mbutrovich mentioned this pull request Jun 16, 2026

chore: fix generate-release-docs.sh for per-Spark-version doc layout #4662

Merged

marvelshan pushed a commit to marvelshan/datafusion-comet that referenced this pull request Jul 2, 2026

refactor: route shim-registered expressions through CometExpressionSe…

3659c17

…rde (apache#4139)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: route shim-registered expressions through CometExpressionSerde#4139

refactor: route shim-registered expressions through CometExpressionSerde#4139
andygrove merged 19 commits into
apache:mainfrom
andygrove:feat/issue-4077-shim-serde

andygrove commented Apr 29, 2026 •

edited

Loading

Uh oh!

comphead Apr 29, 2026

Uh oh!

comphead Apr 29, 2026

Uh oh!

comphead Apr 29, 2026

Uh oh!

andygrove commented May 5, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

andygrove commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def versionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =
	def sparkVersionSpecificStringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] =


		import org.apache.comet.CometSparkSessionExtensions.isSpark40Plus

		class CometStringDecodeFrameworkSuite extends CometTestBase {

Uh oh!

Conversation

andygrove commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 5, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Apr 29, 2026 •

edited

Loading