fix(python-worker): coerce integral floats to int for INT/LONG fields by eugenegujing · Pull Request #6053 · apache/texera

eugenegujing · 2026-07-01T23:28:50Z

What changes were proposed in this PR?

Problem. Pandas-based Python operators (e.g. Sort via TableOperator) build a DataFrame from input tuples. When an INT/LONG column contains nulls, pandas promotes the whole column to float64 because an int column cannot hold NaN, so 119 becomes 119.0. On output, the worker's strict schema validation in Tuple.finalize() then fails with TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead. This crashed every workflow whose CSV had an integer column with at least one missing value, and the only workaround was manually inserting a Type Casting operator for each affected column.

Why fix it in the Python worker (option (a) of the issue). The CSV schema inference is correct (an all-integer column with nulls is INTEGER; the JVM side handles null ints fine), and a UI per-column override (option (c)) would not remove the crash. The type contract is broken by pandas at the Python-worker boundary, so the fix belongs at that boundary's single chokepoint: Tuple.cast_to_schema(), which already performs safe casts (NaN -> None, object -> pickled bytes) right before validate_schema().

The fix (in Tuple.cast_to_schema() only; validate_schema() is unchanged): when the target type is INT or LONG and the value is a float (including np.float64) with a zero fractional part, cast it back to int — but only when the result is provably the original integer:

INT window: Arrow int32 capacity [-2^31, 2^31 - 1]. int32 values are always exactly representable in float64, so capacity is the only constraint.
LONG window: the float64 exact-integer range [-(2^53) + 1, 2^53 - 1] instead of int64 capacity. Above 2^53, float64 rounds, so the received float may already be a corrupted rendition of the original integer; coercing it would turn a loud validation error into silent data corruption. The endpoint 2^53 itself is excluded because it is ambiguous (2^53 + 1 also rounds to float 2^53).
The range check compares the converted int rather than floats, to avoid float rounding at the window endpoints.
Non-integral, infinite, and out-of-window floats are left untouched so validate_schema() still rejects them: lossy coercion must never happen silently. An out-of-window integral float additionally logs an actionable warning suggesting a cast to STRING or DOUBLE (or LONG for large integers in an INT field).

Deliberate behavior change for reviewers to note. Restructuring the if-chain in cast_to_schema() also fixes a pre-existing stale-variable bug: a NaN destined for a BINARY field was first set to None and then re-pickled from the stale local variable, producing pickled-NaN bytes instead of None. NaN in a BINARY field now correctly finalizes to None (guarded by a dedicated test).

The changed logic in core/models/tuple.py. A new module-level constant defines the safely coercible window per integral type:

# Signed value ranges of the integral AttributeTypes within which an
# integral float can be safely cast back to int. INT is bounded by Arrow
# int32 capacity. LONG is bounded by the float64 exact-integer window
# rather than int64 capacity: above 2**53 float64 rounds, so the received
# float may already be a corrupted rendition of the original integer. The
# endpoint 2**53 itself is excluded because it is ambiguous (2**53 + 1
# also rounds to float 2**53).
INTEGRAL_TYPE_RANGES = {
    AttributeType.INT: (-(2**31), 2**31 - 1),
    AttributeType.LONG: (-(2**53) + 1, 2**53 - 1),
}

cast_to_schema()'s per-field loop is restructured from two independent ifs into mutually exclusive branches (null handling / integral-float coercion / BINARY pickling), which both hosts the new coercion and eliminates the stale-variable read described above:

# convert NaN to None to support null value conversion
if checknull(field_value):
    self[field_name] = None
elif field_value is not None:
    field_type = schema.get_attr_type(field_name)
    if (
        field_type in INTEGRAL_TYPE_RANGES
        and isinstance(field_value, float)
        and field_value.is_integer()
    ):
        # pandas promotes an int column holding nulls to float64
        # (119 -> 119.0), so convert integral floats destined for
        # INT/LONG back to int -- but only when the result fits the
        # safe range. Compare on the int result to avoid float
        # rounding at the endpoints.
        min_value, max_value = INTEGRAL_TYPE_RANGES[field_type]
        int_value = int(field_value)
        if min_value <= int_value <= max_value:
            self[field_name] = int_value
        else:
            logger.warning(...)  # actionable guidance, see diff
    elif field_type == AttributeType.BINARY and not isinstance(
        field_value, bytes
    ):
        self[field_name] = b"pickle    " + pickle.dumps(field_value)

The outer per-field try/except (keep the value unchanged if a cast fails, continue with the next field) is preserved, and validate_schema() is untouched, so anything the coercion deliberately skips still fails validation loudly.

Any related issues, documentation, discussions?

Fixes #5935

How was this PR tested?

TDD: the tests were written first and confirmed to reproduce the crash (red), then the fix turned them green.

34 new test cases in amber/src/test/python/core/models/test_tuple.py (59 total in the file, all passing): coercion cases including the int32 and float64-exact-window boundaries and np.float64; rejection of non-integral / infinite / out-of-window floats; the out-of-window warning; NaN/None handling; DOUBLE and STRING fields staying untouched; tests pinning the coercion into cast_to_schema rather than validate_schema; and an integration-style test reproducing the full pipeline (Table.from_tuple_likes -> float64 promotion -> as_tuples -> finalize).
Full Python worker suite: cd amber && pytest -m "not integration" — all pass.
ruff check and ruff format --check clean on both changed files.
sbt "scalafixAll --check" and sbt scalafmtCheckAll pass.
Backend AMBER_TEST_FILTER=skip-integration sbt test: the full suite passes — 0 failed, 0 aborted (WorkflowCore 1570, amber 1076, all other service modules green). Run against a clean iceberg catalog, matching how CI provisions one per run.
Manual reproduction of the issue scenario (CSV with an integer column containing blanks -> Sort) is covered by the integration-style unit test above, which exercises the same Table -> finalize code path the worker uses.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Fable 5)

Problem: pandas-based Python operators (e.g. Sort via TableOperator) build a DataFrame from input tuples. When an INT/LONG column contains nulls, pandas promotes the whole column to float64 because an int column cannot hold NaN, so 119 becomes 119.0. On output, the worker's strict schema validation in Tuple.finalize() then fails with "TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead." This crashed every workflow whose CSV had an integer column with at least one missing value, forcing users to manually insert a Type Casting operator. Fix (Tuple.cast_to_schema only; validate_schema unchanged): when the target type is INT or LONG and the value is a float (including np.float64) with a zero fractional part, cast it back to int, but only when the result is provably the original integer: - INT window: Arrow int32 capacity [-2^31, 2^31 - 1]. int32 values are always exactly representable in float64, so capacity is the only constraint. - LONG window: the float64 exact-integer range [-(2^53) + 1, 2^53 - 1] instead of int64 capacity. Above 2^53 float64 rounds, so the received float may already be a corrupted rendition of the original integer; coercing it would turn a loud validation error into silent data corruption. - The endpoint 2^53 is excluded because it is ambiguous: 2^53 + 1 also rounds to float 2^53. - The range check compares the converted int, not floats, to avoid rounding at the endpoints. - Non-integral, infinite, and out-of-window floats are left untouched so validate_schema() still rejects them: lossy coercion must never happen silently. - An out-of-window integral float additionally logs an actionable warning suggesting a cast to STRING or DOUBLE (or LONG for large integers in an INT field). Also fixes a pre-existing stale-variable bug exposed by restructuring the if-chain: a NaN destined for a BINARY field was first set to None and then re-pickled from the stale local variable, producing pickled-NaN bytes instead of None. NaN in a BINARY field now correctly finalizes to None (guarded by a dedicated test). Tests: 34 new cases in test_tuple.py: - coercion cases incl. int32 and float64-exact-window boundaries and np.float64 - rejection of non-integral / infinite / out-of-window floats, plus the out-of-window warning - NaN/None handling; DOUBLE and STRING fields stay untouched - coercion pinned into cast_to_schema rather than validate_schema - an integration-style test reproducing the full pipeline (Table.from_tuple_likes -> float64 promotion -> as_tuples -> finalize) Fixes apache#5935

github-actions · 2026-07-01T23:29:04Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @Ma77Ball, @kunwp1
You can notify them by mentioning @Ma77Ball, @kunwp1 in a comment.

codecov-commenter · 2026-07-01T23:32:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.76%. Comparing base (22705e9) to head (7a5811b).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6053      +/-   ##
============================================
+ Coverage     56.62%   56.76%   +0.13%     
  Complexity     3019     3019              
============================================
  Files          1122     1122              
  Lines         43200    43367     +167     
  Branches       4648     4648              
============================================
+ Hits          24464    24618     +154     
- Misses        17299    17312      +13     
  Partials       1437     1437

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`		Carriedforward from d345bee
agent-service	`44.59% <ø> (ø)`		Carriedforward from d345bee
amber	`58.56% <ø> (ø)`		Carriedforward from d345bee
computing-unit-managing-service	`0.00% <ø> (ø)`		Carriedforward from d345bee
config-service	`52.30% <ø> (ø)`		Carriedforward from d345bee
file-service	`62.81% <ø> (ø)`		Carriedforward from d345bee
frontend	`49.36% <ø> (ø)`		Carriedforward from d345bee
notebook-migration-service	`78.57% <ø> (ø)`		Carriedforward from d345bee
pyamber	`91.19% <100.00%> (+0.98%)`	⬆️
python	`90.75% <ø> (-0.01%)`	⬇️	Carriedforward from d345bee
workflow-compiling-service	`55.14% <ø> (ø)`		Carriedforward from d345bee

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-07-01T23:34:57Z

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 7 worse · ⚪ 8 noise (<±5%) · 0 without baseline

Compared against main d861036 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	429	0.262	23,793/31,450/31,450 us	🔴 +17.0% / 🔴 +105.7%
🔴	bs=100 sw=10 sl=64	961	0.587	103,611/133,675/133,675 us	🔴 +9.3% / 🔴 +23.3%
⚪	bs=1000 sw=10 sl=64	1,106	0.675	908,370/937,910/937,910 us	⚪ within ±5% / 🟢 -12.3%

Baseline details

Latest main d861036 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	429 tuples/sec	486 tuples/sec	770.95 tuples/sec	-11.7%	-44.4%
bs=10 sw=10 sl=64	MB/s	0.262 MB/s	0.296 MB/s	0.471 MB/s	-11.5%	-44.3%
bs=10 sw=10 sl=64	p50	23,793 us	20,328 us	12,775 us	+17.0%	+86.2%
bs=10 sw=10 sl=64	p95	31,450 us	28,841 us	15,286 us	+9.0%	+105.7%
bs=10 sw=10 sl=64	p99	31,450 us	28,841 us	18,795 us	+9.0%	+67.3%
bs=100 sw=10 sl=64	throughput	961 tuples/sec	1,001 tuples/sec	976.93 tuples/sec	-4.0%	-1.6%
bs=100 sw=10 sl=64	MB/s	0.587 MB/s	0.611 MB/s	0.596 MB/s	-3.9%	-1.6%
bs=100 sw=10 sl=64	p50	103,611 us	100,902 us	102,557 us	+2.7%	+1.0%
bs=100 sw=10 sl=64	p95	133,675 us	122,344 us	108,383 us	+9.3%	+23.3%
bs=100 sw=10 sl=64	p99	133,675 us	122,344 us	115,249 us	+9.3%	+16.0%
bs=1000 sw=10 sl=64	throughput	1,106 tuples/sec	1,094 tuples/sec	1,009 tuples/sec	+1.1%	+9.6%
bs=1000 sw=10 sl=64	MB/s	0.675 MB/s	0.668 MB/s	0.616 MB/s	+1.0%	+9.6%
bs=1000 sw=10 sl=64	p50	908,370 us	914,222 us	997,695 us	-0.6%	-9.0%
bs=1000 sw=10 sl=64	p95	937,910 us	953,786 us	1,036,731 us	-1.7%	-9.5%
bs=1000 sw=10 sl=64	p99	937,910 us	953,786 us	1,069,334 us	-1.7%	-12.3%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,466.16,200,128000,429,0.262,23792.83,31450.06,31450.06
1,100,10,64,20,2080.67,2000,1280000,961,0.587,103610.80,133675.48,133675.48
2,1000,10,64,20,18081.22,20000,12800000,1106,0.675,908369.62,937910.06,937910.06

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

eugenegujing · 2026-07-02T20:14:08Z

/request-review @aicam

Yicong-Huang

Thanks, please see inline comments

- test_tuple.py: use loguru's `logger` directly instead of aliasing it. - Move INTEGRAL_TYPE_RANGES next to the other per-AttributeType maps in models/schema/attribute_type.py; import it into tuple.py. - Document that the int-column-with-nulls -> float64 promotion is inherent to numpy-backed columns (pandas' default) rather than a specific pandas version, verified on the pinned pandas 2.2.3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- test_tuple.py: use loguru's `logger` directly instead of aliasing it. - Move INTEGRAL_TYPE_RANGES next to the other per-AttributeType maps in models/schema/attribute_type.py; import it into tuple.py.

- tuple.py: label out pandas version

github-actions Bot assigned eugenegujing Jul 1, 2026

github-actions Bot added fix pyamber labels Jul 1, 2026

ci: re-trigger checks (flaky ReconfigurationIntegrationSpec on macOS)

38415b1

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions Bot requested a review from aicam July 2, 2026 20:14

Yicong-Huang approved these changes Jul 3, 2026

View reviewed changes

Comment thread amber/src/test/python/core/models/test_tuple.py Outdated

Comment thread amber/src/main/python/core/models/tuple.py Outdated

Comment thread amber/src/main/python/core/models/tuple.py Outdated

refactor(python-worker): address review comments on apache#6053

d345bee

- test_tuple.py: use loguru's `logger` directly instead of aliasing it. - Move INTEGRAL_TYPE_RANGES next to the other per-AttributeType maps in models/schema/attribute_type.py; import it into tuple.py.

eugenegujing force-pushed the fix/python-int-null-schema-coercion branch from 12f0241 to d345bee Compare July 3, 2026 08:50

refactor(python-worker): address review comments on apache#6053

7a5811b

- tuple.py: label out pandas version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(python-worker): coerce integral floats to int for INT/LONG fields#6053

fix(python-worker): coerce integral floats to int for INT/LONG fields#6053
eugenegujing wants to merge 4 commits into
apache:mainfrom
eugenegujing:fix/python-int-null-schema-coercion

eugenegujing commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

eugenegujing commented Jul 2, 2026

Uh oh!

Yicong-Huang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eugenegujing commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

eugenegujing commented Jul 2, 2026

Uh oh!

Yicong-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eugenegujing commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov-commenter commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading