[Append Scan] Integration tests for `IncrementalAppendScan` by smaheshwar-pltr · Pull Request #2235 · apache/iceberg-python

smaheshwar-pltr · 2025-07-22T12:55:42Z

Note: Contains changes from

[Append Scan] Introduce IncrementalAppendScan class (without integration tests) #2234

Smaller diff from those changes: smaheshwar-pltr#6.

Rationale for this change

Split up from incremental append scan work - see #2031 (comment). PyIceberg doesn't support incremental reading of appended data between snapshots, like Spark does.

This PR adds integration tests for incremental appends cans.

Are these changes tested?

Only changes tests (ignoring PRs this one depends on)

Are there any user-facing changes?

No, ignoring PRs this one depends on

github-actions · 2026-03-17T00:28:57Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2026-03-25T00:30:44Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Drop references to other Iceberg clients (Java, Spark) and re-shape the prose to match the style already used by `Table.scan` and `DataScan` (which the original PR apache#2235 also mirrored). - Trim `BaseScan` docstring to a single line, matching `TableScan`. - Reduce `IncrementalAppendScan` class docstring to the args block, mirroring `DataScan`'s shape. Roll the "from may be expired" note into the `from_snapshot_id_exclusive` arg description. - Match `from_snapshot_exclusive` / `to_snapshot_inclusive` chaining docstrings to PR apache#2235's wording ("this for method chaining"). - Simplify `is_parent_ancestor_of` and `plan_files` docstrings. - Drop the "matches iceberg-java" comment from the test parametrize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes #2634. # Rationale for this change Adds `IncrementalAppendScan`, which reads the data appended between two snapshots — the building block for incremental ingestion. Largely a revival of the work in #2235; see #2634 and the previous PRs for motivation. Split out of #3364 at the reviewers' request, and builds on the now-merged `BaseScan` / `ManifestGroupPlanner` refactor (#3511), so this PR's diff is the append-scan feature alone. The surface mirrors Iceberg-Java's engine-facing API (snapshot IDs, inclusive/exclusive start, optional start) rather than the narrower Spark read options, since PyIceberg is increasingly used by engines (e.g. Polars). See the API discussion on this PR. References: https://github.com/apache/iceberg (Iceberg-Java and Spark) and apache/iceberg-cpp#590. Inline review-aid comments (prefixed `[AI reviewer aid]`) point at the relevant reference code. # API `Table.incremental_append_scan(...)` returns an `IncrementalAppendScan`; `StagedTable` overrides it to raise, mirroring `scan()`. The scan reads the rows added by **append** snapshots in `(from, to]`, projected onto the table's current schema; delete / overwrite / replace snapshots in the range (e.g. compaction) are ignored. The range is set via the factory's Spark-style kwargs or the builder methods, each of which returns a refined copy (like `select()` / `filter()`): ```python table.incremental_append_scan( from_snapshot_id_exclusive=None, # optional; defaults to the oldest ancestor of `to` to_snapshot_id_inclusive=None, # optional; defaults to the current snapshot row_filter=..., selected_fields=..., case_sensitive=..., options=..., limit=..., ) scan.from_snapshot_id_exclusive(id) # or .from_snapshot_id_inclusive(id) .to_snapshot_id_inclusive(id) ``` The range is held as public attributes — `from_snapshot_id` + `from_snapshot_inclusive` + `to_snapshot_id` — a single start slot plus an inclusive flag, mirroring Java's `TableScanContext` and consistent with the other scans. # Changes - Range resolution mirrors Java's `BaseIncrementalScan`: an unset start scans from the oldest ancestor of the end; an inclusive start resolves to its parent as the exclusive boundary; an exclusive start is validated with `is_parent_ancestor_of`, so an expired start cursor is accepted as long as the lineage still passes through it; the end defaults to the current snapshot; an empty table with no range set scans nothing. - Planning walks the append-only ancestors in the range, dedups the data manifests whose `added_snapshot_id` is in range (set semantics via `ManifestFile.__eq__` / `__hash__`), and filters manifest entries to `ADDED`-in-range via a new `manifest_entry_filter` on `ManifestGroupPlanner.plan_files`. Compacted (`rewrite_data_files`) output is therefore not picked up — no double counting. - Projects onto the table's **current** schema (matching Java/C++), so rows written under an older schema in the range get `NULL` for newer columns. - Adds snapshot helpers `ancestors_between_ids`, `is_ancestor_of`, and `is_parent_ancestor_of`. - Arrow materialization (`to_arrow` / `to_arrow_batch_reader`) is shared with `DataScan` via small module-level helpers that take the projected schema explicitly, so `BaseScan` stays projection-free (per the #3511 review). # Out of scope (tracked follow-ups) - Branch selection (`use_branch`) and per-endpoint ref/tag start & end (`from_ref_*` / `to_ref_*`) — the rest of the engine-facing surface Java exposes. - `count()`, REST server-side planning, and user-facing doc examples (`mkdocs`). - `dictionary_columns` on `IncrementalAppendScan.to_arrow` / `to_arrow_batch_reader` (added to `DataScan` in #3461; the shared helpers already thread it) — kept out to isolate this PR. # Are these changes tested? Yes — unit tests (range resolution including unset / inclusive / exclusive and expired start, current-schema projection, builder and `update()` copies, empty table, staged-table guard) and integration tests (append-only, non-append snapshots ignored, compaction not double-counted, schema evolution within range, partition- and metrics-evaluator pruning, disconnected snapshots), plus the `test_incremental_read` provision fixture. # Are there any user-facing changes? Yes — the new `Table.incremental_append_scan(...)` API and `IncrementalAppendScan` class. No changes to existing public surface. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Sreesh Maheshwar added 5 commits July 22, 2025 10:51

Introduce an AbstractTableScan with default methods

90a861f

Extract manifest group planning into separate class

6d561f8

Add __eq__ and __hash__ methods to ManifestFile

1eeedb8

Introduce IncrementalAppendScan class

17a6865

Incremental append scan integration tests

c933870

This was referenced Jul 22, 2025

Incremental Append Scan #2031

Closed

[Append Scan] Introduce IncrementalAppendScan class (without integration tests) #2234

Closed

smaheshwar-pltr marked this pull request as ready for review July 22, 2025 13:36

smaheshwar-pltr mentioned this pull request Oct 19, 2025

Incremental Append Scan #2634

Closed

github-actions Bot added the stale label Mar 17, 2026

github-actions Bot closed this Mar 25, 2026

smaheshwar-pltr mentioned this pull request Jun 16, 2026

Feature: Incremental Append Scan #3512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Append Scan] Integration tests for `IncrementalAppendScan`#2235

[Append Scan] Integration tests for `IncrementalAppendScan`#2235
smaheshwar-pltr wants to merge 5 commits into
apache:mainfrom
smaheshwar-pltr:sm/incremental-append-scan-tests

smaheshwar-pltr commented Jul 22, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

smaheshwar-pltr commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smaheshwar-pltr commented Jul 22, 2025 •

edited

Loading