feat: Impl IncrementalAppendScan#590
Merged
Merged
Conversation
9b822d0 to
4c4ffff
Compare
wgtmac
reviewed
Mar 20, 2026
wgtmac
reviewed
Mar 20, 2026
wgtmac
left a comment
Member
There was a problem hiding this comment.
Note: This review was generated by Gemini.
wgtmac
reviewed
Mar 20, 2026
d1bacdf to
5c41a78
Compare
3f97b02 to
8e2ad99
Compare
wgtmac
requested changes
Mar 25, 2026
wgtmac
left a comment
Member
There was a problem hiding this comment.
Note: This review was generated by Gemini.
Summary & Recommendation
- Request Changes: There are logic bugs in handling empty table snapshots and a potential hashing/equality discrepancy for manifest files that need to be addressed.
Member
|
Mostly look good. I will merge this after two remaining issues have been addressed. |
…nstantiate in header
8e2ad99 to
f3cd338
Compare
wgtmac
approved these changes
Mar 26, 2026
kevinjqliu
pushed a commit
to apache/iceberg-python
that referenced
this pull request
Jun 28, 2026
<!-- Closes #2634 --> Closes #2634. # Rationale for this change Adds `IncrementalAppendScan`, which reads the data appended between two snapshots — the building block for incremental ingestion. Largely a revival of the work in #2235; see #2634 and the previous PRs for motivation. Split out of #3364 at the reviewers' request, and builds on the now-merged `BaseScan` / `ManifestGroupPlanner` refactor (#3511), so this PR's diff is the append-scan feature alone. The surface mirrors Iceberg-Java's engine-facing API (snapshot IDs, inclusive/exclusive start, optional start) rather than the narrower Spark read options, since PyIceberg is increasingly used by engines (e.g. Polars). See the API discussion on this PR. References: https://github.com/apache/iceberg (Iceberg-Java and Spark) and apache/iceberg-cpp#590. Inline review-aid comments (prefixed `[AI reviewer aid]`) point at the relevant reference code. # API `Table.incremental_append_scan(...)` returns an `IncrementalAppendScan`; `StagedTable` overrides it to raise, mirroring `scan()`. The scan reads the rows added by **append** snapshots in `(from, to]`, projected onto the table's current schema; delete / overwrite / replace snapshots in the range (e.g. compaction) are ignored. The range is set via the factory's Spark-style kwargs or the builder methods, each of which returns a refined copy (like `select()` / `filter()`): ```python table.incremental_append_scan( from_snapshot_id_exclusive=None, # optional; defaults to the oldest ancestor of `to` to_snapshot_id_inclusive=None, # optional; defaults to the current snapshot row_filter=..., selected_fields=..., case_sensitive=..., options=..., limit=..., ) scan.from_snapshot_id_exclusive(id) # or .from_snapshot_id_inclusive(id) .to_snapshot_id_inclusive(id) ``` The range is held as public attributes — `from_snapshot_id` + `from_snapshot_inclusive` + `to_snapshot_id` — a single start slot plus an inclusive flag, mirroring Java's `TableScanContext` and consistent with the other scans. # Changes - Range resolution mirrors Java's `BaseIncrementalScan`: an unset start scans from the oldest ancestor of the end; an inclusive start resolves to its parent as the exclusive boundary; an exclusive start is validated with `is_parent_ancestor_of`, so an expired start cursor is accepted as long as the lineage still passes through it; the end defaults to the current snapshot; an empty table with no range set scans nothing. - Planning walks the append-only ancestors in the range, dedups the data manifests whose `added_snapshot_id` is in range (set semantics via `ManifestFile.__eq__` / `__hash__`), and filters manifest entries to `ADDED`-in-range via a new `manifest_entry_filter` on `ManifestGroupPlanner.plan_files`. Compacted (`rewrite_data_files`) output is therefore not picked up — no double counting. - Projects onto the table's **current** schema (matching Java/C++), so rows written under an older schema in the range get `NULL` for newer columns. - Adds snapshot helpers `ancestors_between_ids`, `is_ancestor_of`, and `is_parent_ancestor_of`. - Arrow materialization (`to_arrow` / `to_arrow_batch_reader`) is shared with `DataScan` via small module-level helpers that take the projected schema explicitly, so `BaseScan` stays projection-free (per the #3511 review). # Out of scope (tracked follow-ups) - Branch selection (`use_branch`) and per-endpoint ref/tag start & end (`from_ref_*` / `to_ref_*`) — the rest of the engine-facing surface Java exposes. - `count()`, REST server-side planning, and user-facing doc examples (`mkdocs`). - `dictionary_columns` on `IncrementalAppendScan.to_arrow` / `to_arrow_batch_reader` (added to `DataScan` in #3461; the shared helpers already thread it) — kept out to isolate this PR. # Are these changes tested? Yes — unit tests (range resolution including unset / inclusive / exclusive and expired start, current-schema projection, builder and `update()` copies, empty table, staged-table guard) and integration tests (append-only, non-append snapshots ignored, compaction not double-counted, schema evolution within range, partition- and metrics-evaluator pruning, disconnected snapshots), plus the `test_incremental_read` provision fixture. # Are there any user-facing changes? Yes — the new `Table.incremental_append_scan(...)` API and `IncrementalAppendScan` class. No changes to existing public surface. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.