refactor: use Parquet metadata for row counts in validate#39
Merged
Conversation
Replace full-file reads with pyarrow metadata reads in bundle validation: - _check_task_splits: pq.read_metadata().num_rows instead of pd.read_parquet() - _check_leakage: pq.read_schema().names instead of pd.read_parquet(columns=[]) Closes #17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Refactors bundle validation to use Parquet footer metadata/schema for lightweight checks, reducing memory usage and speeding up validate on larger bundles.
Changes:
- Use
pyarrow.parquet.read_metadata(...).num_rowsfor task split row-count validation instead of loading full DataFrames. - Use
pyarrow.parquet.read_schema(...).namesfor leakage column detection instead ofpd.read_parquet(..., columns=[]). - Add tests covering metadata/data consistency, task split row-count mismatch detection, and leakage detection via schema.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
leadforge/validation/bundle_checks.py |
Switches task split row-count checks to Parquet metadata and leakage checks to Parquet schema reads. |
tests/validation/test_bundle_checks.py |
Adds regression tests for metadata row counts, task split row mismatch detection, and schema-based leakage detection. |
.agent-plan.md |
Documents completion of the Parquet-metadata validation refactor and associated tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Only call pq.read_metadata() when manifest has expected row count (avoids unnecessary I/O for partial manifests) - Replace pyarrow-vs-pandas consistency test with monkeypatch test proving _check_task_splits never calls pd.read_parquet Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #39 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pd.read_parquet()withpq.read_metadata().num_rowsin_check_task_splits()for row count validation — avoids loading full Parquet datapd.read_parquet(path, columns=[])withpq.read_schema().namesin_check_leakage()for column name checks — reads only the schema footerNote:
_check_tables()still loads full DataFrames because they are needed downstream for FK integrity checks.Closes #17
Test plan
test_task_split_row_count_mismatchverifies error detection via metadatatest_leakage_detects_extra_columnsverifies column detection viapq.read_schema()test_task_split_metadata_matches_dataverifies metadata row counts match actual data🤖 Generated with Claude Code