Context
From Copilot review on #16 (COPILOT-5): the validate command currently loads every table/task Parquet fully into memory via pd.read_parquet() even when only row counts or column names are needed.
Problem
For larger bundles this could be slow and memory-intensive.
Proposed solution
- Use Parquet metadata (
pyarrow.parquet.read_metadata()) for row counts instead of loading full DataFrames
- For FK checks, read only the required columns via
columns=[fk.child_column]
- For leakage checks, read only schema/column names without loading data
Priority
Low — v1 bundles are small (~5K leads), so this is not a blocker. Worth doing before scaling to larger datasets.
🤖 Generated with Claude Code
Context
From Copilot review on #16 (COPILOT-5): the
validatecommand currently loads every table/task Parquet fully into memory viapd.read_parquet()even when only row counts or column names are needed.Problem
For larger bundles this could be slow and memory-intensive.
Proposed solution
pyarrow.parquet.read_metadata()) for row counts instead of loading full DataFramescolumns=[fk.child_column]Priority
Low — v1 bundles are small (~5K leads), so this is not a blocker. Worth doing before scaling to larger datasets.
🤖 Generated with Claude Code