Skip to content

perf: use Parquet metadata for row counts in validate command #17

@shaypal5

Description

@shaypal5

Context

From Copilot review on #16 (COPILOT-5): the validate command currently loads every table/task Parquet fully into memory via pd.read_parquet() even when only row counts or column names are needed.

Problem

For larger bundles this could be slow and memory-intensive.

Proposed solution

  • Use Parquet metadata (pyarrow.parquet.read_metadata()) for row counts instead of loading full DataFrames
  • For FK checks, read only the required columns via columns=[fk.child_column]
  • For leakage checks, read only schema/column names without loading data

Priority

Low — v1 bundles are small (~5K leads), so this is not a blocker. Worth doing before scaling to larger datasets.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions