feat: extract pipeline functions into leadforge.pipelines.build_v5#30
Conversation
…es.build_v5 Moves reusable data transformation functions (subsample, inject_missingness, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap) and their constants into a proper package module. The script becomes a thin CLI wrapper, and tests import directly from the package instead of using importlib. Closes #29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Refactors the v5 lead-scoring dataset build pipeline by moving reusable transformation steps out of the CLI script and into a proper leadforge module, so they can be imported and covered by normal package-level test/coverage tooling.
Changes:
- Added
leadforge.pipelines.build_v5containing the v5 pipeline constants + transformation helpers. - Simplified
scripts/build_v5_snapshot.pyinto an orchestration/CLI wrapper that imports the pipeline steps. - Updated
tests/scripts/test_build_v5_snapshot.pyto import directly fromleadforge.pipelines.build_v5(removing theimportlibloading hack).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/pipelines/build_v5.py |
New in-package home for v5 pipeline constants and transformation functions. |
leadforge/pipelines/__init__.py |
Introduces the pipelines package. |
scripts/build_v5_snapshot.py |
Refactored to import pipeline helpers/constants and keep orchestration/CLI. |
tests/scripts/test_build_v5_snapshot.py |
Switched to direct imports from leadforge.pipelines.build_v5. |
.agent-plan.md |
Updated planning notes to reflect the extraction work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Apply structured missingness per the v5 contract. | ||
|
|
||
| Patterns (all <10% per column): | ||
| - web_sessions: SDR outbound 15%, inbound marketing 2%, partner referral 5% | ||
| - seniority: partner referral 8%, others 1% | ||
| - days_since_last_touch: structural NaN (no touches) + 3% MCAR | ||
| - days_since_first_touch: structural NaN (no touches) + 2% MCAR | ||
| """ |
| df["converted"] = df["converted"].astype(int) | ||
| missing = [c for c in FINAL_COLUMNS if c not in df.columns] | ||
| if missing: | ||
| raise ValueError( | ||
| f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}" | ||
| ) |
| if len(positives) < n_pos: | ||
| print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr) | ||
| n_pos = len(positives) | ||
| n_neg = n - n_pos | ||
| if len(negatives) < n_neg: | ||
| print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr) | ||
| n_neg = len(negatives) |
| def subsample( | ||
| df: pd.DataFrame, | ||
| rng: np.random.RandomState, | ||
| n: int = SUBSAMPLE_N, | ||
| target_rate: float = TARGET_RATE, | ||
| ) -> pd.DataFrame: |
- Add __all__ to leadforge/pipelines/build_v5.py declaring public API surface - Remove dead _FINAL_COLUMNS/_RENAME_MAP aliases from script - Use FINAL_COLUMNS/RENAME_MAP directly in tests (no alias indirection) - Drop unused FINAL_COLUMNS/RENAME_MAP imports from script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Addresses Copilot review comment: the per-source rates (e.g. 15% for SDR outbound) are conditional; the overall per-column rate stays <10%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #30 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: leadforge/pipelines/build_v5.py:150
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095146
Root author: copilot-pull-request-reviewer
Comment:
The `inject_missingness()` docstring says "Patterns (all <10% per column)" but then lists a 15% missingness rate for `web_sessions` (SDR outbound). If the intent is that the *overall* per-column missingness stays <10% while conditional rates may exceed that, please clarify the wording to avoid a contradiction.
## COPILOT-2
Location: leadforge/pipelines/build_v5.py:111
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095203
Root author: copilot-pull-request-reviewer
Comment:
`rename_and_select()` casts `df["converted"]` to int *before* validating required columns. If the upstream snapshot ever lacks `converted_within_90_days` (or the rename map changes), this will raise a `KeyError` instead of the intended `ValueError` with the helpful "Missing required columns" message. Consider checking `missing` (including `"converted"`) before any column access/casting, or guarding the cast behind a presence check.
## COPILOT-3
Location: leadforge/pipelines/build_v5.py:133
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095229
Root author: copilot-pull-request-reviewer
Comment:
This module is now part of the `leadforge` package, but `subsample()` prints warnings directly to stderr. Printing from a library helper makes it hard for downstream callers (and tests) to control output. Consider using an exception, returning diagnostic info, or emitting a `warnings.warn(...)`/logger warning instead, and leave user-facing printing to the CLI wrapper.
## COPILOT-4
Location: leadforge/pipelines/build_v5.py:120
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095252
Root author: copilot-pull-request-reviewer
Comment:
The pipeline functions take/construct `np.random.RandomState`, but the rest of the `leadforge/` package standardizes determinism via `RNGRoot(seed).child(<name>)` substreams (e.g., `leadforge/core/rng.py` and `leadforge/render/tasks.py`). Since these helpers are now in-package (not just a script), consider switching the RNG interface to follow that convention (e.g., accept a seed and derive per-step sub-seeds via `RNGRoot`, or accept `random.Random`/`np.random.Generator` derived from `RNGRoot`).Run metadata: |
There was a problem hiding this comment.
Pull request overview
Refactors the v5 dataset build pipeline by moving reusable transformation logic out of the scripts/ CLI into a proper leadforge.pipelines package module, enabling direct imports and better test coverage.
Changes:
- Extracts v5 pipeline transforms/constants into
leadforge/pipelines/build_v5.py. - Makes
scripts/build_v5_snapshot.pya thin CLI/orchestration wrapper that imports the pipeline steps. - Updates tests to import pipeline functions directly from the package module (removing the
importlibloading hack).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
leadforge/pipelines/build_v5.py |
New module containing v5 pipeline constants and transformation steps. |
leadforge/pipelines/__init__.py |
Adds pipelines package initializer. |
scripts/build_v5_snapshot.py |
Switches to importing pipeline functions/constants; keeps orchestration + CLI. |
tests/scripts/test_build_v5_snapshot.py |
Updates tests to import from leadforge.pipelines.build_v5 directly. |
.agent-plan.md |
Updates planning doc to reflect the extraction work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if len(positives) < n_pos: | ||
| print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr) | ||
| n_pos = len(positives) | ||
| n_neg = n - n_pos | ||
| if len(negatives) < n_neg: | ||
| print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr) | ||
| n_neg = len(negatives) |
| df["converted"] = df["converted"].astype(int) | ||
| missing = [c for c in FINAL_COLUMNS if c not in df.columns] | ||
| if missing: | ||
| raise ValueError( | ||
| f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}" | ||
| ) |
| def subsample( | ||
| df: pd.DataFrame, | ||
| rng: np.random.RandomState, | ||
| n: int = SUBSAMPLE_N, | ||
| target_rate: float = TARGET_RATE, |
Summary
subsample,inject_missingness,derive_binary_features,cap_expected_acv,rename_and_select,boost_leakage_trap) and constants fromscripts/build_v5_snapshot.pyintoleadforge/pipelines/build_v5.pyscripts/build_v5_snapshot.pya thin CLI wrapper (orchestration functionsgenerate_bundle/build_v5_datasetstay in the script)leadforge.pipelines.build_v5directly instead of theimportlibhackTest plan
--cov=leadforgeCloses #29
🤖 Generated with Claude Code