feat: extract pipeline functions into leadforge.pipelines.build_v5 by shaypal5 · Pull Request #30 · leadforge-dev/leadforge

shaypal5 · 2026-04-30T18:30:52Z

Summary

Extracts reusable pipeline functions (subsample, inject_missingness, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap) and constants from scripts/build_v5_snapshot.py into leadforge/pipelines/build_v5.py
Makes scripts/build_v5_snapshot.py a thin CLI wrapper (orchestration functions generate_bundle/build_v5_dataset stay in the script)
Updates tests to import from leadforge.pipelines.build_v5 directly instead of the importlib hack

Test plan

All 705 tests pass
Ruff lint clean
Pre-commit hooks pass
Pipeline functions now get coverage tracking via --cov=leadforge

Closes #29

🤖 Generated with Claude Code

…es.build_v5 Moves reusable data transformation functions (subsample, inject_missingness, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap) and their constants into a proper package module. The script becomes a thin CLI wrapper, and tests import directly from the package instead of using importlib. Closes #29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Refactors the v5 lead-scoring dataset build pipeline by moving reusable transformation steps out of the CLI script and into a proper leadforge module, so they can be imported and covered by normal package-level test/coverage tooling.

Changes:

Added leadforge.pipelines.build_v5 containing the v5 pipeline constants + transformation helpers.
Simplified scripts/build_v5_snapshot.py into an orchestration/CLI wrapper that imports the pipeline steps.
Updated tests/scripts/test_build_v5_snapshot.py to import directly from leadforge.pipelines.build_v5 (removing the importlib loading hack).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`leadforge/pipelines/build_v5.py`	New in-package home for v5 pipeline constants and transformation functions.
`leadforge/pipelines/__init__.py`	Introduces the `pipelines` package.
`scripts/build_v5_snapshot.py`	Refactored to import pipeline helpers/constants and keep orchestration/CLI.
`tests/scripts/test_build_v5_snapshot.py`	Switched to direct imports from `leadforge.pipelines.build_v5`.
`.agent-plan.md`	Updated planning notes to reflect the extraction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    """Apply structured missingness per the v5 contract.
+
+    Patterns (all <10% per column):
+    - web_sessions: SDR outbound 15%, inbound marketing 2%, partner referral 5%
+    - seniority: partner referral 8%, others 1%
+    - days_since_last_touch: structural NaN (no touches) + 3% MCAR
+    - days_since_first_touch: structural NaN (no touches) + 2% MCAR
+    """


+    df["converted"] = df["converted"].astype(int)
+    missing = [c for c in FINAL_COLUMNS if c not in df.columns]
+    if missing:
+        raise ValueError(
+            f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}"
+        )


+    if len(positives) < n_pos:
+        print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr)
+        n_pos = len(positives)
+        n_neg = n - n_pos
+    if len(negatives) < n_neg:
+        print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr)
+        n_neg = len(negatives)


+def subsample(
+    df: pd.DataFrame,
+    rng: np.random.RandomState,
+    n: int = SUBSAMPLE_N,
+    target_rate: float = TARGET_RATE,
+) -> pd.DataFrame:


- Add __all__ to leadforge/pipelines/build_v5.py declaring public API surface - Remove dead _FINAL_COLUMNS/_RENAME_MAP aliases from script - Use FINAL_COLUMNS/RENAME_MAP directly in tests (no alias indirection) - Drop unused FINAL_COLUMNS/RENAME_MAP imports from script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses Copilot review comment: the per-source rates (e.g. 15% for SDR outbound) are conditional; the overall per-column rate stays <10%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-30T18:51:13Z

pr-agent-context report:

This run includes unresolved review comments on PR #30 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/pipelines/build_v5.py:150
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095146
Root author: copilot-pull-request-reviewer

Comment:
    The `inject_missingness()` docstring says "Patterns (all <10% per column)" but then lists a 15% missingness rate for `web_sessions` (SDR outbound). If the intent is that the *overall* per-column missingness stays <10% while conditional rates may exceed that, please clarify the wording to avoid a contradiction.

## COPILOT-2
Location: leadforge/pipelines/build_v5.py:111
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095203
Root author: copilot-pull-request-reviewer

Comment:
    `rename_and_select()` casts `df["converted"]` to int *before* validating required columns. If the upstream snapshot ever lacks `converted_within_90_days` (or the rename map changes), this will raise a `KeyError` instead of the intended `ValueError` with the helpful "Missing required columns" message. Consider checking `missing` (including `"converted"`) before any column access/casting, or guarding the cast behind a presence check.

## COPILOT-3
Location: leadforge/pipelines/build_v5.py:133
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095229
Root author: copilot-pull-request-reviewer

Comment:
    This module is now part of the `leadforge` package, but `subsample()` prints warnings directly to stderr. Printing from a library helper makes it hard for downstream callers (and tests) to control output. Consider using an exception, returning diagnostic info, or emitting a `warnings.warn(...)`/logger warning instead, and leave user-facing printing to the CLI wrapper.

## COPILOT-4
Location: leadforge/pipelines/build_v5.py:120
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095252
Root author: copilot-pull-request-reviewer

Comment:
    The pipeline functions take/construct `np.random.RandomState`, but the rest of the `leadforge/` package standardizes determinism via `RNGRoot(seed).child(<name>)` substreams (e.g., `leadforge/core/rng.py` and `leadforge/render/tasks.py`). Since these helpers are now in-package (not just a script), consider switching the RNG interface to follow that convention (e.g., accept a seed and derive per-step sub-seeds via `RNGRoot`, or accept `random.Random`/`np.random.Generator` derived from `RNGRoot`).

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25183433131 attempt 1
Comment timestamp: 2026-04-30T18:50:23.605691+00:00
PR head commit: d43f37d2637baeacfc52b9b5a5a782181670e7a6

Copilot

Pull request overview

Refactors the v5 dataset build pipeline by moving reusable transformation logic out of the scripts/ CLI into a proper leadforge.pipelines package module, enabling direct imports and better test coverage.

Changes:

Extracts v5 pipeline transforms/constants into leadforge/pipelines/build_v5.py.
Makes scripts/build_v5_snapshot.py a thin CLI/orchestration wrapper that imports the pipeline steps.
Updates tests to import pipeline functions directly from the package module (removing the importlib loading hack).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`leadforge/pipelines/build_v5.py`	New module containing v5 pipeline constants and transformation steps.
`leadforge/pipelines/__init__.py`	Adds `pipelines` package initializer.
`scripts/build_v5_snapshot.py`	Switches to importing pipeline functions/constants; keeps orchestration + CLI.
`tests/scripts/test_build_v5_snapshot.py`	Updates tests to import from `leadforge.pipelines.build_v5` directly.
`.agent-plan.md`	Updates planning doc to reflect the extraction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if len(positives) < n_pos:
+        print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr)
+        n_pos = len(positives)
+        n_neg = n - n_pos
+    if len(negatives) < n_neg:
+        print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr)
+        n_neg = len(negatives)


+    df["converted"] = df["converted"].astype(int)
+    missing = [c for c in FINAL_COLUMNS if c not in df.columns]
+    if missing:
+        raise ValueError(
+            f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}"
+        )


+def subsample(
+    df: pd.DataFrame,
+    rng: np.random.RandomState,
+    n: int = SUBSAMPLE_N,
+    target_rate: float = TARGET_RATE,


Copilot AI review requested due to automatic review settings April 30, 2026 18:30

shaypal5 added type: refactor Code change with no behavior difference layer: core core/ primitives (RNG, IDs, models, exceptions) labels Apr 30, 2026

shaypal5 added this to the v0.5.0 — CLI-complete release candidate milestone Apr 30, 2026

Copilot started reviewing on behalf of shaypal5 April 30, 2026 18:31 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed Apr 30, 2026

View reviewed changes

shaypal5 added layer: pipelines pipelines/ — dataset build transformations and removed layer: core core/ primitives (RNG, IDs, models, exceptions) labels Apr 30, 2026

This comment has been minimized.

Sign in to view

This was referenced Apr 30, 2026

pipelines: replace stderr prints with warnings.warn in library functions #31

Closed

pipelines: align RNG interface with RNGRoot convention #32

Closed

docs: clarify missingness docstring — conditional rates vs overall rate

d43f37d

Addresses Copilot review comment: the per-source rates (e.g. 15% for SDR outbound) are conditional; the overall per-column rate stays <10%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 30, 2026 18:50

Copilot started reviewing on behalf of shaypal5 April 30, 2026 18:50 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

shaypal5 merged commit dd42b98 into main Apr 30, 2026
10 checks passed

shaypal5 deleted the feat/extract-pipeline-functions-29 branch April 30, 2026 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract pipeline functions into leadforge.pipelines.build_v5#30

feat: extract pipeline functions into leadforge.pipelines.build_v5#30
shaypal5 merged 3 commits into
mainfrom
feat/extract-pipeline-functions-29

shaypal5 commented Apr 30, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Apr 30, 2026

Summary

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment has been minimized.

github-actions Bot commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants