Skip to content

feat: extract pipeline functions into leadforge.pipelines.build_v5#30

Merged
shaypal5 merged 3 commits into
mainfrom
feat/extract-pipeline-functions-29
Apr 30, 2026
Merged

feat: extract pipeline functions into leadforge.pipelines.build_v5#30
shaypal5 merged 3 commits into
mainfrom
feat/extract-pipeline-functions-29

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • Extracts reusable pipeline functions (subsample, inject_missingness, derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap) and constants from scripts/build_v5_snapshot.py into leadforge/pipelines/build_v5.py
  • Makes scripts/build_v5_snapshot.py a thin CLI wrapper (orchestration functions generate_bundle/build_v5_dataset stay in the script)
  • Updates tests to import from leadforge.pipelines.build_v5 directly instead of the importlib hack

Test plan

  • All 705 tests pass
  • Ruff lint clean
  • Pre-commit hooks pass
  • Pipeline functions now get coverage tracking via --cov=leadforge

Closes #29

🤖 Generated with Claude Code

…es.build_v5

Moves reusable data transformation functions (subsample, inject_missingness,
derive_binary_features, cap_expected_acv, rename_and_select, boost_leakage_trap)
and their constants into a proper package module. The script becomes a thin CLI
wrapper, and tests import directly from the package instead of using importlib.

Closes #29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 18:30
@shaypal5 shaypal5 added type: refactor Code change with no behavior difference layer: core core/ primitives (RNG, IDs, models, exceptions) labels Apr 30, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the v5 lead-scoring dataset build pipeline by moving reusable transformation steps out of the CLI script and into a proper leadforge module, so they can be imported and covered by normal package-level test/coverage tooling.

Changes:

  • Added leadforge.pipelines.build_v5 containing the v5 pipeline constants + transformation helpers.
  • Simplified scripts/build_v5_snapshot.py into an orchestration/CLI wrapper that imports the pipeline steps.
  • Updated tests/scripts/test_build_v5_snapshot.py to import directly from leadforge.pipelines.build_v5 (removing the importlib loading hack).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
leadforge/pipelines/build_v5.py New in-package home for v5 pipeline constants and transformation functions.
leadforge/pipelines/__init__.py Introduces the pipelines package.
scripts/build_v5_snapshot.py Refactored to import pipeline helpers/constants and keep orchestration/CLI.
tests/scripts/test_build_v5_snapshot.py Switched to direct imports from leadforge.pipelines.build_v5.
.agent-plan.md Updated planning notes to reflect the extraction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +132
"""Apply structured missingness per the v5 contract.

Patterns (all <10% per column):
- web_sessions: SDR outbound 15%, inbound marketing 2%, partner referral 5%
- seniority: partner referral 8%, others 1%
- days_since_last_touch: structural NaN (no touches) + 3% MCAR
- days_since_first_touch: structural NaN (no touches) + 2% MCAR
"""
Comment on lines +88 to +93
df["converted"] = df["converted"].astype(int)
missing = [c for c in FINAL_COLUMNS if c not in df.columns]
if missing:
raise ValueError(
f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}"
)
Comment on lines +109 to +115
if len(positives) < n_pos:
print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr)
n_pos = len(positives)
n_neg = n - n_pos
if len(negatives) < n_neg:
print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr)
n_neg = len(negatives)
Comment on lines +97 to +102
def subsample(
df: pd.DataFrame,
rng: np.random.RandomState,
n: int = SUBSAMPLE_N,
target_rate: float = TARGET_RATE,
) -> pd.DataFrame:
@shaypal5 shaypal5 added layer: pipelines pipelines/ — dataset build transformations and removed layer: core core/ primitives (RNG, IDs, models, exceptions) labels Apr 30, 2026
- Add __all__ to leadforge/pipelines/build_v5.py declaring public API surface
- Remove dead _FINAL_COLUMNS/_RENAME_MAP aliases from script
- Use FINAL_COLUMNS/RENAME_MAP directly in tests (no alias indirection)
- Drop unused FINAL_COLUMNS/RENAME_MAP imports from script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Addresses Copilot review comment: the per-source rates (e.g. 15% for
SDR outbound) are conditional; the overall per-column rate stays <10%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 18:50
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #30 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/pipelines/build_v5.py:150
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095146
Root author: copilot-pull-request-reviewer

Comment:
    The `inject_missingness()` docstring says "Patterns (all <10% per column)" but then lists a 15% missingness rate for `web_sessions` (SDR outbound). If the intent is that the *overall* per-column missingness stays <10% while conditional rates may exceed that, please clarify the wording to avoid a contradiction.

## COPILOT-2
Location: leadforge/pipelines/build_v5.py:111
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095203
Root author: copilot-pull-request-reviewer

Comment:
    `rename_and_select()` casts `df["converted"]` to int *before* validating required columns. If the upstream snapshot ever lacks `converted_within_90_days` (or the rename map changes), this will raise a `KeyError` instead of the intended `ValueError` with the helpful "Missing required columns" message. Consider checking `missing` (including `"converted"`) before any column access/casting, or guarding the cast behind a presence check.

## COPILOT-3
Location: leadforge/pipelines/build_v5.py:133
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095229
Root author: copilot-pull-request-reviewer

Comment:
    This module is now part of the `leadforge` package, but `subsample()` prints warnings directly to stderr. Printing from a library helper makes it hard for downstream callers (and tests) to control output. Consider using an exception, returning diagnostic info, or emitting a `warnings.warn(...)`/logger warning instead, and leave user-facing printing to the CLI wrapper.

## COPILOT-4
Location: leadforge/pipelines/build_v5.py:120
URL: https://github.com/leadforge-dev/leadforge/pull/30#discussion_r3170095252
Root author: copilot-pull-request-reviewer

Comment:
    The pipeline functions take/construct `np.random.RandomState`, but the rest of the `leadforge/` package standardizes determinism via `RNGRoot(seed).child(<name>)` substreams (e.g., `leadforge/core/rng.py` and `leadforge/render/tasks.py`). Since these helpers are now in-package (not just a script), consider switching the RNG interface to follow that convention (e.g., accept a seed and derive per-step sub-seeds via `RNGRoot`, or accept `random.Random`/`np.random.Generator` derived from `RNGRoot`).

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25183433131 attempt 1
Comment timestamp: 2026-04-30T18:50:23.605691+00:00
PR head commit: d43f37d2637baeacfc52b9b5a5a782181670e7a6

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the v5 dataset build pipeline by moving reusable transformation logic out of the scripts/ CLI into a proper leadforge.pipelines package module, enabling direct imports and better test coverage.

Changes:

  • Extracts v5 pipeline transforms/constants into leadforge/pipelines/build_v5.py.
  • Makes scripts/build_v5_snapshot.py a thin CLI/orchestration wrapper that imports the pipeline steps.
  • Updates tests to import pipeline functions directly from the package module (removing the importlib loading hack).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
leadforge/pipelines/build_v5.py New module containing v5 pipeline constants and transformation steps.
leadforge/pipelines/__init__.py Adds pipelines package initializer.
scripts/build_v5_snapshot.py Switches to importing pipeline functions/constants; keeps orchestration + CLI.
tests/scripts/test_build_v5_snapshot.py Updates tests to import from leadforge.pipelines.build_v5 directly.
.agent-plan.md Updates planning doc to reflect the extraction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +133
if len(positives) < n_pos:
print(f"WARNING: only {len(positives)} positives, need {n_pos}", file=sys.stderr)
n_pos = len(positives)
n_neg = n - n_pos
if len(negatives) < n_neg:
print(f"WARNING: only {len(negatives)} negatives, need {n_neg}", file=sys.stderr)
n_neg = len(negatives)
Comment on lines +106 to +111
df["converted"] = df["converted"].astype(int)
missing = [c for c in FINAL_COLUMNS if c not in df.columns]
if missing:
raise ValueError(
f"Missing required columns after renaming: {missing}. Available: {sorted(df.columns)}"
)
Comment on lines +115 to +119
def subsample(
df: pd.DataFrame,
rng: np.random.RandomState,
n: int = SUBSAMPLE_N,
target_rate: float = TARGET_RATE,
@shaypal5 shaypal5 merged commit dd42b98 into main Apr 30, 2026
10 checks passed
@shaypal5 shaypal5 deleted the feat/extract-pipeline-functions-29 branch April 30, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: pipelines pipelines/ — dataset build transformations type: refactor Code change with no behavior difference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: extract pipeline functions from scripts/ into leadforge package

2 participants