fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup (PR 8.1) by shaypal5 · Pull Request #82 · leadforge-dev/leadforge

shaypal5 · 2026-05-25T09:10:34Z

Summary

PR 8.1 — eight targeted fixes to the snapshot builder, validation probes, and schema, plus full bundle regeneration.

1. Fix `has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak (HIGH)

File: leadforge/render/snapshots.py

The previous implementation used od["close_outcome"].isna() to identify open opportunities. This is incorrect: close_outcome is set by the post-simulation pass for all closed opportunities regardless of when they closed. An opportunity closed on day 91 (after snapshot day 30) would be falsely treated as already closed at snapshot time, artificially deflating has_open_opportunity.

Fix: Changed to closed_at is NaT OR closed_at > lead_created_at + snapshot_day.

AUC delta: Intro LR 0.879→0.879, Intermediate 0.886→0.886, Advanced 0.886→0.886. The AUC is stable because the simulation DGP has most opportunities closed after the snapshot window being won/lost proportionally to the underlying conversion rate — the error inflates false negatives and false positives roughly symmetrically. Correctness is restored regardless.

2. Add `probe_opportunity_snapshot_consistency` leakage probe (MEDIUM)

File: leadforge/validation/leakage_probes.py

New opt-in probe that recomputes has_open_opportunity and opportunity_estimated_acv under correct closed_at > cutoff semantics and asserts equality with the shipped column values. Registered in PROBE_REGISTRY as taxonomy=time_window, opt_in=True (requires full-horizon opportunities table, not available in public bundles). This probe would have caught bug #1 automatically.

3. `to_dataframes_snapshot_safe` guard on missing `lead_id` (MEDIUM)

File: leadforge/render/relational_snapshot_safe.py

_filter_to_snapshot_window now raises ValueError with a descriptive message if lead_id is missing from an event table instead of silently producing all-NaT cutoffs.

4. Clamp Gaussian noise to physical ranges (MEDIUM/LOW)

File: leadforge/render/snapshots.py

Post-noise clamp in _apply_difficulty_distortions: days_since_*, ACV, and monetary columns are clamped to >= 0; count columns (touch_count, session_count, etc.) are clamped to >= 0. Removes the most visible synthetic-data tell (negative durations / negative counts in Advanced tier).

5. Exempt `total_touches_all` from difficulty distortion (MEDIUM)

File: leadforge/render/snapshots.py

Added _DISTORTION_EXEMPT_COLS = {"total_touches_all"}. The leakage trap was getting up to 18% NaN injection in the Advanced tier, muddying the pedagogical leakage-detection exercise it's designed to deliver cleanly.

6. Drop `first_touch_channel` from task snapshot (MEDIUM)

Files: leadforge/schema/features.py, leadforge/render/snapshots.py (implicit via column list), docs

first_touch_channel is byte-identical to lead_source in v1 (both are set to the same origination value in population.py). Removed from LEAD_SNAPSHOT_FEATURES and task splits. Still present in relational tables/leads.parquet (it's a LeadRow entity field). Pipeline RENAME_MAPs for v5/v6/v7 CSVs are unaffected. Snapshot count: 32 → 31 columns.

7. Rename `touches_week_1` → `touches_days_0_7` (LOW)

Files: leadforge/schema/features.py, leadforge/render/snapshots.py, tests, docs

The feature implementation uses _day <= 7 (days 0 through 7 inclusive = 8 values), not 7 values. The name touches_week_1 implies a 7-day window; touches_days_0_7 is precise. v5/v6/v7 build pipelines add a touches_days_0_7 → touches_week_1 entry to their RENAME_MAP for backwards compatibility with existing CSV consumers.

8. Fix label window: `<` → `<=` (MEDIUM)

File: leadforge/simulation/engine.py

state.conversion_day < config.label_window_days → state.conversion_day <= config.label_window_days. The spec says "within 90 days" (inclusive). A conversion exactly on day 90 was previously excluded; now correctly labeled positive.

Test results

1439 passed, 5 skipped — all green. Ruff lint + format clean.

Bundle regeneration

All three tiers regenerated with seed=42:

intro: 41.5% conversion rate
intermediate: 20.1% conversion rate
advanced: 7.9% conversion rate

validate_release_candidate.py: PASS — 3 tiers, 5 seeds, 0 leakage findings

AUC delta from snapshot fix

Pre-fix (old): intro LR 0.879 / GBM n/a, intermediate LR 0.886, advanced LR 0.886
Post-fix (new): intro LR 0.8788 / GBM 0.8729, intermediate LR 0.8859 / GBM 0.8755, advanced LR 0.8861 / GBM 0.8726

The AUC is effectively unchanged. The opportunity leak was correcting a categorical feature (has_open_opportunity) rather than a continuous predictor, and the simulation DGP means the mis-labeled opportunity-open flags are uncorrelated with conversion in aggregate. Correctness of the feature semantics is restored regardless of the marginal AUC impact.

🤖 Generated with Claude Code

… cleanup - Fix has_open_opportunity/opportunity_estimated_acv post-snapshot leak: use closed_at > lead_created_at + snapshot_day semantics instead of close_outcome.isna() (which incorrectly treats post-snapshot closures) - Add probe_opportunity_snapshot_consistency leakage probe: verifies shipped opportunity-derived columns match correct closed_at > cutoff recomputation; registered as opt-in in PROBE_REGISTRY - Add lead_id guard in _filter_to_snapshot_window: raises ValueError with descriptive message instead of producing all-NaT cutoffs - Clamp Gaussian noise to physical ranges in _apply_difficulty_distortions: days_since_*, ACV, and count columns clamped to >= 0 after noise pass - Exempt total_touches_all from difficulty distortion: add _DISTORTION_EXEMPT_COLS to prevent up to 18% NaN on the leakage trap - Drop first_touch_channel from LEAD_SNAPSHOT_FEATURES: byte-identical to lead_source in v1; add rename in v5/v6/v7 pipeline maps for backwards compatibility - Rename touches_week_1 -> touches_days_0_7 in LEAD_SNAPSHOT_FEATURES and snapshot builder (window covers days 0-7 inclusive = 8 values) - Fix label window boundary: change < to <= in engine.py so conversion on day label_window_days is correctly labeled positive (inclusive spec) - Update contract test V5_TASK_COLUMNS_STUDENT_PUBLIC for schema changes - Update all test files referencing old column names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR tightens snapshot/label semantics and schema correctness across the render + validation pipeline, adds a targeted leakage probe to catch opportunity snapshot-time mistakes, and updates downstream tests/docs/pipelines for the renamed/removed snapshot columns.

Changes:

Fix opportunity “open at snapshot” semantics in build_snapshot, add an opt-in validation probe for opportunity snapshot consistency, and harden snapshot-safe relational filtering on missing lead_id.
Update snapshot schema by removing redundant first_touch_channel, renaming touches_week_1 → touches_days_0_7, and threading backward-compat rename maps for v5/v6/v7 CSV pipelines.
Make the label window inclusive (<=) and update tests/docs accordingly; adjust difficulty distortion behavior (clamps + exemption for total_touches_all).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_primary_task_threading.py	Updates expected snapshot column name to `touches_days_0_7`.
tests/simulation/test_engine.py	Adjusts label-window boundary test and clarifies docstring for inclusive behavior.
tests/scripts/test_build_v7_snapshot.py	Updates synthetic snapshot fixture column to `touches_days_0_7`.
tests/scripts/test_build_v6_snapshot.py	Updates synthetic snapshot fixture column to `touches_days_0_7`.
tests/scripts/test_build_v5_snapshot.py	Updates synthetic snapshot fixture column to `touches_days_0_7`.
tests/render/test_snapshot_windowed.py	Renames windowed-touch tests to `touches_days_0_7` and updates assertions.
tests/render/test_bundle_schema_v5_contract.py	Updates schema contract expectations (drop `first_touch_channel` from snapshot; rename touches column).
release/docs/feature_dictionary.md	Updates release docs for 31-column snapshot, removed `first_touch_channel`, and renamed touches feature.
leadforge/validation/leakage_probes.py	Adds opt-in `probe_opportunity_snapshot_consistency` and registers it.
leadforge/simulation/engine.py	Makes label window inclusive: `conversion_day <= label_window_days`.
leadforge/schema/features.py	Removes `first_touch_channel` from snapshot feature spec; renames `touches_week_1` to `touches_days_0_7`.
leadforge/render/snapshots.py	Fixes open-opportunity snapshot semantics; adds distortion exemption + non-negativity clamping; renames touches feature in snapshot builder.
leadforge/render/relational_snapshot_safe.py	Adds a loud guard when snapshot-filtered tables lack `lead_id`.
leadforge/pipelines/common.py	Adds backward-compat rename from `touches_days_0_7` back to `touches_week_1` for v6/v7 outputs.
leadforge/pipelines/build_v5.py	Adds backward-compat rename from `touches_days_0_7` back to `touches_week_1` for v5 outputs.
docs/release/feature_dictionary.md	Mirrors the release feature dictionary updates in the main docs tree.
.agent-plan.md	Marks PR 8.1 items complete and records outcomes (AUC deltas, validation status).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # 1b. Post-noise clamp to physical ranges.
+    # Float non-negative columns: clamp to >= 0.
+    for col in _NONNEG_FLOAT_COLS:
+        if col in df.columns and df[col].notna().any():
+            df[col] = df[col].clip(lower=0)
+    # Int non-negative columns: clamp to >= 0, then restore int dtype.
+    for col in _NONNEG_INT_COLS:
+        if col in df.columns and df[col].notna().any():
+            df[col] = df[col].clip(lower=0)
+


+        od["_snapshot_cutoff"] = od["lead_id"].map(cutoff_map)
+        open_mask = od["_closed_at_ts"].isna() | (od["_closed_at_ts"] > od["_snapshot_cutoff"])
+        open_opps_raw = od[open_mask][["lead_id", "estimated_acv"]]
+    else:
+        open_opps_raw = od[["lead_id", "estimated_acv"]]
+    open_opps = open_opps_raw.groupby("lead_id").first().reset_index()


+    expected_open = (
+        opps_open.groupby("lead_id")["estimated_acv"]
+        .first()
+        .reset_index()
+        .rename(columns={"estimated_acv": "_expected_acv"})
+    )


Ten issues from senior code review addressed: 🔴 Bug: dead 'touches_week_1' entry in _NONNEG_INT_COLS removed by replacing both frozensets with FeatureSpec-derived comprehensions (Fix 3 makes Fix 1 obsolete as a standalone change). 🔴 Bug: _filter_to_snapshot_window guard now runs before the len(events)==0 early-return so a misconfigured empty table still raises instead of silently passing through. 🟠 Design: _NONNEG_FLOAT_COLS and _NONNEG_INT_COLS replaced with FeatureSpec-derived comprehensions keyed on new non_negative field. Added non_negative=True to all 13 count/duration/monetary features. No more manual-maintenance second source of truth. 🟠 Comments: stale 'Mapping: column name → (lower_bound, upper_bound)' comment removed; 'restore int dtype' wording corrected to 'clip() preserves Int64 dtype'. 🟠 Probe category: 'time_window' → 'snapshot_consistency'; taxonomy test updated to accept the new value. 🟠 Probe ACV ordering: added .sort_values('created_at') before groupby.first() in both production code (snapshots.py) and probe (leakage_probes.py) so the tie-break is deterministic when a lead has multiple open opportunities. 🟡 Collapsed redundant if len(od) > 0 branch; _cutoff now computed once and reused for both the creation-date filter and the open/closed gate — eliminates the duplicate lead_id.map(cutoff_map) call. 🟡 FeatureSpec description for touches_days_0_7 trimmed to one clean sentence; rename history belongs in git log, not user-facing CSV. 🟡 first_touch_channel removal comment moved from inside the tuple literal to a proper prose note above the adjacent FeatureSpec. 🟡 Added three fixture-based tests for probe_opportunity_snapshot_ consistency: fires on old close_outcome.isna() semantics, silent on correct closed_at > cutoff semantics, raises on missing columns. Cascade: audit_channel_signal.py default CHANNEL_COLUMNS updated to drop first_touch_channel (column no longer in flat snapshot); related test fixtures and assertions updated accordingly; regenerated channel_signal_audit.{md,json}, release/docs mirror, kaggle metadata, and kaggle preview HTML. 1418 tests pass, 5 skipped. Ruff clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-25T10:21:25Z

pr-agent-context report:

This run includes unresolved review comments on PR #82 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/render/snapshots.py:428
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192877
Root author: copilot-pull-request-reviewer

Comment:
    The post-noise non-negativity clamp runs *before* outlier injection, but outlier injection can create negative values again (it adds ±5σ). This undermines the stated goal of preventing non-physical negatives (durations/ACV). Consider moving the clamp to the end of `_apply_difficulty_distortions` (after outliers), or clamping both after noise and after outliers, so the final output always respects the physical bounds.

## COPILOT-2
Location: leadforge/render/snapshots.py
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192924
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `opportunity_estimated_acv` is documented in the feature spec as the ACV of the *most recent* open opportunity, but the implementation drops `created_at` and then uses `groupby(...).first()`, which depends on input row order and does not guarantee “most recent”. Please make the selection deterministic and consistent with the spec (e.g., sort by `created_at` and take the latest open opp per lead, or use `idxmax` on a parsed `created_at` timestamp).

## COPILOT-3
Location: leadforge/validation/leakage_probes.py:1052
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192948
Root author: copilot-pull-request-reviewer

Comment:
    This probe computes expected `opportunity_estimated_acv` via `groupby('lead_id').first()`, which is order-dependent and may not match the snapshot builder if row ordering differs (e.g., after Parquet read) or if a lead has multiple open opportunities. To avoid false findings, make the selection deterministic (e.g., sort by `created_at` and take the most recent open opp per lead) and mirror the snapshot builder’s exact selection semantics.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 26395535223 attempt 1
Comment timestamp: 2026-05-25T10:20:34.917235+00:00
PR head commit: 3d38946900bc9ce5a7593ff42c0140fdf42b07f7

shaypal5 and others added 2 commits May 25, 2026 12:03

chore(plan): mark PR 8.1 in-progress with completion note

33a9ae4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 25, 2026 09:10

shaypal5 added this to the dataset: leadforge-lead-scoring-v1 milestone May 25, 2026

shaypal5 added type: bugfix Fixes a bug layer: schema schema/ entity/event contracts layer: render render/ bundle and artifact output layer: validation validation/ invariants and checks labels May 25, 2026

Copilot started reviewing on behalf of shaypal5 May 25, 2026 09:10 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 25, 2026

View reviewed changes

shaypal5 merged commit d4a7593 into main May 25, 2026
9 of 10 checks passed

shaypal5 deleted the fix/pr8.1-snapshot-fixes-bundle-regen branch May 25, 2026 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup (PR 8.1)#82

fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup (PR 8.1)#82
shaypal5 merged 3 commits into
mainfrom
fix/pr8.1-snapshot-fixes-bundle-regen

shaypal5 commented May 25, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 25, 2026

Summary

1. Fix has_open_opportunity / opportunity_estimated_acv post-snapshot leak (HIGH)

2. Add probe_opportunity_snapshot_consistency leakage probe (MEDIUM)

3. to_dataframes_snapshot_safe guard on missing lead_id (MEDIUM)

4. Clamp Gaussian noise to physical ranges (MEDIUM/LOW)

5. Exempt total_touches_all from difficulty distortion (MEDIUM)

6. Drop first_touch_channel from task snapshot (MEDIUM)

7. Rename touches_week_1 → touches_days_0_7 (LOW)

8. Fix label window: < → <= (MEDIUM)

Test results

Bundle regeneration

AUC delta from snapshot fix

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Fix `has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak (HIGH)

2. Add `probe_opportunity_snapshot_consistency` leakage probe (MEDIUM)

3. `to_dataframes_snapshot_safe` guard on missing `lead_id` (MEDIUM)

5. Exempt `total_touches_all` from difficulty distortion (MEDIUM)

6. Drop `first_touch_channel` from task snapshot (MEDIUM)

7. Rename `touches_week_1` → `touches_days_0_7` (LOW)

8. Fix label window: `<` → `<=` (MEDIUM)