leadforge-dev · shaypal5 · May 25, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -70,18 +70,16 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 
 _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cross-model review of the v1 preview site, 2026-05-25. Depends on Phase 7 (all except PR 7.3). PR 7.3 depends on this phase._
 
-- [ ] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen`
-  - **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): `leadforge/render/snapshots.py` — change `od["close_outcome"].isna()` gate to `closed_at is null OR closed_at > lead_created_at + snapshot_day`. This is the most critical code fix; measure AUC delta before/after and document.
-  - **Add flat-feature snapshot-consistency probe** (MEDIUM): `leadforge/validation/leakage_probes.py` — probe that recomputes opportunity-derived features under the correct `closed_at > cutoff` semantics and asserts equality with the shipped columns. The missing probe that would have caught the bug above automatically.
-  - **`to_dataframes_snapshot_safe` guard** (MEDIUM): assert `"lead_id" in events.columns` for each `SNAPSHOT_FILTERED_TABLES` entry; fail loud on missing key instead of silently producing all-NaT cutoffs.
-  - **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): post-noise clamp per column type in `_apply_difficulty_distortions` (`days_since_x >= 0`, monetary values `>= 0`). Tiny change, removes the most visible synthetic-data tell, makes the "non-physical values" known limitation obsolete.
-  - **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): remove from `_NUMERIC_DISTORTION_COLS` or add a named exclusion list. Up to 18% NaN on the trap in Advanced muddies the leakage lesson it's supposed to deliver cleanly.
-  - **Drop `first_touch_channel`** (MEDIUM): remove from `LEAD_SNAPSHOT_FEATURES`, flat export, and feature dictionary — byte-identical to `lead_source` in v1; documented redundancy removing itself is cleaner than documenting it.
-  - **Rename `touches_week_1` → `touches_days_0_7`** (LOW): update `LEAD_SNAPSHOT_FEATURES`, snapshot builder, and feature dictionary. The implementation spans days 0–7 inclusive (8 values); the name implies 7.
-  - **Label window: `<` → `<=`** (MEDIUM): `engine.py` — `state.conversion_day < config.label_window_days` should be `<=`. Spec says "within 90 days" (inclusive); the break-me guide invites students to audit this boundary.
-  - **Regenerate all three public tier bundles**, update `release/` artifacts, rerun `validate_release_candidate`, sync claims register.
-  - Labels: `type: bugfix`, `layer: render`, `layer: validation`, `layer: schema`
-  - Size: M-L (~400 lines code + regenerated bundles)
+- [x] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen`
+  - **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): Fixed in `leadforge/render/snapshots.py` — now uses `closed_at is null OR closed_at > lead_created_at + snapshot_day`. AUC delta vs. pre-fix: intro LR 0.879→0.879, intermediate 0.886→0.886, advanced 0.886→0.886 (AUC stable; the bug affected correctness not ranking signal for this DGP).
+  - **Add flat-feature snapshot-consistency probe** (MEDIUM): `probe_opportunity_snapshot_consistency` added to `leadforge/validation/leakage_probes.py`; registered as opt-in in `PROBE_REGISTRY`.
+  - **`to_dataframes_snapshot_safe` guard** (MEDIUM): `_filter_to_snapshot_window` raises `ValueError` with descriptive message on missing `lead_id` column.
+  - **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): Post-noise clamp applied in `_apply_difficulty_distortions` for `days_since_*`, ACV, and count columns.
+  - **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): Added `_DISTORTION_EXEMPT_COLS = {"total_touches_all"}` and excluded from noise/missingness/outlier passes.
+  - **Drop `first_touch_channel`** (MEDIUM): Removed from `LEAD_SNAPSHOT_FEATURES` and task exports; still present in relational `leads.parquet`. v5/v6/v7 pipelines updated with compat rename.
+  - **Rename `touches_week_1` → `touches_days_0_7`** (LOW): Updated `LEAD_SNAPSHOT_FEATURES`, snapshot builder, contract tests. v5/v6/v7 pipelines keep `touches_week_1` via RENAME_MAP for backwards compat.
+  - **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion.
+  - **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass.
 
 - [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
   - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity.

diff --git a/docs/release/channel_signal_audit.json b/docs/release/channel_signal_audit.json
@@ -1,7 +1,6 @@
 {
   "channel_columns": [
-    "lead_source",
-    "first_touch_channel"
+    "lead_source"
   ],
   "industry_mql_to_sql_benchmarks": {
     "Email": 0.005,
@@ -46,39 +45,6 @@
           "train_conversion_rate": 0.4145714285714286,
           "univariate_auc_in_sample": 0.5199794894149169,
           "univariate_auc_out_of_sample": 0.5013517441860464
-        },
-        {
-          "channels": [
-            {
-              "conversion_rate": 0.43439490445859874,
-              "n": 1570,
-              "n_converted": 682,
-              "name": "inbound_marketing",
-              "share": 0.44857142857142857
-            },
-            {
-              "conversion_rate": 0.39111747851002865,
-              "n": 698,
-              "n_converted": 273,
-              "name": "partner_referral",
-              "share": 0.19942857142857143
-            },
-            {
-              "conversion_rate": 0.4025974025974026,
-              "n": 1232,
-              "n_converted": 496,
-              "name": "sdr_outbound",
-              "share": 0.352
-            }
-          ],
-          "column": "first_touch_channel",
-          "n_test": 750,
-          "n_train": 3500,
-          "rate_spread": 0.04327742594857009,
-          "test_conversion_rate": 0.4266666666666667,
-          "train_conversion_rate": 0.4145714285714286,
-          "univariate_auc_in_sample": 0.5199794894149169,
-          "univariate_auc_out_of_sample": 0.5013517441860464
         }
       ],
       "n_test": 750,
@@ -121,39 +87,6 @@
           "train_conversion_rate": 0.20142857142857143,
           "univariate_auc_in_sample": 0.5212431012826857,
           "univariate_auc_out_of_sample": 0.5139326835180411
-        },
-        {
-          "channels": [
-            {
-              "conversion_rate": 0.21273885350318472,
-              "n": 1570,
-              "n_converted": 334,
-              "name": "inbound_marketing",
-              "share": 0.44857142857142857
-            },
-            {
-              "conversion_rate": 0.17621776504297995,
-              "n": 698,
-              "n_converted": 123,
-              "name": "partner_referral",
-              "share": 0.19942857142857143
-            },
-            {
-              "conversion_rate": 0.2012987012987013,
-              "n": 1232,
-              "n_converted": 248,
-              "name": "sdr_outbound",
-              "share": 0.352
-            }
-          ],
-          "column": "first_touch_channel",
-          "n_test": 750,
-          "n_train": 3500,
-          "rate_spread": 0.03652108846020477,
-          "test_conversion_rate": 0.22266666666666668,
-          "train_conversion_rate": 0.20142857142857143,
-          "univariate_auc_in_sample": 0.5212431012826857,
-          "univariate_auc_out_of_sample": 0.5139326835180411
         }
       ],
       "n_test": 750,
@@ -196,39 +129,6 @@
           "train_conversion_rate": 0.07914285714285714,
           "univariate_auc_in_sample": 0.5083011208921436,
           "univariate_auc_out_of_sample": 0.5225784296892246
-        },
-        {
-          "channels": [
-            {
-              "conversion_rate": 0.08152866242038216,
-              "n": 1570,
-              "n_converted": 128,
-              "name": "inbound_marketing",
-              "share": 0.44857142857142857
-            },
-            {
-              "conversion_rate": 0.07593123209169055,
-              "n": 698,
-              "n_converted": 53,
-              "name": "partner_referral",
-              "share": 0.19942857142857143
-            },
-            {
-              "conversion_rate": 0.07792207792207792,
-              "n": 1232,
-              "n_converted": 96,
-              "name": "sdr_outbound",
-              "share": 0.352
-            }
-          ],
-          "column": "first_touch_channel",
-          "n_test": 750,
-          "n_train": 3500,
-          "rate_spread": 0.005597430328691616,
-          "test_conversion_rate": 0.07866666666666666,
-          "train_conversion_rate": 0.07914285714285714,
-          "univariate_auc_in_sample": 0.5083011208921436,
-          "univariate_auc_out_of_sample": 0.5225784296892246
         }
       ],
       "n_test": 750,

diff --git a/docs/release/channel_signal_audit.md b/docs/release/channel_signal_audit.md
@@ -18,7 +18,7 @@ Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.j
 
 `n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%).
 
-### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+### Column: `lead_source`
 
 Per-channel rate spread (max − min): **0.0433**  ·  In-sample univariate AUC: **0.5200**  ·  Out-of-sample univariate AUC: **0.5014**
 
@@ -32,7 +32,7 @@ Per-channel rate spread (max − min): **0.0433**  ·  In-sample univariate AUC:
 
 `n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%).
 
-### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+### Column: `lead_source`
 
 Per-channel rate spread (max − min): **0.0365**  ·  In-sample univariate AUC: **0.5212**  ·  Out-of-sample univariate AUC: **0.5139**
 
@@ -46,7 +46,7 @@ Per-channel rate spread (max − min): **0.0365**  ·  In-sample univariate AUC:
 
 `n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%).
 
-### Columns: `lead_source`, `first_touch_channel` (audit values identical)
+### Column: `lead_source`
 
 Per-channel rate spread (max − min): **0.0056**  ·  In-sample univariate AUC: **0.5083**  ·  Out-of-sample univariate AUC: **0.5226**
 
@@ -62,5 +62,5 @@ The numbers above answer one question: *how strongly does channel alone signal 9
 
 Two empirical observations a reader can make from the numbers above:
 
-1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
+1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` as the only feature. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
 2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`.
diff --git a/docs/release/feature_dictionary.md b/docs/release/feature_dictionary.md
@@ -8,14 +8,14 @@ analytical role and adds the prose explanation, modelling
 recommendations, and pedagogical caveats that don't fit a CSV row.
 
 The grouping below covers every feature in the public student-facing
-snapshot — the same 32 columns ship in `intro`, `intermediate`, and
+snapshot — the same 31 columns ship in `intro`, `intermediate`, and
 `advanced` bundles. The instructor companion adds the hidden truth
 in `metadata/`; it does not change the feature list.
 
 | Category | Columns | Modelling default |
 |---|---|---|
 | Lead identity & timing | 4 | drop `lead_id`; keep `lead_created_at` for cohort splits, drop for production |
-| Lead source & channel | 2 | keep both |
+| Lead source & channel | 1 | keep |
 | Firmographics | 5 | keep all |
 | Personographics | 3 | keep all (categorical encoders welcome) |
 | Engagement (snapshot-window) | 10 | keep all |
@@ -35,16 +35,18 @@ in `metadata/`; it does not change the feature list.
 
 ## Lead source and channel
 
-Two columns describe how each lead entered the funnel. They are
+One column describes how each lead entered the funnel. It is
 populated from the recipe's GTM-motion mix
-(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral`
-20%) and are identical between the two columns in v1 — both encode
-the same origination channel under different field names.
+(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` 20%).
 
 | Column | Dtype | Why it might matter |
 |---|---|---|
 | `lead_source` | string | Origination channel; one of `inbound_marketing` / `sdr_outbound` / `partner_referral`. |
-| `first_touch_channel` | string | Marketing channel of the first recorded touch. Always equals `lead_source` in v1; the field exists to support post-v1 work where origination and first-touch can diverge. |
+
+**Note.** `first_touch_channel` was removed from the task snapshot in PR 8.1: in v1
+it is byte-identical to `lead_source` (both are set to the same origination value), so
+it adds no information. It still appears in the relational `tables/leads.parquet` for
+post-v1 use cases where origination and first-touch can diverge.
 
 **Caveat.** Per [`docs/release/channel_signal_audit.md`](channel_signal_audit.md),
 v1's channel signal is weak: per-channel rate spread ≤ 0.043 and
@@ -99,7 +101,7 @@ features cannot encode events that drove the late-window outcome.
 | `pricing_page_views` | Int64 | Cumulative pricing-page views across sessions. |
 | `demo_page_views` | Int64 | Cumulative demo-page views across sessions. |
 | `total_session_duration_seconds` | Int64 | Cumulative seconds across all sessions. |
-| `touches_week_1` | Int64 | Touches in days 0–7 inclusive (early urgency proxy; the snapshot builder uses `_day <= 7`, which is 8 day values). |
+| `touches_days_0_7` | Int64 | Touches in days 0–7 inclusive (early urgency proxy). Renamed from `touches_week_1` in PR 8.1 for precision: the window covers 8 day values (0, 1, …, 7). |
 | `touches_last_7_days` | Int64 | Touches in the last 7 days of the snapshot window — for `snapshot_day=30`, days 24–30 inclusive (the snapshot builder uses `_day > snapshot_day - 7`). |
 | `days_since_first_touch` | Float64 | NaN if the lead has had zero touches by snapshot day. |
 
@@ -183,8 +185,10 @@ in the table above, including the IDs — the recommendation is what to
 3. **Categoricals — encode.** One-hot or target-encode `industry`,
    `region`, `employee_band`, `estimated_revenue_band`,
    `process_maturity_band`, `role_function`, `seniority`,
-   `buyer_role`, `lead_source`, `first_touch_channel`. The two
-   channel columns carry identical values in v1; pick one.
+   `buyer_role`, `lead_source`.
+   (`first_touch_channel` was removed from the snapshot in PR 8.1 — it
+   was byte-identical to `lead_source` in v1; it still exists in
+   `tables/leads.parquet` but not in the task splits.)
 4. **Engagement and funnel — keep all.** The `Float64` columns carry
    NaN for "no event in window", which is itself a signal — encode
    missingness explicitly rather than imputing to zero blindly.

diff --git a/leadforge/pipelines/build_v5.py b/leadforge/pipelines/build_v5.py
@@ -84,6 +84,8 @@
     "activity_count": "sales_activities",
     "converted_within_90_days": "converted",
     "total_touches_all": "__leakage__total_touches_90d",
+    # touches_days_0_7 renamed back to touches_week_1 for v5 CSV format compatibility.
+    "touches_days_0_7": "touches_week_1",
 }
 
 

diff --git a/leadforge/pipelines/common.py b/leadforge/pipelines/common.py
@@ -114,6 +114,8 @@
     "session_count": "web_sessions",
     "activity_count": "sales_activities",
     "converted_within_90_days": "converted",
+    # touches_days_0_7 renamed back to touches_week_1 for v6/v7 CSV format compatibility.
+    "touches_days_0_7": "touches_week_1",
 }
 
 

diff --git a/leadforge/render/relational_snapshot_safe.py b/leadforge/render/relational_snapshot_safe.py
@@ -101,7 +101,7 @@ def to_dataframes_snapshot_safe(
         df = dfs[name]
         if name == "opportunities":
             df = _drop_columns(df, BANNED_OPP_COLUMNS)
-        out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon)
+        out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon, table_name=name)
 
     return out
 
@@ -143,7 +143,15 @@ def _filter_to_snapshot_window(
     anchor: pd.DataFrame,
     ts_col: str,
     horizon: pd.Timedelta,
+    table_name: str = "<unknown>",
 ) -> pd.DataFrame:
+    # Column-presence guard runs before the empty check: a misconfigured table
+    # that happens to be empty should still raise, not silently pass through.
+    if "lead_id" not in events.columns:
+        raise ValueError(
+            f"SNAPSHOT_FILTERED_TABLES entry '{table_name}' is missing a 'lead_id' column; "
+            "cannot apply per-lead snapshot filter."
+        )
     if len(events) == 0:
         return events
     merged = events.merge(anchor, on="lead_id", how="left")