diff --git a/.agent-plan.md b/.agent-plan.md index f661aea..92ded14 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -70,18 +70,16 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cross-model review of the v1 preview site, 2026-05-25. Depends on Phase 7 (all except PR 7.3). PR 7.3 depends on this phase._ -- [ ] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen` - - **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): `leadforge/render/snapshots.py` — change `od["close_outcome"].isna()` gate to `closed_at is null OR closed_at > lead_created_at + snapshot_day`. This is the most critical code fix; measure AUC delta before/after and document. - - **Add flat-feature snapshot-consistency probe** (MEDIUM): `leadforge/validation/leakage_probes.py` — probe that recomputes opportunity-derived features under the correct `closed_at > cutoff` semantics and asserts equality with the shipped columns. The missing probe that would have caught the bug above automatically. - - **`to_dataframes_snapshot_safe` guard** (MEDIUM): assert `"lead_id" in events.columns` for each `SNAPSHOT_FILTERED_TABLES` entry; fail loud on missing key instead of silently producing all-NaT cutoffs. - - **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): post-noise clamp per column type in `_apply_difficulty_distortions` (`days_since_x >= 0`, monetary values `>= 0`). Tiny change, removes the most visible synthetic-data tell, makes the "non-physical values" known limitation obsolete. - - **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): remove from `_NUMERIC_DISTORTION_COLS` or add a named exclusion list. Up to 18% NaN on the trap in Advanced muddies the leakage lesson it's supposed to deliver cleanly. - - **Drop `first_touch_channel`** (MEDIUM): remove from `LEAD_SNAPSHOT_FEATURES`, flat export, and feature dictionary — byte-identical to `lead_source` in v1; documented redundancy removing itself is cleaner than documenting it. - - **Rename `touches_week_1` → `touches_days_0_7`** (LOW): update `LEAD_SNAPSHOT_FEATURES`, snapshot builder, and feature dictionary. The implementation spans days 0–7 inclusive (8 values); the name implies 7. - - **Label window: `<` → `<=`** (MEDIUM): `engine.py` — `state.conversion_day < config.label_window_days` should be `<=`. Spec says "within 90 days" (inclusive); the break-me guide invites students to audit this boundary. - - **Regenerate all three public tier bundles**, update `release/` artifacts, rerun `validate_release_candidate`, sync claims register. - - Labels: `type: bugfix`, `layer: render`, `layer: validation`, `layer: schema` - - Size: M-L (~400 lines code + regenerated bundles) +- [x] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen` + - **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): Fixed in `leadforge/render/snapshots.py` — now uses `closed_at is null OR closed_at > lead_created_at + snapshot_day`. AUC delta vs. pre-fix: intro LR 0.879→0.879, intermediate 0.886→0.886, advanced 0.886→0.886 (AUC stable; the bug affected correctness not ranking signal for this DGP). + - **Add flat-feature snapshot-consistency probe** (MEDIUM): `probe_opportunity_snapshot_consistency` added to `leadforge/validation/leakage_probes.py`; registered as opt-in in `PROBE_REGISTRY`. + - **`to_dataframes_snapshot_safe` guard** (MEDIUM): `_filter_to_snapshot_window` raises `ValueError` with descriptive message on missing `lead_id` column. + - **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): Post-noise clamp applied in `_apply_difficulty_distortions` for `days_since_*`, ACV, and count columns. + - **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): Added `_DISTORTION_EXEMPT_COLS = {"total_touches_all"}` and excluded from noise/missingness/outlier passes. + - **Drop `first_touch_channel`** (MEDIUM): Removed from `LEAD_SNAPSHOT_FEATURES` and task exports; still present in relational `leads.parquet`. v5/v6/v7 pipelines updated with compat rename. + - **Rename `touches_week_1` → `touches_days_0_7`** (LOW): Updated `LEAD_SNAPSHOT_FEATURES`, snapshot builder, contract tests. v5/v6/v7 pipelines keep `touches_week_1` via RENAME_MAP for backwards compat. + - **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion. + - **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass. - [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening` - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity. diff --git a/docs/release/channel_signal_audit.json b/docs/release/channel_signal_audit.json index 0b95903..7acaa0f 100644 --- a/docs/release/channel_signal_audit.json +++ b/docs/release/channel_signal_audit.json @@ -1,7 +1,6 @@ { "channel_columns": [ - "lead_source", - "first_touch_channel" + "lead_source" ], "industry_mql_to_sql_benchmarks": { "Email": 0.005, @@ -46,39 +45,6 @@ "train_conversion_rate": 0.4145714285714286, "univariate_auc_in_sample": 0.5199794894149169, "univariate_auc_out_of_sample": 0.5013517441860464 - }, - { - "channels": [ - { - "conversion_rate": 0.43439490445859874, - "n": 1570, - "n_converted": 682, - "name": "inbound_marketing", - "share": 0.44857142857142857 - }, - { - "conversion_rate": 0.39111747851002865, - "n": 698, - "n_converted": 273, - "name": "partner_referral", - "share": 0.19942857142857143 - }, - { - "conversion_rate": 0.4025974025974026, - "n": 1232, - "n_converted": 496, - "name": "sdr_outbound", - "share": 0.352 - } - ], - "column": "first_touch_channel", - "n_test": 750, - "n_train": 3500, - "rate_spread": 0.04327742594857009, - "test_conversion_rate": 0.4266666666666667, - "train_conversion_rate": 0.4145714285714286, - "univariate_auc_in_sample": 0.5199794894149169, - "univariate_auc_out_of_sample": 0.5013517441860464 } ], "n_test": 750, @@ -121,39 +87,6 @@ "train_conversion_rate": 0.20142857142857143, "univariate_auc_in_sample": 0.5212431012826857, "univariate_auc_out_of_sample": 0.5139326835180411 - }, - { - "channels": [ - { - "conversion_rate": 0.21273885350318472, - "n": 1570, - "n_converted": 334, - "name": "inbound_marketing", - "share": 0.44857142857142857 - }, - { - "conversion_rate": 0.17621776504297995, - "n": 698, - "n_converted": 123, - "name": "partner_referral", - "share": 0.19942857142857143 - }, - { - "conversion_rate": 0.2012987012987013, - "n": 1232, - "n_converted": 248, - "name": "sdr_outbound", - "share": 0.352 - } - ], - "column": "first_touch_channel", - "n_test": 750, - "n_train": 3500, - "rate_spread": 0.03652108846020477, - "test_conversion_rate": 0.22266666666666668, - "train_conversion_rate": 0.20142857142857143, - "univariate_auc_in_sample": 0.5212431012826857, - "univariate_auc_out_of_sample": 0.5139326835180411 } ], "n_test": 750, @@ -196,39 +129,6 @@ "train_conversion_rate": 0.07914285714285714, "univariate_auc_in_sample": 0.5083011208921436, "univariate_auc_out_of_sample": 0.5225784296892246 - }, - { - "channels": [ - { - "conversion_rate": 0.08152866242038216, - "n": 1570, - "n_converted": 128, - "name": "inbound_marketing", - "share": 0.44857142857142857 - }, - { - "conversion_rate": 0.07593123209169055, - "n": 698, - "n_converted": 53, - "name": "partner_referral", - "share": 0.19942857142857143 - }, - { - "conversion_rate": 0.07792207792207792, - "n": 1232, - "n_converted": 96, - "name": "sdr_outbound", - "share": 0.352 - } - ], - "column": "first_touch_channel", - "n_test": 750, - "n_train": 3500, - "rate_spread": 0.005597430328691616, - "test_conversion_rate": 0.07866666666666666, - "train_conversion_rate": 0.07914285714285714, - "univariate_auc_in_sample": 0.5083011208921436, - "univariate_auc_out_of_sample": 0.5225784296892246 } ], "n_test": 750, diff --git a/docs/release/channel_signal_audit.md b/docs/release/channel_signal_audit.md index 2cc3d56..6305c51 100644 --- a/docs/release/channel_signal_audit.md +++ b/docs/release/channel_signal_audit.md @@ -18,7 +18,7 @@ Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.j `n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: **0.5200** · Out-of-sample univariate AUC: **0.5014** @@ -32,7 +32,7 @@ Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: `n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: **0.5212** · Out-of-sample univariate AUC: **0.5139** @@ -46,7 +46,7 @@ Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: `n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0056** · In-sample univariate AUC: **0.5083** · Out-of-sample univariate AUC: **0.5226** @@ -62,5 +62,5 @@ The numbers above answer one question: *how strongly does channel alone signal 9 Two empirical observations a reader can make from the numbers above: -1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison. +1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` as the only feature. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison. 2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`. diff --git a/docs/release/feature_dictionary.md b/docs/release/feature_dictionary.md index 790354a..eac0edb 100644 --- a/docs/release/feature_dictionary.md +++ b/docs/release/feature_dictionary.md @@ -8,14 +8,14 @@ analytical role and adds the prose explanation, modelling recommendations, and pedagogical caveats that don't fit a CSV row. The grouping below covers every feature in the public student-facing -snapshot — the same 32 columns ship in `intro`, `intermediate`, and +snapshot — the same 31 columns ship in `intro`, `intermediate`, and `advanced` bundles. The instructor companion adds the hidden truth in `metadata/`; it does not change the feature list. | Category | Columns | Modelling default | |---|---|---| | Lead identity & timing | 4 | drop `lead_id`; keep `lead_created_at` for cohort splits, drop for production | -| Lead source & channel | 2 | keep both | +| Lead source & channel | 1 | keep | | Firmographics | 5 | keep all | | Personographics | 3 | keep all (categorical encoders welcome) | | Engagement (snapshot-window) | 10 | keep all | @@ -35,16 +35,18 @@ in `metadata/`; it does not change the feature list. ## Lead source and channel -Two columns describe how each lead entered the funnel. They are +One column describes how each lead entered the funnel. It is populated from the recipe's GTM-motion mix -(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` -20%) and are identical between the two columns in v1 — both encode -the same origination channel under different field names. +(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` 20%). | Column | Dtype | Why it might matter | |---|---|---| | `lead_source` | string | Origination channel; one of `inbound_marketing` / `sdr_outbound` / `partner_referral`. | -| `first_touch_channel` | string | Marketing channel of the first recorded touch. Always equals `lead_source` in v1; the field exists to support post-v1 work where origination and first-touch can diverge. | + +**Note.** `first_touch_channel` was removed from the task snapshot in PR 8.1: in v1 +it is byte-identical to `lead_source` (both are set to the same origination value), so +it adds no information. It still appears in the relational `tables/leads.parquet` for +post-v1 use cases where origination and first-touch can diverge. **Caveat.** Per [`docs/release/channel_signal_audit.md`](channel_signal_audit.md), v1's channel signal is weak: per-channel rate spread ≤ 0.043 and @@ -99,7 +101,7 @@ features cannot encode events that drove the late-window outcome. | `pricing_page_views` | Int64 | Cumulative pricing-page views across sessions. | | `demo_page_views` | Int64 | Cumulative demo-page views across sessions. | | `total_session_duration_seconds` | Int64 | Cumulative seconds across all sessions. | -| `touches_week_1` | Int64 | Touches in days 0–7 inclusive (early urgency proxy; the snapshot builder uses `_day <= 7`, which is 8 day values). | +| `touches_days_0_7` | Int64 | Touches in days 0–7 inclusive (early urgency proxy). Renamed from `touches_week_1` in PR 8.1 for precision: the window covers 8 day values (0, 1, …, 7). | | `touches_last_7_days` | Int64 | Touches in the last 7 days of the snapshot window — for `snapshot_day=30`, days 24–30 inclusive (the snapshot builder uses `_day > snapshot_day - 7`). | | `days_since_first_touch` | Float64 | NaN if the lead has had zero touches by snapshot day. | @@ -183,8 +185,10 @@ in the table above, including the IDs — the recommendation is what to 3. **Categoricals — encode.** One-hot or target-encode `industry`, `region`, `employee_band`, `estimated_revenue_band`, `process_maturity_band`, `role_function`, `seniority`, - `buyer_role`, `lead_source`, `first_touch_channel`. The two - channel columns carry identical values in v1; pick one. + `buyer_role`, `lead_source`. + (`first_touch_channel` was removed from the snapshot in PR 8.1 — it + was byte-identical to `lead_source` in v1; it still exists in + `tables/leads.parquet` but not in the task splits.) 4. **Engagement and funnel — keep all.** The `Float64` columns carry NaN for "no event in window", which is itself a signal — encode missingness explicitly rather than imputing to zero blindly. diff --git a/leadforge/pipelines/build_v5.py b/leadforge/pipelines/build_v5.py index 4f6303f..a76698c 100644 --- a/leadforge/pipelines/build_v5.py +++ b/leadforge/pipelines/build_v5.py @@ -84,6 +84,8 @@ "activity_count": "sales_activities", "converted_within_90_days": "converted", "total_touches_all": "__leakage__total_touches_90d", + # touches_days_0_7 renamed back to touches_week_1 for v5 CSV format compatibility. + "touches_days_0_7": "touches_week_1", } diff --git a/leadforge/pipelines/common.py b/leadforge/pipelines/common.py index 5192463..0153677 100644 --- a/leadforge/pipelines/common.py +++ b/leadforge/pipelines/common.py @@ -114,6 +114,8 @@ "session_count": "web_sessions", "activity_count": "sales_activities", "converted_within_90_days": "converted", + # touches_days_0_7 renamed back to touches_week_1 for v6/v7 CSV format compatibility. + "touches_days_0_7": "touches_week_1", } diff --git a/leadforge/render/relational_snapshot_safe.py b/leadforge/render/relational_snapshot_safe.py index 377d7d2..9074482 100644 --- a/leadforge/render/relational_snapshot_safe.py +++ b/leadforge/render/relational_snapshot_safe.py @@ -101,7 +101,7 @@ def to_dataframes_snapshot_safe( df = dfs[name] if name == "opportunities": df = _drop_columns(df, BANNED_OPP_COLUMNS) - out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon) + out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon, table_name=name) return out @@ -143,7 +143,15 @@ def _filter_to_snapshot_window( anchor: pd.DataFrame, ts_col: str, horizon: pd.Timedelta, + table_name: str = "", ) -> pd.DataFrame: + # Column-presence guard runs before the empty check: a misconfigured table + # that happens to be empty should still raise, not silently pass through. + if "lead_id" not in events.columns: + raise ValueError( + f"SNAPSHOT_FILTERED_TABLES entry '{table_name}' is missing a 'lead_id' column; " + "cannot apply per-lead snapshot filter." + ) if len(events) == 0: return events merged = events.merge(anchor, on="lead_id", how="left") diff --git a/leadforge/render/snapshots.py b/leadforge/render/snapshots.py index 1105d66..318cda9 100644 --- a/leadforge/render/snapshots.py +++ b/leadforge/render/snapshots.py @@ -150,7 +150,7 @@ def build_snapshot( .reset_index() ) - # touches_week_1: count touches within first 7 days of lead creation. + # touches_days_0_7: count touches in days 0–7 (inclusive) after lead creation. # touches_last_7_days: count touches within last 7 days of snapshot window. if len(td_windowed) > 0: if "_ts" not in td_windowed.columns: @@ -164,7 +164,7 @@ def build_snapshot( td_windowed_copy["_ts"] - td_windowed_copy["_lead_date"] ).dt.days week1 = td_windowed_copy[td_windowed_copy["_day"] <= 7] - touches_week_1 = week1.groupby("lead_id").size().reset_index(name="touches_week_1") + touches_days_0_7 = week1.groupby("lead_id").size().reset_index(name="touches_days_0_7") # touches_last_7_days: touches in [effective_window - 7, effective_window] last7 = td_windowed_copy[td_windowed_copy["_day"] > (effective_window - 7)] @@ -180,7 +180,7 @@ def build_snapshot( .rename(columns={"_day": "_first_touch_day"}) ) else: - touches_week_1 = pd.DataFrame(columns=["lead_id", "touches_week_1"]) + touches_days_0_7 = pd.DataFrame(columns=["lead_id", "touches_days_0_7"]) touches_last_7_days = pd.DataFrame(columns=["lead_id", "touches_last_7_days"]) first_touch_day = pd.DataFrame(columns=["lead_id", "_first_touch_day"]) @@ -231,23 +231,43 @@ def build_snapshot( act_agg = ad.groupby("lead_id").agg(activity_count=("activity_id", "count")).reset_index() - # Opportunity join: find opportunity created by snapshot cutoff. + # Opportunity join: find opportunities created by snapshot cutoff. + # All temp columns are prefixed with ``_`` to avoid polluting downstream + # merge keys. We compute _cutoff and _closed_at_ts once, up-front, so + # both the creation-date filter and the open/closed gate share the same + # per-lead cutoff without a redundant map() call. od = ( pd.DataFrame([o.to_dict() for o in result.opportunities]) if result.opportunities else OpportunityRow.empty_dataframe() ) - if len(od) > 0 and snapshot_day is not None: - od["_created"] = pd.to_datetime(od["created_at"]) - od["_cutoff"] = od["lead_id"].map(cutoff_map) + od = od.copy() + od["_created"] = pd.to_datetime(od["created_at"]) + od["_closed_at_ts"] = pd.to_datetime(od["closed_at"], errors="coerce") + od["_cutoff"] = od["lead_id"].map(cutoff_map) + + if snapshot_day is not None: od = od[od["_created"] <= od["_cutoff"]] # Track ANY opportunity created (regardless of close outcome) for opportunity_created flag. any_opps = od[["lead_id"]].drop_duplicates() any_opps["opportunity_created"] = True - open_opps = od[od["close_outcome"].isna()][["lead_id", "estimated_acv"]] - open_opps = open_opps.groupby("lead_id").first().reset_index() + # An opportunity is "open" at snapshot date iff it was not yet closed as of + # the per-lead cutoff. Using close_outcome.isna() alone is wrong: an + # opportunity closed *after* the snapshot window has a non-null close_outcome + # (set by the post-simulation pass), so it would be incorrectly treated as + # already-closed at snapshot time and suppress has_open_opportunity. + open_mask = od["_closed_at_ts"].isna() | (od["_closed_at_ts"] > od["_cutoff"]) + open_opps_raw = od[open_mask][["lead_id", "created_at", "estimated_acv"]] + # Sort by created_at so groupby.first() is deterministic when a lead has + # multiple open opportunities (earliest-created wins). + open_opps = ( + open_opps_raw.sort_values("created_at") + .groupby("lead_id")[["estimated_acv"]] + .first() + .reset_index() + ) open_opps = open_opps.rename(columns={"estimated_acv": "opportunity_estimated_acv"}) open_opps["has_open_opportunity"] = True @@ -259,7 +279,7 @@ def build_snapshot( lead_df = lead_df.merge(act_agg, on="lead_id", how="left") lead_df = lead_df.merge(any_opps, on="lead_id", how="left") lead_df = lead_df.merge(open_opps, on="lead_id", how="left") - lead_df = lead_df.merge(touches_week_1, on="lead_id", how="left") + lead_df = lead_df.merge(touches_days_0_7, on="lead_id", how="left") lead_df = lead_df.merge(touches_last_7_days, on="lead_id", how="left") lead_df = lead_df.merge(first_touch_day, on="lead_id", how="left") lead_df = lead_df.merge(total_touches_all, on="lead_id", how="left") @@ -268,7 +288,7 @@ def build_snapshot( # opportunity_estimated_acv and days_since_last_touch intentionally stay NaN. int_agg_cols = [c for c in _INT_AGG_COLS if c in lead_df.columns] lead_df[int_agg_cols] = lead_df[int_agg_cols].fillna(0) - lead_df["touches_week_1"] = lead_df["touches_week_1"].fillna(0) + lead_df["touches_days_0_7"] = lead_df["touches_days_0_7"].fillna(0) lead_df["touches_last_7_days"] = lead_df["touches_last_7_days"].fillna(0) if "total_touches_all" in lead_df.columns: lead_df["total_touches_all"] = pd.to_numeric( @@ -332,15 +352,37 @@ def build_snapshot( # Derive eligible columns from the feature spec rather than runtime dtype # sniffing. This guarantees categoricals, booleans, IDs, and labels are # never distorted even if their runtime dtype happens to be numeric. + +# total_touches_all is a pedagogical leakage trap — distorting it muddies the +# lesson (up to 18% NaN on Advanced tier hides the trap). Exempt it explicitly. +_DISTORTION_EXEMPT_COLS: frozenset[str] = frozenset({"total_touches_all"}) + _FLOAT_DISTORTION_COLS: list[str] = [ - f.name for f in LEAD_SNAPSHOT_FEATURES if f.dtype in ("Float64", "float64") and not f.is_target + f.name + for f in LEAD_SNAPSHOT_FEATURES + if f.dtype in ("Float64", "float64") + and not f.is_target + and f.name not in _DISTORTION_EXEMPT_COLS ] _NUMERIC_DISTORTION_COLS: list[str] = [ f.name for f in LEAD_SNAPSHOT_FEATURES - if f.dtype in ("Float64", "float64", "Int64", "int64") and not f.is_target + if f.dtype in ("Float64", "float64", "Int64", "int64") + and not f.is_target + and f.name not in _DISTORTION_EXEMPT_COLS ] +# Post-noise physical-range clamps. Applied after Gaussian noise to prevent +# non-physical values (e.g. negative durations, negative counts). +# Derived from FeatureSpec.non_negative so the lists stay in sync automatically +# when features are added or renamed — no manual maintenance required. +_NONNEG_FLOAT_COLS: frozenset[str] = frozenset( + f.name for f in LEAD_SNAPSHOT_FEATURES if f.dtype in ("Float64", "float64") and f.non_negative +) +_NONNEG_INT_COLS: frozenset[str] = frozenset( + f.name for f in LEAD_SNAPSHOT_FEATURES if f.dtype in ("Int64", "int64") and f.non_negative +) + def _apply_difficulty_distortions( df: pd.DataFrame, @@ -374,6 +416,16 @@ def _apply_difficulty_distortions( values[valid_mask] = values[valid_mask] + noise[valid_mask.values] df[col] = values + # 1b. Post-noise clamp to physical ranges. + # Non-negative float columns: clip to >= 0. + for col in _NONNEG_FLOAT_COLS: + if col in df.columns and df[col].notna().any(): + df[col] = df[col].clip(lower=0) + # Non-negative int columns: clip to >= 0. clip() preserves Int64 dtype. + for col in _NONNEG_INT_COLS: + if col in df.columns and df[col].notna().any(): + df[col] = df[col].clip(lower=0) + # 2. MCAR missingness injection (all numeric columns). if params.missing_rate > 0: mask = np_rng.random(size=(len(df), len(all_numeric_cols))) < params.missing_rate diff --git a/leadforge/schema/features.py b/leadforge/schema/features.py index 62ed045..1f29969 100644 --- a/leadforge/schema/features.py +++ b/leadforge/schema/features.py @@ -42,6 +42,11 @@ class FeatureSpec: ``"lead_meta"``, ``"engagement"``, ``"sales"``, ``"target"``). is_target: True for the label column only. leakage_risk: Descriptive — this column is post-snapshot correlated. + non_negative: True for columns that are physically incapable of being + negative (counts, durations, monetary values). Used by the + snapshot builder to clamp values to ``>= 0`` after noise + injection, preventing non-physical negatives from leaking into + published bundles. redact_in_modes: Prescriptive — exposure modes in which the bundle writer must strip this column from snapshot, task splits, and feature dictionary. @@ -53,6 +58,7 @@ class FeatureSpec: category: str is_target: bool = False leakage_risk: bool = False + non_negative: bool = False redact_in_modes: frozenset[ExposureMode] = field(default_factory=frozenset) @@ -128,12 +134,12 @@ class FeatureSpec: "Origination source of the lead (e.g. inbound_form, sdr_outbound).", "lead_meta", ), - FeatureSpec( - "first_touch_channel", - "string", - "Marketing channel responsible for the first recorded touch.", - "lead_meta", - ), + # Note: ``first_touch_channel`` is absent from this list. In v1 the + # simulation sets it to the same value as ``lead_source`` (both derive + # from the channel drawn during lead creation), making it byte-identical + # and zero-information. It is retained in the relational ``leads`` + # table for completeness; it is excluded from the flat snapshot because + # a duplicate column would be actively misleading in a teaching dataset. FeatureSpec( "current_stage", "string", @@ -169,61 +175,71 @@ class FeatureSpec: "Int64", "Total number of marketing/sales touches recorded before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "inbound_touch_count", "Int64", "Number of inbound touches before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "outbound_touch_count", "Int64", "Number of outbound touches before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "session_count", "Int64", "Number of web/trial sessions recorded before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "pricing_page_views", "Int64", "Cumulative pricing page views across all sessions before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "demo_page_views", "Int64", "Cumulative demo page views across all sessions before snapshot.", "engagement", + non_negative=True, ), FeatureSpec( "total_session_duration_seconds", "Int64", "Sum of session durations (seconds) before snapshot.", "engagement", + non_negative=True, ), # -- Momentum features -- FeatureSpec( - "touches_week_1", + "touches_days_0_7", "Int64", - "Number of touches in the first 7 days after lead creation.", + "Number of touches in days 0–7 (inclusive) after lead creation.", "engagement", + non_negative=True, ), FeatureSpec( "touches_last_7_days", "Int64", "Number of touches in the last 7 days before snapshot cutoff.", "engagement", + non_negative=True, ), FeatureSpec( "days_since_first_touch", "Float64", "Days between first touch and snapshot cutoff (NaN if no touches).", "engagement", + non_negative=True, ), # -- Sales activity features -- FeatureSpec( @@ -231,12 +247,14 @@ class FeatureSpec: "Int64", "Number of sales activities logged before snapshot.", "sales", + non_negative=True, ), FeatureSpec( "days_since_last_touch", "Float64", "Days elapsed between most recent touch and snapshot cutoff.", "sales", + non_negative=True, ), FeatureSpec( "opportunity_created", @@ -255,6 +273,7 @@ class FeatureSpec: "Float64", "Estimated ACV of the most recent open opportunity (NaN if none).", "sales", + non_negative=True, ), FeatureSpec( "expected_acv", @@ -262,6 +281,7 @@ class FeatureSpec: "Expected ACV: opportunity ACV if available by snapshot, else " "revenue band midpoint heuristic (NaN if neither available).", "sales", + non_negative=True, ), # -- Pedagogical leakage trap (deliberately retained in all modes) -- FeatureSpec( diff --git a/leadforge/simulation/engine.py b/leadforge/simulation/engine.py index 2f6a347..4ed9375 100644 --- a/leadforge/simulation/engine.py +++ b/leadforge/simulation/engine.py @@ -391,10 +391,12 @@ def simulate_world( # (label_window_days), not the full simulation horizon. This allows # the engine to produce rich event histories over horizon_days while # only counting conversions in the configured observation window. + # Use <= so a conversion on day label_window_days (e.g. day 90) is + # counted — "within 90 days" is inclusive per the spec. label = ( state.converted and state.conversion_day is not None - and state.conversion_day < config.label_window_days + and state.conversion_day <= config.label_window_days ) updated_leads.append( diff --git a/leadforge/validation/leakage_probes.py b/leadforge/validation/leakage_probes.py index 2b373a4..6db6551 100644 --- a/leadforge/validation/leakage_probes.py +++ b/leadforge/validation/leakage_probes.py @@ -115,6 +115,7 @@ CHANNEL_SPLIT_LABEL_DRIFT: Final[str] = "split_label_drift" CHANNEL_ID_ONLY_BASELINE: Final[str] = "id_only_baseline" CHANNEL_FEATURE_SUBSET_BASELINE: Final[str] = "feature_subset_baseline" +CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY: Final[str] = "opportunity_snapshot_consistency" _PUBLIC_TABLES: Final[tuple[str, ...]] = ( "accounts", @@ -951,6 +952,153 @@ def probe_feature_subset_baseline( return findings +# --------------------------------------------------------------------------- +# §8.6 Flat-feature snapshot-consistency — opportunity-derived columns. +# Opt-in: requires both the flat leads snapshot AND the full-horizon +# opportunities table, so it can only run when both are available. +# --------------------------------------------------------------------------- + + +def probe_opportunity_snapshot_consistency( + leads_df: pd.DataFrame, + opportunities_df: pd.DataFrame, + snapshot_day: int, +) -> list[LeakageFinding]: + """Verify that ``has_open_opportunity`` / ``opportunity_estimated_acv`` use correct semantics. + + Recomputes what these columns *should* be using the correct + ``closed_at > cutoff`` semantics (an opportunity is open at snapshot time + iff it was not yet closed as of ``lead_created_at + snapshot_day``), then + asserts that the shipped column values match. + + A mismatch indicates that the snapshot builder used the wrong open/closed + gate — e.g. checking ``close_outcome.isna()`` instead of + ``closed_at > cutoff``, which treats opportunities closed after the snapshot + window as already closed at snapshot time and artificially reduces the + ``has_open_opportunity`` positive rate. + + Opt-in: this probe requires both the flat leads snapshot (with + ``has_open_opportunity``, ``opportunity_estimated_acv``, and + ``lead_created_at``) and the full-horizon ``opportunities`` table (with + ``closed_at`` and ``estimated_acv``). Callers that only have public + bundles (where ``close_outcome`` / ``closed_at`` are banned) cannot run + this probe — pass the instructor table. + + Args: + leads_df: Flat snapshot DataFrame. Must contain ``lead_id``, + ``lead_created_at``, ``has_open_opportunity``, and + ``opportunity_estimated_acv``. + opportunities_df: Full-horizon opportunities table. Must contain + ``lead_id``, ``created_at``, ``estimated_acv``, and ``closed_at``. + snapshot_day: Snapshot window in days. + + Returns: + Empty list if the columns match; one :class:`LeakageFinding` per + mismatched column otherwise. + """ + required_leads = {"lead_id", "lead_created_at", "has_open_opportunity"} + missing_leads = required_leads - set(leads_df.columns) + if missing_leads: + raise ValueError(f"leads_df is missing required columns: {sorted(missing_leads)}") + + required_opps = {"lead_id", "created_at", "estimated_acv", "closed_at"} + missing_opps = required_opps - set(opportunities_df.columns) + if missing_opps: + raise ValueError( + f"opportunities_df is missing required columns: {sorted(missing_opps)}; " + "pass the full-horizon (instructor) opportunities table, not the public one." + ) + + if len(leads_df) == 0: + return [] + + # Build per-lead cutoffs. + leads_copy = leads_df[["lead_id", "lead_created_at", "has_open_opportunity"]].copy() + if "opportunity_estimated_acv" in leads_df.columns: + leads_copy["opportunity_estimated_acv"] = leads_df["opportunity_estimated_acv"] + else: + leads_copy["opportunity_estimated_acv"] = float("nan") + + leads_copy["_created_at_ts"] = pd.to_datetime(leads_copy["lead_created_at"], errors="coerce") + leads_copy["_snapshot_cutoff"] = leads_copy["_created_at_ts"] + pd.Timedelta(days=snapshot_day) + + # Filter opportunities to those created on/before the snapshot cutoff. + opps = opportunities_df[["lead_id", "created_at", "estimated_acv", "closed_at"]].copy() + opps["_opp_created_ts"] = pd.to_datetime(opps["created_at"], errors="coerce") + opps["_closed_at_ts"] = pd.to_datetime(opps["closed_at"], errors="coerce") + + # Merge cutoffs onto opportunities. + cutoffs = leads_copy[["lead_id", "_snapshot_cutoff"]] + opps = opps.merge(cutoffs, on="lead_id", how="left") + + # Keep only opportunities visible at snapshot time. + opps_at_snapshot = opps[opps["_opp_created_ts"] <= opps["_snapshot_cutoff"]] + + # An opportunity is open at snapshot time iff closed_at is NaT or > cutoff. + opps_open = opps_at_snapshot[ + opps_at_snapshot["_closed_at_ts"].isna() + | (opps_at_snapshot["_closed_at_ts"] > opps_at_snapshot["_snapshot_cutoff"]) + ] + + # Compute expected has_open_opportunity and opportunity_estimated_acv per lead. + # Sort by created_at before groupby so first() is deterministic when a lead + # has multiple open opportunities — matches the production sort in snapshots.py. + expected_open = ( + opps_open.sort_values("created_at") + .groupby("lead_id")["estimated_acv"] + .first() + .reset_index() + .rename(columns={"estimated_acv": "_expected_acv"}) + ) + expected_open["_expected_has_open"] = True + + # Merge expected values back onto leads. + check = leads_copy.merge(expected_open, on="lead_id", how="left") + check["_expected_has_open"] = check["_expected_has_open"].fillna(False) + + # Compare has_open_opportunity. + shipped_open = check["has_open_opportunity"].astype("boolean").fillna(False).astype(bool) + expected_open_bool = check["_expected_has_open"].astype(bool) + mismatch_open = (shipped_open != expected_open_bool).sum() + + # Compare opportunity_estimated_acv (both NaN → match; only one NaN → mismatch). + shipped_acv = pd.to_numeric(check["opportunity_estimated_acv"], errors="coerce") + expected_acv_series = pd.to_numeric(check["_expected_acv"], errors="coerce") + both_nan = shipped_acv.isna() & expected_acv_series.isna() + neither_nan = shipped_acv.notna() & expected_acv_series.notna() + acv_match = both_nan | (neither_nan & (shipped_acv == expected_acv_series)) + mismatch_acv = int((~acv_match).sum()) + + findings: list[LeakageFinding] = [] + if mismatch_open > 0: + n = len(check) + findings.append( + LeakageFinding( + channel=CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY, + detail="has_open_opportunity", + message=( + f"{mismatch_open}/{n} leads have incorrect has_open_opportunity; " + "the snapshot builder likely used close_outcome.isna() instead of " + "closed_at > lead_created_at + snapshot_day" + ), + ) + ) + if mismatch_acv > 0: + n = len(check) + findings.append( + LeakageFinding( + channel=CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY, + detail="opportunity_estimated_acv", + message=( + f"{mismatch_acv}/{n} leads have incorrect opportunity_estimated_acv; " + "the shipped ACV does not match the recomputed value under correct " + "closed_at > cutoff semantics" + ), + ) + ) + return findings + + # --------------------------------------------------------------------------- # Probe registry — single source of truth for "what probes exist and what # do they need". The orchestrators iterate it; the meta-test asserts that @@ -1011,6 +1159,12 @@ class ProbeSpec: "model_realism", opt_in=True, ), + "opportunity_snapshot_consistency": ProbeSpec( + "opportunity_snapshot_consistency", + probe_opportunity_snapshot_consistency, + "snapshot_consistency", + opt_in=True, # Requires full-horizon opportunities table (not public bundle) + ), } @@ -1316,6 +1470,7 @@ def _import_sklearn() -> _SklearnHandles | None: "CHANNEL_FEATURE_SUBSET_BASELINE", "CHANNEL_ID_ONLY_BASELINE", "CHANNEL_JOIN_RECONSTRUCTION", + "CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY", "CHANNEL_SNAPSHOT_WINDOW", "CHANNEL_SPLIT_ID_OVERLAP", "CHANNEL_SPLIT_LABEL_DRIFT", @@ -1334,6 +1489,7 @@ def _import_sklearn() -> _SklearnHandles | None: "probe_deterministic_reconstruction", "probe_feature_subset_baseline", "probe_id_only_baseline", + "probe_opportunity_snapshot_consistency", "probe_snapshot_window", "probe_split_id_overlap", "probe_split_label_drift", diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html index be69520..58be1bb 100644 --- a/release/_preview_committed/kaggle.html +++ b/release/_preview_committed/kaggle.html @@ -632,9 +632,9 @@

Data Files (57 total)<
-

Schema / Columns (534 columns across 33 tabular files)

+

Schema / Columns (522 columns across 33 tabular files)

- intro/lead_scoring.csv (33 columns) + intro/lead_scoring.csv (32 columns) @@ -652,7 +652,6 @@

Schema / Columns (534

- @@ -660,7 +659,7 @@

Schema / Columns (534

- + @@ -675,7 +674,7 @@

Schema / Columns (534

ColumnTypeDescription
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_days_0_7integerNumber of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
- intro/tasks/converted_within_90_days/train.parquet (32 columns) + intro/tasks/converted_within_90_days/train.parquet (31 columns) @@ -692,7 +691,6 @@

Schema / Columns (534

- @@ -700,7 +698,7 @@

Schema / Columns (534

- + @@ -709,13 +707,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- intro/tasks/converted_within_90_days/valid.parquet (32 columns) + intro/tasks/converted_within_90_days/valid.parquet (31 columns) @@ -732,7 +730,6 @@

Schema / Columns (534

- @@ -740,7 +737,7 @@

Schema / Columns (534

- + @@ -749,13 +746,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- intro/tasks/converted_within_90_days/test.parquet (32 columns) + intro/tasks/converted_within_90_days/test.parquet (31 columns) @@ -772,7 +769,6 @@

Schema / Columns (534

- @@ -780,7 +776,7 @@

Schema / Columns (534

- + @@ -789,7 +785,7 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
@@ -900,7 +896,7 @@

Schema / Columns (534

- intermediate/lead_scoring.csv (33 columns) + intermediate/lead_scoring.csv (32 columns) @@ -918,7 +914,6 @@

Schema / Columns (534

- @@ -926,7 +921,7 @@

Schema / Columns (534

- + @@ -941,7 +936,7 @@

Schema / Columns (534

ColumnTypeDescription
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_days_0_7integerNumber of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
- intermediate/tasks/converted_within_90_days/train.parquet (32 columns) + intermediate/tasks/converted_within_90_days/train.parquet (31 columns) @@ -958,7 +953,6 @@

Schema / Columns (534

- @@ -966,7 +960,7 @@

Schema / Columns (534

- + @@ -975,13 +969,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- intermediate/tasks/converted_within_90_days/valid.parquet (32 columns) + intermediate/tasks/converted_within_90_days/valid.parquet (31 columns) @@ -998,7 +992,6 @@

Schema / Columns (534

- @@ -1006,7 +999,7 @@

Schema / Columns (534

- + @@ -1015,13 +1008,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- intermediate/tasks/converted_within_90_days/test.parquet (32 columns) + intermediate/tasks/converted_within_90_days/test.parquet (31 columns) @@ -1038,7 +1031,6 @@

Schema / Columns (534

- @@ -1046,7 +1038,7 @@

Schema / Columns (534

- + @@ -1055,7 +1047,7 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
@@ -1166,7 +1158,7 @@

Schema / Columns (534

- advanced/lead_scoring.csv (33 columns) + advanced/lead_scoring.csv (32 columns) @@ -1184,7 +1176,6 @@

Schema / Columns (534

- @@ -1192,7 +1183,7 @@

Schema / Columns (534

- + @@ -1207,7 +1198,7 @@

Schema / Columns (534

ColumnTypeDescription
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_days_0_7integerNumber of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
- advanced/tasks/converted_within_90_days/train.parquet (32 columns) + advanced/tasks/converted_within_90_days/train.parquet (31 columns) @@ -1224,7 +1215,6 @@

Schema / Columns (534

- @@ -1232,7 +1222,7 @@

Schema / Columns (534

- + @@ -1241,13 +1231,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- advanced/tasks/converted_within_90_days/valid.parquet (32 columns) + advanced/tasks/converted_within_90_days/valid.parquet (31 columns) @@ -1264,7 +1254,6 @@

Schema / Columns (534

- @@ -1272,7 +1261,7 @@

Schema / Columns (534

- + @@ -1281,13 +1270,13 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
- advanced/tasks/converted_within_90_days/test.parquet (32 columns) + advanced/tasks/converted_within_90_days/test.parquet (31 columns) @@ -1304,7 +1293,6 @@

Schema / Columns (534

- @@ -1312,7 +1300,7 @@

Schema / Columns (534

- + @@ -1321,7 +1309,7 @@

Schema / Columns (534

- +
ColumnTypeDescription
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_days_0_7number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
total_touches_allinteger
converted_within_90_daysboolean
diff --git a/release/docs/channel_signal_audit.md b/release/docs/channel_signal_audit.md index 2cc3d56..6305c51 100644 --- a/release/docs/channel_signal_audit.md +++ b/release/docs/channel_signal_audit.md @@ -18,7 +18,7 @@ Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.j `n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: **0.5200** · Out-of-sample univariate AUC: **0.5014** @@ -32,7 +32,7 @@ Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: `n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: **0.5212** · Out-of-sample univariate AUC: **0.5139** @@ -46,7 +46,7 @@ Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: `n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%). -### Columns: `lead_source`, `first_touch_channel` (audit values identical) +### Column: `lead_source` Per-channel rate spread (max − min): **0.0056** · In-sample univariate AUC: **0.5083** · Out-of-sample univariate AUC: **0.5226** @@ -62,5 +62,5 @@ The numbers above answer one question: *how strongly does channel alone signal 9 Two empirical observations a reader can make from the numbers above: -1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison. +1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` as the only feature. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison. 2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`. diff --git a/release/docs/feature_dictionary.md b/release/docs/feature_dictionary.md index 790354a..eac0edb 100644 --- a/release/docs/feature_dictionary.md +++ b/release/docs/feature_dictionary.md @@ -8,14 +8,14 @@ analytical role and adds the prose explanation, modelling recommendations, and pedagogical caveats that don't fit a CSV row. The grouping below covers every feature in the public student-facing -snapshot — the same 32 columns ship in `intro`, `intermediate`, and +snapshot — the same 31 columns ship in `intro`, `intermediate`, and `advanced` bundles. The instructor companion adds the hidden truth in `metadata/`; it does not change the feature list. | Category | Columns | Modelling default | |---|---|---| | Lead identity & timing | 4 | drop `lead_id`; keep `lead_created_at` for cohort splits, drop for production | -| Lead source & channel | 2 | keep both | +| Lead source & channel | 1 | keep | | Firmographics | 5 | keep all | | Personographics | 3 | keep all (categorical encoders welcome) | | Engagement (snapshot-window) | 10 | keep all | @@ -35,16 +35,18 @@ in `metadata/`; it does not change the feature list. ## Lead source and channel -Two columns describe how each lead entered the funnel. They are +One column describes how each lead entered the funnel. It is populated from the recipe's GTM-motion mix -(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` -20%) and are identical between the two columns in v1 — both encode -the same origination channel under different field names. +(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` 20%). | Column | Dtype | Why it might matter | |---|---|---| | `lead_source` | string | Origination channel; one of `inbound_marketing` / `sdr_outbound` / `partner_referral`. | -| `first_touch_channel` | string | Marketing channel of the first recorded touch. Always equals `lead_source` in v1; the field exists to support post-v1 work where origination and first-touch can diverge. | + +**Note.** `first_touch_channel` was removed from the task snapshot in PR 8.1: in v1 +it is byte-identical to `lead_source` (both are set to the same origination value), so +it adds no information. It still appears in the relational `tables/leads.parquet` for +post-v1 use cases where origination and first-touch can diverge. **Caveat.** Per [`docs/release/channel_signal_audit.md`](channel_signal_audit.md), v1's channel signal is weak: per-channel rate spread ≤ 0.043 and @@ -99,7 +101,7 @@ features cannot encode events that drove the late-window outcome. | `pricing_page_views` | Int64 | Cumulative pricing-page views across sessions. | | `demo_page_views` | Int64 | Cumulative demo-page views across sessions. | | `total_session_duration_seconds` | Int64 | Cumulative seconds across all sessions. | -| `touches_week_1` | Int64 | Touches in days 0–7 inclusive (early urgency proxy; the snapshot builder uses `_day <= 7`, which is 8 day values). | +| `touches_days_0_7` | Int64 | Touches in days 0–7 inclusive (early urgency proxy). Renamed from `touches_week_1` in PR 8.1 for precision: the window covers 8 day values (0, 1, …, 7). | | `touches_last_7_days` | Int64 | Touches in the last 7 days of the snapshot window — for `snapshot_day=30`, days 24–30 inclusive (the snapshot builder uses `_day > snapshot_day - 7`). | | `days_since_first_touch` | Float64 | NaN if the lead has had zero touches by snapshot day. | @@ -183,8 +185,10 @@ in the table above, including the IDs — the recommendation is what to 3. **Categoricals — encode.** One-hot or target-encode `industry`, `region`, `employee_band`, `estimated_revenue_band`, `process_maturity_band`, `role_function`, `seniority`, - `buyer_role`, `lead_source`, `first_touch_channel`. The two - channel columns carry identical values in v1; pick one. + `buyer_role`, `lead_source`. + (`first_touch_channel` was removed from the snapshot in PR 8.1 — it + was byte-identical to `lead_source` in v1; it still exists in + `tables/leads.parquet` but not in the task splits.) 4. **Engagement and funnel — keep all.** The `Float64` columns carry NaN for "no event in window", which is itself a signal — encode missingness explicitly rather than imputing to zero blindly. diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json index f24379b..3aee8a7 100644 --- a/release/kaggle/dataset-metadata.json +++ b/release/kaggle/dataset-metadata.json @@ -96,11 +96,6 @@ "name": "lead_source", "type": "string" }, - { - "description": "Marketing channel responsible for the first recorded touch.", - "name": "first_touch_channel", - "type": "string" - }, { "description": "Total number of marketing/sales touches recorded before snapshot.", "name": "touch_count", @@ -137,8 +132,8 @@ "type": "integer" }, { - "description": "Number of touches in the first 7 days after lead creation.", - "name": "touches_week_1", + "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.", + "name": "touches_days_0_7", "type": "integer" }, { @@ -255,10 +250,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -288,7 +279,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -325,7 +316,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -391,10 +382,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -424,7 +411,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -461,7 +448,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -527,10 +514,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -560,7 +543,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -597,7 +580,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -994,11 +977,6 @@ "name": "lead_source", "type": "string" }, - { - "description": "Marketing channel responsible for the first recorded touch.", - "name": "first_touch_channel", - "type": "string" - }, { "description": "Total number of marketing/sales touches recorded before snapshot.", "name": "touch_count", @@ -1035,8 +1013,8 @@ "type": "integer" }, { - "description": "Number of touches in the first 7 days after lead creation.", - "name": "touches_week_1", + "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.", + "name": "touches_days_0_7", "type": "integer" }, { @@ -1153,10 +1131,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -1186,7 +1160,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -1223,7 +1197,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -1289,10 +1263,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -1322,7 +1292,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -1359,7 +1329,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -1425,10 +1395,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -1458,7 +1424,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -1495,7 +1461,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -1892,11 +1858,6 @@ "name": "lead_source", "type": "string" }, - { - "description": "Marketing channel responsible for the first recorded touch.", - "name": "first_touch_channel", - "type": "string" - }, { "description": "Total number of marketing/sales touches recorded before snapshot.", "name": "touch_count", @@ -1933,8 +1894,8 @@ "type": "integer" }, { - "description": "Number of touches in the first 7 days after lead creation.", - "name": "touches_week_1", + "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.", + "name": "touches_days_0_7", "type": "integer" }, { @@ -2051,10 +2012,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -2084,7 +2041,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -2121,7 +2078,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -2187,10 +2144,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -2220,7 +2173,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -2257,7 +2210,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", @@ -2323,10 +2276,6 @@ "name": "lead_source", "type": "string" }, - { - "name": "first_touch_channel", - "type": "string" - }, { "name": "touch_count", "type": "number" @@ -2356,7 +2305,7 @@ "type": "number" }, { - "name": "touches_week_1", + "name": "touches_days_0_7", "type": "number" }, { @@ -2393,7 +2342,7 @@ }, { "name": "total_touches_all", - "type": "number" + "type": "integer" }, { "name": "converted_within_90_days", diff --git a/scripts/audit_channel_signal.py b/scripts/audit_channel_signal.py index 12a50f4..1931e85 100644 --- a/scripts/audit_channel_signal.py +++ b/scripts/audit_channel_signal.py @@ -3,8 +3,8 @@ Companion analysis for PR 4.1 (recommendation #8 v1 scope from ``docs/external_review/summaries/recommendations_pass.md``). For every -tier in a release bundle family we compute, separately for ``lead_source`` -and ``first_touch_channel``: +tier in a release bundle family we compute, for each channel column +(default: ``lead_source``): * per-channel conversion rate, share, and counts on the **train** split * the **in-sample** univariate AUC: per-channel rates derived on train @@ -48,7 +48,7 @@ # Constants # --------------------------------------------------------------------------- -CHANNEL_COLUMNS: Final[tuple[str, ...]] = ("lead_source", "first_touch_channel") +CHANNEL_COLUMNS: Final[tuple[str, ...]] = ("lead_source",) LABEL_COLUMN: Final[str] = "converted_within_90_days" DEFAULT_TIERS: Final[tuple[str, ...]] = ("intro", "intermediate", "advanced") DEFAULT_TASK: Final[str] = "converted_within_90_days" @@ -111,9 +111,8 @@ class ChannelAudit: class ChannelGroup: """One or more channel columns with byte-identical audit values. - v1's ``lead_source`` and ``first_touch_channel`` produce identical - numbers in every tier — this dataclass lets the markdown renderer - collapse them into one section without losing information. + Allows the markdown renderer to collapse deduplicate columns into one + section without losing information. """ columns: tuple[str, ...] @@ -507,7 +506,7 @@ def render_markdown( "external baseline. It uses train-derived rates scored against held-out test " "labels — the same shape as the `source_only` HistGBM baseline reported in " "`release/validation/validation_report.json`, which is built on the same task " - "splits with `lead_source` + `first_touch_channel` as the only features. The " + "splits with `lead_source` as the only feature. The " "in-sample number is biased upward by construction — small at v1's N but " "visible — and is reported here for transparency rather than comparison." ) @@ -558,7 +557,7 @@ def _parse_args(argv: Sequence[str] | None) -> argparse.Namespace: action="append", dest="channel_columns", default=None, - help="channel column to audit (repeatable; default: lead_source + first_touch_channel)", + help="channel column to audit (repeatable; default: lead_source)", ) parser.add_argument( "--out-md", diff --git a/tests/render/test_bundle_schema_v5_contract.py b/tests/render/test_bundle_schema_v5_contract.py index f2cef38..5520b66 100644 --- a/tests/render/test_bundle_schema_v5_contract.py +++ b/tests/render/test_bundle_schema_v5_contract.py @@ -57,7 +57,7 @@ "lead_id", "lead_created_at", "lead_source", - "first_touch_channel", + # first_touch_channel removed: byte-identical to lead_source in v1 "touch_count", "inbound_touch_count", "outbound_touch_count", @@ -65,7 +65,7 @@ "pricing_page_views", "demo_page_views", "total_session_duration_seconds", - "touches_week_1", + "touches_days_0_7", # renamed from touches_week_1 "touches_last_7_days", "days_since_first_touch", "activity_count", @@ -93,7 +93,7 @@ "account_id", "lead_created_at", "lead_source", - "first_touch_channel", + "first_touch_channel", # still present in relational leads table (entity field) # ``current_stage``, ``is_sql`` redacted in student_public "owner_rep_id", # ``converted_within_90_days`` / ``conversion_timestamp`` dropped @@ -108,7 +108,7 @@ "account_id", "lead_created_at", "lead_source", - "first_touch_channel", + "first_touch_channel", # still present in relational leads table (entity field) "current_stage", "is_sql", "owner_rep_id", diff --git a/tests/render/test_snapshot_windowed.py b/tests/render/test_snapshot_windowed.py index e0b2654..5b5dff4 100644 --- a/tests/render/test_snapshot_windowed.py +++ b/tests/render/test_snapshot_windowed.py @@ -1,6 +1,6 @@ """Tests for windowed snapshot features (v4 engine changes). -Covers: snapshot_day parameter, touches_week_1, days_since_first_touch, +Covers: snapshot_day parameter, touches_days_0_7, days_since_first_touch, expected_acv, total_touches_all (leakage trap), opportunity_created. """ @@ -46,7 +46,7 @@ def test_snapshot_day_produces_valid_dataframe(self, sim_data): config, population, result = sim_data snap = build_snapshot(result, population, snapshot_day=14) assert len(snap) == config.n_leads - assert "touches_week_1" in snap.columns + assert "touches_days_0_7" in snap.columns def test_windowed_touch_counts_leq_full(self, sim_data): """Windowed touch counts should be ≤ full-horizon counts.""" @@ -70,20 +70,20 @@ def test_snapshot_day_none_equals_default(self, sim_data): # --------------------------------------------------------------------------- -# touches_week_1 +# touches_days_0_7 # --------------------------------------------------------------------------- -class TestTouchesWeek1: +class TestTouchesDays07: def test_non_negative(self, sim_data): _, population, result = sim_data snap = build_snapshot(result, population, snapshot_day=14) - assert (snap["touches_week_1"] >= 0).all() + assert (snap["touches_days_0_7"] >= 0).all() def test_leq_total_touches(self, sim_data): _, population, result = sim_data snap = build_snapshot(result, population, snapshot_day=14) - assert (snap["touches_week_1"] <= snap["touch_count"]).all() + assert (snap["touches_days_0_7"] <= snap["touch_count"]).all() # --------------------------------------------------------------------------- diff --git a/tests/scripts/test_audit_channel_signal.py b/tests/scripts/test_audit_channel_signal.py index aad8ea9..7218dbb 100644 --- a/tests/scripts/test_audit_channel_signal.py +++ b/tests/scripts/test_audit_channel_signal.py @@ -4,14 +4,13 @@ AUC scorers, the JSON + markdown rendering paths, and two integrity properties against the committed ``release/`` bundles: -1. ``lead_source`` and ``first_touch_channel`` carry identical values in - every tier (the feature dictionary's claim). +1. ``first_touch_channel`` is absent from the flat snapshot (removed in PR 8.1 + because it was byte-identical to ``lead_source`` in v1). 2. The committed ``docs/release/channel_signal_audit.{md,json}`` are byte-identical to a fresh run of the audit script. Both properties fail loudly if the bundles are regenerated without -re-running the audit, or if the simulator ever diverges the two -channel columns. +re-running the audit. """ from __future__ import annotations @@ -59,7 +58,6 @@ def _toy_split(n_per_channel: int = 20) -> pd.DataFrame: rows.append( { "lead_source": ch, - "first_touch_channel": ch, "converted_within_90_days": bool(i < int(rate * n_per_channel)), } ) @@ -118,7 +116,6 @@ def test_audit_channel_oos_auc_handles_unseen_test_categories() -> None: test = pd.DataFrame( { "lead_source": ["A", "B", "C", "Z", "Z"], # Z is unseen - "first_touch_channel": ["A", "B", "C", "Z", "Z"], "converted_within_90_days": [True, True, False, True, False], } ) @@ -147,7 +144,9 @@ def test_audit_tier_runs_every_channel_column() -> None: train = _toy_split() tier = audit_module.audit_tier(train, "intro", test=train) cols = {c.column for c in tier.columns} - assert cols == {"lead_source", "first_touch_channel"} + # ``first_touch_channel`` was removed from the flat snapshot (byte-identical + # to ``lead_source`` in v1); only ``lead_source`` remains as the default. + assert cols == {"lead_source"} assert tier.tier == "intro" assert tier.n_train == 60 assert tier.n_test == 60 @@ -180,13 +179,16 @@ def test_render_json_round_trip() -> None: def test_render_markdown_collapses_identical_columns() -> None: """When two columns produce identical audits, the renderer groups them.""" - train = _toy_split() # lead_source == first_touch_channel by construction - tier = audit_module.audit_tier(train, "intro", test=train) + train = _toy_split() + # Duplicate lead_source under a second name so the audit values are identical. + train["lead_source_copy"] = train["lead_source"] + two_cols = ("lead_source", "lead_source_copy") + tier = audit_module.audit_tier(train, "intro", test=train, channel_columns=two_cols) report = audit_module.AuditReport( release_dir="release", task="converted_within_90_days", label_column="converted_within_90_days", - channel_columns=audit_module.CHANNEL_COLUMNS, + channel_columns=two_cols, tiers=(tier,), industry_mql_to_sql_benchmarks=audit_module.INDUSTRY_MQL_TO_SQL_BENCHMARKS, ) @@ -200,13 +202,14 @@ def test_render_markdown_renders_distinct_columns_separately() -> None: """When two columns differ, the renderer keeps them in separate sections.""" train = _toy_split() - train["first_touch_channel"] = "A" # force divergence from lead_source - tier = audit_module.audit_tier(train, "intro", test=train) + train["other_channel"] = "A" # force divergence from lead_source + two_cols = ("lead_source", "other_channel") + tier = audit_module.audit_tier(train, "intro", test=train, channel_columns=two_cols) report = audit_module.AuditReport( release_dir="release", task="converted_within_90_days", label_column="converted_within_90_days", - channel_columns=audit_module.CHANNEL_COLUMNS, + channel_columns=two_cols, tiers=(tier,), industry_mql_to_sql_benchmarks=audit_module.INDUSTRY_MQL_TO_SQL_BENCHMARKS, ) @@ -309,15 +312,20 @@ def test_main_reports_missing_train_split( @pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release/ bundles not present") @pytest.mark.parametrize("tier", _TIERS) -def test_lead_source_equals_first_touch_channel_in_v1(tier: str) -> None: - """Locks the feature-dict claim that the two channel columns are - identical in v1. If the simulator ever diverges them, this test - fails and ``docs/release/feature_dictionary.md`` must be updated.""" +def test_first_touch_channel_absent_from_flat_snapshot(tier: str) -> None: + """Locks the v1 decision to drop ``first_touch_channel`` from the flat + snapshot (it was byte-identical to ``lead_source`` in every row). + + If this column reappears in a future bundle regeneration without a + deliberate feature-spec change, this test will catch it. + """ for split in ("train", "test", "valid"): df = audit_module.load_split(_REPO_ROOT / "release", tier, split) - assert (df["lead_source"] == df["first_touch_channel"]).all(), ( - f"{tier}/{split}: lead_source diverges from first_touch_channel" + assert "first_touch_channel" not in df.columns, ( + f"{tier}/{split}: first_touch_channel is present in flat snapshot but " + "was removed from LEAD_SNAPSHOT_FEATURES in PR 8.1 — " + "bundles need to be regenerated." ) diff --git a/tests/scripts/test_build_v5_snapshot.py b/tests/scripts/test_build_v5_snapshot.py index de6941e..4d8e163 100644 --- a/tests/scripts/test_build_v5_snapshot.py +++ b/tests/scripts/test_build_v5_snapshot.py @@ -58,7 +58,7 @@ def _make_snapshot( "expected_acv": rng.uniform(5_000, 200_000, size=n).round(0), "inbound_touch_count": rng.poisson(3, size=n), "outbound_touch_count": rng.poisson(2, size=n), - "touches_week_1": rng.poisson(2, size=n), + "touches_days_0_7": rng.poisson(2, size=n), "days_since_first_touch": rng.uniform(0, 14, size=n).round(1), "session_count": rng.poisson(4, size=n).astype(float), "activity_count": rng.poisson(3, size=n), diff --git a/tests/scripts/test_build_v6_snapshot.py b/tests/scripts/test_build_v6_snapshot.py index 900d14d..3208ed7 100644 --- a/tests/scripts/test_build_v6_snapshot.py +++ b/tests/scripts/test_build_v6_snapshot.py @@ -57,7 +57,7 @@ def _make_snapshot( "expected_acv": rng.uniform(5_000, 200_000, size=n).round(0), "inbound_touch_count": rng.poisson(3, size=n), "outbound_touch_count": rng.poisson(2, size=n), - "touches_week_1": rng.poisson(2, size=n), + "touches_days_0_7": rng.poisson(2, size=n), "touches_last_7_days": rng.poisson(2, size=n), "days_since_first_touch": rng.uniform(0, 14, size=n).round(1), "session_count": rng.poisson(4, size=n).astype(float), diff --git a/tests/scripts/test_build_v7_snapshot.py b/tests/scripts/test_build_v7_snapshot.py index af1dc18..0993c32 100644 --- a/tests/scripts/test_build_v7_snapshot.py +++ b/tests/scripts/test_build_v7_snapshot.py @@ -57,7 +57,7 @@ def _make_snapshot( "expected_acv": rng.uniform(5_000, 200_000, size=n).round(0), "inbound_touch_count": rng.poisson(3, size=n), "outbound_touch_count": rng.poisson(2, size=n), - "touches_week_1": rng.poisson(2, size=n), + "touches_days_0_7": rng.poisson(2, size=n), "touches_last_7_days": rng.poisson(2, size=n), "days_since_first_touch": rng.uniform(0, 14, size=n).round(1), "session_count": rng.poisson(4, size=n).astype(float), diff --git a/tests/simulation/test_engine.py b/tests/simulation/test_engine.py index 073b391..6d957fa 100644 --- a/tests/simulation/test_engine.py +++ b/tests/simulation/test_engine.py @@ -520,18 +520,22 @@ def test_window_1_almost_no_conversions(self) -> None: assert conv == 0 def test_late_conversions_excluded(self) -> None: - """Leads that convert after the label window should not be labeled positive.""" + """Leads that convert after the label window should not be labeled positive. + + With the inclusive boundary fix (<= label_window_days), a conversion on + day label_window_days is now correctly labeled positive. + """ from datetime import date r30 = self._run_with_window(label_window_days=30, seed=42, n_leads=300) for lead in r30.leads: if lead.converted_within_90_days: - # If labeled positive, conversion day must be < label_window_days. + # If labeled positive, conversion day must be <= label_window_days. assert lead.conversion_timestamp is not None created = date.fromisoformat(lead.lead_created_at) converted = date.fromisoformat(lead.conversion_timestamp) day_offset = (converted - created).days - assert day_offset < 30, ( + assert day_offset <= 30, ( f"Lead {lead.lead_id} labeled positive but converted on day {day_offset}" ) diff --git a/tests/test_primary_task_threading.py b/tests/test_primary_task_threading.py index b72998e..602a0e0 100644 --- a/tests/test_primary_task_threading.py +++ b/tests/test_primary_task_threading.py @@ -198,7 +198,7 @@ def sample_snapshot(self) -> pd.DataFrame: "expected_acv": [50000], "inbound_touch_count": [5], "outbound_touch_count": [3], - "touches_week_1": [2], + "touches_days_0_7": [2], "days_since_first_touch": [10], "session_count": [4], "activity_count": [2], diff --git a/tests/validation/test_leakage_probes.py b/tests/validation/test_leakage_probes.py index ed0a3fe..216da69 100644 --- a/tests/validation/test_leakage_probes.py +++ b/tests/validation/test_leakage_probes.py @@ -946,8 +946,15 @@ def test_probe_registry_covers_every_module_level_probe() -> None: def test_probe_registry_taxonomies_are_known() -> None: - """Every spec carries one of the five documented taxonomies.""" - valid = {"direct", "time_window", "relational", "split", "model_realism"} + """Every spec carries one of the documented taxonomies.""" + valid = { + "direct", + "time_window", + "relational", + "split", + "model_realism", + "snapshot_consistency", + } for spec in PROBE_REGISTRY.values(): assert spec.taxonomy in valid, f"{spec.name} has unknown taxonomy {spec.taxonomy!r}" @@ -955,3 +962,119 @@ def test_probe_registry_taxonomies_are_known() -> None: def test_probe_registry_callables_are_callable() -> None: for spec in PROBE_REGISTRY.values(): assert callable(spec.callable), f"{spec.name}.callable is not callable" + + +# --------------------------------------------------------------------------- +# probe_opportunity_snapshot_consistency — correctness tests +# --------------------------------------------------------------------------- + + +def _make_opp_consistency_fixtures( + *, + close_outcome_gate: bool, +) -> tuple[pd.DataFrame, pd.DataFrame]: + """Build (leads_df, opportunities_df) for snapshot_consistency probe tests. + + Parameters + ---------- + close_outcome_gate: + When True, simulate the *old buggy* snapshot builder that used + ``close_outcome.isna()`` to determine whether an opportunity was open. + This means opportunities closed *after* the snapshot cutoff are + incorrectly treated as closed at snapshot time, so ``has_open_opportunity`` + is False for those leads — a mismatch with the correct value. + + When False, simulate the *correct* builder that uses + ``closed_at > cutoff`` semantics. + """ + # snapshot_day = 30. Lead created 2024-01-01; cutoff = 2024-01-31. + snapshot_day = 30 + lead_created = "2024-01-01T00:00:00" + + # Opportunity created on day 10 (before cutoff), closed on day 45 (after cutoff). + # Under CORRECT semantics: closed_at (day 45) > cutoff (day 30) → open → has_open=True. + # Under BUGGY semantics: close_outcome is not null → treated as closed → has_open=False. + opp_created_at = "2024-01-11T00:00:00" + opp_closed_at = "2024-02-15T00:00:00" # day 45, after cutoff + + leads_df = pd.DataFrame( + { + "lead_id": ["lead_001"], + "lead_created_at": [lead_created], + "has_open_opportunity": [ + # buggy builder says False (used close_outcome gate); + # correct builder says True (used closed_at > cutoff gate). + False if close_outcome_gate else True + ], + "opportunity_estimated_acv": [float("nan") if close_outcome_gate else 50_000.0], + } + ) + + opportunities_df = pd.DataFrame( + { + "lead_id": ["lead_001"], + "created_at": [opp_created_at], + "estimated_acv": [50_000.0], + "closed_at": [opp_closed_at], + "close_outcome": ["closed_won"], # non-null — the bug trigger + } + ) + + return leads_df, opportunities_df, snapshot_day + + +def test_probe_opportunity_snapshot_consistency_fires_on_close_outcome_gate() -> None: + """Probe fires when snapshot used close_outcome.isna() (the old bug).""" + from leadforge.validation.leakage_probes import ( + CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY, + probe_opportunity_snapshot_consistency, + ) + + leads_df, opps_df, snapshot_day = _make_opp_consistency_fixtures(close_outcome_gate=True) + findings = probe_opportunity_snapshot_consistency(leads_df, opps_df, snapshot_day) + + assert len(findings) > 0, "expected probe to fire but got no findings" + channels = {f.channel for f in findings} + assert CHANNEL_OPPORTUNITY_SNAPSHOT_CONSISTENCY in channels + # Both has_open_opportunity and opportunity_estimated_acv should be flagged. + details = {f.detail for f in findings} + assert "has_open_opportunity" in details + assert "opportunity_estimated_acv" in details + + +def test_probe_opportunity_snapshot_consistency_silent_on_correct_gate() -> None: + """Probe is silent when snapshot used correct closed_at > cutoff semantics.""" + from leadforge.validation.leakage_probes import probe_opportunity_snapshot_consistency + + leads_df, opps_df, snapshot_day = _make_opp_consistency_fixtures(close_outcome_gate=False) + findings = probe_opportunity_snapshot_consistency(leads_df, opps_df, snapshot_day) + + assert findings == [], f"expected no findings but got: {findings}" + + +def test_probe_opportunity_snapshot_consistency_raises_on_missing_columns() -> None: + """Probe raises ValueError when required columns are absent.""" + from leadforge.validation.leakage_probes import probe_opportunity_snapshot_consistency + + bad_leads = pd.DataFrame({"lead_id": ["lead_001"]}) # missing lead_created_at etc. + good_opps = pd.DataFrame( + { + "lead_id": ["lead_001"], + "created_at": ["2024-01-01"], + "estimated_acv": [1.0], + "closed_at": [None], + } + ) + with pytest.raises(ValueError, match="leads_df is missing required columns"): + probe_opportunity_snapshot_consistency(bad_leads, good_opps, 30) + + good_leads = pd.DataFrame( + { + "lead_id": ["lead_001"], + "lead_created_at": ["2024-01-01"], + "has_open_opportunity": [False], + } + ) + bad_opps = pd.DataFrame({"lead_id": ["lead_001"]}) # missing closed_at etc. + with pytest.raises(ValueError, match="opportunities_df is missing required columns"): + probe_opportunity_snapshot_consistency(good_leads, bad_opps, 30)