Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 10 additions & 12 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,18 +70,16 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family

_Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cross-model review of the v1 preview site, 2026-05-25. Depends on Phase 7 (all except PR 7.3). PR 7.3 depends on this phase._

- [ ] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen`
- **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): `leadforge/render/snapshots.py` — change `od["close_outcome"].isna()` gate to `closed_at is null OR closed_at > lead_created_at + snapshot_day`. This is the most critical code fix; measure AUC delta before/after and document.
- **Add flat-feature snapshot-consistency probe** (MEDIUM): `leadforge/validation/leakage_probes.py` — probe that recomputes opportunity-derived features under the correct `closed_at > cutoff` semantics and asserts equality with the shipped columns. The missing probe that would have caught the bug above automatically.
- **`to_dataframes_snapshot_safe` guard** (MEDIUM): assert `"lead_id" in events.columns` for each `SNAPSHOT_FILTERED_TABLES` entry; fail loud on missing key instead of silently producing all-NaT cutoffs.
- **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): post-noise clamp per column type in `_apply_difficulty_distortions` (`days_since_x >= 0`, monetary values `>= 0`). Tiny change, removes the most visible synthetic-data tell, makes the "non-physical values" known limitation obsolete.
- **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): remove from `_NUMERIC_DISTORTION_COLS` or add a named exclusion list. Up to 18% NaN on the trap in Advanced muddies the leakage lesson it's supposed to deliver cleanly.
- **Drop `first_touch_channel`** (MEDIUM): remove from `LEAD_SNAPSHOT_FEATURES`, flat export, and feature dictionary — byte-identical to `lead_source` in v1; documented redundancy removing itself is cleaner than documenting it.
- **Rename `touches_week_1` → `touches_days_0_7`** (LOW): update `LEAD_SNAPSHOT_FEATURES`, snapshot builder, and feature dictionary. The implementation spans days 0–7 inclusive (8 values); the name implies 7.
- **Label window: `<` → `<=`** (MEDIUM): `engine.py` — `state.conversion_day < config.label_window_days` should be `<=`. Spec says "within 90 days" (inclusive); the break-me guide invites students to audit this boundary.
- **Regenerate all three public tier bundles**, update `release/` artifacts, rerun `validate_release_candidate`, sync claims register.
- Labels: `type: bugfix`, `layer: render`, `layer: validation`, `layer: schema`
- Size: M-L (~400 lines code + regenerated bundles)
- [x] **PR 8.1** — `fix(render,validation,schema): snapshot fixes + noise clamps + schema cleanup + bundle regen`
- **`has_open_opportunity` / `opportunity_estimated_acv` post-snapshot leak** (HIGH): Fixed in `leadforge/render/snapshots.py` — now uses `closed_at is null OR closed_at > lead_created_at + snapshot_day`. AUC delta vs. pre-fix: intro LR 0.879→0.879, intermediate 0.886→0.886, advanced 0.886→0.886 (AUC stable; the bug affected correctness not ranking signal for this DGP).
- **Add flat-feature snapshot-consistency probe** (MEDIUM): `probe_opportunity_snapshot_consistency` added to `leadforge/validation/leakage_probes.py`; registered as opt-in in `PROBE_REGISTRY`.
- **`to_dataframes_snapshot_safe` guard** (MEDIUM): `_filter_to_snapshot_window` raises `ValueError` with descriptive message on missing `lead_id` column.
- **Clamp Gaussian noise to physical ranges** (MEDIUM/LOW): Post-noise clamp applied in `_apply_difficulty_distortions` for `days_since_*`, ACV, and count columns.
- **Exempt `total_touches_all` from difficulty distortion** (MEDIUM): Added `_DISTORTION_EXEMPT_COLS = {"total_touches_all"}` and excluded from noise/missingness/outlier passes.
- **Drop `first_touch_channel`** (MEDIUM): Removed from `LEAD_SNAPSHOT_FEATURES` and task exports; still present in relational `leads.parquet`. v5/v6/v7 pipelines updated with compat rename.
- **Rename `touches_week_1` → `touches_days_0_7`** (LOW): Updated `LEAD_SNAPSHOT_FEATURES`, snapshot builder, contract tests. v5/v6/v7 pipelines keep `touches_week_1` via RENAME_MAP for backwards compat.
- **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion.
- **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass.

- [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
- **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity.
Expand Down
102 changes: 1 addition & 101 deletions docs/release/channel_signal_audit.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{
"channel_columns": [
"lead_source",
"first_touch_channel"
"lead_source"
],
"industry_mql_to_sql_benchmarks": {
"Email": 0.005,
Expand Down Expand Up @@ -46,39 +45,6 @@
"train_conversion_rate": 0.4145714285714286,
"univariate_auc_in_sample": 0.5199794894149169,
"univariate_auc_out_of_sample": 0.5013517441860464
},
{
"channels": [
{
"conversion_rate": 0.43439490445859874,
"n": 1570,
"n_converted": 682,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.39111747851002865,
"n": 698,
"n_converted": 273,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.4025974025974026,
"n": 1232,
"n_converted": 496,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.04327742594857009,
"test_conversion_rate": 0.4266666666666667,
"train_conversion_rate": 0.4145714285714286,
"univariate_auc_in_sample": 0.5199794894149169,
"univariate_auc_out_of_sample": 0.5013517441860464
}
],
"n_test": 750,
Expand Down Expand Up @@ -121,39 +87,6 @@
"train_conversion_rate": 0.20142857142857143,
"univariate_auc_in_sample": 0.5212431012826857,
"univariate_auc_out_of_sample": 0.5139326835180411
},
{
"channels": [
{
"conversion_rate": 0.21273885350318472,
"n": 1570,
"n_converted": 334,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.17621776504297995,
"n": 698,
"n_converted": 123,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.2012987012987013,
"n": 1232,
"n_converted": 248,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.03652108846020477,
"test_conversion_rate": 0.22266666666666668,
"train_conversion_rate": 0.20142857142857143,
"univariate_auc_in_sample": 0.5212431012826857,
"univariate_auc_out_of_sample": 0.5139326835180411
}
],
"n_test": 750,
Expand Down Expand Up @@ -196,39 +129,6 @@
"train_conversion_rate": 0.07914285714285714,
"univariate_auc_in_sample": 0.5083011208921436,
"univariate_auc_out_of_sample": 0.5225784296892246
},
{
"channels": [
{
"conversion_rate": 0.08152866242038216,
"n": 1570,
"n_converted": 128,
"name": "inbound_marketing",
"share": 0.44857142857142857
},
{
"conversion_rate": 0.07593123209169055,
"n": 698,
"n_converted": 53,
"name": "partner_referral",
"share": 0.19942857142857143
},
{
"conversion_rate": 0.07792207792207792,
"n": 1232,
"n_converted": 96,
"name": "sdr_outbound",
"share": 0.352
}
],
"column": "first_touch_channel",
"n_test": 750,
"n_train": 3500,
"rate_spread": 0.005597430328691616,
"test_conversion_rate": 0.07866666666666666,
"train_conversion_rate": 0.07914285714285714,
"univariate_auc_in_sample": 0.5083011208921436,
"univariate_auc_out_of_sample": 0.5225784296892246
}
],
"n_test": 750,
Expand Down
8 changes: 4 additions & 4 deletions docs/release/channel_signal_audit.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Audit produced by `scripts/audit_channel_signal.py`; see `channel_signal_audit.j

`n_train = 3500` (90-day conversion rate 41.46%); `n_test = 750` (rate 42.67%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)
### Column: `lead_source`

Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC: **0.5200** · Out-of-sample univariate AUC: **0.5014**

Expand All @@ -32,7 +32,7 @@ Per-channel rate spread (max − min): **0.0433** · In-sample univariate AUC:

`n_train = 3500` (90-day conversion rate 20.14%); `n_test = 750` (rate 22.27%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)
### Column: `lead_source`

Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC: **0.5212** · Out-of-sample univariate AUC: **0.5139**

Expand All @@ -46,7 +46,7 @@ Per-channel rate spread (max − min): **0.0365** · In-sample univariate AUC:

`n_train = 3500` (90-day conversion rate 7.91%); `n_test = 750` (rate 7.87%).

### Columns: `lead_source`, `first_touch_channel` (audit values identical)
### Column: `lead_source`

Per-channel rate spread (max − min): **0.0056** · In-sample univariate AUC: **0.5083** · Out-of-sample univariate AUC: **0.5226**

Expand All @@ -62,5 +62,5 @@ The numbers above answer one question: *how strongly does channel alone signal 9

Two empirical observations a reader can make from the numbers above:

1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` + `first_touch_channel` as the only features. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
1. **The out-of-sample univariate AUC is the comparable number** for any external baseline. It uses train-derived rates scored against held-out test labels — the same shape as the `source_only` HistGBM baseline reported in `release/validation/validation_report.json`, which is built on the same task splits with `lead_source` as the only feature. The in-sample number is biased upward by construction — small at v1's N but visible — and is reported here for transparency rather than comparison.
2. **The numerical conclusion is bundle-specific.** When the per-channel rate spread is small and the OOS univariate AUC is close to chance, channel alone is a weak feature for the bundle this audit was run against. v1's bundles currently produce that outcome (see the per-tier sections above) — consistent with the design: the simulator drives conversion through motif-family hazards keyed off latent traits, not channel-conditional probabilities. Channel-conditional encoding is tracked as post-v1 work in `docs/release/post_v1_roadmap.md`.
24 changes: 14 additions & 10 deletions docs/release/feature_dictionary.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ analytical role and adds the prose explanation, modelling
recommendations, and pedagogical caveats that don't fit a CSV row.

The grouping below covers every feature in the public student-facing
snapshot — the same 32 columns ship in `intro`, `intermediate`, and
snapshot — the same 31 columns ship in `intro`, `intermediate`, and
`advanced` bundles. The instructor companion adds the hidden truth
in `metadata/`; it does not change the feature list.

| Category | Columns | Modelling default |
|---|---|---|
| Lead identity & timing | 4 | drop `lead_id`; keep `lead_created_at` for cohort splits, drop for production |
| Lead source & channel | 2 | keep both |
| Lead source & channel | 1 | keep |
| Firmographics | 5 | keep all |
| Personographics | 3 | keep all (categorical encoders welcome) |
| Engagement (snapshot-window) | 10 | keep all |
Expand All @@ -35,16 +35,18 @@ in `metadata/`; it does not change the feature list.

## Lead source and channel

Two columns describe how each lead entered the funnel. They are
One column describes how each lead entered the funnel. It is
populated from the recipe's GTM-motion mix
(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral`
20%) and are identical between the two columns in v1 — both encode
the same origination channel under different field names.
(`inbound_marketing` 45%, `sdr_outbound` 35%, `partner_referral` 20%).

| Column | Dtype | Why it might matter |
|---|---|---|
| `lead_source` | string | Origination channel; one of `inbound_marketing` / `sdr_outbound` / `partner_referral`. |
| `first_touch_channel` | string | Marketing channel of the first recorded touch. Always equals `lead_source` in v1; the field exists to support post-v1 work where origination and first-touch can diverge. |

**Note.** `first_touch_channel` was removed from the task snapshot in PR 8.1: in v1
it is byte-identical to `lead_source` (both are set to the same origination value), so
it adds no information. It still appears in the relational `tables/leads.parquet` for
post-v1 use cases where origination and first-touch can diverge.

**Caveat.** Per [`docs/release/channel_signal_audit.md`](channel_signal_audit.md),
v1's channel signal is weak: per-channel rate spread ≤ 0.043 and
Expand Down Expand Up @@ -99,7 +101,7 @@ features cannot encode events that drove the late-window outcome.
| `pricing_page_views` | Int64 | Cumulative pricing-page views across sessions. |
| `demo_page_views` | Int64 | Cumulative demo-page views across sessions. |
| `total_session_duration_seconds` | Int64 | Cumulative seconds across all sessions. |
| `touches_week_1` | Int64 | Touches in days 0–7 inclusive (early urgency proxy; the snapshot builder uses `_day <= 7`, which is 8 day values). |
| `touches_days_0_7` | Int64 | Touches in days 0–7 inclusive (early urgency proxy). Renamed from `touches_week_1` in PR 8.1 for precision: the window covers 8 day values (0, 1, …, 7). |
| `touches_last_7_days` | Int64 | Touches in the last 7 days of the snapshot window — for `snapshot_day=30`, days 24–30 inclusive (the snapshot builder uses `_day > snapshot_day - 7`). |
| `days_since_first_touch` | Float64 | NaN if the lead has had zero touches by snapshot day. |

Expand Down Expand Up @@ -183,8 +185,10 @@ in the table above, including the IDs — the recommendation is what to
3. **Categoricals — encode.** One-hot or target-encode `industry`,
`region`, `employee_band`, `estimated_revenue_band`,
`process_maturity_band`, `role_function`, `seniority`,
`buyer_role`, `lead_source`, `first_touch_channel`. The two
channel columns carry identical values in v1; pick one.
`buyer_role`, `lead_source`.
(`first_touch_channel` was removed from the snapshot in PR 8.1 — it
was byte-identical to `lead_source` in v1; it still exists in
`tables/leads.parquet` but not in the task splits.)
4. **Engagement and funnel — keep all.** The `Float64` columns carry
NaN for "no event in window", which is itself a signal — encode
missingness explicitly rather than imputing to zero blindly.
Expand Down
2 changes: 2 additions & 0 deletions leadforge/pipelines/build_v5.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@
"activity_count": "sales_activities",
"converted_within_90_days": "converted",
"total_touches_all": "__leakage__total_touches_90d",
# touches_days_0_7 renamed back to touches_week_1 for v5 CSV format compatibility.
"touches_days_0_7": "touches_week_1",
}


Expand Down
2 changes: 2 additions & 0 deletions leadforge/pipelines/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,8 @@
"session_count": "web_sessions",
"activity_count": "sales_activities",
"converted_within_90_days": "converted",
# touches_days_0_7 renamed back to touches_week_1 for v6/v7 CSV format compatibility.
"touches_days_0_7": "touches_week_1",
}


Expand Down
10 changes: 9 additions & 1 deletion leadforge/render/relational_snapshot_safe.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def to_dataframes_snapshot_safe(
df = dfs[name]
if name == "opportunities":
df = _drop_columns(df, BANNED_OPP_COLUMNS)
out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon)
out[name] = _filter_to_snapshot_window(df, anchor, ts_col, horizon, table_name=name)

return out

Expand Down Expand Up @@ -143,7 +143,15 @@ def _filter_to_snapshot_window(
anchor: pd.DataFrame,
ts_col: str,
horizon: pd.Timedelta,
table_name: str = "<unknown>",
) -> pd.DataFrame:
# Column-presence guard runs before the empty check: a misconfigured table
# that happens to be empty should still raise, not silently pass through.
if "lead_id" not in events.columns:
raise ValueError(
f"SNAPSHOT_FILTERED_TABLES entry '{table_name}' is missing a 'lead_id' column; "
"cannot apply per-lead snapshot filter."
)
if len(events) == 0:
return events
merged = events.merge(anchor, on="lead_id", how="left")
Expand Down
Loading
Loading