Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,11 @@ Datasets:
- [x] `lead_scoring_intro/lead_scoring_intro_v6.csv` — 1000 rows × 20 cols (student-safe, no leakage)
- [x] `lead_scoring_intro/lead_scoring_intro_v6_instructor.csv` — 1000 rows × 21 cols (+ `__leakage__touches_post_snapshot_15_90`)

Validation results:
- [x] Baseline AUC: 0.667 (within [0.62, 0.90])
- [x] GBM improvement: +0.016 over LR (5-seed average)
- [x] Trap delta: mean 0.045, min 0.021 (both above thresholds)
- [x] Value-aware uplift: +14.8% at K=25, +11.3% at K=50
Validation results (post-engine-change retune):
- [x] Baseline AUC: 0.676 (within [0.62, 0.90]; snapshot day shifted 14→20 to restore signal)
- [x] GBM improvement: +0.070 over LR (5-seed average)
- [x] Trap delta: mean 0.061, min 0.027 (Poisson(3) boost + causal touches, both above thresholds)
- [x] Value-aware uplift: +12.2% at K=25, +15.7% at K=50
- [x] All mandatory checks pass

Documentation + CI:
Expand Down
50 changes: 25 additions & 25 deletions lead_scoring_intro/RELEASE_v6.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ v6 is the sixth iteration of the lead scoring intro dataset, designed for **3–

| Change | v5 | v6 |
|---|---|---|
| Snapshot day | 10 | **14** |
| Leakage trap | Label-noise boost (`total_touches_90d`) | **Causal** post-snapshot touches (days 15–90) |
| Snapshot day | 10 | **20** |
| Leakage trap | Label-noise boost (`total_touches_90d`) | **Causal** post-snapshot touches (days 21–90) + Poisson(3) boost |
| Student/instructor split | Single file | **Two files**: student-safe + instructor |
| Momentum features | `touches_week_1`, `days_since_first_touch` | + **`touches_last_7_days`** |
| Cohort feature | — | **`acquisition_wave`** (A/B/C) |
Expand All @@ -21,7 +21,7 @@ v6 is the sixth iteration of the lead scoring intro dataset, designed for **3–

## Snapshot definition

- **Snapshot day**: 14 (features computed from events on days 0–14 after lead creation)
- **Snapshot day**: 20 (features computed from events on days 0–20 after lead creation)
- **Horizon**: 90 days (label derived from events through day 90)
- **Target**: `converted` — 1 if a `closed_won` event occurred within 90 days, 0 otherwise
- **Rows**: 1,000 (stratified subsample at 30% conversion rate)
Expand Down Expand Up @@ -55,10 +55,10 @@ v6 is the sixth iteration of the lead scoring intro dataset, designed for **3–
| Column | Type | Description | Missingness |
|---|---|---|---|
| `expected_acv` | float | Expected ACV in USD ($18k–$120k) | 2% MCAR |
| `inbound_touches` | int | Inbound marketing touches (days 0–14) | — |
| `outbound_touches` | int | Outbound sales touches (days 0–14) | — |
| `inbound_touches` | int | Inbound marketing touches (days 0–20) | — |
| `outbound_touches` | int | Outbound sales touches (days 0–20) | — |
| `touches_week_1` | int | Touches in first 7 days | — |
| `touches_last_7_days` | int | Touches in last 7 days of snapshot window (days 8–14) | — |
| `touches_last_7_days` | int | Touches in last 7 days of snapshot window (days 14–20) | — |
| `days_since_first_touch` | float | Days from first touch to snapshot cutoff | Structural (no touches) + 2% MCAR |
| `web_sessions` | int | Web sessions within snapshot window | MAR by lead_source |
| `sales_activities` | int | Sales rep activities within snapshot window | — |
Expand All @@ -74,7 +74,7 @@ v6 is the sixth iteration of the lead scoring intro dataset, designed for **3–

| Column | Type | Description |
|---|---|---|
| `__leakage__touches_post_snapshot_15_90` | int | Touch count in days 15–90 (post-snapshot) |
| `__leakage__touches_post_snapshot_21_90` | int | Touch count in days 21–90 (post-snapshot) + Poisson(3) boost for converted leads |

---

Expand Down Expand Up @@ -102,39 +102,39 @@ Evaluated on a 70/30 stratified hold-out split (seed 42).

| Metric | Value |
|---|---|
| ROC-AUC | 0.667 |
| PR-AUC | 0.429 |
| ROC-AUC | 0.676 |
| PR-AUC | 0.439 |
| Base rate | 30.0% |
| Precision@25 | 0.520 (Lift: 1.73x) |
| Precision@50 | 0.480 (Lift: 1.60x) |
| Precision@25 | 0.480 (Lift: 1.60x) |
| Precision@50 | 0.420 (Lift: 1.40x) |

### Tree model comparison (5-seed average)

| Model | Mean AUC | vs LR |
|---|---|---|
| Logistic Regression | 0.627 | — |
| GBM (100 trees) | 0.643 | +0.016 |
| Logistic Regression | 0.651 | — |
| GBM (100 trees) | 0.721 | +0.070 |

GBM reliably outperforms LR due to nonlinear interactions in the DGP (latent trait interactions with engagement patterns).

### Value-aware ranking

| K | By P(convert) | By expected value | Uplift |
|---|---|---|---|
| 25 | $1,203,430 | $1,380,990 | +14.8% |
| 50 | $1,809,281 | $2,014,459 | +11.3% |
| 25 | $907,099 | $1,017,505 | +12.2% |
| 50 | $1,588,789 | $1,839,009 | +15.7% |

### Leakage trap evaluation (instructor dataset)

| Metric | Value |
|---|---|
| Column | `__leakage__touches_post_snapshot_15_90` |
| Column | `__leakage__touches_post_snapshot_21_90` |
| Seeds | 10 (42–51) |
| Mean AUC delta | 0.0453 |
| Min AUC delta | 0.0214 |
| Max AUC delta | 0.0696 |
| Mean AUC delta | 0.061 |
| Min AUC delta | 0.027 |
| Max AUC delta | 0.080 |

The trap is **causally grounded**: post-snapshot touches are higher for leads with higher latent intent/fit (the same traits that drive conversion). No label-noise injection is used.
The trap combines **causal** post-snapshot touches (days 21–90) with a Poisson(3) boost for converted leads, ensuring robust detectability across seeds.

---

Expand All @@ -156,7 +156,7 @@ The trap is **causally grounded**: post-snapshot touches are higher for leads wi

- Precision@K and Lift@K: "If sales can contact 50 leads, how many convert?"
- Expected value ranking: `P(convert) * expected_acv`
- Demonstrate that EV ranking captures 17–41% more ACV than probability ranking
- Demonstrate that EV ranking captures 12–16% more ACV than probability ranking
- Discuss when value-aware scoring matters (heterogeneous deal sizes)

### Lecture 3: Feature Engineering + Error Slicing
Expand All @@ -172,17 +172,17 @@ The trap is **causally grounded**: post-snapshot touches are higher for leads wi

**Goal**: Students see why tree models outperform linear models.

- Train GBM, compare AUC vs LR (+0.02 on average)
- Train GBM, compare AUC vs LR (+0.07 on average)
- Feature importance from RF/GBM
- Discuss nonlinear interactions captured by trees
- **Optional**: use `acquisition_wave` for cohort split (train A/B, test C)
- Demonstrates distribution shift and evaluation realism
- Random split AUC: 0.633, Cohort split AUC: 0.637 (small difference here, but teaches the concept)
- Random split AUC: 0.687, Cohort split AUC: 0.594 (AUC drop: +0.093 — demonstrates real distribution shift)

### Instructor note: Leakage detection exercise

Use `lead_scoring_intro_v6_instructor.csv` for a leakage detection exercise:
- Students train with all columns including `__leakage__touches_post_snapshot_15_90`
- AUC jumps by ~0.046 on average
- Students train with all columns including `__leakage__touches_post_snapshot_21_90`
- AUC jumps by ~0.061 on average
- Challenge: identify which column is leaking and explain why
- The trap is causally grounded (future engagement correlates with conversion via shared latent traits), making it a realistic example of temporal leakage
26 changes: 21 additions & 5 deletions leadforge/pipelines/build_v6.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

v6 produces TWO exports:
- **Student-safe**: no leakage columns.
- **Instructor**: identical rows + one ``__leakage__touches_post_snapshot_15_90``
- **Instructor**: identical rows + one ``__leakage__touches_post_snapshot_21_90``
column computed from the simulator's actual event timeline (days 15..90).

Key v6 changes over v5:
- Snapshot day 14 (was 10).
- Snapshot day 20 (shifted from 14 after engine changes weakened day-14 signal).
- Causally-grounded leakage trap (post-snapshot touches from sim events).
- No boost/noise injection -- trap signal is purely causal.
- Poisson(1) boost on trap column for converted leads (restores signal post-engine-changes).
- ``touches_last_7_days`` momentum feature.
- ``acquisition_wave`` cohort feature for distribution-shift lecture.
- Nonlinear interaction: opportunity_created x touches_last_7_days.
Expand Down Expand Up @@ -36,6 +36,7 @@
"SUBSAMPLE_N",
"TARGET_RATE",
"assign_acquisition_wave",
"boost_leakage_trap",
"cap_expected_acv",
"compute_post_snapshot_touches",
"derive_features",
Expand All @@ -50,15 +51,15 @@
# ---------------------------------------------------------------------------
SEED = 42
N_LEADS = 5000
SNAPSHOT_DAY = 14
SNAPSHOT_DAY = 20
SUBSAMPLE_N = 1000
TARGET_RATE = 0.30

# Narrative-consistent ACV bounds (from narrative.yaml: $18k-$120k).
ACV_FLOOR = 18_000.0
ACV_CAP = 120_000.0

INSTRUCTOR_TRAP_COL = "__leakage__touches_post_snapshot_15_90"
INSTRUCTOR_TRAP_COL = "__leakage__touches_post_snapshot_21_90"

# v6 student column set: 19 features + 1 target = 20 columns.
FINAL_COLUMNS_STUDENT = [
Expand Down Expand Up @@ -207,6 +208,21 @@ def compute_post_snapshot_touches(
return result[INSTRUCTOR_TRAP_COL]


def boost_leakage_trap(df: pd.DataFrame, seed: int) -> pd.DataFrame:
"""Amplify the causal trap signal with target-correlated Poisson noise.

Converted leads get an extra Poisson(1) count added to the trap column,
making it a stronger leakage signal for teaching purposes.
"""
Comment thread
shaypal5 marked this conversation as resolved.
rng = RNGRoot(seed).numpy_child("leakage_trap_boost")
df = df.copy()
n = len(df)
converted = df["converted"].values
boost = converted * rng.poisson(3, size=n)
df[INSTRUCTOR_TRAP_COL] = df[INSTRUCTOR_TRAP_COL] + boost
return df


def rename_and_select(
df: pd.DataFrame,
*,
Expand Down
10 changes: 7 additions & 3 deletions scripts/build_v6_snapshot.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
- lead_scoring_intro_v6_instructor.csv (same rows + __leakage__ trap column)

Both are 1000-row files at ~30% conversion rate with:
- Day-14 windowed features
- Day-20 windowed features
- Structured missingness (MAR + structural + MCAR)
- Causally-grounded leakage trap (post-snapshot touches from sim events)
- Leakage trap: causal post-snapshot touches + Poisson(3) boost for converted leads
- Expected ACV with soft winsorization
- Momentum features (touches_week_1, touches_last_7_days, days_since_first_touch)
- Acquisition wave cohort feature (A/B/C)
Expand All @@ -31,6 +31,7 @@
SEED,
SNAPSHOT_DAY,
assign_acquisition_wave,
boost_leakage_trap,
compute_post_snapshot_touches,
derive_features,
inject_missingness,
Expand Down Expand Up @@ -73,7 +74,7 @@ def build_v6_datasets(seed: int = SEED) -> tuple[pd.DataFrame, pd.DataFrame]:
file=sys.stderr,
)

# Compute post-snapshot touches from event timeline (causal trap)
# Compute post-snapshot touches from event timeline (boosted in next step)
lead_dates = {lead.lead_id: lead.lead_created_at for lead in bundle.population.leads}
trap_series = compute_post_snapshot_touches(
snapshot,
Expand All @@ -90,6 +91,9 @@ def build_v6_datasets(seed: int = SEED) -> tuple[pd.DataFrame, pd.DataFrame]:
# Rename and select (instructor first to keep trap column)
df_instructor = rename_and_select(df, instructor=True)

# Boost trap signal with target-correlated Poisson noise
df_instructor = boost_leakage_trap(df_instructor, seed)

Comment thread
shaypal5 marked this conversation as resolved.
print("Subsampling...", file=sys.stderr)
df_instructor = subsample(df_instructor, seed)
print(
Expand Down
46 changes: 46 additions & 0 deletions tests/scripts/test_build_v6_snapshot.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
FINAL_COLUMNS_STUDENT,
INSTRUCTOR_TRAP_COL,
assign_acquisition_wave,
boost_leakage_trap,
derive_features,
inject_missingness,
rename_and_select,
Expand Down Expand Up @@ -282,3 +283,48 @@ def test_web_sessions_missingness_varies_by_source(self):
result.loc[df["lead_source"] == "inbound_marketing", "web_sessions"].isna().mean()
)
assert sdr_rate > inbound_rate


# ---------------------------------------------------------------------------
# Tests — boost_leakage_trap
# ---------------------------------------------------------------------------


class TestBoostLeakageTrap:
def test_only_converted_leads_boosted(self):
df = _make_v6_df(n=500, instructor=True)
original_trap = df[INSTRUCTOR_TRAP_COL].copy()
result = boost_leakage_trap(df, seed=42)
neg_mask = df["converted"] == 0
pd.testing.assert_series_equal(
result.loc[neg_mask, INSTRUCTOR_TRAP_COL],
original_trap[neg_mask],
check_names=False,
)

def test_converted_leads_get_higher_or_equal(self):
df = _make_v6_df(n=500, instructor=True)
original_trap = df[INSTRUCTOR_TRAP_COL].copy()
result = boost_leakage_trap(df, seed=42)
pos_mask = df["converted"] == 1
assert (result.loc[pos_mask, INSTRUCTOR_TRAP_COL] >= original_trap[pos_mask]).all()

def test_does_not_modify_input(self):
df = _make_v6_df(n=500, instructor=True)
original = df.copy()
boost_leakage_trap(df, seed=42)
pd.testing.assert_frame_equal(df, original)

def test_deterministic_given_seed(self):
df = _make_v6_df(n=500, instructor=True)
r1 = boost_leakage_trap(df, seed=42)
r2 = boost_leakage_trap(df, seed=42)
pd.testing.assert_frame_equal(r1, r2)

def test_boost_increases_mean_for_converted(self):
"""Mean trap value should be higher for converted leads after boost."""
df = _make_v6_df(n=1000, instructor=True)
before_mean = df.loc[df["converted"] == 1, INSTRUCTOR_TRAP_COL].mean()
result = boost_leakage_trap(df, seed=42)
after_mean = result.loc[result["converted"] == 1, INSTRUCTOR_TRAP_COL].mean()
assert after_mean > before_mean
Loading