You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #56 fixed the deterministic current_stage leak in student_public bundles via exposure-layer column redaction. That addresses the most blatant symptom but does not make the released bundle structurally leakage-free. This issue tracks the three remaining concerns surfaced during the PR self-review.
The CLAUDE.md invariant is: "all features computed from events ≤ snapshot day only". At a 90-day snapshot with a 90-day label window, that invariant is structurally violated — features are anchored at the same day the label resolves. PR #56 only patches the columns where the leak is observable with df['col'].value_counts(). The remaining concerns are observable only with conditional probabilities or AUC ablations.
Concern 1 — Event aggregates are computed over the label window
Every aggregated feature in the snapshot is computed over [lead_created_at, lead_created_at + horizon_days], where horizon_days == label_window_days == 90 for the released bundles. That means touch_count, inbound_touch_count, outbound_touch_count, session_count, pricing_page_views, demo_page_views, total_session_duration_seconds, activity_count, days_since_last_touch, touches_last_7_days, expected_acv all aggregate events that occurred after the conversion (for converted leads).
Recommended fix. Use a windowed snapshot: set snapshot_day < label_window_days for the released bundles, so aggregations stop before the label window closes. This is exactly what v6/v7 datasets already do (snapshot day 14 / 20 with 90-day label window) — the infrastructure is in leadforge/render/snapshots.py and accepts a snapshot_day kwarg today; only leadforge/api/bundle.py and scripts/build_public_release.py need to thread it.
Trade-off to decide during implementation.
Choosing a snapshot day is a recipe-level decision. Day 14 (v6) and day 20 (v7) are precedents, but those datasets had specific pedagogical structure (causal follow-up ramp); the release bundle doesn't.
Conversion rates will shift (fewer label positives observable at day N than at day 90). The published target ranges in difficulty_profiles.yaml may need recalibration.
Every byte of every published parquet file changes. Documentation (release/README.md, release/HF_DATASET_CARD.md, dataset_card.md) needs coordinated updates with new conversion rates.
Possible (likely?) bump of BUNDLE_SCHEMA_VERSION if the snapshot-day choice is recorded in the manifest. Per the standing rule, this needs a heads-up before bumping.
Concern 2 — is_sql=False is near-deterministic for non-conversion
At advanced tier, is_sql=False predicts non-conversion with 99.4% probability — that is effectively deterministic for the negative class. The cause is the simulation invariant: nearly all conversions go through SQL stage (PR #45 added a 1%-per-day pre-SQL bypass that explains the few is_sql=False, converted=True cases).
Recommended fix. Tag is_sql with redact_in_modes={ExposureMode.student_public} in leadforge/schema/features.py. The mechanism is already in place from PR #56; this is a one-line change plus regeneration. is_sql would still be available in research_instructor for DGP-aware evaluation.
Alternative: keep is_sql as a strong feature with explicit documentation. I lean against this — the asymmetry between tiers (3.8% → 0.6% as difficulty rises) means the same column behaves differently across the bundles, which is more confusing than informative for students.
Concern 3 — is_mql is constant True
Measured across all three tiers: is_mql == True for 100% of leads in the published bundles. Zero variance. This is because leadforge/simulation/population.py initialises every lead at current_stage="mql" (line 369), and is_mql reflects whether the lead reached MQL by snapshot day — trivially true.
Recommended fix. Either:
(a) Drop is_mql from the snapshot entirely (mark it removed in the feature spec), or
(b) Initialise some leads at a pre-MQL stage in population.py so is_mql carries actual signal.
(a) is the smaller change and probably correct — the column carries no information today and the schema is cleaner without it. (b) is a simulator behaviour change with broader implications and shouldn't be done unless we want pre-MQL leads in the funnel for other reasons.
A test along the lines of test_no_zero_variance_features should guard the result so this doesn't regress.
Context
PR #56 fixed the deterministic
current_stageleak instudent_publicbundles via exposure-layer column redaction. That addresses the most blatant symptom but does not make the released bundle structurally leakage-free. This issue tracks the three remaining concerns surfaced during the PR self-review.The CLAUDE.md invariant is: "all features computed from events ≤ snapshot day only". At a 90-day snapshot with a 90-day label window, that invariant is structurally violated — features are anchored at the same day the label resolves. PR #56 only patches the columns where the leak is observable with
df['col'].value_counts(). The remaining concerns are observable only with conditional probabilities or AUC ablations.Concern 1 — Event aggregates are computed over the label window
Every aggregated feature in the snapshot is computed over
[lead_created_at, lead_created_at + horizon_days], wherehorizon_days == label_window_days == 90for the released bundles. That meanstouch_count,inbound_touch_count,outbound_touch_count,session_count,pricing_page_views,demo_page_views,total_session_duration_seconds,activity_count,days_since_last_touch,touches_last_7_days,expected_acvall aggregate events that occurred after the conversion (for converted leads).Recommended fix. Use a windowed snapshot: set
snapshot_day < label_window_daysfor the released bundles, so aggregations stop before the label window closes. This is exactly what v6/v7 datasets already do (snapshot day 14 / 20 with 90-day label window) — the infrastructure is inleadforge/render/snapshots.pyand accepts asnapshot_daykwarg today; onlyleadforge/api/bundle.pyandscripts/build_public_release.pyneed to thread it.Trade-off to decide during implementation.
difficulty_profiles.yamlmay need recalibration.release/README.md,release/HF_DATASET_CARD.md,dataset_card.md) needs coordinated updates with new conversion rates.BUNDLE_SCHEMA_VERSIONif the snapshot-day choice is recorded in the manifest. Per the standing rule, this needs a heads-up before bumping.Concern 2 —
is_sql=Falseis near-deterministic for non-conversionMeasured on the regenerated PR-#56 bundles:
At advanced tier,
is_sql=Falsepredicts non-conversion with 99.4% probability — that is effectively deterministic for the negative class. The cause is the simulation invariant: nearly all conversions go through SQL stage (PR #45 added a 1%-per-day pre-SQL bypass that explains the fewis_sql=False, converted=Truecases).Recommended fix. Tag
is_sqlwithredact_in_modes={ExposureMode.student_public}inleadforge/schema/features.py. The mechanism is already in place from PR #56; this is a one-line change plus regeneration.is_sqlwould still be available inresearch_instructorfor DGP-aware evaluation.Alternative: keep
is_sqlas a strong feature with explicit documentation. I lean against this — the asymmetry between tiers (3.8% → 0.6% as difficulty rises) means the same column behaves differently across the bundles, which is more confusing than informative for students.Concern 3 —
is_mqlis constantTrueMeasured across all three tiers:
is_mql == Truefor 100% of leads in the published bundles. Zero variance. This is becauseleadforge/simulation/population.pyinitialises every lead atcurrent_stage="mql"(line 369), andis_mqlreflects whether the lead reached MQL by snapshot day — trivially true.Recommended fix. Either:
is_mqlfrom the snapshot entirely (mark it removed in the feature spec), orpopulation.pysois_mqlcarries actual signal.(a) is the smaller change and probably correct — the column carries no information today and the schema is cleaner without it. (b) is a simulator behaviour change with broader implications and shouldn't be done unless we want pre-MQL leads in the funnel for other reasons.
A test along the lines of
test_no_zero_variance_featuresshould guard the result so this doesn't regress.Suggested implementation order
References
.agent-plan.md→ "Follow-up: structural leakage in student_public bundles (open)"docs/v4/design.md— windowed snapshot precedentscripts/build_v6_snapshot.py,scripts/build_v7_snapshot.py— windowed snapshots in production for the simplified CSV datasets