Structural leakage in student_public bundles (post-#56 follow-up)

## Context

PR #56 fixed the deterministic `current_stage` leak in `student_public` bundles via exposure-layer column redaction. That addresses the most blatant symptom but **does not** make the released bundle structurally leakage-free. This issue tracks the three remaining concerns surfaced during the PR self-review.

The CLAUDE.md invariant is: *"all features computed from events ≤ snapshot day only"*. At a 90-day snapshot with a 90-day label window, that invariant is **structurally** violated — features are anchored at the same day the label resolves. PR #56 only patches the columns where the leak is observable with `df['col'].value_counts()`. The remaining concerns are observable only with conditional probabilities or AUC ablations.

## Concern 1 — Event aggregates are computed over the label window

Every aggregated feature in the snapshot is computed over `[lead_created_at, lead_created_at + horizon_days]`, where `horizon_days == label_window_days == 90` for the released bundles. That means `touch_count`, `inbound_touch_count`, `outbound_touch_count`, `session_count`, `pricing_page_views`, `demo_page_views`, `total_session_duration_seconds`, `activity_count`, `days_since_last_touch`, `touches_last_7_days`, `expected_acv` all aggregate events that occurred *after* the conversion (for converted leads).

**Recommended fix.** Use a windowed snapshot: set `snapshot_day < label_window_days` for the released bundles, so aggregations stop before the label window closes. This is exactly what v6/v7 datasets already do (snapshot day 14 / 20 with 90-day label window) — the infrastructure is in `leadforge/render/snapshots.py` and accepts a `snapshot_day` kwarg today; only `leadforge/api/bundle.py` and `scripts/build_public_release.py` need to thread it.

**Trade-off to decide during implementation.**
- Choosing a snapshot day is a recipe-level decision. Day 14 (v6) and day 20 (v7) are precedents, but those datasets had specific pedagogical structure (causal follow-up ramp); the release bundle doesn't.
- Conversion rates will shift (fewer label positives observable at day N than at day 90). The published target ranges in `difficulty_profiles.yaml` may need recalibration.
- Every byte of every published parquet file changes. Documentation (`release/README.md`, `release/HF_DATASET_CARD.md`, `dataset_card.md`) needs coordinated updates with new conversion rates.
- Possible (likely?) bump of `BUNDLE_SCHEMA_VERSION` if the snapshot-day choice is recorded in the manifest. **Per the standing rule, this needs a heads-up before bumping.**

## Concern 2 — `is_sql=False` is near-deterministic for non-conversion

Measured on the regenerated PR-#56 bundles:

| Tier | P(conv \| is_sql=True) | P(conv \| is_sql=False) | n (is_sql=False) |
|---|---|---|---|
| intro | 0.4385 | **0.0383** | 209 |
| intermediate | 0.2164 | **0.0153** | 261 |
| advanced | 0.0865 | **0.0062** | 321 |

At advanced tier, `is_sql=False` predicts non-conversion with 99.4% probability — that is effectively deterministic for the negative class. The cause is the simulation invariant: nearly all conversions go through SQL stage (PR #45 added a 1%-per-day pre-SQL bypass that explains the few `is_sql=False, converted=True` cases).

**Recommended fix.** Tag `is_sql` with `redact_in_modes={ExposureMode.student_public}` in `leadforge/schema/features.py`. The mechanism is already in place from PR #56; this is a one-line change plus regeneration. `is_sql` would still be available in `research_instructor` for DGP-aware evaluation.

Alternative: keep `is_sql` as a strong feature with explicit documentation. I lean against this — the asymmetry between tiers (3.8% → 0.6% as difficulty rises) means the same column behaves differently across the bundles, which is more confusing than informative for students.

## Concern 3 — `is_mql` is constant `True`

Measured across all three tiers: `is_mql == True` for 100% of leads in the published bundles. Zero variance. This is because `leadforge/simulation/population.py` initialises every lead at `current_stage="mql"` (line 369), and `is_mql` reflects whether the lead reached MQL by snapshot day — trivially true.

**Recommended fix.** Either:
- (a) Drop `is_mql` from the snapshot entirely (mark it removed in the feature spec), or
- (b) Initialise some leads at a pre-MQL stage in `population.py` so `is_mql` carries actual signal.

(a) is the smaller change and probably correct — the column carries no information today and the schema is cleaner without it. (b) is a simulator behaviour change with broader implications and shouldn't be done unless we want pre-MQL leads in the funnel for other reasons.

A test along the lines of `test_no_zero_variance_features` should guard the result so this doesn't regress.

## Suggested implementation order

1. Land #2 and #3 first (small, contained, no schema bump).
2. Then #1 as its own PR (large blast radius, recalibration, documentation, possible schema bump — needs explicit go-ahead before bumping).

## References

- PR #56 — the redaction fix this issue follows up
- `.agent-plan.md` → "Follow-up: structural leakage in student_public bundles (open)"
- `docs/v4/design.md` — windowed snapshot precedent
- `scripts/build_v6_snapshot.py`, `scripts/build_v7_snapshot.py` — windowed snapshots in production for the simplified CSV datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structural leakage in student_public bundles (post-#56 follow-up) #57

Context

Concern 1 — Event aggregates are computed over the label window

Concern 2 — `is_sql=False` is near-deterministic for non-conversion

Concern 3 — `is_mql` is constant `True`

Suggested implementation order

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tier	P(conv \| is_sql=True)	P(conv \| is_sql=False)	n (is_sql=False)
intro	0.4385	0.0383	209
intermediate	0.2164	0.0153	261
advanced	0.0865	0.0062	321

Structural leakage in student_public bundles (post-#56 follow-up) #57

Description

Context

Concern 1 — Event aggregates are computed over the label window

Concern 2 — is_sql=False is near-deterministic for non-conversion

Concern 3 — is_mql is constant True

Suggested implementation order

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Concern 2 — `is_sql=False` is near-deterministic for non-conversion

Concern 3 — `is_mql` is constant `True`