Skip to content

Structural leakage in student_public bundles (post-#56 follow-up) #57

@shaypal5

Description

@shaypal5

Context

PR #56 fixed the deterministic current_stage leak in student_public bundles via exposure-layer column redaction. That addresses the most blatant symptom but does not make the released bundle structurally leakage-free. This issue tracks the three remaining concerns surfaced during the PR self-review.

The CLAUDE.md invariant is: "all features computed from events ≤ snapshot day only". At a 90-day snapshot with a 90-day label window, that invariant is structurally violated — features are anchored at the same day the label resolves. PR #56 only patches the columns where the leak is observable with df['col'].value_counts(). The remaining concerns are observable only with conditional probabilities or AUC ablations.

Concern 1 — Event aggregates are computed over the label window

Every aggregated feature in the snapshot is computed over [lead_created_at, lead_created_at + horizon_days], where horizon_days == label_window_days == 90 for the released bundles. That means touch_count, inbound_touch_count, outbound_touch_count, session_count, pricing_page_views, demo_page_views, total_session_duration_seconds, activity_count, days_since_last_touch, touches_last_7_days, expected_acv all aggregate events that occurred after the conversion (for converted leads).

Recommended fix. Use a windowed snapshot: set snapshot_day < label_window_days for the released bundles, so aggregations stop before the label window closes. This is exactly what v6/v7 datasets already do (snapshot day 14 / 20 with 90-day label window) — the infrastructure is in leadforge/render/snapshots.py and accepts a snapshot_day kwarg today; only leadforge/api/bundle.py and scripts/build_public_release.py need to thread it.

Trade-off to decide during implementation.

  • Choosing a snapshot day is a recipe-level decision. Day 14 (v6) and day 20 (v7) are precedents, but those datasets had specific pedagogical structure (causal follow-up ramp); the release bundle doesn't.
  • Conversion rates will shift (fewer label positives observable at day N than at day 90). The published target ranges in difficulty_profiles.yaml may need recalibration.
  • Every byte of every published parquet file changes. Documentation (release/README.md, release/HF_DATASET_CARD.md, dataset_card.md) needs coordinated updates with new conversion rates.
  • Possible (likely?) bump of BUNDLE_SCHEMA_VERSION if the snapshot-day choice is recorded in the manifest. Per the standing rule, this needs a heads-up before bumping.

Concern 2 — is_sql=False is near-deterministic for non-conversion

Measured on the regenerated PR-#56 bundles:

Tier P(conv | is_sql=True) P(conv | is_sql=False) n (is_sql=False)
intro 0.4385 0.0383 209
intermediate 0.2164 0.0153 261
advanced 0.0865 0.0062 321

At advanced tier, is_sql=False predicts non-conversion with 99.4% probability — that is effectively deterministic for the negative class. The cause is the simulation invariant: nearly all conversions go through SQL stage (PR #45 added a 1%-per-day pre-SQL bypass that explains the few is_sql=False, converted=True cases).

Recommended fix. Tag is_sql with redact_in_modes={ExposureMode.student_public} in leadforge/schema/features.py. The mechanism is already in place from PR #56; this is a one-line change plus regeneration. is_sql would still be available in research_instructor for DGP-aware evaluation.

Alternative: keep is_sql as a strong feature with explicit documentation. I lean against this — the asymmetry between tiers (3.8% → 0.6% as difficulty rises) means the same column behaves differently across the bundles, which is more confusing than informative for students.

Concern 3 — is_mql is constant True

Measured across all three tiers: is_mql == True for 100% of leads in the published bundles. Zero variance. This is because leadforge/simulation/population.py initialises every lead at current_stage="mql" (line 369), and is_mql reflects whether the lead reached MQL by snapshot day — trivially true.

Recommended fix. Either:

  • (a) Drop is_mql from the snapshot entirely (mark it removed in the feature spec), or
  • (b) Initialise some leads at a pre-MQL stage in population.py so is_mql carries actual signal.

(a) is the smaller change and probably correct — the column carries no information today and the schema is cleaner without it. (b) is a simulator behaviour change with broader implications and shouldn't be done unless we want pre-MQL leads in the funnel for other reasons.

A test along the lines of test_no_zero_variance_features should guard the result so this doesn't regress.

Suggested implementation order

  1. Land docs: add label taxonomy and milestone map to PR workflow #2 and feat: Milestone 0 — project foundation and package skeleton (v0.1.0) #3 first (small, contained, no schema bump).
  2. Then docs: add mandatory branch/PR workflow rules #1 as its own PR (large blast radius, recalibration, documentation, possible schema bump — needs explicit go-ahead before bumping).

References

  • PR fix: redact current_stage from student_public bundles #56 — the redaction fix this issue follows up
  • .agent-plan.md → "Follow-up: structural leakage in student_public bundles (open)"
  • docs/v4/design.md — windowed snapshot precedent
  • scripts/build_v6_snapshot.py, scripts/build_v7_snapshot.py — windowed snapshots in production for the simplified CSV datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions