leadforge-dev · shaypal5 · May 4, 2026 · May 4, 2026 · May 4, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -66,15 +66,13 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie
 
 Deterministic leak fixed via exposure-layer redaction. `FeatureSpec` now carries an explicit `redact_in_modes: frozenset[ExposureMode]` field — *prescriptive* — alongside the descriptive `leakage_risk` flag. `current_stage` is marked `redact_in_modes={ExposureMode.student_public}`; the writer queries `redacted_columns_for(mode)` and strips matching columns from the snapshot, task splits, and feature dictionary before they hit disk. The pedagogical trap `total_touches_all` is preserved in all modes (no entry in `redact_in_modes`). The manifest records `redacted_columns: [...]` so the bundle is self-describing. `validate_bundle()` cross-checks parquet schemas, feature dictionary, and the manifest's declared redaction set against `redacted_columns_for(mode)` derived independently from the feature spec. Hash-determinism preserved (73/73 identical across builds).
 
-### Follow-up: structural leakage in `student_public` bundles (open)
+### Follow-up: structural leakage in `student_public` bundles (issue #57)
 
-Stripping `current_stage` addresses the deterministic label-encoding leak but does **not** make the released bundle structurally leakage-free. Three concerns to address in a follow-up PR:
+Tracked in [GitHub issue #57](https://github.com/leadforge-dev/leadforge/issues/57).
 
-1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. They correlate with post-conversion activity. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. This shifts every feature value and every conversion rate in the release bundles, so it's deferred to its own PR with a coordinated documentation update.
-2. **`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced). At advanced tier it effectively encodes the negative class. Either redact `is_sql` in `student_public` (probably correct) or accept it as a strong feature with documentation. Decide alongside #1.
-3. **`is_mql` is a constant `True`.** Zero variance feature in all three tiers. Should be removed from the snapshot or, if it can ever be False under some recipe, the simulator should produce that variance.
-
-Suggested action: open one tracked GitHub issue covering all three (currently no issue exists; user has standing instruction not to file without confirmation).
+1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. **Open** — its own PR with documentation recalibration; will likely bump `BUNDLE_SCHEMA_VERSION` again.
+2. ~~**`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced).~~ **Resolved** — `is_sql` redacted in `student_public` mode by post-#57 PR (bundle schema v3).
+3. ~~**`is_mql` is a constant `True`.** Zero variance feature in all three tiers.~~ **Resolved** — `is_mql` removed from the canonical feature list by post-#57 PR (bundle schema v3). Guarded by a new `test_no_zero_variance_features` check.
 
 ---
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,60 @@ Format inspired by [Keep a Changelog](https://keepachangelog.com/).
 
 ## Unreleased
 
+### Bundle schema v3
+
+`bundle_schema_version` bumped from `"2"` to `"3"`.  Three structural
+changes follow up on PR #56 (issue #57):
+
+- **`is_mql` fully removed.**  Every lead is initialised at MQL stage in
+  the simulator, making the field constant `True` and zero-variance.
+  It carried no information for modelling and is now removed from the
+  `LeadRow` entity, the relational `leads.parquet`, the snapshot, the
+  task splits, and the feature dictionary — in all exposure modes.
+- **`is_sql` redacted in `student_public` mode.**  Measured across 5
+  seeds on full-size bundles: P(converted | is_sql=False) =
+  0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) /
+  0.011 ± 0.004 (advanced).  At advanced tier this is essentially
+  deterministic for the negative class — practically a one-rule
+  classifier.  `is_sql` remains in `research_instructor` exports for
+  DGP-aware research.
+- **Redaction now applies to relational tables too.**  In v2, the
+  exposure-layer redaction only stripped columns from the snapshot /
+  task splits; users following the README's "Option 3" (feature
+  engineering off the raw `tables/leads.parquet`) could trivially
+  rejoin redacted columns.  In v3, `redacted_columns_for(mode)` is
+  applied uniformly to every published parquet under both `tables/`
+  and `tasks/`.  In `student_public` bundles, `tables/leads.parquet`
+  no longer carries `current_stage` or `is_sql`.
+
+### New automated checks
+
+- `tests/render/test_bundle_schema_v3_contract.py` pins the exact
+  column set per mode for v3 — any future change that touches the
+  feature spec or redaction policy without updating the contract
+  fails this test, forcing an explicit version coordination.
+- `test_no_zero_variance_features` in `tests/exposure/test_redaction.py`
+  asserts no constant or near-constant columns in the published
+  student_public task split (1% rare-class threshold on bundles
+  large enough for the threshold to be statistically meaningful).
+
+### Bundle column counts (v3)
+
+- `student_public/{intro,intermediate,advanced}`: **32** task split
+  columns (down from 34 in v2); **9** columns in `tables/leads.parquet`
+  (down from 12).
+- `research_instructor/intermediate_instructor`: **34** task split
+  columns (down from 35); **11** columns in `tables/leads.parquet`
+  (down from 12 — `is_mql` removed).
+
+### Open follow-up
+
+Issue #57 sub-item 1 remains open: event-aggregate features
+(`touch_count`, `session_count`, `pricing_page_views`, ...) are still
+computed over the same 90-day window the label resolves in.  The
+structural fix is a windowed snapshot rebuild and is deferred to its
+own PR.
+
 ---
 
 ## v1.0.0 — 2026-05-02

diff --git a/leadforge/api/bundle.py b/leadforge/api/bundle.py
@@ -59,6 +59,14 @@ def write_bundle(
     population = bundle.population
     world_graph = bundle.world_graph
 
+    # The redaction set comes from the canonical feature spec — the same
+    # source of truth the validator uses.  It is applied uniformly to
+    # every published parquet file (relational tables AND task splits) so
+    # users doing feature engineering off the raw tables (per the
+    # README's "Option 3") cannot trivially reintroduce a redacted
+    # column by joining ``tables/leads.parquet`` to their feature set.
+    redacted = redacted_columns_for(config.exposure_mode)
+
     # ------------------------------------------------------------------
     # 1. Relational tables → tables/
     # ------------------------------------------------------------------
@@ -68,17 +76,19 @@ def write_bundle(
     dfs = to_dataframes(result, population)
     table_row_counts: dict[str, int] = {}
     for table_name, df in dfs.items():
+        if redacted:
+            cols_to_drop = [c for c in redacted if c in df.columns]
+            if cols_to_drop:
+                df = df.drop(columns=cols_to_drop)
         write_parquet(df, tables_dir / f"{table_name}.parquet")
         table_row_counts[table_name] = len(df)
 
     # ------------------------------------------------------------------
     # 2. Snapshot + task splits → tasks/
     #
-    # Apply exposure-mode redaction here (rather than in apply_exposure)
-    # so that the manifest's per-file SHA-256 hashes reflect the published
-    # column set without a post-write rewrite step.  The redacted column
-    # set is derived from the canonical feature spec — the same source
-    # of truth the validator uses to check bundles.
+    # Same redaction rule applied to the snapshot DataFrame before the
+    # task splits are written, so manifest SHA-256 hashes reflect the
+    # published column set without a post-write rewrite step.
     # ------------------------------------------------------------------
     snapshot = build_snapshot(
         result,
@@ -87,7 +97,6 @@ def write_bundle(
         difficulty_params=config.difficulty_params,
         seed=config.seed,
     )
-    redacted = redacted_columns_for(config.exposure_mode)
     if redacted:
         drop_cols = [c for c in redacted if c in snapshot.columns]
         if drop_cols:

diff --git a/leadforge/render/manifests.py b/leadforge/render/manifests.py
@@ -20,7 +20,15 @@
     from leadforge.structure.graph import WorldGraph
 
 # Bump this whenever the bundle layout or manifest schema changes.
-BUNDLE_SCHEMA_VERSION = "2"
+# History:
+#   "1" — initial layout (pre-M8)
+#   "2" — M8 render layer: tables/, tasks/, dataset_card.md,
+#         feature_dictionary.csv, manifest.json structure
+#   "3" — issue #57 follow-up: ``is_mql`` removed from the canonical
+#         feature list (zero-variance); ``is_sql`` redacted in
+#         ``student_public`` mode (near-deterministic for non-conversion).
+#         ``manifest.redacted_columns`` was already added in PR #56.
+BUNDLE_SCHEMA_VERSION = "3"
 
 # Manifest fields whose value is non-deterministic by design (wall-clock,
 # host metadata, etc.).  Determinism checks must ignore these fields when

diff --git a/leadforge/schema/entities.py b/leadforge/schema/entities.py
@@ -150,12 +150,15 @@ class LeadRow:
         "first_touch_channel": "string",
         "current_stage": "string",
         "owner_rep_id": "string",
-        "is_mql": "boolean",
         "is_sql": "boolean",
         "converted_within_90_days": "boolean",
         "conversion_timestamp": "string",
     }
 
+    # ``is_mql`` was removed in bundle schema v3 (issue #57).  Every lead
+    # is initialised at MQL stage in ``simulation/population.py``, so the
+    # field was constant ``True`` and zero-variance across all bundles.
+
     lead_id: str
     contact_id: str
     account_id: str
@@ -164,7 +167,6 @@ class LeadRow:
     first_touch_channel: str
     current_stage: str
     owner_rep_id: str
-    is_mql: bool
     is_sql: bool
     converted_within_90_days: bool
     conversion_timestamp: str | None = None

diff --git a/leadforge/schema/features.py b/leadforge/schema/features.py
@@ -145,17 +145,23 @@ class FeatureSpec:
         leakage_risk=True,
         redact_in_modes=frozenset({ExposureMode.student_public}),
     ),
-    FeatureSpec(
-        "is_mql",
-        "boolean",
-        "Whether the lead had achieved MQL status at snapshot date.",
-        "lead_meta",
-    ),
+    # Note: ``is_mql`` was removed from the canonical feature list (issue #57)
+    # because every lead is initialised at MQL stage in
+    # ``leadforge/simulation/population.py``, making the column constant
+    # ``True`` and zero-variance.  The underlying ``LeadRow.is_mql`` field
+    # still lives on the relational ``leads.parquet`` table.
     FeatureSpec(
         "is_sql",
         "boolean",
-        "Whether the lead had achieved SQL status at snapshot date.",
+        "Whether the lead had achieved SQL status at snapshot date. "
+        "Strongly correlated with the label: the simulator only converts "
+        "non-SQL leads via a rare direct-conversion path, so "
+        "is_sql=False predicts non-conversion with very high probability "
+        "(P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across difficulty "
+        "tiers).  Redacted from student_public bundles.",
         "lead_meta",
+        leakage_risk=True,
+        redact_in_modes=frozenset({ExposureMode.student_public}),
     ),
     # -- Engagement features --
     FeatureSpec(

diff --git a/leadforge/simulation/engine.py b/leadforge/simulation/engine.py
@@ -407,7 +407,6 @@ def simulate_world(
                 first_touch_channel=lead.first_touch_channel,
                 current_stage=state.current_stage,
                 owner_rep_id=lead.owner_rep_id,
-                is_mql=True,  # all leads start at mql
                 is_sql=is_sql,
                 converted_within_90_days=label,
                 conversion_timestamp=conv_ts,

diff --git a/leadforge/simulation/population.py b/leadforge/simulation/population.py
@@ -368,7 +368,6 @@ def _generate_leads(
                 first_touch_channel=lead_source,
                 current_stage="mql",
                 owner_rep_id=owner_rep_id,
-                is_mql=True,
                 is_sql=False,
                 converted_within_90_days=False,
                 conversion_timestamp=None,

diff --git a/leadforge/validation/drift.py b/leadforge/validation/drift.py
@@ -11,6 +11,7 @@
 from typing import Any
 
 import pandas as pd
+import pyarrow.parquet as pq
 
 from leadforge.core.serialization import load_json
 
@@ -66,10 +67,17 @@ def check_cross_seed_stability(bundles: dict[int, Path]) -> list[str]:
         if len(df) > 0:
             rates[seed] = float(df[_LABEL_COLUMN].mean())
 
+        # ``current_stage`` is redacted from ``leads.parquet`` in
+        # ``student_public`` mode (bundle schema v3+).  Skip stage diversity
+        # collection in that case — degeneracy still surfaces via the
+        # conversion-rate spread checks below, and via ``check_realism`` on
+        # any ``research_instructor`` bundle that does carry the column.
         leads_path = bundle_path / "tables/leads.parquet"
         if leads_path.exists():
-            leads = pd.read_parquet(leads_path, columns=["current_stage"])
-            stage_counts[seed] = int(leads["current_stage"].nunique())
+            schema_names = set(pq.read_schema(leads_path).names)
+            if "current_stage" in schema_names:
+                leads = pd.read_parquet(leads_path, columns=["current_stage"])
+                stage_counts[seed] = int(leads["current_stage"].nunique())
 
     # Check conversion rate spread — if one seed's rate is 5x another's, that's suspicious
     if len(rates) >= 2:

diff --git a/leadforge/validation/invariants.py b/leadforge/validation/invariants.py
@@ -203,11 +203,42 @@ def check_exposure_monotonicity(student_bundle: Path, instructor_bundle: Path) -
     if extra_in_instructor:
         errors.append(f"Tables in instructor but not student: {sorted(extra_in_instructor)}")
 
+    expected_redacted = redacted_columns_for(ExposureMode.student_public)
     for table in sorted(student_tables & instructor_tables):
-        s_sha = file_sha256(student_bundle / "tables" / table)
-        i_sha = file_sha256(instructor_bundle / "tables" / table)
-        if s_sha != i_sha:
-            errors.append(f"Table content mismatch: {table}")
+        s_path = student_bundle / "tables" / table
+        i_path = instructor_bundle / "tables" / table
+        if file_sha256(s_path) == file_sha256(i_path):
+            continue
+        # Mismatch is acceptable iff the only difference is redacted
+        # columns (same logic as for task splits below).
+        s_df = pd.read_parquet(s_path)
+        i_df = pd.read_parquet(i_path)
+        if len(s_df) != len(i_df):
+            errors.append(
+                f"Table {table}: row count mismatch student={len(s_df)} instructor={len(i_df)}"
+            )
+            continue
+        s_cols = set(s_df.columns)
+        i_cols = set(i_df.columns)
+        extra_in_student = s_cols - i_cols
+        if extra_in_student:
+            errors.append(
+                f"Table {table}: student has columns missing from instructor: "
+                f"{sorted(extra_in_student)}"
+            )
+            continue
+        diff = i_cols - s_cols
+        if not diff.issubset(expected_redacted):
+            errors.append(
+                f"Table {table}: instructor−student column diff {sorted(diff)} contains "
+                f"non-redacted columns (expected subset of {sorted(expected_redacted)})"
+            )
+            continue
+        shared = [c for c in s_df.columns if c in i_df.columns]
+        s_shared = s_df[shared].reset_index(drop=True)
+        i_shared = i_df[shared].reset_index(drop=True)
+        if not s_shared.equals(i_shared):
+            errors.append(f"Table {table}: shared-column values differ between modes")
 
     # Both must have the same task splits with identical content
     student_tasks = (

diff --git a/leadforge/validation/realism.py b/leadforge/validation/realism.py
@@ -131,12 +131,23 @@ def _check_feature_ranges(root: Path, manifest: dict[str, Any]) -> list[str]:
 
 
 def _check_stage_distribution(root: Path) -> list[str]:
-    """Check that leads span multiple funnel stages (not all stuck in one)."""
+    """Check that leads span multiple funnel stages (not all stuck in one).
+
+    ``current_stage`` is redacted from the relational ``leads.parquet`` in
+    ``student_public`` mode (bundle schema v3 onward), so this check is a
+    no-op there — the underlying simulation is identical to the
+    ``research_instructor`` bundle, which still carries the column and will
+    surface a degenerate simulation through the same check.
+    """
     errors: list[str] = []
     leads_path = root / "tables/leads.parquet"
     if not leads_path.exists():
         return errors
 
+    schema_names = set(pq.read_schema(leads_path).names)
+    if "current_stage" not in schema_names:
+        return errors
+
     df = pd.read_parquet(leads_path, columns=["current_stage"])
     if len(df) == 0:
         return errors

diff --git a/release/HF_DATASET_CARD.md b/release/HF_DATASET_CARD.md
@@ -77,7 +77,7 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat
 | | Intro | Intermediate | Advanced |
 |---|---|---|---|
 | Leads | 5,000 | 5,000 | 5,000 |
-| Features | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) |
+| Features | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) |
 | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
 | Conversion rate | 41.5% | 20.1% | 7.9% |
 | Signal strength | 0.90 | 0.70 | 0.50 |
@@ -92,12 +92,13 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat
 
 Each difficulty tier includes 9 Parquet tables under `tables/`: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions. These form a normalized CRM schema linked by foreign keys.
 
-## Leakage handling
+## Leakage handling (bundle schema v3)
 
-- **Stripped from public bundles:** `current_stage` directly encoded the label at the 90-day horizon (terminal stages `closed_won`/`closed_lost`). Removed in `student_public` mode; available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped.
+- **Stripped from public bundles:** `current_stage` (label-encoding at the 90-day horizon) and `is_sql` (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across tiers — near-deterministic for non-conversion). Both available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped.
+- **Removed entirely:** `is_mql` (constant `True` in the simulator — zero variance, no information).
 - **Deliberately retained as a pedagogical trap:** `total_touches_all` counts touches over the full 90-day window including post-snapshot events. Flagged `leakage_risk=True` in `feature_dictionary.csv`. Use it as an exercise — train with and without, compare AUC, explain the gap.
 
-**Caveats:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window that the label resolves in, so they correlate with post-conversion events; `is_mql` is constant `True` in all bundles; `is_sql=False` is near-deterministic for non-conversion. A windowed-snapshot follow-up will address this structurally — see the package CHANGELOG.
+**Caveat:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window the label resolves in, so they correlate with post-conversion events. A windowed-snapshot rebuild is the structural fix — see [issue #57](https://github.com/leadforge-dev/leadforge/issues/57).
 
 ## Research companion