diff --git a/.agent-plan.md b/.agent-plan.md index b7a91ee..b11f6ff 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -66,15 +66,13 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie Deterministic leak fixed via exposure-layer redaction. `FeatureSpec` now carries an explicit `redact_in_modes: frozenset[ExposureMode]` field — *prescriptive* — alongside the descriptive `leakage_risk` flag. `current_stage` is marked `redact_in_modes={ExposureMode.student_public}`; the writer queries `redacted_columns_for(mode)` and strips matching columns from the snapshot, task splits, and feature dictionary before they hit disk. The pedagogical trap `total_touches_all` is preserved in all modes (no entry in `redact_in_modes`). The manifest records `redacted_columns: [...]` so the bundle is self-describing. `validate_bundle()` cross-checks parquet schemas, feature dictionary, and the manifest's declared redaction set against `redacted_columns_for(mode)` derived independently from the feature spec. Hash-determinism preserved (73/73 identical across builds). -### Follow-up: structural leakage in `student_public` bundles (open) +### Follow-up: structural leakage in `student_public` bundles (issue #57) -Stripping `current_stage` addresses the deterministic label-encoding leak but does **not** make the released bundle structurally leakage-free. Three concerns to address in a follow-up PR: +Tracked in [GitHub issue #57](https://github.com/leadforge-dev/leadforge/issues/57). -1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. They correlate with post-conversion activity. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. This shifts every feature value and every conversion rate in the release bundles, so it's deferred to its own PR with a coordinated documentation update. -2. **`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced). At advanced tier it effectively encodes the negative class. Either redact `is_sql` in `student_public` (probably correct) or accept it as a strong feature with documentation. Decide alongside #1. -3. **`is_mql` is a constant `True`.** Zero variance feature in all three tiers. Should be removed from the snapshot or, if it can ever be False under some recipe, the simulator should produce that variance. - -Suggested action: open one tracked GitHub issue covering all three (currently no issue exists; user has standing instruction not to file without confirmation). +1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. **Open** — its own PR with documentation recalibration; will likely bump `BUNDLE_SCHEMA_VERSION` again. +2. ~~**`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced).~~ **Resolved** — `is_sql` redacted in `student_public` mode by post-#57 PR (bundle schema v3). +3. ~~**`is_mql` is a constant `True`.** Zero variance feature in all three tiers.~~ **Resolved** — `is_mql` removed from the canonical feature list by post-#57 PR (bundle schema v3). Guarded by a new `test_no_zero_variance_features` check. --- diff --git a/CHANGELOG.md b/CHANGELOG.md index 92da2b2..4093008 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,60 @@ Format inspired by [Keep a Changelog](https://keepachangelog.com/). ## Unreleased +### Bundle schema v3 + +`bundle_schema_version` bumped from `"2"` to `"3"`. Three structural +changes follow up on PR #56 (issue #57): + +- **`is_mql` fully removed.** Every lead is initialised at MQL stage in + the simulator, making the field constant `True` and zero-variance. + It carried no information for modelling and is now removed from the + `LeadRow` entity, the relational `leads.parquet`, the snapshot, the + task splits, and the feature dictionary — in all exposure modes. +- **`is_sql` redacted in `student_public` mode.** Measured across 5 + seeds on full-size bundles: P(converted | is_sql=False) = + 0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) / + 0.011 ± 0.004 (advanced). At advanced tier this is essentially + deterministic for the negative class — practically a one-rule + classifier. `is_sql` remains in `research_instructor` exports for + DGP-aware research. +- **Redaction now applies to relational tables too.** In v2, the + exposure-layer redaction only stripped columns from the snapshot / + task splits; users following the README's "Option 3" (feature + engineering off the raw `tables/leads.parquet`) could trivially + rejoin redacted columns. In v3, `redacted_columns_for(mode)` is + applied uniformly to every published parquet under both `tables/` + and `tasks/`. In `student_public` bundles, `tables/leads.parquet` + no longer carries `current_stage` or `is_sql`. + +### New automated checks + +- `tests/render/test_bundle_schema_v3_contract.py` pins the exact + column set per mode for v3 — any future change that touches the + feature spec or redaction policy without updating the contract + fails this test, forcing an explicit version coordination. +- `test_no_zero_variance_features` in `tests/exposure/test_redaction.py` + asserts no constant or near-constant columns in the published + student_public task split (1% rare-class threshold on bundles + large enough for the threshold to be statistically meaningful). + +### Bundle column counts (v3) + +- `student_public/{intro,intermediate,advanced}`: **32** task split + columns (down from 34 in v2); **9** columns in `tables/leads.parquet` + (down from 12). +- `research_instructor/intermediate_instructor`: **34** task split + columns (down from 35); **11** columns in `tables/leads.parquet` + (down from 12 — `is_mql` removed). + +### Open follow-up + +Issue #57 sub-item 1 remains open: event-aggregate features +(`touch_count`, `session_count`, `pricing_page_views`, ...) are still +computed over the same 90-day window the label resolves in. The +structural fix is a windowed snapshot rebuild and is deferred to its +own PR. + --- ## v1.0.0 — 2026-05-02 diff --git a/leadforge/api/bundle.py b/leadforge/api/bundle.py index bfb3874..4f2d1bb 100644 --- a/leadforge/api/bundle.py +++ b/leadforge/api/bundle.py @@ -59,6 +59,14 @@ def write_bundle( population = bundle.population world_graph = bundle.world_graph + # The redaction set comes from the canonical feature spec — the same + # source of truth the validator uses. It is applied uniformly to + # every published parquet file (relational tables AND task splits) so + # users doing feature engineering off the raw tables (per the + # README's "Option 3") cannot trivially reintroduce a redacted + # column by joining ``tables/leads.parquet`` to their feature set. + redacted = redacted_columns_for(config.exposure_mode) + # ------------------------------------------------------------------ # 1. Relational tables → tables/ # ------------------------------------------------------------------ @@ -68,17 +76,19 @@ def write_bundle( dfs = to_dataframes(result, population) table_row_counts: dict[str, int] = {} for table_name, df in dfs.items(): + if redacted: + cols_to_drop = [c for c in redacted if c in df.columns] + if cols_to_drop: + df = df.drop(columns=cols_to_drop) write_parquet(df, tables_dir / f"{table_name}.parquet") table_row_counts[table_name] = len(df) # ------------------------------------------------------------------ # 2. Snapshot + task splits → tasks/ # - # Apply exposure-mode redaction here (rather than in apply_exposure) - # so that the manifest's per-file SHA-256 hashes reflect the published - # column set without a post-write rewrite step. The redacted column - # set is derived from the canonical feature spec — the same source - # of truth the validator uses to check bundles. + # Same redaction rule applied to the snapshot DataFrame before the + # task splits are written, so manifest SHA-256 hashes reflect the + # published column set without a post-write rewrite step. # ------------------------------------------------------------------ snapshot = build_snapshot( result, @@ -87,7 +97,6 @@ def write_bundle( difficulty_params=config.difficulty_params, seed=config.seed, ) - redacted = redacted_columns_for(config.exposure_mode) if redacted: drop_cols = [c for c in redacted if c in snapshot.columns] if drop_cols: diff --git a/leadforge/render/manifests.py b/leadforge/render/manifests.py index d6bc408..2c71312 100644 --- a/leadforge/render/manifests.py +++ b/leadforge/render/manifests.py @@ -20,7 +20,15 @@ from leadforge.structure.graph import WorldGraph # Bump this whenever the bundle layout or manifest schema changes. -BUNDLE_SCHEMA_VERSION = "2" +# History: +# "1" — initial layout (pre-M8) +# "2" — M8 render layer: tables/, tasks/, dataset_card.md, +# feature_dictionary.csv, manifest.json structure +# "3" — issue #57 follow-up: ``is_mql`` removed from the canonical +# feature list (zero-variance); ``is_sql`` redacted in +# ``student_public`` mode (near-deterministic for non-conversion). +# ``manifest.redacted_columns`` was already added in PR #56. +BUNDLE_SCHEMA_VERSION = "3" # Manifest fields whose value is non-deterministic by design (wall-clock, # host metadata, etc.). Determinism checks must ignore these fields when diff --git a/leadforge/schema/entities.py b/leadforge/schema/entities.py index b20b8c7..60270e4 100644 --- a/leadforge/schema/entities.py +++ b/leadforge/schema/entities.py @@ -150,12 +150,15 @@ class LeadRow: "first_touch_channel": "string", "current_stage": "string", "owner_rep_id": "string", - "is_mql": "boolean", "is_sql": "boolean", "converted_within_90_days": "boolean", "conversion_timestamp": "string", } + # ``is_mql`` was removed in bundle schema v3 (issue #57). Every lead + # is initialised at MQL stage in ``simulation/population.py``, so the + # field was constant ``True`` and zero-variance across all bundles. + lead_id: str contact_id: str account_id: str @@ -164,7 +167,6 @@ class LeadRow: first_touch_channel: str current_stage: str owner_rep_id: str - is_mql: bool is_sql: bool converted_within_90_days: bool conversion_timestamp: str | None = None diff --git a/leadforge/schema/features.py b/leadforge/schema/features.py index 906011a..62ed045 100644 --- a/leadforge/schema/features.py +++ b/leadforge/schema/features.py @@ -145,17 +145,23 @@ class FeatureSpec: leakage_risk=True, redact_in_modes=frozenset({ExposureMode.student_public}), ), - FeatureSpec( - "is_mql", - "boolean", - "Whether the lead had achieved MQL status at snapshot date.", - "lead_meta", - ), + # Note: ``is_mql`` was removed from the canonical feature list (issue #57) + # because every lead is initialised at MQL stage in + # ``leadforge/simulation/population.py``, making the column constant + # ``True`` and zero-variance. The underlying ``LeadRow.is_mql`` field + # still lives on the relational ``leads.parquet`` table. FeatureSpec( "is_sql", "boolean", - "Whether the lead had achieved SQL status at snapshot date.", + "Whether the lead had achieved SQL status at snapshot date. " + "Strongly correlated with the label: the simulator only converts " + "non-SQL leads via a rare direct-conversion path, so " + "is_sql=False predicts non-conversion with very high probability " + "(P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across difficulty " + "tiers). Redacted from student_public bundles.", "lead_meta", + leakage_risk=True, + redact_in_modes=frozenset({ExposureMode.student_public}), ), # -- Engagement features -- FeatureSpec( diff --git a/leadforge/simulation/engine.py b/leadforge/simulation/engine.py index f8af31b..2f6a347 100644 --- a/leadforge/simulation/engine.py +++ b/leadforge/simulation/engine.py @@ -407,7 +407,6 @@ def simulate_world( first_touch_channel=lead.first_touch_channel, current_stage=state.current_stage, owner_rep_id=lead.owner_rep_id, - is_mql=True, # all leads start at mql is_sql=is_sql, converted_within_90_days=label, conversion_timestamp=conv_ts, diff --git a/leadforge/simulation/population.py b/leadforge/simulation/population.py index 77bf260..1f6b6d3 100644 --- a/leadforge/simulation/population.py +++ b/leadforge/simulation/population.py @@ -368,7 +368,6 @@ def _generate_leads( first_touch_channel=lead_source, current_stage="mql", owner_rep_id=owner_rep_id, - is_mql=True, is_sql=False, converted_within_90_days=False, conversion_timestamp=None, diff --git a/leadforge/validation/drift.py b/leadforge/validation/drift.py index 4f9c951..3c9b18e 100644 --- a/leadforge/validation/drift.py +++ b/leadforge/validation/drift.py @@ -11,6 +11,7 @@ from typing import Any import pandas as pd +import pyarrow.parquet as pq from leadforge.core.serialization import load_json @@ -66,10 +67,17 @@ def check_cross_seed_stability(bundles: dict[int, Path]) -> list[str]: if len(df) > 0: rates[seed] = float(df[_LABEL_COLUMN].mean()) + # ``current_stage`` is redacted from ``leads.parquet`` in + # ``student_public`` mode (bundle schema v3+). Skip stage diversity + # collection in that case — degeneracy still surfaces via the + # conversion-rate spread checks below, and via ``check_realism`` on + # any ``research_instructor`` bundle that does carry the column. leads_path = bundle_path / "tables/leads.parquet" if leads_path.exists(): - leads = pd.read_parquet(leads_path, columns=["current_stage"]) - stage_counts[seed] = int(leads["current_stage"].nunique()) + schema_names = set(pq.read_schema(leads_path).names) + if "current_stage" in schema_names: + leads = pd.read_parquet(leads_path, columns=["current_stage"]) + stage_counts[seed] = int(leads["current_stage"].nunique()) # Check conversion rate spread — if one seed's rate is 5x another's, that's suspicious if len(rates) >= 2: diff --git a/leadforge/validation/invariants.py b/leadforge/validation/invariants.py index 0de63bc..58799e7 100644 --- a/leadforge/validation/invariants.py +++ b/leadforge/validation/invariants.py @@ -203,11 +203,42 @@ def check_exposure_monotonicity(student_bundle: Path, instructor_bundle: Path) - if extra_in_instructor: errors.append(f"Tables in instructor but not student: {sorted(extra_in_instructor)}") + expected_redacted = redacted_columns_for(ExposureMode.student_public) for table in sorted(student_tables & instructor_tables): - s_sha = file_sha256(student_bundle / "tables" / table) - i_sha = file_sha256(instructor_bundle / "tables" / table) - if s_sha != i_sha: - errors.append(f"Table content mismatch: {table}") + s_path = student_bundle / "tables" / table + i_path = instructor_bundle / "tables" / table + if file_sha256(s_path) == file_sha256(i_path): + continue + # Mismatch is acceptable iff the only difference is redacted + # columns (same logic as for task splits below). + s_df = pd.read_parquet(s_path) + i_df = pd.read_parquet(i_path) + if len(s_df) != len(i_df): + errors.append( + f"Table {table}: row count mismatch student={len(s_df)} instructor={len(i_df)}" + ) + continue + s_cols = set(s_df.columns) + i_cols = set(i_df.columns) + extra_in_student = s_cols - i_cols + if extra_in_student: + errors.append( + f"Table {table}: student has columns missing from instructor: " + f"{sorted(extra_in_student)}" + ) + continue + diff = i_cols - s_cols + if not diff.issubset(expected_redacted): + errors.append( + f"Table {table}: instructor−student column diff {sorted(diff)} contains " + f"non-redacted columns (expected subset of {sorted(expected_redacted)})" + ) + continue + shared = [c for c in s_df.columns if c in i_df.columns] + s_shared = s_df[shared].reset_index(drop=True) + i_shared = i_df[shared].reset_index(drop=True) + if not s_shared.equals(i_shared): + errors.append(f"Table {table}: shared-column values differ between modes") # Both must have the same task splits with identical content student_tasks = ( diff --git a/leadforge/validation/realism.py b/leadforge/validation/realism.py index 0c52f2d..afa0187 100644 --- a/leadforge/validation/realism.py +++ b/leadforge/validation/realism.py @@ -131,12 +131,23 @@ def _check_feature_ranges(root: Path, manifest: dict[str, Any]) -> list[str]: def _check_stage_distribution(root: Path) -> list[str]: - """Check that leads span multiple funnel stages (not all stuck in one).""" + """Check that leads span multiple funnel stages (not all stuck in one). + + ``current_stage`` is redacted from the relational ``leads.parquet`` in + ``student_public`` mode (bundle schema v3 onward), so this check is a + no-op there — the underlying simulation is identical to the + ``research_instructor`` bundle, which still carries the column and will + surface a degenerate simulation through the same check. + """ errors: list[str] = [] leads_path = root / "tables/leads.parquet" if not leads_path.exists(): return errors + schema_names = set(pq.read_schema(leads_path).names) + if "current_stage" not in schema_names: + return errors + df = pd.read_parquet(leads_path, columns=["current_stage"]) if len(df) == 0: return errors diff --git a/release/HF_DATASET_CARD.md b/release/HF_DATASET_CARD.md index 42dd4fa..c01136f 100644 --- a/release/HF_DATASET_CARD.md +++ b/release/HF_DATASET_CARD.md @@ -77,7 +77,7 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat | | Intro | Intermediate | Advanced | |---|---|---|---| | Leads | 5,000 | 5,000 | 5,000 | -| Features | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) | +| Features | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) | | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` | | Conversion rate | 41.5% | 20.1% | 7.9% | | Signal strength | 0.90 | 0.70 | 0.50 | @@ -92,12 +92,13 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat Each difficulty tier includes 9 Parquet tables under `tables/`: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions. These form a normalized CRM schema linked by foreign keys. -## Leakage handling +## Leakage handling (bundle schema v3) -- **Stripped from public bundles:** `current_stage` directly encoded the label at the 90-day horizon (terminal stages `closed_won`/`closed_lost`). Removed in `student_public` mode; available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped. +- **Stripped from public bundles:** `current_stage` (label-encoding at the 90-day horizon) and `is_sql` (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across tiers — near-deterministic for non-conversion). Both available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped. +- **Removed entirely:** `is_mql` (constant `True` in the simulator — zero variance, no information). - **Deliberately retained as a pedagogical trap:** `total_touches_all` counts touches over the full 90-day window including post-snapshot events. Flagged `leakage_risk=True` in `feature_dictionary.csv`. Use it as an exercise — train with and without, compare AUC, explain the gap. -**Caveats:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window that the label resolves in, so they correlate with post-conversion events; `is_mql` is constant `True` in all bundles; `is_sql=False` is near-deterministic for non-conversion. A windowed-snapshot follow-up will address this structurally — see the package CHANGELOG. +**Caveat:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window the label resolves in, so they correlate with post-conversion events. A windowed-snapshot rebuild is the structural fix — see [issue #57](https://github.com/leadforge-dev/leadforge/issues/57). ## Research companion diff --git a/release/README.md b/release/README.md index 2cb7649..e174f49 100644 --- a/release/README.md +++ b/release/README.md @@ -106,7 +106,7 @@ leadforge generate \ | Leads | 5,000 | 5,000 | 5,000 | | Accounts | 1,500 | 1,500 | 1,500 | | Contacts | 4,200 | 4,200 | 4,200 | -| Columns | 34 (student_public) / 35 (instructor) | 34 / 35 | 34 / 35 | +| Columns | 32 (student_public) / 34 (instructor) | 32 / 34 | 32 / 34 | | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` | | Conversion rate (target) | 30-45% | 18-28% | 8-15% | | Conversion rate (observed) | 41.5% | 20.1% | 7.9% | @@ -128,16 +128,22 @@ The sales funnel runs through inbound marketing (45%), SDR outbound (35%), and p Each bundle contains a `dataset_card.md` and a `feature_dictionary.csv` with the authoritative, auto-generated column list, descriptions, dtypes, and `leakage_risk` flags. Refer to those rather than mirroring counts here, which would drift. -**Leakage handling** +**Leakage handling (bundle schema v3)** -- `current_stage` -- at the 90-day horizon, contains terminal stages (`closed_won`/`closed_lost`) that encode the label directly. **Stripped from `student_public` bundles** by the exposure layer; available in `intermediate_instructor/` for research and DGP-aware evaluation. The `redacted_columns` field in `manifest.json` records what was stripped. -- `total_touches_all` -- counts touches over the full 90-day window, including post-snapshot events. **Deliberately retained** as a pedagogical trap (flagged `leakage_risk=True` in the dictionary). Use it as an exercise in leakage detection: train with and without it, compare AUC, then explain the gap. +Redaction in `student_public` bundles applies uniformly to **both** the task splits *and* the relational tables. Joining `tables/leads.parquet` back to your features cannot recover a redacted column. -**Known caveats** (see [PR #56](https://github.com/leadforge-dev/leadforge/pull/56) for the discussion): +| Column | Status in `student_public` | Status in `intermediate_instructor` | Why | +|---|---|---|---| +| `current_stage` | redacted (gone from task splits and `tables/leads.parquet`) | retained | At day 90 this contains terminal stages (`closed_won`/`closed_lost`) that encode the label directly. | +| `is_sql` | redacted | retained | `is_sql=False` predicts non-conversion with very high probability — measured across 5 seeds, P(conv \| is_sql=False) = 0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) / 0.011 ± 0.004 (advanced). | +| `is_mql` | removed entirely (no mode has it) | removed entirely | Every lead is initialised at MQL stage in the simulator, so the field was constant `True` and carried no information. | +| `total_touches_all` | retained | retained | Deliberate pedagogical leakage trap (counts touches over the full 90-day window). Flagged `leakage_risk=True` in `feature_dictionary.csv`. Train with and without it, compare AUC, explain the gap. | + +The `redacted_columns` field in each bundle's `manifest.json` records exactly what was stripped. + +**Known caveats** (tracked in [issue #57](https://github.com/leadforge-dev/leadforge/issues/57)): -- All event-aggregate features (`touch_count`, `session_count`, `pricing_page_views`, ...) are computed over the same 90-day window in which the label resolves. They correlate with post-conversion events and are not yet structurally leakage-free. Stripping `current_stage` removes the most blatant deterministic leak; a windowed-snapshot follow-up is the structural fix. -- `is_mql` is constant `True` across all leads in the current bundles (zero variance). -- `is_sql=False` is near-deterministic for non-conversion (~3.8% / 1.5% / 0.6% conversion rate at intro / intermediate / advanced). +- All event-aggregate features (`touch_count`, `session_count`, `pricing_page_views`, ...) are computed over the same 90-day window in which the label resolves. They correlate with post-conversion events. The structural fix is a windowed snapshot rebuild — open follow-up. ## Research companion diff --git a/tests/exposure/test_redaction.py b/tests/exposure/test_redaction.py index 4e67a62..18852ff 100644 --- a/tests/exposure/test_redaction.py +++ b/tests/exposure/test_redaction.py @@ -121,6 +121,51 @@ def test_validate_bundle_passes(self, bundle: Path) -> None: redaction_errors = [e for e in errors if "redacted columns" in e] assert redaction_errors == [] + def test_no_zero_variance_features(self, bundle: Path) -> None: + """Guard against constant or near-constant columns regressing into + the bundle. + + ``is_mql`` was constant ``True`` and is caught by the strict + ``nunique >= 2`` assertion that runs on every bundle size. + + For larger bundles (≥ 200 rows), additionally guard against the + weaker case of a column whose rarest low-cardinality value + appears in fewer than 1% of rows — practically zero-variance for + modelling. Skipped on the tiny test fixtures because the 1% + threshold is below 2 rows there and the test would false-positive + on legitimate small-sample sparsity. + + ID columns, the timestamp, and the target are exempt — IDs are + unique by design, the timestamp varies trivially, and the target's + variance is the dataset's purpose. + """ + df = pd.read_parquet(bundle / "tasks/converted_within_90_days/train.parquet") + exempt = {"account_id", "contact_id", "lead_id", "lead_created_at"} + target = next(f.name for f in LEAD_SNAPSHOT_FEATURES if f.is_target) + exempt.add(target) + n = len(df) + # 1% of rows, at least 2. Only enforced on bundles large enough + # for the threshold to be statistically meaningful. + check_rare_class = n >= 200 + min_rare_count = max(2, n // 100) + + for col in df.columns: + if col in exempt: + continue + counts = df[col].value_counts(dropna=False) + assert len(counts) >= 2, ( + f"feature {col!r} has zero variance in the published " + f"student_public bundle ({len(counts)} distinct value)" + ) + if check_rare_class and len(counts) <= 5: + rarest_count = int(counts.min()) + assert rarest_count >= min_rare_count, ( + f"feature {col!r} is near-constant in the published " + f"student_public bundle: rarest value appears " + f"{rarest_count} times in {n} rows " + f"(threshold {min_rare_count})" + ) + # --------------------------------------------------------------------------- # End-to-end: research_instructor keeps everything diff --git a/tests/render/test_bundle_schema_v3_contract.py b/tests/render/test_bundle_schema_v3_contract.py new file mode 100644 index 0000000..2135bde --- /dev/null +++ b/tests/render/test_bundle_schema_v3_contract.py @@ -0,0 +1,184 @@ +"""Schema contract test for ``bundle_schema_version == "3"``. + +The constants below are an *intentional* duplication of the column sets +the bundle writer produces. The duplication is the point: any change to +``LEAD_SNAPSHOT_FEATURES``, ``LeadRow``, or the redaction policy that +also changes the published column set must update this contract. A bare +"add a new feature" PR that touches the spec but not this file will fail +here, forcing the author to either update the contract (and bump +``BUNDLE_SCHEMA_VERSION``) or revisit the change. + +If you find yourself wondering "do I have to update this?": yes. That +is the failure mode this test is designed to catch. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pyarrow.parquet as pq +import pytest + +from leadforge.api.generator import Generator + +# Pinned column lists for bundle schema v3. Update *together* with +# ``BUNDLE_SCHEMA_VERSION`` and ``LEAD_SNAPSHOT_FEATURES``. + +V3_TASK_COLUMNS_STUDENT_PUBLIC: frozenset[str] = frozenset( + { + "account_id", + "industry", + "region", + "employee_band", + "estimated_revenue_band", + "process_maturity_band", + "contact_id", + "role_function", + "seniority", + "buyer_role", + "lead_id", + "lead_created_at", + "lead_source", + "first_touch_channel", + "touch_count", + "inbound_touch_count", + "outbound_touch_count", + "session_count", + "pricing_page_views", + "demo_page_views", + "total_session_duration_seconds", + "touches_week_1", + "touches_last_7_days", + "days_since_first_touch", + "activity_count", + "days_since_last_touch", + "opportunity_created", + "has_open_opportunity", + "opportunity_estimated_acv", + "expected_acv", + "total_touches_all", + "converted_within_90_days", + } +) + +V3_TASK_COLUMNS_RESEARCH_INSTRUCTOR: frozenset[str] = V3_TASK_COLUMNS_STUDENT_PUBLIC | { + "current_stage", + "is_sql", +} + +V3_LEAD_TABLE_COLUMNS_STUDENT_PUBLIC: frozenset[str] = frozenset( + { + "lead_id", + "contact_id", + "account_id", + "lead_created_at", + "lead_source", + "first_touch_channel", + # ``current_stage``, ``is_sql`` redacted in student_public + "owner_rep_id", + "converted_within_90_days", + "conversion_timestamp", + } +) + +V3_LEAD_TABLE_COLUMNS_RESEARCH_INSTRUCTOR: frozenset[str] = V3_LEAD_TABLE_COLUMNS_STUDENT_PUBLIC | { + "current_stage", + "is_sql", +} + +_SMALL = {"n_leads": 30, "n_accounts": 15, "n_contacts": 45} + + +def _build(mode: str, out: Path, seed: int = 42) -> None: + gen = Generator.from_recipe("b2b_saas_procurement_v1", seed=seed, exposure_mode=mode) + gen.generate(**_SMALL).save(str(out)) + + +@pytest.fixture(scope="module") +def student_bundle(tmp_path_factory: pytest.TempPathFactory) -> Path: + out = tmp_path_factory.mktemp("v3_student") + _build("student_public", out) + return out + + +@pytest.fixture(scope="module") +def instructor_bundle(tmp_path_factory: pytest.TempPathFactory) -> Path: + out = tmp_path_factory.mktemp("v3_instructor") + _build("research_instructor", out) + return out + + +def _task_cols(bundle: Path) -> frozenset[str]: + return frozenset(pq.read_schema(bundle / "tasks/converted_within_90_days/train.parquet").names) + + +def _leads_cols(bundle: Path) -> frozenset[str]: + return frozenset(pq.read_schema(bundle / "tables/leads.parquet").names) + + +def test_manifest_declares_v3(student_bundle: Path, instructor_bundle: Path) -> None: + for b in (student_bundle, instructor_bundle): + manifest = json.loads((b / "manifest.json").read_text()) + assert manifest["bundle_schema_version"] == "3", ( + f"{b.name}: bundle_schema_version is {manifest['bundle_schema_version']!r}, " + "expected '3'" + ) + + +def test_student_public_task_columns_match_v3_contract(student_bundle: Path) -> None: + actual = _task_cols(student_bundle) + assert actual == V3_TASK_COLUMNS_STUDENT_PUBLIC, ( + f"student_public task split columns drifted from v3 contract.\n" + f" unexpected: {sorted(actual - V3_TASK_COLUMNS_STUDENT_PUBLIC)}\n" + f" missing: {sorted(V3_TASK_COLUMNS_STUDENT_PUBLIC - actual)}\n" + " → either update tests/render/test_bundle_schema_v3_contract.py and " + "bump BUNDLE_SCHEMA_VERSION, or revert the schema change." + ) + + +def test_research_instructor_task_columns_match_v3_contract(instructor_bundle: Path) -> None: + actual = _task_cols(instructor_bundle) + assert actual == V3_TASK_COLUMNS_RESEARCH_INSTRUCTOR, ( + f"research_instructor task split columns drifted from v3 contract.\n" + f" unexpected: {sorted(actual - V3_TASK_COLUMNS_RESEARCH_INSTRUCTOR)}\n" + f" missing: {sorted(V3_TASK_COLUMNS_RESEARCH_INSTRUCTOR - actual)}" + ) + + +def test_student_public_leads_table_columns_match_v3_contract(student_bundle: Path) -> None: + actual = _leads_cols(student_bundle) + assert actual == V3_LEAD_TABLE_COLUMNS_STUDENT_PUBLIC, ( + f"student_public tables/leads.parquet columns drifted from v3 contract.\n" + f" unexpected: {sorted(actual - V3_LEAD_TABLE_COLUMNS_STUDENT_PUBLIC)}\n" + f" missing: {sorted(V3_LEAD_TABLE_COLUMNS_STUDENT_PUBLIC - actual)}" + ) + + +def test_research_instructor_leads_table_columns_match_v3_contract( + instructor_bundle: Path, +) -> None: + actual = _leads_cols(instructor_bundle) + assert actual == V3_LEAD_TABLE_COLUMNS_RESEARCH_INSTRUCTOR, ( + f"research_instructor tables/leads.parquet columns drifted from v3 contract.\n" + f" unexpected: {sorted(actual - V3_LEAD_TABLE_COLUMNS_RESEARCH_INSTRUCTOR)}\n" + f" missing: {sorted(V3_LEAD_TABLE_COLUMNS_RESEARCH_INSTRUCTOR - actual)}" + ) + + +def test_student_public_redacted_set_in_manifest(student_bundle: Path) -> None: + manifest = json.loads((student_bundle / "manifest.json").read_text()) + assert set(manifest["redacted_columns"]) == {"current_stage", "is_sql"} + + +def test_research_instructor_redacted_set_in_manifest(instructor_bundle: Path) -> None: + manifest = json.loads((instructor_bundle / "manifest.json").read_text()) + assert manifest["redacted_columns"] == [] + + +def test_is_mql_absent_from_all_v3_artifacts(student_bundle: Path, instructor_bundle: Path) -> None: + """``is_mql`` was removed entirely in v3 (issue #57): not in the snapshot, + not in the relational table, not in the feature dictionary, not anywhere.""" + for b in (student_bundle, instructor_bundle): + assert "is_mql" not in _task_cols(b) + assert "is_mql" not in _leads_cols(b) diff --git a/tests/schema/test_entities.py b/tests/schema/test_entities.py index e3d6f2a..61e0c4e 100644 --- a/tests/schema/test_entities.py +++ b/tests/schema/test_entities.py @@ -57,7 +57,6 @@ def _make_lead() -> LeadRow: first_touch_channel="organic_search", current_stage="mql", owner_rep_id="rep_000001", - is_mql=True, is_sql=False, converted_within_90_days=False, conversion_timestamp=None, @@ -111,7 +110,7 @@ def test_empty_dataframe_has_zero_rows(cls: type) -> None: def test_lead_empty_dataframe_boolean_columns() -> None: df = LeadRow.empty_dataframe() - assert str(df["is_mql"].dtype) == "boolean" + assert str(df["is_sql"].dtype) == "boolean" assert str(df["converted_within_90_days"].dtype) == "boolean" @@ -145,7 +144,7 @@ def test_lead_rows_parquet_roundtrip(tmp_path: Path) -> None: write_parquet(df, path) restored = read_parquet(path) assert restored["lead_id"].iloc[0] == "lead_000001" - assert bool(restored["is_mql"].iloc[0]) is True + assert bool(restored["is_sql"].iloc[0]) is False # --------------------------------------------------------------------------- diff --git a/tests/simulation/test_engine.py b/tests/simulation/test_engine.py index aee1b3b..073b391 100644 --- a/tests/simulation/test_engine.py +++ b/tests/simulation/test_engine.py @@ -177,9 +177,16 @@ def test_converted_within_90_days_is_bool(self) -> None: result = _run_sim() assert all(isinstance(row.converted_within_90_days, bool) for row in result.leads) - def test_all_leads_are_mql(self) -> None: + def test_no_lead_is_initialised_pre_mql(self) -> None: + """All leads enter the simulation at MQL stage; pre-MQL stages don't + appear. ``is_mql`` was removed from the schema in v3 because of + this invariant — but the invariant itself still holds.""" result = _run_sim() - assert all(row.is_mql for row in result.leads) + # Stages that would mean "not yet MQL" — none should be present in + # initial population, nor in final state because the funnel only + # advances forward. + pre_mql_stages = {"awareness", "interest"} + assert not any(row.current_stage in pre_mql_stages for row in result.leads) def test_converted_leads_have_timestamp(self) -> None: result = _run_sim(n_leads=100) diff --git a/tests/simulation/test_population.py b/tests/simulation/test_population.py index 9fe017a..45c82c7 100644 --- a/tests/simulation/test_population.py +++ b/tests/simulation/test_population.py @@ -200,7 +200,6 @@ def test_lead_initial_stage_is_mql() -> None: result = _make_result() for lead in result.leads: assert lead.current_stage == "mql" - assert lead.is_mql is True assert lead.is_sql is False assert lead.converted_within_90_days is False assert lead.conversion_timestamp is None diff --git a/tests/validation/test_realism.py b/tests/validation/test_realism.py index e77cd0d..4b591ce 100644 --- a/tests/validation/test_realism.py +++ b/tests/validation/test_realism.py @@ -88,12 +88,22 @@ def test_detects_non_boolean_feature(self, tmp_path: Path, bundle_dir: Path) -> manifest = load_json(corrupt / "manifest.json") train_path = corrupt / "tasks/converted_within_90_days/train.parquet" df = pd.read_parquet(train_path) + # Pick the first non-target boolean column at test time so this + # test self-heals when feature names change. Falls back gracefully + # if the spec ever has zero non-target booleans (currently impossible). + from leadforge.schema.features import LEAD_SNAPSHOT_FEATURES + + bool_col = next( + f.name + for f in LEAD_SNAPSHOT_FEATURES + if f.dtype == "boolean" and not f.is_target and f.name in df.columns + ) # Replace boolean column with a string — clearly not boolean dtype. - df["is_mql"] = "yes" + df[bool_col] = "yes" df.to_parquet(train_path) errors = check_realism(corrupt, manifest) - assert any("non-boolean dtype" in e and "is_mql" in e for e in errors) + assert any("non-boolean dtype" in e and bool_col in e for e in errors) def test_detects_single_stage(self, tmp_path: Path, bundle_dir: Path) -> None: corrupt = tmp_path / "one_stage"