Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 5 additions & 7 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,13 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie

Deterministic leak fixed via exposure-layer redaction. `FeatureSpec` now carries an explicit `redact_in_modes: frozenset[ExposureMode]` field — *prescriptive* — alongside the descriptive `leakage_risk` flag. `current_stage` is marked `redact_in_modes={ExposureMode.student_public}`; the writer queries `redacted_columns_for(mode)` and strips matching columns from the snapshot, task splits, and feature dictionary before they hit disk. The pedagogical trap `total_touches_all` is preserved in all modes (no entry in `redact_in_modes`). The manifest records `redacted_columns: [...]` so the bundle is self-describing. `validate_bundle()` cross-checks parquet schemas, feature dictionary, and the manifest's declared redaction set against `redacted_columns_for(mode)` derived independently from the feature spec. Hash-determinism preserved (73/73 identical across builds).

### Follow-up: structural leakage in `student_public` bundles (open)
### Follow-up: structural leakage in `student_public` bundles (issue #57)

Stripping `current_stage` addresses the deterministic label-encoding leak but does **not** make the released bundle structurally leakage-free. Three concerns to address in a follow-up PR:
Tracked in [GitHub issue #57](https://github.com/leadforge-dev/leadforge/issues/57).

1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. They correlate with post-conversion activity. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. This shifts every feature value and every conversion rate in the release bundles, so it's deferred to its own PR with a coordinated documentation update.
2. **`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced). At advanced tier it effectively encodes the negative class. Either redact `is_sql` in `student_public` (probably correct) or accept it as a strong feature with documentation. Decide alongside #1.
3. **`is_mql` is a constant `True`.** Zero variance feature in all three tiers. Should be removed from the snapshot or, if it can ever be False under some recipe, the simulator should produce that variance.

Suggested action: open one tracked GitHub issue covering all three (currently no issue exists; user has standing instruction not to file without confirmation).
1. **Event-aggregate features are computed over the label window.** `touch_count`, `session_count`, `pricing_page_views`, `expected_acv`, `days_since_last_touch`, etc. all aggregate events in `[lead_created_at, lead_created_at + 90d]`, the same window over which the label resolves. The structural fix is a windowed snapshot (`snapshot_day=N` with `N < label_window_days`), as v6/v7 datasets already do at day 14/20. **Open** — its own PR with documentation recalibration; will likely bump `BUNDLE_SCHEMA_VERSION` again.
2. ~~**`is_sql=False` is near-deterministic for non-conversion.** Measured on the regenerated bundle: P(converted | is_sql=False) = 0.038 (intro), 0.015 (intermediate), 0.006 (advanced).~~ **Resolved** — `is_sql` redacted in `student_public` mode by post-#57 PR (bundle schema v3).
3. ~~**`is_mql` is a constant `True`.** Zero variance feature in all three tiers.~~ **Resolved** — `is_mql` removed from the canonical feature list by post-#57 PR (bundle schema v3). Guarded by a new `test_no_zero_variance_features` check.

---

Expand Down
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,60 @@ Format inspired by [Keep a Changelog](https://keepachangelog.com/).

## Unreleased

### Bundle schema v3

`bundle_schema_version` bumped from `"2"` to `"3"`. Three structural
changes follow up on PR #56 (issue #57):

- **`is_mql` fully removed.** Every lead is initialised at MQL stage in
the simulator, making the field constant `True` and zero-variance.
It carried no information for modelling and is now removed from the
`LeadRow` entity, the relational `leads.parquet`, the snapshot, the
task splits, and the feature dictionary — in all exposure modes.
- **`is_sql` redacted in `student_public` mode.** Measured across 5
seeds on full-size bundles: P(converted | is_sql=False) =
0.061 ± 0.026 (intro) / 0.020 ± 0.010 (intermediate) /
0.011 ± 0.004 (advanced). At advanced tier this is essentially
deterministic for the negative class — practically a one-rule
classifier. `is_sql` remains in `research_instructor` exports for
DGP-aware research.
- **Redaction now applies to relational tables too.** In v2, the
exposure-layer redaction only stripped columns from the snapshot /
task splits; users following the README's "Option 3" (feature
engineering off the raw `tables/leads.parquet`) could trivially
rejoin redacted columns. In v3, `redacted_columns_for(mode)` is
applied uniformly to every published parquet under both `tables/`
and `tasks/`. In `student_public` bundles, `tables/leads.parquet`
no longer carries `current_stage` or `is_sql`.

### New automated checks

- `tests/render/test_bundle_schema_v3_contract.py` pins the exact
column set per mode for v3 — any future change that touches the
feature spec or redaction policy without updating the contract
fails this test, forcing an explicit version coordination.
- `test_no_zero_variance_features` in `tests/exposure/test_redaction.py`
asserts no constant or near-constant columns in the published
student_public task split (1% rare-class threshold on bundles
large enough for the threshold to be statistically meaningful).

### Bundle column counts (v3)

- `student_public/{intro,intermediate,advanced}`: **32** task split
columns (down from 34 in v2); **9** columns in `tables/leads.parquet`
(down from 12).
- `research_instructor/intermediate_instructor`: **34** task split
columns (down from 35); **11** columns in `tables/leads.parquet`
(down from 12 — `is_mql` removed).

### Open follow-up

Issue #57 sub-item 1 remains open: event-aggregate features
(`touch_count`, `session_count`, `pricing_page_views`, ...) are still
computed over the same 90-day window the label resolves in. The
structural fix is a windowed snapshot rebuild and is deferred to its
own PR.

---

## v1.0.0 — 2026-05-02
Expand Down
21 changes: 15 additions & 6 deletions leadforge/api/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,14 @@ def write_bundle(
population = bundle.population
world_graph = bundle.world_graph

# The redaction set comes from the canonical feature spec — the same
# source of truth the validator uses. It is applied uniformly to
# every published parquet file (relational tables AND task splits) so
# users doing feature engineering off the raw tables (per the
# README's "Option 3") cannot trivially reintroduce a redacted
# column by joining ``tables/leads.parquet`` to their feature set.
redacted = redacted_columns_for(config.exposure_mode)

# ------------------------------------------------------------------
# 1. Relational tables → tables/
# ------------------------------------------------------------------
Expand All @@ -68,17 +76,19 @@ def write_bundle(
dfs = to_dataframes(result, population)
table_row_counts: dict[str, int] = {}
for table_name, df in dfs.items():
if redacted:
cols_to_drop = [c for c in redacted if c in df.columns]
if cols_to_drop:
df = df.drop(columns=cols_to_drop)
write_parquet(df, tables_dir / f"{table_name}.parquet")
table_row_counts[table_name] = len(df)

# ------------------------------------------------------------------
# 2. Snapshot + task splits → tasks/
#
# Apply exposure-mode redaction here (rather than in apply_exposure)
# so that the manifest's per-file SHA-256 hashes reflect the published
# column set without a post-write rewrite step. The redacted column
# set is derived from the canonical feature spec — the same source
# of truth the validator uses to check bundles.
# Same redaction rule applied to the snapshot DataFrame before the
# task splits are written, so manifest SHA-256 hashes reflect the
# published column set without a post-write rewrite step.
# ------------------------------------------------------------------
snapshot = build_snapshot(
result,
Expand All @@ -87,7 +97,6 @@ def write_bundle(
difficulty_params=config.difficulty_params,
seed=config.seed,
)
redacted = redacted_columns_for(config.exposure_mode)
if redacted:
drop_cols = [c for c in redacted if c in snapshot.columns]
if drop_cols:
Expand Down
10 changes: 9 additions & 1 deletion leadforge/render/manifests.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,15 @@
from leadforge.structure.graph import WorldGraph

# Bump this whenever the bundle layout or manifest schema changes.
BUNDLE_SCHEMA_VERSION = "2"
# History:
# "1" — initial layout (pre-M8)
# "2" — M8 render layer: tables/, tasks/, dataset_card.md,
# feature_dictionary.csv, manifest.json structure
# "3" — issue #57 follow-up: ``is_mql`` removed from the canonical
# feature list (zero-variance); ``is_sql`` redacted in
# ``student_public`` mode (near-deterministic for non-conversion).
# ``manifest.redacted_columns`` was already added in PR #56.
BUNDLE_SCHEMA_VERSION = "3"

# Manifest fields whose value is non-deterministic by design (wall-clock,
# host metadata, etc.). Determinism checks must ignore these fields when
Expand Down
6 changes: 4 additions & 2 deletions leadforge/schema/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,15 @@ class LeadRow:
"first_touch_channel": "string",
"current_stage": "string",
"owner_rep_id": "string",
"is_mql": "boolean",
"is_sql": "boolean",
"converted_within_90_days": "boolean",
"conversion_timestamp": "string",
}

# ``is_mql`` was removed in bundle schema v3 (issue #57). Every lead
# is initialised at MQL stage in ``simulation/population.py``, so the
# field was constant ``True`` and zero-variance across all bundles.

lead_id: str
contact_id: str
account_id: str
Expand All @@ -164,7 +167,6 @@ class LeadRow:
first_touch_channel: str
current_stage: str
owner_rep_id: str
is_mql: bool
is_sql: bool
converted_within_90_days: bool
conversion_timestamp: str | None = None
Expand Down
20 changes: 13 additions & 7 deletions leadforge/schema/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,17 +145,23 @@ class FeatureSpec:
leakage_risk=True,
redact_in_modes=frozenset({ExposureMode.student_public}),
),
FeatureSpec(
"is_mql",
"boolean",
"Whether the lead had achieved MQL status at snapshot date.",
"lead_meta",
),
# Note: ``is_mql`` was removed from the canonical feature list (issue #57)
# because every lead is initialised at MQL stage in
# ``leadforge/simulation/population.py``, making the column constant
# ``True`` and zero-variance. The underlying ``LeadRow.is_mql`` field
# still lives on the relational ``leads.parquet`` table.
FeatureSpec(
"is_sql",
"boolean",
"Whether the lead had achieved SQL status at snapshot date.",
"Whether the lead had achieved SQL status at snapshot date. "
"Strongly correlated with the label: the simulator only converts "
"non-SQL leads via a rare direct-conversion path, so "
"is_sql=False predicts non-conversion with very high probability "
"(P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across difficulty "
"tiers). Redacted from student_public bundles.",
"lead_meta",
leakage_risk=True,
redact_in_modes=frozenset({ExposureMode.student_public}),
Comment on lines +156 to +164
Comment on lines +163 to +164
),
# -- Engagement features --
FeatureSpec(
Expand Down
1 change: 0 additions & 1 deletion leadforge/simulation/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -407,7 +407,6 @@ def simulate_world(
first_touch_channel=lead.first_touch_channel,
current_stage=state.current_stage,
owner_rep_id=lead.owner_rep_id,
is_mql=True, # all leads start at mql
is_sql=is_sql,
converted_within_90_days=label,
conversion_timestamp=conv_ts,
Expand Down
1 change: 0 additions & 1 deletion leadforge/simulation/population.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,6 @@ def _generate_leads(
first_touch_channel=lead_source,
current_stage="mql",
owner_rep_id=owner_rep_id,
is_mql=True,
is_sql=False,
converted_within_90_days=False,
conversion_timestamp=None,
Expand Down
12 changes: 10 additions & 2 deletions leadforge/validation/drift.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from typing import Any

import pandas as pd
import pyarrow.parquet as pq

from leadforge.core.serialization import load_json

Expand Down Expand Up @@ -66,10 +67,17 @@ def check_cross_seed_stability(bundles: dict[int, Path]) -> list[str]:
if len(df) > 0:
rates[seed] = float(df[_LABEL_COLUMN].mean())

# ``current_stage`` is redacted from ``leads.parquet`` in
# ``student_public`` mode (bundle schema v3+). Skip stage diversity
# collection in that case — degeneracy still surfaces via the
# conversion-rate spread checks below, and via ``check_realism`` on
# any ``research_instructor`` bundle that does carry the column.
leads_path = bundle_path / "tables/leads.parquet"
if leads_path.exists():
leads = pd.read_parquet(leads_path, columns=["current_stage"])
stage_counts[seed] = int(leads["current_stage"].nunique())
schema_names = set(pq.read_schema(leads_path).names)
if "current_stage" in schema_names:
leads = pd.read_parquet(leads_path, columns=["current_stage"])
stage_counts[seed] = int(leads["current_stage"].nunique())

# Check conversion rate spread — if one seed's rate is 5x another's, that's suspicious
if len(rates) >= 2:
Expand Down
39 changes: 35 additions & 4 deletions leadforge/validation/invariants.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,11 +203,42 @@ def check_exposure_monotonicity(student_bundle: Path, instructor_bundle: Path) -
if extra_in_instructor:
errors.append(f"Tables in instructor but not student: {sorted(extra_in_instructor)}")

expected_redacted = redacted_columns_for(ExposureMode.student_public)
for table in sorted(student_tables & instructor_tables):
s_sha = file_sha256(student_bundle / "tables" / table)
i_sha = file_sha256(instructor_bundle / "tables" / table)
if s_sha != i_sha:
errors.append(f"Table content mismatch: {table}")
s_path = student_bundle / "tables" / table
i_path = instructor_bundle / "tables" / table
if file_sha256(s_path) == file_sha256(i_path):
continue
# Mismatch is acceptable iff the only difference is redacted
# columns (same logic as for task splits below).
s_df = pd.read_parquet(s_path)
i_df = pd.read_parquet(i_path)
if len(s_df) != len(i_df):
errors.append(
f"Table {table}: row count mismatch student={len(s_df)} instructor={len(i_df)}"
)
continue
s_cols = set(s_df.columns)
i_cols = set(i_df.columns)
extra_in_student = s_cols - i_cols
if extra_in_student:
errors.append(
f"Table {table}: student has columns missing from instructor: "
f"{sorted(extra_in_student)}"
)
continue
diff = i_cols - s_cols
if not diff.issubset(expected_redacted):
errors.append(
f"Table {table}: instructor−student column diff {sorted(diff)} contains "
f"non-redacted columns (expected subset of {sorted(expected_redacted)})"
)
continue
shared = [c for c in s_df.columns if c in i_df.columns]
s_shared = s_df[shared].reset_index(drop=True)
i_shared = i_df[shared].reset_index(drop=True)
if not s_shared.equals(i_shared):
errors.append(f"Table {table}: shared-column values differ between modes")

# Both must have the same task splits with identical content
student_tasks = (
Expand Down
13 changes: 12 additions & 1 deletion leadforge/validation/realism.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,23 @@ def _check_feature_ranges(root: Path, manifest: dict[str, Any]) -> list[str]:


def _check_stage_distribution(root: Path) -> list[str]:
"""Check that leads span multiple funnel stages (not all stuck in one)."""
"""Check that leads span multiple funnel stages (not all stuck in one).

``current_stage`` is redacted from the relational ``leads.parquet`` in
``student_public`` mode (bundle schema v3 onward), so this check is a
no-op there — the underlying simulation is identical to the
``research_instructor`` bundle, which still carries the column and will
surface a degenerate simulation through the same check.
"""
errors: list[str] = []
leads_path = root / "tables/leads.parquet"
if not leads_path.exists():
return errors

schema_names = set(pq.read_schema(leads_path).names)
if "current_stage" not in schema_names:
return errors

df = pd.read_parquet(leads_path, columns=["current_stage"])
if len(df) == 0:
return errors
Expand Down
9 changes: 5 additions & 4 deletions release/HF_DATASET_CARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat
| | Intro | Intermediate | Advanced |
|---|---|---|---|
| Leads | 5,000 | 5,000 | 5,000 |
| Features | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) | 32 + 1 trap (+ 1 target) |
| Features | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) | 30 + 1 trap (+ 1 target) |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Conversion rate | 41.5% | 20.1% | 7.9% |
| Signal strength | 0.90 | 0.70 | 0.50 |
Expand All @@ -92,12 +92,13 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat

Each difficulty tier includes 9 Parquet tables under `tables/`: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions. These form a normalized CRM schema linked by foreign keys.

## Leakage handling
## Leakage handling (bundle schema v3)

- **Stripped from public bundles:** `current_stage` directly encoded the label at the 90-day horizon (terminal stages `closed_won`/`closed_lost`). Removed in `student_public` mode; available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped.
- **Stripped from public bundles:** `current_stage` (label-encoding at the 90-day horizon) and `is_sql` (P(conv | is_sql=False) ≈ 0.04 / 0.015 / 0.006 across tiers — near-deterministic for non-conversion). Both available in `intermediate_instructor/`. The `manifest.json` field `redacted_columns` lists what was stripped.
- **Removed entirely:** `is_mql` (constant `True` in the simulator — zero variance, no information).
Comment on lines +97 to +98
- **Deliberately retained as a pedagogical trap:** `total_touches_all` counts touches over the full 90-day window including post-snapshot events. Flagged `leakage_risk=True` in `feature_dictionary.csv`. Use it as an exercise — train with and without, compare AUC, explain the gap.

**Caveats:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window that the label resolves in, so they correlate with post-conversion events; `is_mql` is constant `True` in all bundles; `is_sql=False` is near-deterministic for non-conversion. A windowed-snapshot follow-up will address this structurally — see the package CHANGELOG.
**Caveat:** event-aggregate features (`touch_count`, `session_count`, ...) are computed over the same 90-day window the label resolves in, so they correlate with post-conversion events. A windowed-snapshot rebuild is the structural fix — see [issue #57](https://github.com/leadforge-dev/leadforge/issues/57).

## Research companion

Expand Down
Loading
Loading