leadforge-dev · shaypal5 · May 5, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -20,9 +20,9 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 **Companion docs:** `docs/release/v1_release_design.md`, `docs/release/v1_acceptance_gates.md`, `docs/release/post_v1_roadmap.md`
 **External review materials:** `docs/external_review/{gemini,chatgpt}/` (raw) + `docs/external_review/summaries/` (synthesized)
 
-### Phase 1 — Audit and naming
-- [ ] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md`
-- [ ] Lock dataset release name `leadforge-lead-scoring-v1`
+### Phase 1 — Audit and naming ✓ (PR 1.1)
+- [x] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md` — all three tiers reconstruct `converted_within_90_days` at 100% via paths A–E; LR/HistGBM AUC = 1.000 on join-derived features. Probe script: `scripts/probe_relational_leakage.py` (function `deterministic_relational_reconstruction` designed to lift into PR 3.1's `leadforge/validation/leakage_probes.py`).
+- [x] Lock dataset release name `leadforge-lead-scoring-v1` (already locked via PR #61's milestone rename + roadmap edits; G1.1 reaffirmed)
 
 ### Phase 2 — Snapshot-safe relational export
 - [ ] `leadforge/render/relational_snapshot_safe.py` (new)

diff --git a/docs/release/v1_acceptance_gates.md b/docs/release/v1_acceptance_gates.md
@@ -10,7 +10,7 @@ read by `scripts/validate_release_candidate.py` and by humans before tag.
 
 ## Naming and versioning gate
 
-- **G1.1** Dataset release name: `leadforge-lead-scoring-v1`. Locked in Phase 1.
+- **G1.1** Dataset release name: `leadforge-lead-scoring-v1`. Locked in Phase 1 (PR #61 milestone rename + roadmap edits; reaffirmed in PR 1.1's `docs/release/v1_current_state_audit.md`).
 - **G1.2** Kaggle slug: `leadforge-lead-scoring-v1`.
 - **G1.3** Hugging Face repo: `leadforge-lead-scoring-v1` (public family) and `leadforge-lead-scoring-v1-instructor` (companion).
 - **G1.4** Bundle `package_version` reflects the leadforge package at build time.

diff --git a/docs/release/v1_current_state_audit.md b/docs/release/v1_current_state_audit.md
@@ -0,0 +1,246 @@
+# v1 Current-State Audit — Relational Leakage in Alpha `student_public` Bundles
+
+**Phase:** PR 1.1 (Phase 1 of `v1_release_roadmap.md`)
+**Date:** 2026-05-05
+**Generated by:** `scripts/probe_relational_leakage.py` against `release/{intro,intermediate,advanced}/`
+**Status:** **BLOCKER CONFIRMED.** Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) **passes empirically** on alpha bundles — important nuance, see §G4.4 below.
+**Structural fix:** PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`.
+
+This document reproduces the relational-leakage finding from
+[`docs/external_review/summaries/chatgpt_v2_summary.md`](../external_review/summaries/chatgpt_v2_summary.md) §0
+on the actual 5000-lead alpha bundles.
+
+## TL;DR
+
+The public bundles leak `converted_within_90_days` through **two qualitatively different mechanisms** that the audit must distinguish:
+
+1. **The label is published in cleartext.** `leads.converted_within_90_days`
+   and `leads.conversion_timestamp` are present in every public tier.
+   Path A is "open the parquet, read the column" — not leakage *via joins*.
+2. **Joins to post-outcome entities reconstruct the label deterministically.**
+   `opportunities.close_outcome == "closed_won"`, plus the existence of
+   conversion-conditional `customers.parquet` and `subscriptions.parquet`
+   tables, each independently reconstruct the target at 100% accuracy. This is
+   the leakage the *roadmap-level* discussion is about, and it survives
+   even if Path A is patched first.
+
+Phase 2 must remove **both** mechanisms. Removing only the label column
+(easy) leaves the join-only reconstruction (paths B/C/D) at 100%.
+
+## Method
+
+`scripts/probe_relational_leakage.py <bundle_dir>` reports four orthogonal
+pieces of evidence:
+
+1. **Deterministic reconstruction paths** (no model fit, just joins):
+
+   | Path | Description |
+   |---|---|
+   | A. Direct label read | `leads.converted_within_90_days` taken as the prediction. |
+   | B. Opportunity outcome | Lead has any `opportunities` row with `close_outcome == "closed_won"`. |
+   | C. Customer existence | Lead → opportunities → customers (any joined customer). |
+   | D. Subscription existence | Lead → opportunities → customers → subscriptions. |
+   | E. Deterministic OR (B ∨ C ∨ D) | Headline join-only reconstruction. |
+
+2. **Phase-2-success ablation.** Same deterministic probes after simulating
+   PR 2.1's redaction in-process (drop label columns from `leads`, drop
+   `close_outcome`/`closed_at` from `opportunities`, treat `customers` and
+   `subscriptions` as empty). Tells us what the post-fix probe should look
+   like, *before* PR 2.1 ships.
+
+3. **Bonus model probes.** 5-fold CV LR + HistGBM on join-derived
+   features, in two variants:
+   - `with_close_outcome_aggregates` — includes `any_closed_won` (which is
+     just Path B aggregated; trivially perfect).
+   - `without_close_outcome_aggregates` — only `n_opps`, `max_acv`,
+     `mean_acv`, `n_customers`, `n_subscriptions`. The load-bearing variant —
+     answers "do the *non-trivial* relational features carry the leak
+     independently of `close_outcome`?"
+
+4. **Snapshot-window probe (G4.4).** Per event table, count rows with
+   `timestamp > lead_created_at + horizon_days`. Direct test of the
+   timestamp-bound invariant.
+
+The deterministic reconstruction is implemented as a pure function
+`deterministic_relational_reconstruction(leads, opportunities, customers, subscriptions)`,
+designed to lift verbatim into `leadforge/validation/leakage_probes.py`
+(PR 3.1). The function refuses to operate on non-unique `lead_id` and
+accepts empty `customers`/`subscriptions` frames (Phase 2 success state).
+
+Reproduce via:
+
+```bash
+python scripts/probe_relational_leakage.py release/intro
+python scripts/probe_relational_leakage.py release/intermediate
+python scripts/probe_relational_leakage.py release/advanced
+```
+
+For Phase-2 CI gating after PR 2.2:
+
+```bash
+python scripts/probe_relational_leakage.py release/intermediate --max-accuracy 0.65
+# exit 2 if any deterministic path or bonus AUC > 0.65
+```
+
+## Bundle composition
+
+| Tier | n_leads | n_opportunities | n_customers | n_subscriptions | conversion rate |
+|---|---:|---:|---:|---:|---:|
+| intro        | 5000 | 4701 | 2110 | 2110 | 0.422 |
+| intermediate | 5000 | 4641 | 1049 | 1049 | 0.210 |
+| advanced     | 5000 | 4557 |  393 |  393 | 0.079 |
+
+`n_customers == n_subscriptions == n_converted_leads` per tier — direct
+evidence that customers and subscriptions are conversion-conditional
+entities. Their *presence in the public table set is the leak*; column
+contents are immaterial.
+
+## Deterministic reconstruction (paths A–E)
+
+Reconstruction **accuracy** vs `converted_within_90_days`. Precision /
+recall / F1 are also 1.000 across the board (full output in the script's
+JSON mode); only accuracy reproduced here for compactness. AUC is not
+reported because these are deterministic 0/1 predictions (AUC undefined /
+degenerate).
+
+| Tier | A. direct | B. opp won | C. customer | D. subscription | E. B∨C∨D |
+|---|---:|---:|---:|---:|---:|
+| intro        | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+| intermediate | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+| advanced     | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+
+## Phase-2-success ablation
+
+Same deterministic probes, run on a virtual redacted view (label columns
+dropped from `leads`, `close_outcome`/`closed_at` dropped from
+`opportunities`, `customers`/`subscriptions` empty). Every path collapses
+to all-False predictions, so accuracy reduces to the baseline of always
+predicting the negative class — i.e., `1 - conversion_rate`:
+
+| Tier | accuracy of any path | matches `1 - conv. rate` |
+|---|---:|:--:|
+| intro        | 0.578 | ✓ (1 − 0.422) |
+| intermediate | 0.790 | ✓ (1 − 0.210) |
+| advanced     | 0.921 | ✓ (1 − 0.079) |
+
+This is the *correct* post-fix shape for deterministic probes: with the
+post-outcome side channels gone, no join produces a positive prediction.
+The remaining residual risk — "can a *model* trained on
+`n_opps`/`max_acv`/etc. (which are NOT post-outcome) still leak?" — is
+PR 2.1 / 3.1 territory and is left for those PRs to band.
+
+## Bonus model probes (5-fold CV)
+
+| Tier | variant | LR AUC | LR AP | HistGBM AUC | HistGBM AP | n_features |
+|---|---|---:|---:|---:|---:|---:|
+| intro        | with `any_closed_won`/`any_closed`    | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
+| intro        | without close-outcome aggregates       | 1.000 | 1.000 | 1.000 | 1.000 | 5 |
+| intermediate | with `any_closed_won`/`any_closed`    | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
+| intermediate | without close-outcome aggregates       | 1.000 | 1.000 | 1.000 | 1.000 | 5 |
+| advanced     | with `any_closed_won`/`any_closed`    | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
+| advanced     | without close-outcome aggregates       | 1.000 | 1.000 | 1.000 | 1.000 | 5 |
+
+**Key observation:** AUC = 1.000 even *without* `any_closed_won`. That
+means non-trivial relational features (`n_opps`, `n_customers`,
+`n_subscriptions`, `max_acv`, `mean_acv`) are individually sufficient to
+reconstruct the label, because `customers.parquet` and
+`subscriptions.parquet` exist *only* for converted leads. This is why
+PR 2.1's structural fix must omit those tables entirely from public
+bundles, not just redact a column.
+
+## G4.4 — snapshot-window probe
+
+Direct empirical check on alpha bundles: are there event rows with
+`timestamp > lead_created_at + 90d`?
+
+| Tier | touches | sessions | sales_activities | opportunities |
+|---|---|---|---|---|
+| intro        | 0 / 53354 PASS | 0 / 14339 PASS | 0 / 56643 PASS | 0 / 4701 PASS |
+| intermediate | 0 / 54803 PASS | 0 / 14565 PASS | 0 / 60739 PASS | 0 / 4641 PASS |
+| advanced     | 0 / 54662 PASS | 0 / 14599 PASS | 0 / 62254 PASS | 0 / 4557 PASS |
+
+**G4.4 passes literally:** the 90-day simulation horizon already bounds
+event timestamps. **But** that is not the same as "the public bundle is
+snapshot-safe." Events *within* the 90-day window still encode conversion
+(Path B uses opportunities created within the horizon; the customers and
+subscriptions tables only exist for leads that closed within the horizon).
+The snapshot-window invariant (G4.4) and the relational-leakage invariant
+(G4.1–G4.3) are independent constraints; passing G4.4 does not imply
+passing G4.5.
+
+## Acceptance-gate verdict
+
+| Gate | Verdict | Evidence |
+|---|---|---|
+| **G4.1** Public `leads` excludes `converted_within_90_days` and `conversion_timestamp` | ✗ FAIL | both columns present in all three tiers |
+| **G4.2** Public `opportunities` excludes `close_outcome` and `closed_at` | ✗ FAIL | both columns present in all three tiers |
+| **G4.3** Public bundles do not contain `customers.parquet` or `subscriptions.parquet` | ✗ FAIL | both files present in all three tiers |
+| **G4.4** No public event rows past `lead_created_at + snapshot_day` | ✓ PASS | 0 violations across all event tables and all tiers (90-day horizon) |
+| **G4.5** Probabilistic relational reconstruction probe AUC ≤ TBD | ✗ FAIL | LR / HistGBM AUC = 1.000 in every tier in both feature variants |
+| **G4.6** Manifest field `relational_snapshot_safe == true` for `student_public` | ✗ FAIL | manifest field does not yet exist (introduced in PR 2.2 with `BUNDLE_SCHEMA_VERSION` 4 → 5) |
+
+## Why every reconstruction metric is 1.000 (and what that implies for Phase 2)
+
+The public bundles expose four logically-equivalent reconstructions of
+`converted_within_90_days`:
+
+1. The label itself (Path A).
+2. `close_outcome == "closed_won"` on opportunities (Path B).
+3. The presence of any joined customer (Path C).
+4. The presence of any joined subscription (Path D).
+
+All four are functions of the same underlying truth — they all flip on iff
+the lead converted within 90 days — so any model with access to any of
+them trivially achieves AUC 1.0. This is structural, not probabilistic:
+PR 2.1 must remove the *information channels*, not "shrink the leakage."
+
+## Note on the instructor companion
+
+`release/intermediate_instructor/` is a `research_instructor` bundle and
+is *expected* to retain all four channels — that's the point of the
+instructor mode (full truth for teaching). Running this script against the
+instructor companion will report the same 1.000 reconstruction; that's
+correct behavior, not a regression. The public/instructor diff is gated
+separately by G9.\*.
+
+## Pointer to the structural fix — PR 2.1
+
+PR 2.1 of `v1_release_roadmap.md` is the structural fix:
+
+1. New `leadforge/render/relational_snapshot_safe.py`:
+   - Drop `converted_within_90_days` / `conversion_timestamp` from public `leads`.
+   - Drop `close_outcome` / `closed_at` from public `opportunities`.
+   - Filter `opportunities` to `created_at <= lead_created_at + snapshot_day` per lead.
+   - Filter `touches`/`sessions`/`sales_activities` similarly (defence-in-depth even though G4.4 passes today).
+   - Omit `customers.parquet` / `subscriptions.parquet` from public bundles.
+
+2. New `leadforge/validation/relational_leakage.py`:
+   - Lift `deterministic_relational_reconstruction` from this PR's
+     `scripts/probe_relational_leakage.py` and assert that paths B/C/D
+     produce zero hits because the underlying columns/tables are absent.
+   - Assert no banned columns; assert event timestamps within horizon.
+   - Add a Phase-2 bonus-model probe — train LR/HistGBM on the redacted
+     view's `n_opps`/ACV features and band the residual AUC.
+
+3. PR 2.2 wires the new export through `leadforge/exposure/filters.py`
+   and `leadforge/api/bundle.py`; bumps `BUNDLE_SCHEMA_VERSION` 4 → 5;
+   adds the `relational_snapshot_safe: true` manifest field for
+   `student_public`.
+
+After PR 2.2 ships, this script must be re-run on the regenerated
+bundles. Expected post-fix shape:
+
+- Deterministic paths A–E: all-False (matches the Phase-2 ablation rows
+  in the table above).
+- Bonus model AUC (without close-outcome aggregates): the residual
+  band that PR 3.3 will calibrate — currently unbanded.
+- G4.4: still PASS.
+- The script's `--max-accuracy` flag becomes the regression gate in CI.
+
+## Related artifacts
+
+- Probe script: [`scripts/probe_relational_leakage.py`](../../scripts/probe_relational_leakage.py)
+- Unit tests: [`tests/scripts/test_probe_relational_leakage.py`](../../tests/scripts/test_probe_relational_leakage.py)
+- Acceptance gates: [`docs/release/v1_acceptance_gates.md`](v1_acceptance_gates.md) §"Relational leakage gate"
+- Roadmap: [`docs/release/v1_release_roadmap.md`](v1_release_roadmap.md) §"Phase 2 — Snapshot-safe relational export"
+- Original finding: [`docs/external_review/summaries/chatgpt_v2_summary.md`](../external_review/summaries/chatgpt_v2_summary.md) §0