Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
**Companion docs:** `docs/release/v1_release_design.md`, `docs/release/v1_acceptance_gates.md`, `docs/release/post_v1_roadmap.md`
**External review materials:** `docs/external_review/{gemini,chatgpt}/` (raw) + `docs/external_review/summaries/` (synthesized)

### Phase 1 — Audit and naming
- [ ] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md`
- [ ] Lock dataset release name `leadforge-lead-scoring-v1`
### Phase 1 — Audit and naming ✓ (PR 1.1)
- [x] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md` — all three tiers reconstruct `converted_within_90_days` at 100% via paths A–E; LR/HistGBM AUC = 1.000 on join-derived features. Probe script: `scripts/probe_relational_leakage.py` (function `deterministic_relational_reconstruction` designed to lift into PR 3.1's `leadforge/validation/leakage_probes.py`).
- [x] Lock dataset release name `leadforge-lead-scoring-v1` (already locked via PR #61's milestone rename + roadmap edits; G1.1 reaffirmed)

### Phase 2 — Snapshot-safe relational export
- [ ] `leadforge/render/relational_snapshot_safe.py` (new)
Expand Down
2 changes: 1 addition & 1 deletion docs/release/v1_acceptance_gates.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ read by `scripts/validate_release_candidate.py` and by humans before tag.

## Naming and versioning gate

- **G1.1** Dataset release name: `leadforge-lead-scoring-v1`. Locked in Phase 1.
- **G1.1** Dataset release name: `leadforge-lead-scoring-v1`. Locked in Phase 1 (PR #61 milestone rename + roadmap edits; reaffirmed in PR 1.1's `docs/release/v1_current_state_audit.md`).
- **G1.2** Kaggle slug: `leadforge-lead-scoring-v1`.
- **G1.3** Hugging Face repo: `leadforge-lead-scoring-v1` (public family) and `leadforge-lead-scoring-v1-instructor` (companion).
- **G1.4** Bundle `package_version` reflects the leadforge package at build time.
Expand Down
246 changes: 246 additions & 0 deletions docs/release/v1_current_state_audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# v1 Current-State Audit — Relational Leakage in Alpha `student_public` Bundles

**Phase:** PR 1.1 (Phase 1 of `v1_release_roadmap.md`)
**Date:** 2026-05-05
**Generated by:** `scripts/probe_relational_leakage.py` against `release/{intro,intermediate,advanced}/`
**Status:** **BLOCKER CONFIRMED.** Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) **passes empirically** on alpha bundles — important nuance, see §G4.4 below.
**Structural fix:** PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`.
Comment on lines +6 to +7

This document reproduces the relational-leakage finding from
[`docs/external_review/summaries/chatgpt_v2_summary.md`](../external_review/summaries/chatgpt_v2_summary.md) §0
on the actual 5000-lead alpha bundles.

## TL;DR

The public bundles leak `converted_within_90_days` through **two qualitatively different mechanisms** that the audit must distinguish:

1. **The label is published in cleartext.** `leads.converted_within_90_days`
and `leads.conversion_timestamp` are present in every public tier.
Path A is "open the parquet, read the column" — not leakage *via joins*.
2. **Joins to post-outcome entities reconstruct the label deterministically.**
`opportunities.close_outcome == "closed_won"`, plus the existence of
conversion-conditional `customers.parquet` and `subscriptions.parquet`
tables, each independently reconstruct the target at 100% accuracy. This is
the leakage the *roadmap-level* discussion is about, and it survives
even if Path A is patched first.

Phase 2 must remove **both** mechanisms. Removing only the label column
(easy) leaves the join-only reconstruction (paths B/C/D) at 100%.

## Method

`scripts/probe_relational_leakage.py <bundle_dir>` reports four orthogonal
pieces of evidence:

1. **Deterministic reconstruction paths** (no model fit, just joins):

| Path | Description |
|---|---|
| A. Direct label read | `leads.converted_within_90_days` taken as the prediction. |
| B. Opportunity outcome | Lead has any `opportunities` row with `close_outcome == "closed_won"`. |
| C. Customer existence | Lead → opportunities → customers (any joined customer). |
| D. Subscription existence | Lead → opportunities → customers → subscriptions. |
| E. Deterministic OR (B ∨ C ∨ D) | Headline join-only reconstruction. |

2. **Phase-2-success ablation.** Same deterministic probes after simulating
PR 2.1's redaction in-process (drop label columns from `leads`, drop
`close_outcome`/`closed_at` from `opportunities`, treat `customers` and
`subscriptions` as empty). Tells us what the post-fix probe should look
like, *before* PR 2.1 ships.

3. **Bonus model probes.** 5-fold CV LR + HistGBM on join-derived
features, in two variants:
- `with_close_outcome_aggregates` — includes `any_closed_won` (which is
just Path B aggregated; trivially perfect).
- `without_close_outcome_aggregates` — only `n_opps`, `max_acv`,
`mean_acv`, `n_customers`, `n_subscriptions`. The load-bearing variant —
answers "do the *non-trivial* relational features carry the leak
independently of `close_outcome`?"

4. **Snapshot-window probe (G4.4).** Per event table, count rows with
`timestamp > lead_created_at + horizon_days`. Direct test of the
timestamp-bound invariant.

The deterministic reconstruction is implemented as a pure function
`deterministic_relational_reconstruction(leads, opportunities, customers, subscriptions)`,
designed to lift verbatim into `leadforge/validation/leakage_probes.py`
(PR 3.1). The function refuses to operate on non-unique `lead_id` and
accepts empty `customers`/`subscriptions` frames (Phase 2 success state).

Reproduce via:

```bash
python scripts/probe_relational_leakage.py release/intro
python scripts/probe_relational_leakage.py release/intermediate
python scripts/probe_relational_leakage.py release/advanced
```

For Phase-2 CI gating after PR 2.2:

```bash
python scripts/probe_relational_leakage.py release/intermediate --max-accuracy 0.65
# exit 2 if any deterministic path or bonus AUC > 0.65
```

## Bundle composition

| Tier | n_leads | n_opportunities | n_customers | n_subscriptions | conversion rate |
|---|---:|---:|---:|---:|---:|
| intro | 5000 | 4701 | 2110 | 2110 | 0.422 |
| intermediate | 5000 | 4641 | 1049 | 1049 | 0.210 |
| advanced | 5000 | 4557 | 393 | 393 | 0.079 |

`n_customers == n_subscriptions == n_converted_leads` per tier — direct
evidence that customers and subscriptions are conversion-conditional
entities. Their *presence in the public table set is the leak*; column
contents are immaterial.

## Deterministic reconstruction (paths A–E)

Reconstruction **accuracy** vs `converted_within_90_days`. Precision /
recall / F1 are also 1.000 across the board (full output in the script's
JSON mode); only accuracy reproduced here for compactness. AUC is not
reported because these are deterministic 0/1 predictions (AUC undefined /
degenerate).

| Tier | A. direct | B. opp won | C. customer | D. subscription | E. B∨C∨D |
|---|---:|---:|---:|---:|---:|
| intro | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| intermediate | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| advanced | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

## Phase-2-success ablation

Same deterministic probes, run on a virtual redacted view (label columns
dropped from `leads`, `close_outcome`/`closed_at` dropped from
`opportunities`, `customers`/`subscriptions` empty). Every path collapses
to all-False predictions, so accuracy reduces to the baseline of always
predicting the negative class — i.e., `1 - conversion_rate`:

| Tier | accuracy of any path | matches `1 - conv. rate` |
|---|---:|:--:|
| intro | 0.578 | ✓ (1 − 0.422) |
| intermediate | 0.790 | ✓ (1 − 0.210) |
| advanced | 0.921 | ✓ (1 − 0.079) |

This is the *correct* post-fix shape for deterministic probes: with the
post-outcome side channels gone, no join produces a positive prediction.
The remaining residual risk — "can a *model* trained on
`n_opps`/`max_acv`/etc. (which are NOT post-outcome) still leak?" — is
PR 2.1 / 3.1 territory and is left for those PRs to band.

## Bonus model probes (5-fold CV)

| Tier | variant | LR AUC | LR AP | HistGBM AUC | HistGBM AP | n_features |
|---|---|---:|---:|---:|---:|---:|
| intro | with `any_closed_won`/`any_closed` | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
| intro | without close-outcome aggregates | 1.000 | 1.000 | 1.000 | 1.000 | 5 |
| intermediate | with `any_closed_won`/`any_closed` | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
| intermediate | without close-outcome aggregates | 1.000 | 1.000 | 1.000 | 1.000 | 5 |
| advanced | with `any_closed_won`/`any_closed` | 1.000 | 1.000 | 1.000 | 1.000 | 7 |
| advanced | without close-outcome aggregates | 1.000 | 1.000 | 1.000 | 1.000 | 5 |

**Key observation:** AUC = 1.000 even *without* `any_closed_won`. That
means non-trivial relational features (`n_opps`, `n_customers`,
`n_subscriptions`, `max_acv`, `mean_acv`) are individually sufficient to
reconstruct the label, because `customers.parquet` and
`subscriptions.parquet` exist *only* for converted leads. This is why
PR 2.1's structural fix must omit those tables entirely from public
bundles, not just redact a column.

## G4.4 — snapshot-window probe

Direct empirical check on alpha bundles: are there event rows with
`timestamp > lead_created_at + 90d`?

| Tier | touches | sessions | sales_activities | opportunities |
|---|---|---|---|---|
| intro | 0 / 53354 PASS | 0 / 14339 PASS | 0 / 56643 PASS | 0 / 4701 PASS |
| intermediate | 0 / 54803 PASS | 0 / 14565 PASS | 0 / 60739 PASS | 0 / 4641 PASS |
| advanced | 0 / 54662 PASS | 0 / 14599 PASS | 0 / 62254 PASS | 0 / 4557 PASS |

**G4.4 passes literally:** the 90-day simulation horizon already bounds
event timestamps. **But** that is not the same as "the public bundle is
snapshot-safe." Events *within* the 90-day window still encode conversion
(Path B uses opportunities created within the horizon; the customers and
subscriptions tables only exist for leads that closed within the horizon).
The snapshot-window invariant (G4.4) and the relational-leakage invariant
(G4.1–G4.3) are independent constraints; passing G4.4 does not imply
passing G4.5.

## Acceptance-gate verdict

| Gate | Verdict | Evidence |
|---|---|---|
| **G4.1** Public `leads` excludes `converted_within_90_days` and `conversion_timestamp` | ✗ FAIL | both columns present in all three tiers |
| **G4.2** Public `opportunities` excludes `close_outcome` and `closed_at` | ✗ FAIL | both columns present in all three tiers |
| **G4.3** Public bundles do not contain `customers.parquet` or `subscriptions.parquet` | ✗ FAIL | both files present in all three tiers |
| **G4.4** No public event rows past `lead_created_at + snapshot_day` | ✓ PASS | 0 violations across all event tables and all tiers (90-day horizon) |
| **G4.5** Probabilistic relational reconstruction probe AUC ≤ TBD | ✗ FAIL | LR / HistGBM AUC = 1.000 in every tier in both feature variants |
| **G4.6** Manifest field `relational_snapshot_safe == true` for `student_public` | ✗ FAIL | manifest field does not yet exist (introduced in PR 2.2 with `BUNDLE_SCHEMA_VERSION` 4 → 5) |

## Why every reconstruction metric is 1.000 (and what that implies for Phase 2)

The public bundles expose four logically-equivalent reconstructions of
`converted_within_90_days`:

1. The label itself (Path A).
2. `close_outcome == "closed_won"` on opportunities (Path B).
3. The presence of any joined customer (Path C).
4. The presence of any joined subscription (Path D).

All four are functions of the same underlying truth — they all flip on iff
the lead converted within 90 days — so any model with access to any of
them trivially achieves AUC 1.0. This is structural, not probabilistic:
PR 2.1 must remove the *information channels*, not "shrink the leakage."

## Note on the instructor companion

`release/intermediate_instructor/` is a `research_instructor` bundle and
is *expected* to retain all four channels — that's the point of the
instructor mode (full truth for teaching). Running this script against the
instructor companion will report the same 1.000 reconstruction; that's
correct behavior, not a regression. The public/instructor diff is gated
separately by G9.\*.

## Pointer to the structural fix — PR 2.1

PR 2.1 of `v1_release_roadmap.md` is the structural fix:

1. New `leadforge/render/relational_snapshot_safe.py`:
- Drop `converted_within_90_days` / `conversion_timestamp` from public `leads`.
- Drop `close_outcome` / `closed_at` from public `opportunities`.
- Filter `opportunities` to `created_at <= lead_created_at + snapshot_day` per lead.
- Filter `touches`/`sessions`/`sales_activities` similarly (defence-in-depth even though G4.4 passes today).
- Omit `customers.parquet` / `subscriptions.parquet` from public bundles.

2. New `leadforge/validation/relational_leakage.py`:
- Lift `deterministic_relational_reconstruction` from this PR's
`scripts/probe_relational_leakage.py` and assert that paths B/C/D
produce zero hits because the underlying columns/tables are absent.
- Assert no banned columns; assert event timestamps within horizon.
- Add a Phase-2 bonus-model probe — train LR/HistGBM on the redacted
view's `n_opps`/ACV features and band the residual AUC.

3. PR 2.2 wires the new export through `leadforge/exposure/filters.py`
and `leadforge/api/bundle.py`; bumps `BUNDLE_SCHEMA_VERSION` 4 → 5;
adds the `relational_snapshot_safe: true` manifest field for
`student_public`.

After PR 2.2 ships, this script must be re-run on the regenerated
bundles. Expected post-fix shape:

- Deterministic paths A–E: all-False (matches the Phase-2 ablation rows
in the table above).
- Bonus model AUC (without close-outcome aggregates): the residual
band that PR 3.3 will calibrate — currently unbanded.
- G4.4: still PASS.
- The script's `--max-accuracy` flag becomes the regression gate in CI.

## Related artifacts

- Probe script: [`scripts/probe_relational_leakage.py`](../../scripts/probe_relational_leakage.py)
- Unit tests: [`tests/scripts/test_probe_relational_leakage.py`](../../tests/scripts/test_probe_relational_leakage.py)
- Acceptance gates: [`docs/release/v1_acceptance_gates.md`](v1_acceptance_gates.md) §"Relational leakage gate"
- Roadmap: [`docs/release/v1_release_roadmap.md`](v1_release_roadmap.md) §"Phase 2 — Snapshot-safe relational export"
- Original finding: [`docs/external_review/summaries/chatgpt_v2_summary.md`](../external_review/summaries/chatgpt_v2_summary.md) §0
Loading
Loading