docs: PR 1.1 — relational leakage audit on alpha bundles#62
Conversation
Reproduces the ChatGPT v2 finding on the actual 5000-lead alpha bundles
(`release/{intro,intermediate,advanced}/`). All three public tiers
reconstruct `converted_within_90_days` at 100% via every probed path;
the join-derived LR / HistGBM model also achieves AUC = 1.000 in 5-fold CV.
Adds:
- `scripts/probe_relational_leakage.py` — five deterministic reconstruction
paths (A direct, B opportunity won, C customer exists, D subscription
exists, E B∨C∨D) plus a bonus public-relational-only model. The
deterministic reconstruction is exposed as a pure function
`deterministic_relational_reconstruction(...)` designed to lift verbatim
into `leadforge/validation/leakage_probes.py` in PR 3.1.
- `tests/scripts/test_probe_relational_leakage.py` — unit test on a
hand-built four-lead toy bundle covering each path independently, plus
a path-A-collapses-when-label-dropped test that pre-validates Phase 2's
fix shape.
- `docs/release/v1_current_state_audit.md` — per-tier evidence table,
explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the
structural fix.
Also reaffirms G1.1 (dataset name `leadforge-lead-scoring-v1`, already
locked via PR #61's milestone rename) in `v1_acceptance_gates.md`, and
checks off Phase 1 in `.agent-plan.md`.
This PR is evidence + docs only. The structural fix lives in PR 2.1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds a Phase-1 audit artifact for the v1 release roadmap by introducing a script + unit test that demonstrate deterministic relational reconstruction of converted_within_90_days from the current alpha student_public bundles, and documents the evidence + gate failures to anchor the Phase-2 structural fix work.
Changes:
- Added
scripts/probe_relational_leakage.pyto compute deterministic reconstruction paths (A–E) plus optional sklearn-based CV metrics on join-derived features. - Added
tests/scripts/test_probe_relational_leakage.pyto lock the deterministic reconstruction behavior on a toy 4-lead bundle (including “label dropped” behavior for Phase 2). - Added
docs/release/v1_current_state_audit.mdevidence write-up and minor wording updates to the release acceptance gates and agent plan.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/probe_relational_leakage.py |
New probe script implementing deterministic join-based reconstruction and optional sklearn CV metrics. |
tests/scripts/test_probe_relational_leakage.py |
New unit tests validating each deterministic reconstruction path and the “missing label” behavior. |
docs/release/v1_current_state_audit.md |
New audit doc capturing per-tier evidence and explicit G4.x verdicts. |
docs/release/v1_acceptance_gates.md |
Tightens G1.1 wording to reference Phase-1 reaffirmation in the audit doc. |
.agent-plan.md |
Marks Phase 1 as complete and records the probe’s headline findings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Brutal self-review identified 15 issues; this commit addresses the correctness/credibility ones. Summary: Script (scripts/probe_relational_leakage.py): - Assert lead_id uniqueness in deterministic_relational_reconstruction (a validator must refuse non-unique keys, not silently misalign). - Make customers/subscriptions optional inputs — empty frames now accepted, so the same script runs cleanly on Phase-2-fixed bundles. - Branch the agg dict properly — was using has_close_outcome flag *inside* a lambda that pandas would have crashed before invoking. - Add Phase-2-success ablation: re-run deterministic probes on a redacted-in-process view to surface the post-fix shape *now*. - Bonus model gains a without_close_outcome_aggregates variant — the with-aggregates variant is trivially perfect because any_closed_won is Path B aggregated; the without variant is the load-bearing probe. Refactored to sklearn Pipeline (removes string-based scaler dispatch). - Add G4.4 snapshot-window probe directly on event tables. Empirical result: alpha bundles PASS G4.4 (90-day horizon already bounds timestamps). The audit doc now corrects the prior "FAIL (assumed)" framing. - Add --max-accuracy flag for Phase-2 CI gating (exit 2 on threshold violation). - Rewrite docstring to clearly distinguish four orthogonal evidence channels. Tests (tests/scripts/test_probe_relational_leakage.py): - Add _binary_metrics coverage (perfect / all-negative / mixed) — the function had divide-by-zero handling and F1 logic with zero tests. - Add lead_id-uniqueness rejection test. - Add CLI smoke test on release/intermediate (skip if not present). - Add --max-accuracy gate failure test. - Drop the synthetic edge-case lead — toy bundle is now 3 leads cleanly covering A/C without B/D. Audit doc (docs/release/v1_current_state_audit.md): - Distinguish Path A (label in cleartext) from Paths B-E (joins). - Add Phase-2 ablation table showing collapse to (1 - conv_rate). - Report bonus model in two variants; AUC = 1.000 in BOTH variants is a substantive new finding (non-trivial relational features carry the leak independently of close_outcome — i.e., omitting customers/subscriptions tables is structurally required). - Correct G4.4 verdict to PASS (with caveat about within-horizon content still leaking). - Note instructor companion behavior is expected. - Stop reporting AUC on deterministic 0/1 paths. - Remove "sanity floor" framing for Path A (understated the violation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Self-review pass — applied (commit 8f0b4ca)Did a brutal outsider-view review and addressed the correctness/credibility findings. Highlights: Substantive new finding from the review: The bonus model achieves AUC = 1.000 even without the Corrected G4.4 verdict. Previously "FAIL (assumed)." Direct empirical probe on event-table timestamps shows G4.4 passes literally (90-day simulation horizon already bounds timestamps; 0 violations across all event tables and tiers). The audit now distinguishes the snapshot-window invariant from the relational-leakage invariants — they're independent gates. Probe script is now Phase-2-ready. Accepts optional Test coverage tripled (2 → 8 tests): Framing corrections in the audit doc:
|
This comment has been minimized.
This comment has been minimized.
Five real issues from the automated review: 1. Path A (label cleartext read) used .astype(bool), which maps NaN to True. Routed through pandas' nullable "boolean" dtype with explicit fillna(False) so partially-redacted bundles don't report phantom positives. Same fix applied to y_true construction in probe_bundle. 2. _binary_metrics F1 returned NaN when precision OR recall was exactly 0.0 — should be 0.0 (zero-skill is well-defined; NaN should signal undefined). New logic: NaN only when precision/recall is itself NaN; 0.0 when both are zero; harmonic mean otherwise. 3. Module docstring promised exit 1 on "shape errors" but main() only caught FileNotFoundError. Broadened catch to (FileNotFoundError, KeyError, ValueError); error message now includes exception type. 4. probe_bundle() previously bailed with an "error" key when converted_within_90_days was absent. Post-Phase-2 bundles will be exactly that shape. Now reports path_prediction_rates (fraction of leads each path flags positive) unconditionally — useful as a sanity view both before and after the fix. The early-return is now a "note" key rather than "error", with a message pointing the reader at the prediction-rates view. 5. New tests: - F1 zero-precision-and-recall yields zero, not NaN - Path A with NaN-bearing label column maps NaN -> False Total: 10 tests, all passing. Lint/format/mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _g4_4_snapshot_window_probe( | ||
| leads: pd.DataFrame, | ||
| tables_dir: Path, | ||
| horizon_days: int, | ||
| ) -> dict[str, dict[str, int]]: | ||
| """Per-event-table: count rows with timestamp > lead_created_at + horizon_days. | ||
|
|
||
| Implements G4.4 directly. Skips tables that aren't on disk. | ||
| """ | ||
| leads_anchor = leads[["lead_id", "lead_created_at"]].copy() | ||
| leads_anchor["lead_created_at"] = pd.to_datetime(leads_anchor["lead_created_at"]) | ||
| horizon = pd.Timedelta(days=horizon_days) | ||
|
|
||
| event_tables = ( | ||
| ("touches", "touch_timestamp"), | ||
| ("sessions", "session_timestamp"), | ||
| ("sales_activities", "activity_timestamp"), | ||
| ("opportunities", "created_at"), | ||
| ) | ||
| out: dict[str, dict[str, int]] = {} | ||
| for tbl, ts_col in event_tables: | ||
| path = tables_dir / f"{tbl}.parquet" | ||
| if not path.exists(): | ||
| continue | ||
| df = pd.read_parquet(path, columns=["lead_id", ts_col]) | ||
| df[ts_col] = pd.to_datetime(df[ts_col]) | ||
| merged = df.merge(leads_anchor, on="lead_id", how="left") | ||
| bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum() | ||
| out[tbl] = {"violations": int(bad), "total_rows": int(len(df))} | ||
| return out | ||
|
|
||
|
|
||
| def _read_horizon_days(bundle_dir: Path) -> int: | ||
| manifest = bundle_dir / "manifest.json" | ||
| if manifest.exists(): | ||
| try: | ||
| data = json.loads(manifest.read_text()) | ||
| return int(data.get("horizon_days") or DEFAULT_HORIZON_DAYS) | ||
| except (json.JSONDecodeError, ValueError, TypeError): | ||
| return DEFAULT_HORIZON_DAYS | ||
| return DEFAULT_HORIZON_DAYS |
| df = pd.read_parquet(path, columns=["lead_id", ts_col]) | ||
| df[ts_col] = pd.to_datetime(df[ts_col]) | ||
| merged = df.merge(leads_anchor, on="lead_id", how="left") | ||
| bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum() | ||
| out[tbl] = {"violations": int(bad), "total_rows": int(len(df))} | ||
| return out |
| def _read_optional_table(tables_dir: Path, name: str, columns: list[str]) -> pd.DataFrame: | ||
| path = tables_dir / f"{name}.parquet" | ||
| if path.exists(): | ||
| return pd.read_parquet(path) |
| include_close_outcome_aggregates=include_co_agg, | ||
| ) | ||
| y = ( | ||
| leads.set_index("lead_id")["converted_within_90_days"] |
| **Status:** **BLOCKER CONFIRMED.** Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) **passes empirically** on alpha bundles — important nuance, see §G4.4 below. | ||
| **Structural fix:** PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`. |
|
pr-agent-context report: This is a refreshed snapshot of the current PR state.
This run includes patch coverage gaps on PR #62 in repository https://github.com/leadforge-dev/leadforge
Address the patch coverage gaps below, then push all of these changes in a single commit.
# Patch coverage
Patch test coverage is 0%; please raise it to 100%. These are the uncovered code lines:
- scripts/probe_relational_leakage.py: 60, 62, 63, 64, 65, 66, 67, 69, 71, 72, 75, 98, 99, 101, 107, 108, 110, 113, 114, 118, 119, 122, 123, 127, 128, 131, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 149, 151, 163, 170, 171, 172, 173, 174, 175, 176, 177, 183, 184, 185, 186, 188, 190, 202, 216, 221, 222, 226, 231, 232, 233, 237, 238, 240, 241, 242, 244, 249, 250, 251, 252, 254, 260, 267, 268, 269, 270, 271, 274, 292, 293, 294, 295, 296, 297, 298, 299, 300, 302, 303, 305, 306, 310, 317, 323, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 349, 350, 353, 363, 366, 367, 370, 377, 384, 387, 396, 397, 398, 400, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 419, 420, 421, 422, 423, 424, 425, 426, 427, 430, 431, 432, 433, 434, 437, 444, 445, 446, 447, 448, 449, 451, 452, 453, 456, 460, 462, 478, 479, 480, 484, 486, 487, 492, 494, 500, 501, 503, 509, 510, 511, 515, 516, 519, 520, 527, 528, 529, 537, 538, 539, 540, 542, 543, 544, 546, 547, 548, 549, 551, 552, 553, 554, 556, 557, 558, 560, 561, 562, 563, 564, 570, 571, 572, 573, 574, 576, 577, 578, 579, 582, 588, 589, 590, 591, 592, 593, 594, 595, 598, 599, 602, 610, 615, 624, 626, 627, 628, 629, 630, 632, 633, 635, 637, 638, 639, 640, 645, 646, 649, 650
- tests/scripts/test_probe_relational_leakage.py: 8, 10, 11, 12, 13, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 38, 45, 67, 73, 78, 81, 82, 84, 88, 95, 96, 99, 102, 103, 104, 105, 108, 112, 115, 116, 117, 118, 119, 122, 125, 126, 127, 128, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 147, 148, 149, 150, 151, 152, 153, 156, 157, 158, 159, 160, 161, 162, 163, 166, 170, 171, 172, 173, 174, 175, 178, 181, 188, 196, 199, 202, 205, 206, 207, 215, 219, 222, 228, 229, 231, 234, 238, 241, 253, 254Run metadata: |
Summary
PR 1.1 of the v1 dataset release sequence (
docs/release/v1_release_roadmap.mdPhase 1). Reproduces the ChatGPT v2 relational-leakage finding on the actual 5000-lead alpha bundles inrelease/{intro,intermediate,advanced}/and produces documented evidence to anchor Phase 2's structural fix.This is a docs + small-script PR — no production code changes, no bundle regeneration. Scope intentionally narrow per the roadmap.
What's in this PR
scripts/probe_relational_leakage.py— five deterministic reconstruction paths plus a bonus public-relational-only model:leads.converted_within_90_days(sanity floor)close_outcome == 'closed_won')The deterministic reconstruction is a pure function
deterministic_relational_reconstruction(...)designed to lift verbatim intoleadforge/validation/leakage_probes.pyin PR 3.1.tests/scripts/test_probe_relational_leakage.py— unit test on a hand-built four-lead toy bundle covering each path independently, plus apath_a_collapses_when_label_droppedtest that pre-validates the shape Phase 2's fix needs to produce.docs/release/v1_current_state_audit.md— per-tier evidence table, explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the structural fix.docs/release/v1_acceptance_gates.md— minor wording tightening of G1.1 to point at this audit doc as the Phase 1 reaffirmation..agent-plan.md— Phase 1 items checked off.Headline result (per-tier reconstruction accuracy vs.
converted_within_90_days)All five deterministic paths achieve 1.000 accuracy / precision / recall / F1 in every tier. The bonus LR and HistGBM models also achieve AUC = 1.000 in 5-fold CV.
The 100% reconstruction is structural, not probabilistic noise: public bundles literally contain the label (
leads.converted_within_90_days), plus three independent post-outcome side channels (opportunities.close_outcome, conversion-conditionalcustomers.parquet, conversion-conditionalsubscriptions.parquet). See the audit doc for the full G4.x verdict.Acceptance (per the PR 1.1 spec)
python scripts/probe_relational_leakage.py release/intermediateexits 0 and reports a numeric reconstruction accuracyv1_acceptance_gates.md(no-op tightening — name was already locked via External review synthesis + v1 release roadmap #61)NOT in this PR
This is evidence + docs only. The structural fix lives in PR 2.1.
Test plan
python -m pytest tests/scripts/test_probe_relational_leakage.py -v(2 passed)ruff checkandruff format --checkclean on new filesmypy scripts/probe_relational_leakage.pycleanpre-commit run --files <changed>cleanpython scripts/probe_relational_leakage.py release/intro|intermediate|advancedruns and reports numeric results🤖 Generated with Claude Code