docs: PR 1.1 — relational leakage audit on alpha bundles by shaypal5 · Pull Request #62 · leadforge-dev/leadforge

shaypal5 · 2026-05-05T16:20:39Z

Summary

PR 1.1 of the v1 dataset release sequence (docs/release/v1_release_roadmap.md Phase 1). Reproduces the ChatGPT v2 relational-leakage finding on the actual 5000-lead alpha bundles in release/{intro,intermediate,advanced}/ and produces documented evidence to anchor Phase 2's structural fix.

This is a docs + small-script PR — no production code changes, no bundle regeneration. Scope intentionally narrow per the roadmap.

What's in this PR

scripts/probe_relational_leakage.py — five deterministic reconstruction paths plus a bonus public-relational-only model:
- A. Direct read of leads.converted_within_90_days (sanity floor)
- B. Opportunity outcome (any close_outcome == 'closed_won')
- C. Customer existence (lead → opportunities → customers)
- D. Subscription existence (lead → opportunities → customers → subscriptions)
- E. Deterministic OR (B ∨ C ∨ D) — the headline reconstruction
- F. Bonus: 5-fold CV LR + HistGBM on join-derived features
The deterministic reconstruction is a pure function deterministic_relational_reconstruction(...) designed to lift verbatim into leadforge/validation/leakage_probes.py in PR 3.1.
tests/scripts/test_probe_relational_leakage.py — unit test on a hand-built four-lead toy bundle covering each path independently, plus a path_a_collapses_when_label_dropped test that pre-validates the shape Phase 2's fix needs to produce.
docs/release/v1_current_state_audit.md — per-tier evidence table, explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the structural fix.
docs/release/v1_acceptance_gates.md — minor wording tightening of G1.1 to point at this audit doc as the Phase 1 reaffirmation.
.agent-plan.md — Phase 1 items checked off.

Headline result (per-tier reconstruction accuracy vs. `converted_within_90_days`)

All five deterministic paths achieve 1.000 accuracy / precision / recall / F1 in every tier. The bonus LR and HistGBM models also achieve AUC = 1.000 in 5-fold CV.

Tier	n_leads	conv. rate	Path E acc	LR AUC	HistGBM AUC
intro	5000	0.422	1.000	1.000	1.000
intermediate	5000	0.210	1.000	1.000	1.000
advanced	5000	0.079	1.000	1.000	1.000

The 100% reconstruction is structural, not probabilistic noise: public bundles literally contain the label (leads.converted_within_90_days), plus three independent post-outcome side channels (opportunities.close_outcome, conversion-conditional customers.parquet, conversion-conditional subscriptions.parquet). See the audit doc for the full G4.x verdict.

Acceptance (per the PR 1.1 spec)

python scripts/probe_relational_leakage.py release/intermediate exits 0 and reports a numeric reconstruction accuracy
The audit doc contains a per-tier metric table
One small unit test on the deterministic-reconstruction function (two, actually — second test pre-locks the shape Phase 2 produces)
G1.1 confirmed in v1_acceptance_gates.md (no-op tightening — name was already locked via External review synthesis + v1 release roadmap #61)

NOT in this PR

The snapshot-safe relational module (PR 2.1)
Validator wiring (PR 2.2)
The validation modules (PR 3.1)
Bundle regeneration

This is evidence + docs only. The structural fix lives in PR 2.1.

Test plan

python -m pytest tests/scripts/test_probe_relational_leakage.py -v (2 passed)
ruff check and ruff format --check clean on new files
mypy scripts/probe_relational_leakage.py clean
pre-commit run --files <changed> clean
python scripts/probe_relational_leakage.py release/intro|intermediate|advanced runs and reports numeric results

🤖 Generated with Claude Code

Reproduces the ChatGPT v2 finding on the actual 5000-lead alpha bundles (`release/{intro,intermediate,advanced}/`). All three public tiers reconstruct `converted_within_90_days` at 100% via every probed path; the join-derived LR / HistGBM model also achieves AUC = 1.000 in 5-fold CV. Adds: - `scripts/probe_relational_leakage.py` — five deterministic reconstruction paths (A direct, B opportunity won, C customer exists, D subscription exists, E B∨C∨D) plus a bonus public-relational-only model. The deterministic reconstruction is exposed as a pure function `deterministic_relational_reconstruction(...)` designed to lift verbatim into `leadforge/validation/leakage_probes.py` in PR 3.1. - `tests/scripts/test_probe_relational_leakage.py` — unit test on a hand-built four-lead toy bundle covering each path independently, plus a path-A-collapses-when-label-dropped test that pre-validates Phase 2's fix shape. - `docs/release/v1_current_state_audit.md` — per-tier evidence table, explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the structural fix. Also reaffirms G1.1 (dataset name `leadforge-lead-scoring-v1`, already locked via PR #61's milestone rename) in `v1_acceptance_gates.md`, and checks off Phase 1 in `.agent-plan.md`. This PR is evidence + docs only. The structural fix lives in PR 2.1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a Phase-1 audit artifact for the v1 release roadmap by introducing a script + unit test that demonstrate deterministic relational reconstruction of converted_within_90_days from the current alpha student_public bundles, and documents the evidence + gate failures to anchor the Phase-2 structural fix work.

Changes:

Added scripts/probe_relational_leakage.py to compute deterministic reconstruction paths (A–E) plus optional sklearn-based CV metrics on join-derived features.
Added tests/scripts/test_probe_relational_leakage.py to lock the deterministic reconstruction behavior on a toy 4-lead bundle (including “label dropped” behavior for Phase 2).
Added docs/release/v1_current_state_audit.md evidence write-up and minor wording updates to the release acceptance gates and agent plan.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`scripts/probe_relational_leakage.py`	New probe script implementing deterministic join-based reconstruction and optional sklearn CV metrics.
`tests/scripts/test_probe_relational_leakage.py`	New unit tests validating each deterministic reconstruction path and the “missing label” behavior.
`docs/release/v1_current_state_audit.md`	New audit doc capturing per-tier evidence and explicit G4.x verdicts.
`docs/release/v1_acceptance_gates.md`	Tightens G1.1 wording to reference Phase-1 reaffirmation in the audit doc.
`.agent-plan.md`	Marks Phase 1 as complete and records the probe’s headline findings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Brutal self-review identified 15 issues; this commit addresses the correctness/credibility ones. Summary: Script (scripts/probe_relational_leakage.py): - Assert lead_id uniqueness in deterministic_relational_reconstruction (a validator must refuse non-unique keys, not silently misalign). - Make customers/subscriptions optional inputs — empty frames now accepted, so the same script runs cleanly on Phase-2-fixed bundles. - Branch the agg dict properly — was using has_close_outcome flag *inside* a lambda that pandas would have crashed before invoking. - Add Phase-2-success ablation: re-run deterministic probes on a redacted-in-process view to surface the post-fix shape *now*. - Bonus model gains a without_close_outcome_aggregates variant — the with-aggregates variant is trivially perfect because any_closed_won is Path B aggregated; the without variant is the load-bearing probe. Refactored to sklearn Pipeline (removes string-based scaler dispatch). - Add G4.4 snapshot-window probe directly on event tables. Empirical result: alpha bundles PASS G4.4 (90-day horizon already bounds timestamps). The audit doc now corrects the prior "FAIL (assumed)" framing. - Add --max-accuracy flag for Phase-2 CI gating (exit 2 on threshold violation). - Rewrite docstring to clearly distinguish four orthogonal evidence channels. Tests (tests/scripts/test_probe_relational_leakage.py): - Add _binary_metrics coverage (perfect / all-negative / mixed) — the function had divide-by-zero handling and F1 logic with zero tests. - Add lead_id-uniqueness rejection test. - Add CLI smoke test on release/intermediate (skip if not present). - Add --max-accuracy gate failure test. - Drop the synthetic edge-case lead — toy bundle is now 3 leads cleanly covering A/C without B/D. Audit doc (docs/release/v1_current_state_audit.md): - Distinguish Path A (label in cleartext) from Paths B-E (joins). - Add Phase-2 ablation table showing collapse to (1 - conv_rate). - Report bonus model in two variants; AUC = 1.000 in BOTH variants is a substantive new finding (non-trivial relational features carry the leak independently of close_outcome — i.e., omitting customers/subscriptions tables is structurally required). - Correct G4.4 verdict to PASS (with caveat about within-horizon content still leaking). - Note instructor companion behavior is expected. - Stop reporting AUC on deterministic 0/1 paths. - Remove "sanity floor" framing for Path A (understated the violation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

shaypal5 · 2026-05-05T17:07:29Z

Self-review pass — applied (commit `8f0b4ca`)

Did a brutal outsider-view review and addressed the correctness/credibility findings. Highlights:

Substantive new finding from the review: The bonus model achieves AUC = 1.000 even without the any_closed_won aggregate (5 features: n_opps, max_acv, mean_acv, n_customers, n_subscriptions). That means non-trivial relational features carry the leak independently of close_outcome — confirming PR 2.1's structural fix must omit customers/subscriptions from public bundles, not just redact a column.

Corrected G4.4 verdict. Previously "FAIL (assumed)." Direct empirical probe on event-table timestamps shows G4.4 passes literally (90-day simulation horizon already bounds timestamps; 0 violations across all event tables and tiers). The audit now distinguishes the snapshot-window invariant from the relational-leakage invariants — they're independent gates.

Probe script is now Phase-2-ready. Accepts optional customers/subscriptions (so the same script runs cleanly on regenerated bundles), asserts lead_id uniqueness, and adds --max-accuracy for CI gating (exit 2 on threshold violation). The Phase-2-success ablation runs in-process to surface the post-fix shape now.

Test coverage tripled (2 → 8 tests): _binary_metrics (3 cases), uniqueness rejection, Phase-2 ablation, and two CLI subprocess tests on the real alpha bundle.

Framing corrections in the audit doc:

Distinguish Path A (label in cleartext) from Paths B–E (joins). Different fix mechanisms; conflating them was misleading.
Drop "sanity floor" language for Path A — it understated the violation.
Stop reporting AUC for deterministic 0/1 paths.
Note that release/intermediate_instructor is expected to retain these channels.

Five real issues from the automated review: 1. Path A (label cleartext read) used .astype(bool), which maps NaN to True. Routed through pandas' nullable "boolean" dtype with explicit fillna(False) so partially-redacted bundles don't report phantom positives. Same fix applied to y_true construction in probe_bundle. 2. _binary_metrics F1 returned NaN when precision OR recall was exactly 0.0 — should be 0.0 (zero-skill is well-defined; NaN should signal undefined). New logic: NaN only when precision/recall is itself NaN; 0.0 when both are zero; harmonic mean otherwise. 3. Module docstring promised exit 1 on "shape errors" but main() only caught FileNotFoundError. Broadened catch to (FileNotFoundError, KeyError, ValueError); error message now includes exception type. 4. probe_bundle() previously bailed with an "error" key when converted_within_90_days was absent. Post-Phase-2 bundles will be exactly that shape. Now reports path_prediction_rates (fraction of leads each path flags positive) unconditionally — useful as a sanity view both before and after the fix. The early-return is now a "note" key rather than "error", with a message pointing the reader at the prediction-rates view. 5. New tests: - F1 zero-precision-and-recall yields zero, not NaN - Path A with NaN-bearing label column maps NaN -> False Total: 10 tests, all passing. Lint/format/mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def _g4_4_snapshot_window_probe(
+    leads: pd.DataFrame,
+    tables_dir: Path,
+    horizon_days: int,
+) -> dict[str, dict[str, int]]:
+    """Per-event-table: count rows with timestamp > lead_created_at + horizon_days.
+
+    Implements G4.4 directly. Skips tables that aren't on disk.
+    """
+    leads_anchor = leads[["lead_id", "lead_created_at"]].copy()
+    leads_anchor["lead_created_at"] = pd.to_datetime(leads_anchor["lead_created_at"])
+    horizon = pd.Timedelta(days=horizon_days)
+
+    event_tables = (
+        ("touches", "touch_timestamp"),
+        ("sessions", "session_timestamp"),
+        ("sales_activities", "activity_timestamp"),
+        ("opportunities", "created_at"),
+    )
+    out: dict[str, dict[str, int]] = {}
+    for tbl, ts_col in event_tables:
+        path = tables_dir / f"{tbl}.parquet"
+        if not path.exists():
+            continue
+        df = pd.read_parquet(path, columns=["lead_id", ts_col])
+        df[ts_col] = pd.to_datetime(df[ts_col])
+        merged = df.merge(leads_anchor, on="lead_id", how="left")
+        bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum()
+        out[tbl] = {"violations": int(bad), "total_rows": int(len(df))}
+    return out
+
+
+def _read_horizon_days(bundle_dir: Path) -> int:
+    manifest = bundle_dir / "manifest.json"
+    if manifest.exists():
+        try:
+            data = json.loads(manifest.read_text())
+            return int(data.get("horizon_days") or DEFAULT_HORIZON_DAYS)
+        except (json.JSONDecodeError, ValueError, TypeError):
+            return DEFAULT_HORIZON_DAYS
+    return DEFAULT_HORIZON_DAYS


+        df = pd.read_parquet(path, columns=["lead_id", ts_col])
+        df[ts_col] = pd.to_datetime(df[ts_col])
+        merged = df.merge(leads_anchor, on="lead_id", how="left")
+        bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum()
+        out[tbl] = {"violations": int(bad), "total_rows": int(len(df))}
+    return out


+def _read_optional_table(tables_dir: Path, name: str, columns: list[str]) -> pd.DataFrame:
+    path = tables_dir / f"{name}.parquet"
+    if path.exists():
+        return pd.read_parquet(path)


+            include_close_outcome_aggregates=include_co_agg,
+        )
+        y = (
+            leads.set_index("lead_id")["converted_within_90_days"]


+**Status:** **BLOCKER CONFIRMED.** Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) **passes empirically** on alpha bundles — important nuance, see §G4.4 below.
+**Structural fix:** PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`.


github-actions · 2026-05-05T17:18:47Z

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes patch coverage gaps on PR #62 in repository https://github.com/leadforge-dev/leadforge

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 0%; please raise it to 100%. These are the uncovered code lines:
- scripts/probe_relational_leakage.py: 60, 62, 63, 64, 65, 66, 67, 69, 71, 72, 75, 98, 99, 101, 107, 108, 110, 113, 114, 118, 119, 122, 123, 127, 128, 131, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 149, 151, 163, 170, 171, 172, 173, 174, 175, 176, 177, 183, 184, 185, 186, 188, 190, 202, 216, 221, 222, 226, 231, 232, 233, 237, 238, 240, 241, 242, 244, 249, 250, 251, 252, 254, 260, 267, 268, 269, 270, 271, 274, 292, 293, 294, 295, 296, 297, 298, 299, 300, 302, 303, 305, 306, 310, 317, 323, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 349, 350, 353, 363, 366, 367, 370, 377, 384, 387, 396, 397, 398, 400, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 419, 420, 421, 422, 423, 424, 425, 426, 427, 430, 431, 432, 433, 434, 437, 444, 445, 446, 447, 448, 449, 451, 452, 453, 456, 460, 462, 478, 479, 480, 484, 486, 487, 492, 494, 500, 501, 503, 509, 510, 511, 515, 516, 519, 520, 527, 528, 529, 537, 538, 539, 540, 542, 543, 544, 546, 547, 548, 549, 551, 552, 553, 554, 556, 557, 558, 560, 561, 562, 563, 564, 570, 571, 572, 573, 574, 576, 577, 578, 579, 582, 588, 589, 590, 591, 592, 593, 594, 595, 598, 599, 602, 610, 615, 624, 626, 627, 628, 629, 630, 632, 633, 635, 637, 638, 639, 640, 645, 646, 649, 650
- tests/scripts/test_probe_relational_leakage.py: 8, 10, 11, 12, 13, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 38, 45, 67, 73, 78, 81, 82, 84, 88, 95, 96, 99, 102, 103, 104, 105, 108, 112, 115, 116, 117, 118, 119, 122, 125, 126, 127, 128, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 147, 148, 149, 150, 151, 152, 153, 156, 157, 158, 159, 160, 161, 162, 163, 166, 170, 171, 172, 173, 174, 175, 178, 181, 188, 196, 199, 202, 205, 206, 207, 215, 219, 222, 228, 229, 231, 234, 238, 241, 253, 254

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: review comment posted
Workflow run: 25391228846 attempt 1
Comment timestamp: 2026-05-05T17:17:37.648607+00:00
PR head commit: d9cb0b2365c3e10fd04311eba1aa0f7b8a7c2fa3

Copilot AI review requested due to automatic review settings May 5, 2026 16:20

shaypal5 added this to the dataset: leadforge-lead-scoring-v1 milestone May 5, 2026

shaypal5 added the type: docs Documentation or narrative changes label May 5, 2026

Copilot started reviewing on behalf of shaypal5 May 5, 2026 16:21 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 5, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Copilot AI review requested due to automatic review settings May 5, 2026 17:12

Copilot started reviewing on behalf of shaypal5 May 5, 2026 17:13 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 5, 2026

View reviewed changes

shaypal5 merged commit e2f7097 into main May 5, 2026
13 of 15 checks passed

shaypal5 deleted the docs/pr-1.1-relational-leakage-audit branch May 5, 2026 17:20

shaypal5 mentioned this pull request May 5, 2026

feat(render,validation): snapshot-safe relational export + leakage validator (PR 2.1) #63

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: PR 1.1 — relational leakage audit on alpha bundles#62

docs: PR 1.1 — relational leakage audit on alpha bundles#62
shaypal5 merged 3 commits into
mainfrom
docs/pr-1.1-relational-leakage-audit

shaypal5 commented May 5, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shaypal5 commented May 5, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Status: BLOCKER CONFIRMED. Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) passes empirically on alpha bundles — important nuance, see §G4.4 below.
		Structural fix: PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`.

Conversation

shaypal5 commented May 5, 2026

Summary

What's in this PR

Headline result (per-tier reconstruction accuracy vs. converted_within_90_days)

Acceptance (per the PR 1.1 spec)

NOT in this PR

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shaypal5 commented May 5, 2026

Self-review pass — applied (commit 8f0b4ca)

Uh oh!

This comment has been minimized.

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Headline result (per-tier reconstruction accuracy vs. `converted_within_90_days`)

Self-review pass — applied (commit `8f0b4ca`)