Skip to content

docs: PR 1.1 — relational leakage audit on alpha bundles#62

Merged
shaypal5 merged 3 commits into
mainfrom
docs/pr-1.1-relational-leakage-audit
May 5, 2026
Merged

docs: PR 1.1 — relational leakage audit on alpha bundles#62
shaypal5 merged 3 commits into
mainfrom
docs/pr-1.1-relational-leakage-audit

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 1.1 of the v1 dataset release sequence (docs/release/v1_release_roadmap.md Phase 1). Reproduces the ChatGPT v2 relational-leakage finding on the actual 5000-lead alpha bundles in release/{intro,intermediate,advanced}/ and produces documented evidence to anchor Phase 2's structural fix.

This is a docs + small-script PR — no production code changes, no bundle regeneration. Scope intentionally narrow per the roadmap.

What's in this PR

  • scripts/probe_relational_leakage.py — five deterministic reconstruction paths plus a bonus public-relational-only model:

    • A. Direct read of leads.converted_within_90_days (sanity floor)
    • B. Opportunity outcome (any close_outcome == 'closed_won')
    • C. Customer existence (lead → opportunities → customers)
    • D. Subscription existence (lead → opportunities → customers → subscriptions)
    • E. Deterministic OR (B ∨ C ∨ D) — the headline reconstruction
    • F. Bonus: 5-fold CV LR + HistGBM on join-derived features

    The deterministic reconstruction is a pure function deterministic_relational_reconstruction(...) designed to lift verbatim into leadforge/validation/leakage_probes.py in PR 3.1.

  • tests/scripts/test_probe_relational_leakage.py — unit test on a hand-built four-lead toy bundle covering each path independently, plus a path_a_collapses_when_label_dropped test that pre-validates the shape Phase 2's fix needs to produce.

  • docs/release/v1_current_state_audit.md — per-tier evidence table, explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the structural fix.

  • docs/release/v1_acceptance_gates.md — minor wording tightening of G1.1 to point at this audit doc as the Phase 1 reaffirmation.

  • .agent-plan.md — Phase 1 items checked off.

Headline result (per-tier reconstruction accuracy vs. converted_within_90_days)

All five deterministic paths achieve 1.000 accuracy / precision / recall / F1 in every tier. The bonus LR and HistGBM models also achieve AUC = 1.000 in 5-fold CV.

Tier n_leads conv. rate Path E acc LR AUC HistGBM AUC
intro 5000 0.422 1.000 1.000 1.000
intermediate 5000 0.210 1.000 1.000 1.000
advanced 5000 0.079 1.000 1.000 1.000

The 100% reconstruction is structural, not probabilistic noise: public bundles literally contain the label (leads.converted_within_90_days), plus three independent post-outcome side channels (opportunities.close_outcome, conversion-conditional customers.parquet, conversion-conditional subscriptions.parquet). See the audit doc for the full G4.x verdict.

Acceptance (per the PR 1.1 spec)

  • python scripts/probe_relational_leakage.py release/intermediate exits 0 and reports a numeric reconstruction accuracy
  • The audit doc contains a per-tier metric table
  • One small unit test on the deterministic-reconstruction function (two, actually — second test pre-locks the shape Phase 2 produces)
  • G1.1 confirmed in v1_acceptance_gates.md (no-op tightening — name was already locked via External review synthesis + v1 release roadmap #61)

NOT in this PR

  • The snapshot-safe relational module (PR 2.1)
  • Validator wiring (PR 2.2)
  • The validation modules (PR 3.1)
  • Bundle regeneration

This is evidence + docs only. The structural fix lives in PR 2.1.

Test plan

  • python -m pytest tests/scripts/test_probe_relational_leakage.py -v (2 passed)
  • ruff check and ruff format --check clean on new files
  • mypy scripts/probe_relational_leakage.py clean
  • pre-commit run --files <changed> clean
  • python scripts/probe_relational_leakage.py release/intro|intermediate|advanced runs and reports numeric results

🤖 Generated with Claude Code

Reproduces the ChatGPT v2 finding on the actual 5000-lead alpha bundles
(`release/{intro,intermediate,advanced}/`). All three public tiers
reconstruct `converted_within_90_days` at 100% via every probed path;
the join-derived LR / HistGBM model also achieves AUC = 1.000 in 5-fold CV.

Adds:

- `scripts/probe_relational_leakage.py` — five deterministic reconstruction
  paths (A direct, B opportunity won, C customer exists, D subscription
  exists, E B∨C∨D) plus a bonus public-relational-only model. The
  deterministic reconstruction is exposed as a pure function
  `deterministic_relational_reconstruction(...)` designed to lift verbatim
  into `leadforge/validation/leakage_probes.py` in PR 3.1.
- `tests/scripts/test_probe_relational_leakage.py` — unit test on a
  hand-built four-lead toy bundle covering each path independently, plus
  a path-A-collapses-when-label-dropped test that pre-validates Phase 2's
  fix shape.
- `docs/release/v1_current_state_audit.md` — per-tier evidence table,
  explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the
  structural fix.

Also reaffirms G1.1 (dataset name `leadforge-lead-scoring-v1`, already
locked via PR #61's milestone rename) in `v1_acceptance_gates.md`, and
checks off Phase 1 in `.agent-plan.md`.

This PR is evidence + docs only. The structural fix lives in PR 2.1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 16:20
@shaypal5 shaypal5 added the type: docs Documentation or narrative changes label May 5, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Phase-1 audit artifact for the v1 release roadmap by introducing a script + unit test that demonstrate deterministic relational reconstruction of converted_within_90_days from the current alpha student_public bundles, and documents the evidence + gate failures to anchor the Phase-2 structural fix work.

Changes:

  • Added scripts/probe_relational_leakage.py to compute deterministic reconstruction paths (A–E) plus optional sklearn-based CV metrics on join-derived features.
  • Added tests/scripts/test_probe_relational_leakage.py to lock the deterministic reconstruction behavior on a toy 4-lead bundle (including “label dropped” behavior for Phase 2).
  • Added docs/release/v1_current_state_audit.md evidence write-up and minor wording updates to the release acceptance gates and agent plan.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/probe_relational_leakage.py New probe script implementing deterministic join-based reconstruction and optional sklearn CV metrics.
tests/scripts/test_probe_relational_leakage.py New unit tests validating each deterministic reconstruction path and the “missing label” behavior.
docs/release/v1_current_state_audit.md New audit doc capturing per-tier evidence and explicit G4.x verdicts.
docs/release/v1_acceptance_gates.md Tightens G1.1 wording to reference Phase-1 reaffirmation in the audit doc.
.agent-plan.md Marks Phase 1 as complete and records the probe’s headline findings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread scripts/probe_relational_leakage.py Outdated
Comment thread docs/release/v1_current_state_audit.md Outdated
Comment thread docs/release/v1_current_state_audit.md Outdated
Brutal self-review identified 15 issues; this commit addresses the
correctness/credibility ones. Summary:

Script (scripts/probe_relational_leakage.py):

- Assert lead_id uniqueness in deterministic_relational_reconstruction
  (a validator must refuse non-unique keys, not silently misalign).
- Make customers/subscriptions optional inputs — empty frames now
  accepted, so the same script runs cleanly on Phase-2-fixed bundles.
- Branch the agg dict properly — was using has_close_outcome flag
  *inside* a lambda that pandas would have crashed before invoking.
- Add Phase-2-success ablation: re-run deterministic probes on a
  redacted-in-process view to surface the post-fix shape *now*.
- Bonus model gains a without_close_outcome_aggregates variant —
  the with-aggregates variant is trivially perfect because
  any_closed_won is Path B aggregated; the without variant is the
  load-bearing probe. Refactored to sklearn Pipeline (removes
  string-based scaler dispatch).
- Add G4.4 snapshot-window probe directly on event tables. Empirical
  result: alpha bundles PASS G4.4 (90-day horizon already bounds
  timestamps). The audit doc now corrects the prior "FAIL (assumed)"
  framing.
- Add --max-accuracy flag for Phase-2 CI gating (exit 2 on threshold
  violation).
- Rewrite docstring to clearly distinguish four orthogonal evidence
  channels.

Tests (tests/scripts/test_probe_relational_leakage.py):

- Add _binary_metrics coverage (perfect / all-negative / mixed) — the
  function had divide-by-zero handling and F1 logic with zero tests.
- Add lead_id-uniqueness rejection test.
- Add CLI smoke test on release/intermediate (skip if not present).
- Add --max-accuracy gate failure test.
- Drop the synthetic edge-case lead — toy bundle is now 3 leads
  cleanly covering A/C without B/D.

Audit doc (docs/release/v1_current_state_audit.md):

- Distinguish Path A (label in cleartext) from Paths B-E (joins).
- Add Phase-2 ablation table showing collapse to (1 - conv_rate).
- Report bonus model in two variants; AUC = 1.000 in BOTH variants
  is a substantive new finding (non-trivial relational features
  carry the leak independently of close_outcome — i.e., omitting
  customers/subscriptions tables is structurally required).
- Correct G4.4 verdict to PASS (with caveat about within-horizon
  content still leaking).
- Note instructor companion behavior is expected.
- Stop reporting AUC on deterministic 0/1 paths.
- Remove "sanity floor" framing for Path A (understated the violation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5

shaypal5 commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Self-review pass — applied (commit 8f0b4ca)

Did a brutal outsider-view review and addressed the correctness/credibility findings. Highlights:

Substantive new finding from the review: The bonus model achieves AUC = 1.000 even without the any_closed_won aggregate (5 features: n_opps, max_acv, mean_acv, n_customers, n_subscriptions). That means non-trivial relational features carry the leak independently of close_outcome — confirming PR 2.1's structural fix must omit customers/subscriptions from public bundles, not just redact a column.

Corrected G4.4 verdict. Previously "FAIL (assumed)." Direct empirical probe on event-table timestamps shows G4.4 passes literally (90-day simulation horizon already bounds timestamps; 0 violations across all event tables and tiers). The audit now distinguishes the snapshot-window invariant from the relational-leakage invariants — they're independent gates.

Probe script is now Phase-2-ready. Accepts optional customers/subscriptions (so the same script runs cleanly on regenerated bundles), asserts lead_id uniqueness, and adds --max-accuracy for CI gating (exit 2 on threshold violation). The Phase-2-success ablation runs in-process to surface the post-fix shape now.

Test coverage tripled (2 → 8 tests): _binary_metrics (3 cases), uniqueness rejection, Phase-2 ablation, and two CLI subprocess tests on the real alpha bundle.

Framing corrections in the audit doc:

  • Distinguish Path A (label in cleartext) from Paths B–E (joins). Different fix mechanisms; conflating them was misleading.
  • Drop "sanity floor" language for Path A — it understated the violation.
  • Stop reporting AUC for deterministic 0/1 paths.
  • Note that release/intermediate_instructor is expected to retain these channels.

@github-actions

This comment has been minimized.

Five real issues from the automated review:

1. Path A (label cleartext read) used .astype(bool), which maps NaN to
   True. Routed through pandas' nullable "boolean" dtype with explicit
   fillna(False) so partially-redacted bundles don't report phantom
   positives. Same fix applied to y_true construction in probe_bundle.

2. _binary_metrics F1 returned NaN when precision OR recall was exactly
   0.0 — should be 0.0 (zero-skill is well-defined; NaN should signal
   undefined). New logic: NaN only when precision/recall is itself NaN;
   0.0 when both are zero; harmonic mean otherwise.

3. Module docstring promised exit 1 on "shape errors" but main() only
   caught FileNotFoundError. Broadened catch to (FileNotFoundError,
   KeyError, ValueError); error message now includes exception type.

4. probe_bundle() previously bailed with an "error" key when
   converted_within_90_days was absent. Post-Phase-2 bundles will be
   exactly that shape. Now reports path_prediction_rates (fraction of
   leads each path flags positive) unconditionally — useful as a
   sanity view both before and after the fix. The early-return is now
   a "note" key rather than "error", with a message pointing the
   reader at the prediction-rates view.

5. New tests:
   - F1 zero-precision-and-recall yields zero, not NaN
   - Path A with NaN-bearing label column maps NaN -> False

Total: 10 tests, all passing. Lint/format/mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 17:12
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +387 to +427
def _g4_4_snapshot_window_probe(
leads: pd.DataFrame,
tables_dir: Path,
horizon_days: int,
) -> dict[str, dict[str, int]]:
"""Per-event-table: count rows with timestamp > lead_created_at + horizon_days.

Implements G4.4 directly. Skips tables that aren't on disk.
"""
leads_anchor = leads[["lead_id", "lead_created_at"]].copy()
leads_anchor["lead_created_at"] = pd.to_datetime(leads_anchor["lead_created_at"])
horizon = pd.Timedelta(days=horizon_days)

event_tables = (
("touches", "touch_timestamp"),
("sessions", "session_timestamp"),
("sales_activities", "activity_timestamp"),
("opportunities", "created_at"),
)
out: dict[str, dict[str, int]] = {}
for tbl, ts_col in event_tables:
path = tables_dir / f"{tbl}.parquet"
if not path.exists():
continue
df = pd.read_parquet(path, columns=["lead_id", ts_col])
df[ts_col] = pd.to_datetime(df[ts_col])
merged = df.merge(leads_anchor, on="lead_id", how="left")
bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum()
out[tbl] = {"violations": int(bad), "total_rows": int(len(df))}
return out


def _read_horizon_days(bundle_dir: Path) -> int:
manifest = bundle_dir / "manifest.json"
if manifest.exists():
try:
data = json.loads(manifest.read_text())
return int(data.get("horizon_days") or DEFAULT_HORIZON_DAYS)
except (json.JSONDecodeError, ValueError, TypeError):
return DEFAULT_HORIZON_DAYS
return DEFAULT_HORIZON_DAYS
Comment on lines +411 to +416
df = pd.read_parquet(path, columns=["lead_id", ts_col])
df[ts_col] = pd.to_datetime(df[ts_col])
merged = df.merge(leads_anchor, on="lead_id", how="left")
bad = (merged[ts_col] > merged["lead_created_at"] + horizon).sum()
out[tbl] = {"violations": int(bad), "total_rows": int(len(df))}
return out
def _read_optional_table(tables_dir: Path, name: str, columns: list[str]) -> pd.DataFrame:
path = tables_dir / f"{name}.parquet"
if path.exists():
return pd.read_parquet(path)
include_close_outcome_aggregates=include_co_agg,
)
y = (
leads.set_index("lead_id")["converted_within_90_days"]
Comment on lines +6 to +7
**Status:** **BLOCKER CONFIRMED.** Public bundles fail G4.1, G4.2, G4.3, G4.5, G4.6 in `v1_acceptance_gates.md`. G4.4 (snapshot-window timestamps) **passes empirically** on alpha bundles — important nuance, see §G4.4 below.
**Structural fix:** PR 2.1 — `leadforge/render/relational_snapshot_safe.py` + `leadforge/validation/relational_leakage.py`.
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes patch coverage gaps on PR #62 in repository https://github.com/leadforge-dev/leadforge

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 0%; please raise it to 100%. These are the uncovered code lines:
- scripts/probe_relational_leakage.py: 60, 62, 63, 64, 65, 66, 67, 69, 71, 72, 75, 98, 99, 101, 107, 108, 110, 113, 114, 118, 119, 122, 123, 127, 128, 131, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 149, 151, 163, 170, 171, 172, 173, 174, 175, 176, 177, 183, 184, 185, 186, 188, 190, 202, 216, 221, 222, 226, 231, 232, 233, 237, 238, 240, 241, 242, 244, 249, 250, 251, 252, 254, 260, 267, 268, 269, 270, 271, 274, 292, 293, 294, 295, 296, 297, 298, 299, 300, 302, 303, 305, 306, 310, 317, 323, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 349, 350, 353, 363, 366, 367, 370, 377, 384, 387, 396, 397, 398, 400, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 419, 420, 421, 422, 423, 424, 425, 426, 427, 430, 431, 432, 433, 434, 437, 444, 445, 446, 447, 448, 449, 451, 452, 453, 456, 460, 462, 478, 479, 480, 484, 486, 487, 492, 494, 500, 501, 503, 509, 510, 511, 515, 516, 519, 520, 527, 528, 529, 537, 538, 539, 540, 542, 543, 544, 546, 547, 548, 549, 551, 552, 553, 554, 556, 557, 558, 560, 561, 562, 563, 564, 570, 571, 572, 573, 574, 576, 577, 578, 579, 582, 588, 589, 590, 591, 592, 593, 594, 595, 598, 599, 602, 610, 615, 624, 626, 627, 628, 629, 630, 632, 633, 635, 637, 638, 639, 640, 645, 646, 649, 650
- tests/scripts/test_probe_relational_leakage.py: 8, 10, 11, 12, 13, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 38, 45, 67, 73, 78, 81, 82, 84, 88, 95, 96, 99, 102, 103, 104, 105, 108, 112, 115, 116, 117, 118, 119, 122, 125, 126, 127, 128, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 147, 148, 149, 150, 151, 152, 153, 156, 157, 158, 159, 160, 161, 162, 163, 166, 170, 171, 172, 173, 174, 175, 178, 181, 188, 196, 199, 202, 205, 206, 207, 215, 219, 222, 228, 229, 231, 234, 238, 241, 253, 254

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: review comment posted
Workflow run: 25391228846 attempt 1
Comment timestamp: 2026-05-05T17:17:37.648607+00:00
PR head commit: d9cb0b2365c3e10fd04311eba1aa0f7b8a7c2fa3

@shaypal5 shaypal5 merged commit e2f7097 into main May 5, 2026
13 of 15 checks passed
@shaypal5 shaypal5 deleted the docs/pr-1.1-relational-leakage-audit branch May 5, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: docs Documentation or narrative changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants