Skip to content

External review synthesis + v1 release roadmap#61

Merged
shaypal5 merged 3 commits into
mainfrom
external-review-roadmap
May 5, 2026
Merged

External review synthesis + v1 release roadmap#61
shaypal5 merged 3 commits into
mainfrom
external-review-roadmap

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Lands the Gemini and ChatGPT external-review outputs (two iterations each, six files) targeting the v0.1.0-alpha bundles, plus a structured synthesis under docs/external_review/summaries/.
  • Adds the integrated v1 release roadmap, design, and acceptance gates under docs/release/.
  • Adds a hard constraint to CLAUDE.md forbidding public relational tables that reconstruct the label via joins.
  • Updates .agent-plan.md with the new "Next Up — leadforge-lead-scoring-v1 Curated Dataset Release" section, alpha tracker preserved beneath as historical context.

Critical finding from review

ChatGPT v2 verified locally that student_public bundles reconstruct converted_within_90_days end-to-end through public relational tables: opportunities.close_outcome == "closed_won" plus customers/subscriptions existence (which only exist for converted leads) recover the target with 100% accuracy via joins. This is THE blocker for any public Kaggle/HF release. Phase 2 of the v1 roadmap is the structural fix.

Roadmap shape (7 phases)

  1. Audit and naming — reproduce blocker locally; lock dataset name leadforge-lead-scoring-v1
  2. Snapshot-safe relational export — leadforge/render/relational_snapshot_safe.py + leadforge/validation/relational_leakage.py; bundle schema v4 → v5
  3. Release validation hardening — release_quality.py + leakage_probes.py + reporting.py; calibration / lift / cross-seed bands
  4. Channel-signal audit + dataset card hardening — Gemini's channel-conditional rates partially adopted as a v1 audit (full encoding deferred to post-v1)
  5. Platform packaging — Kaggle dataset-metadata.json + HF README.md with configs + cover image
  6. Notebook sequence (4) + adversarial framing (issue templates, break-me guide)
  7. LLM critique + publish — minimal one-shot critique with Effective Semantic Diversity rubric dimension

Recommendations triage

22 review findings processed in docs/external_review/summaries/recommendations_pass.md with explicit action codes (accept / accept-with-different-approach / defer / out-of-scope-issue). 14 items go to v1; 8 to post-v1 (after-release-roadmap); 2 to v2-track issues. The split is documented item-by-item with rationale.

What's NOT in this PR

Planning artifacts only — no code changes. Implementation begins in follow-up PRs (Phase 1 first).

Doc inventory

docs/external_review/
  gemini/{gemini_report_v1,gemini_report_v2}.md
  chatgpt/{chatgpt_report_v1,chatgpt_report_v2,leadforge_report_v1_critique,leadforge_second_attempt_guidance}.md
  summaries/
    README.md
    {gemini_v1,gemini_v2,chatgpt_v1,chatgpt_v1_critique,chatgpt_guidance,chatgpt_v2}_summary.md
    cross_source_takeaways.md
    key_findings.md
    recommendations_pass.md

docs/release/
  v1_release_roadmap.md       # active source of truth
  v1_release_design.md        # architecture decisions
  v1_acceptance_gates.md      # operational definition of v1-ready
  post_v1_roadmap.md          # deferred items

Test plan

  • Pre-commit hooks pass (whitespace + EOF fixes auto-applied during first commit; rerun clean on second)
  • Reviewer reads recommendations_pass.md and confirms triage matches intent
  • Reviewer reads v1_release_roadmap.md Phase 2 design and confirms snapshot-safe export shape is correct
  • Reviewer reads v1_acceptance_gates.md and confirms the gate set is the right v1-ready definition
  • CLAUDE.md hard constraint reads correctly and isn't redundant with existing items

🤖 Generated with Claude Code

Land the Gemini and ChatGPT external-review outputs (two iterations each,
six files) targeting the v0.1.0-alpha bundles, plus a structured
synthesis under docs/external_review/summaries/ and an integrated v1
release roadmap under docs/release/.

Critical finding from review (verified locally by ChatGPT v2):
student_public bundles reconstruct converted_within_90_days end-to-end
through public relational tables — opportunities.close_outcome plus
customers/subscriptions existence recover the target via joins. CLAUDE.md
gains a hard constraint forbidding this; the v1 release roadmap's
Phase 2 is the structural fix.

This PR is planning artifacts only — no code changes. The roadmap
defines 7 phases ending with publishing leadforge-lead-scoring-v1 to
Kaggle and Hugging Face. Implementation begins in follow-up PRs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 15:35
@shaypal5 shaypal5 added the type: docs Documentation or narrative changes label May 5, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds planning and reference documentation for moving from the current alpha bundles toward a public v1 dataset release (leadforge-lead-scoring-v1). It lands the raw external review artifacts (Gemini + ChatGPT), a Claude-authored synthesis/triage of those findings, and a phased v1 release roadmap with an explicit “v1-ready” acceptance-gate contract; it also updates project guidance (CLAUDE.md) and the agent execution plan (.agent-plan.md) to reflect the newly identified relational-leakage blocker.

Changes:

  • Add v1 release planning docs: roadmap, design decisions, and acceptance gates (plus a post-v1 backlog).
  • Add external review corpus + structured summaries/synthesis/triage under docs/external_review/.
  • Update constraints and planning trackers (CLAUDE.md, .agent-plan.md) to reflect the relational-leakage finding and v1 workstream.

Reviewed changes

Copilot reviewed 22 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/release/v1_release_roadmap.md Defines the 7-phase v1 release plan and operational “v1-ready” criteria.
docs/release/v1_release_design.md Captures v1-specific architectural decisions (snapshot-safe relational export, release validation/reporting, packaging, critique loop).
docs/release/v1_acceptance_gates.md Specifies machine-checkable gates/criteria intended to define “ready to publish.”
docs/release/post_v1_roadmap.md Collects deferred items and longer-horizon improvements after v1 ships.
docs/external_review/summaries/recommendations_pass.md Triage of 22 findings with action codes and v1 vs post-v1 targeting.
docs/external_review/summaries/README.md Index of the external review corpus and synthesized outputs.
docs/external_review/summaries/key_findings.md Action-prioritized synthesis of review findings (critical→low).
docs/external_review/summaries/gemini_v2_summary.md Summary of Gemini v2 review and unique contributions.
docs/external_review/summaries/gemini_v1_summary.md Summary of Gemini v1 review and unique contributions.
docs/external_review/summaries/cross_source_takeaways.md Cross-source theme consolidation and divergence map.
docs/external_review/summaries/chatgpt_v2_summary.md Summary of ChatGPT v2 forensic review and proposed roadmap shape.
docs/external_review/summaries/chatgpt_v1_summary.md Summary of ChatGPT v1 (historical / superseded).
docs/external_review/summaries/chatgpt_v1_critique_summary.md Summary of ChatGPT’s critique of its v1 report (methodology corrections).
docs/external_review/summaries/chatgpt_guidance_summary.md Summary of the “second attempt guidance” methodology spec.
docs/external_review/gemini/gemini_report_v2.md Raw Gemini v2 external review artifact.
docs/external_review/gemini/gemini_report_v1.md Raw Gemini v1 external review artifact.
docs/external_review/gemini/.gitkeep Keeps the Gemini review directory tracked.
docs/external_review/chatgpt/leadforge_report_v1_critique.md Raw ChatGPT critique artifact (methodology + corrections).
docs/external_review/chatgpt/chatgpt_report_v2.md Raw ChatGPT v2 external review artifact (forensic audit + roadmap).
docs/external_review/chatgpt/chatgpt_report_v1.md Raw ChatGPT v1 external review artifact (superseded).
docs/external_review/chatgpt/.gitkeep Keeps the ChatGPT review directory tracked.
CLAUDE.md Adds an explicit hard constraint prohibiting public relational tables that reconstruct the label via joins; links new v1 docs.
.agent-plan.md Adds “Next Up — leadforge-lead-scoring-v1” section and preserves the alpha tracker as historical context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/release/v1_release_design.md Outdated
Comment thread docs/release/v1_release_design.md
Comment thread docs/release/v1_acceptance_gates.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread docs/release/v1_release_roadmap.md Outdated
Comment thread docs/external_review/summaries/README.md
Comment thread docs/external_review/summaries/recommendations_pass.md
Comment thread docs/release/v1_release_design.md Outdated
Comment thread docs/release/v1_release_roadmap.md Outdated
The original roadmap stopped at the phase level; reviewer asked for the
phase → N PRs → which-files-each-PR-touches decomposition. Adds:

- a "PRs" column to the phase summary table
- a new "PR breakdown" section with per-phase enumeration of the 14
  planned PRs (planning IDs phase.seq, not GitHub PR numbers)
- per-PR labels, file lists, sizes, and intra-phase dependencies

Total: 14 PRs targeting the v1.1.0 — Curated dataset v1 release
milestone. Phase 3 has 3 PRs (the largest phase); Phases 1 and 4 are
single-PR each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

…one naming, CSV split contract

Six accepted review findings from PR #61 Copilot review (the three
"||"-prefix table claims were false positives and resolved as
irrelevant):

- COPILOT-1 (v1_release_design.md:37): clarify that flat task splits
  are emitted as Parquet at tasks/<task_id>/{train,valid,test}.parquet
  in the current alpha contract; CSV exports are a Phase 5 deliverable.
  Standardize on `valid.csv` (matches existing `valid.parquet`), drop
  the `validation.csv`/`valid.csv` divergence.

- COPILOT-2/3/4: replace fictional `event_timestamp` with the real
  per-table timestamp columns (`touch_timestamp`,
  `session_timestamp`, `activity_timestamp`) plus opportunities'
  `created_at`. Updates v1_release_design.md (the design statement),
  v1_acceptance_gates.md (gate G4.4), and CLAUDE.md (hard constraint).
  Makes all three machine-checkable against the actual schema.

- COPILOT-8: resolve git-tag inconsistency. v1_release_design.md:25
  no longer mentions `dataset/v1`; standardizes on
  `leadforge-lead-scoring-v1` as both the git tag and Release name.

- COPILOT-9: rename GitHub milestone #7 from "v1.1.0 — Curated dataset
  v1 release" (which read like a semver package version) to
  "dataset: leadforge-lead-scoring-v1" — explicitly dataset-scoped, no
  conflict with the package version decoupling principle. Three
  references in v1_release_roadmap.md updated to match.

Three Copilot threads (COPILOT-5/6/7) claimed tables had `||` prefixes;
verified false on inspection — tables use standard `|` delimiters.
Resolved as not-applicable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 16:01
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #61 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25387522318 attempt 1
Comment timestamp: 2026-05-05T16:02:18.835567+00:00
PR head commit: ec12b88399cf12eeb8519f6d01c5657d838b0417

@shaypal5 shaypal5 merged commit 79cd3c8 into main May 5, 2026
10 checks passed
@shaypal5 shaypal5 deleted the external-review-roadmap branch May 5, 2026 16:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 24 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# v1 Lead-Scoring Dataset Release Roadmap

**Target:** Publish `leadforge-lead-scoring-v1` to Kaggle and Hugging Face as a best-in-class educational synthetic CRM dataset family.
**Source of truth:** This roadmap is derived from `docs/external_review/summaries/recommendations_pass.md` (signed off 2026-05-05).
Comment on lines +145 to +146
| **v1 (first-release-roadmap)** | 14 | #1, #2, #3, #4, #5, #6, #7, #8 (audit), #9, #10, #11 (minimal), #12 (LLM rubric), #14, #15, #16, #19 |
| **post-v1 (after-release-roadmap)** | 8 | #8 (full encoding), #11 (full CI), #12 (quantitative), #13, #17, #18, #20 (log-normal) |
prefixed with `TBD` are placeholders set in Phase 3 of the v1 release
roadmap; a release candidate cannot ship until all `TBD`s are resolved.

This file is the operational definition of done for the v1 release. It is

## Public/instructor diff gate

- **G9.1** Every public/instructor difference is intentional and listed in `release/EXPOSURE_DELTA.md`.
Comment thread .agent-plan.md
**v1.0.0 released (2026-05-02).** All milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py.
**leadforge package v1.0.0 released (2026-05-02).** All framework milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py.

**v0.1.0-alpha datasets shipped (2026-05-05).** Five bundles (intro / intermediate / advanced / intermediate_instructor / tiny_demo) in `release/`, packaged for external review by Gemini + ChatGPT (two iterations each). Synthesis lives under `docs/external_review/summaries/`. **Critical finding:** public relational tables reconstruct the target via joins (verified locally by ChatGPT v2 in `chatgpt_report_v2.md §0`). The v1 release work below addresses this plus the rest of the review consensus.
shaypal5 added a commit that referenced this pull request May 5, 2026
* docs: PR 1.1 — relational leakage audit on alpha bundles

Reproduces the ChatGPT v2 finding on the actual 5000-lead alpha bundles
(`release/{intro,intermediate,advanced}/`). All three public tiers
reconstruct `converted_within_90_days` at 100% via every probed path;
the join-derived LR / HistGBM model also achieves AUC = 1.000 in 5-fold CV.

Adds:

- `scripts/probe_relational_leakage.py` — five deterministic reconstruction
  paths (A direct, B opportunity won, C customer exists, D subscription
  exists, E B∨C∨D) plus a bonus public-relational-only model. The
  deterministic reconstruction is exposed as a pure function
  `deterministic_relational_reconstruction(...)` designed to lift verbatim
  into `leadforge/validation/leakage_probes.py` in PR 3.1.
- `tests/scripts/test_probe_relational_leakage.py` — unit test on a
  hand-built four-lead toy bundle covering each path independently, plus
  a path-A-collapses-when-label-dropped test that pre-validates Phase 2's
  fix shape.
- `docs/release/v1_current_state_audit.md` — per-tier evidence table,
  explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the
  structural fix.

Also reaffirms G1.1 (dataset name `leadforge-lead-scoring-v1`, already
locked via PR #61's milestone rename) in `v1_acceptance_gates.md`, and
checks off Phase 1 in `.agent-plan.md`.

This PR is evidence + docs only. The structural fix lives in PR 2.1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review: address self-review findings on PR 1.1

Brutal self-review identified 15 issues; this commit addresses the
correctness/credibility ones. Summary:

Script (scripts/probe_relational_leakage.py):

- Assert lead_id uniqueness in deterministic_relational_reconstruction
  (a validator must refuse non-unique keys, not silently misalign).
- Make customers/subscriptions optional inputs — empty frames now
  accepted, so the same script runs cleanly on Phase-2-fixed bundles.
- Branch the agg dict properly — was using has_close_outcome flag
  *inside* a lambda that pandas would have crashed before invoking.
- Add Phase-2-success ablation: re-run deterministic probes on a
  redacted-in-process view to surface the post-fix shape *now*.
- Bonus model gains a without_close_outcome_aggregates variant —
  the with-aggregates variant is trivially perfect because
  any_closed_won is Path B aggregated; the without variant is the
  load-bearing probe. Refactored to sklearn Pipeline (removes
  string-based scaler dispatch).
- Add G4.4 snapshot-window probe directly on event tables. Empirical
  result: alpha bundles PASS G4.4 (90-day horizon already bounds
  timestamps). The audit doc now corrects the prior "FAIL (assumed)"
  framing.
- Add --max-accuracy flag for Phase-2 CI gating (exit 2 on threshold
  violation).
- Rewrite docstring to clearly distinguish four orthogonal evidence
  channels.

Tests (tests/scripts/test_probe_relational_leakage.py):

- Add _binary_metrics coverage (perfect / all-negative / mixed) — the
  function had divide-by-zero handling and F1 logic with zero tests.
- Add lead_id-uniqueness rejection test.
- Add CLI smoke test on release/intermediate (skip if not present).
- Add --max-accuracy gate failure test.
- Drop the synthetic edge-case lead — toy bundle is now 3 leads
  cleanly covering A/C without B/D.

Audit doc (docs/release/v1_current_state_audit.md):

- Distinguish Path A (label in cleartext) from Paths B-E (joins).
- Add Phase-2 ablation table showing collapse to (1 - conv_rate).
- Report bonus model in two variants; AUC = 1.000 in BOTH variants
  is a substantive new finding (non-trivial relational features
  carry the leak independently of close_outcome — i.e., omitting
  customers/subscriptions tables is structurally required).
- Correct G4.4 verdict to PASS (with caveat about within-horizon
  content still leaking).
- Note instructor companion behavior is expected.
- Stop reporting AUC on deterministic 0/1 paths.
- Remove "sanity floor" framing for Path A (understated the violation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review: address Copilot inline comments on PR 1.1

Five real issues from the automated review:

1. Path A (label cleartext read) used .astype(bool), which maps NaN to
   True. Routed through pandas' nullable "boolean" dtype with explicit
   fillna(False) so partially-redacted bundles don't report phantom
   positives. Same fix applied to y_true construction in probe_bundle.

2. _binary_metrics F1 returned NaN when precision OR recall was exactly
   0.0 — should be 0.0 (zero-skill is well-defined; NaN should signal
   undefined). New logic: NaN only when precision/recall is itself NaN;
   0.0 when both are zero; harmonic mean otherwise.

3. Module docstring promised exit 1 on "shape errors" but main() only
   caught FileNotFoundError. Broadened catch to (FileNotFoundError,
   KeyError, ValueError); error message now includes exception type.

4. probe_bundle() previously bailed with an "error" key when
   converted_within_90_days was absent. Post-Phase-2 bundles will be
   exactly that shape. Now reports path_prediction_rates (fraction of
   leads each path flags positive) unconditionally — useful as a
   sanity view both before and after the fix. The early-return is now
   a "note" key rather than "error", with a message pointing the
   reader at the prediction-rates view.

5. New tests:
   - F1 zero-precision-and-recall yields zero, not NaN
   - Path A with NaN-bearing label column maps NaN -> False

Total: 10 tests, all passing. Lint/format/mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: docs Documentation or narrative changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants