From bee364d3f8450f5c53efe7134c6e2b44f80fe3f2 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Tue, 5 May 2026 18:34:01 +0300 Subject: [PATCH 1/3] docs: external review synthesis + v1 release roadmap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Land the Gemini and ChatGPT external-review outputs (two iterations each, six files) targeting the v0.1.0-alpha bundles, plus a structured synthesis under docs/external_review/summaries/ and an integrated v1 release roadmap under docs/release/. Critical finding from review (verified locally by ChatGPT v2): student_public bundles reconstruct converted_within_90_days end-to-end through public relational tables — opportunities.close_outcome plus customers/subscriptions existence recover the target via joins. CLAUDE.md gains a hard constraint forbidding this; the v1 release roadmap's Phase 2 is the structural fix. This PR is planning artifacts only — no code changes. The roadmap defines 7 phases ending with publishing leadforge-lead-scoring-v1 to Kaggle and Hugging Face. Implementation begins in follow-up PRs. Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 59 +- CLAUDE.md | 6 + docs/external_review/chatgpt/.gitkeep | 0 .../chatgpt/chatgpt_report_v1.md | 148 +++ .../chatgpt/chatgpt_report_v2.md | 780 +++++++++++ .../chatgpt/leadforge_report_v1_critique.md | 678 ++++++++++ .../leadforge_second_attempt_guidance.md | 1167 +++++++++++++++++ docs/external_review/gemini/.gitkeep | 0 .../gemini/gemini_report_v1.md | 245 ++++ .../gemini/gemini_report_v2.md | 247 ++++ docs/external_review/summaries/README.md | 45 + .../summaries/chatgpt_guidance_summary.md | 51 + .../summaries/chatgpt_v1_critique_summary.md | 41 + .../summaries/chatgpt_v1_summary.md | 31 + .../summaries/chatgpt_v2_summary.md | 86 ++ .../summaries/cross_source_takeaways.md | 161 +++ .../summaries/gemini_v1_summary.md | 44 + .../summaries/gemini_v2_summary.md | 52 + .../external_review/summaries/key_findings.md | 151 +++ .../summaries/recommendations_pass.md | 175 +++ docs/release/post_v1_roadmap.md | 100 ++ docs/release/v1_acceptance_gates.md | 178 +++ docs/release/v1_release_design.md | 236 ++++ docs/release/v1_release_roadmap.md | 344 +++++ 24 files changed, 5022 insertions(+), 3 deletions(-) create mode 100644 docs/external_review/chatgpt/.gitkeep create mode 100644 docs/external_review/chatgpt/chatgpt_report_v1.md create mode 100644 docs/external_review/chatgpt/chatgpt_report_v2.md create mode 100644 docs/external_review/chatgpt/leadforge_report_v1_critique.md create mode 100644 docs/external_review/chatgpt/leadforge_second_attempt_guidance.md create mode 100644 docs/external_review/gemini/.gitkeep create mode 100644 docs/external_review/gemini/gemini_report_v1.md create mode 100644 docs/external_review/gemini/gemini_report_v2.md create mode 100644 docs/external_review/summaries/README.md create mode 100644 docs/external_review/summaries/chatgpt_guidance_summary.md create mode 100644 docs/external_review/summaries/chatgpt_v1_critique_summary.md create mode 100644 docs/external_review/summaries/chatgpt_v1_summary.md create mode 100644 docs/external_review/summaries/chatgpt_v2_summary.md create mode 100644 docs/external_review/summaries/cross_source_takeaways.md create mode 100644 docs/external_review/summaries/gemini_v1_summary.md create mode 100644 docs/external_review/summaries/gemini_v2_summary.md create mode 100644 docs/external_review/summaries/key_findings.md create mode 100644 docs/external_review/summaries/recommendations_pass.md create mode 100644 docs/release/post_v1_roadmap.md create mode 100644 docs/release/v1_acceptance_gates.md create mode 100644 docs/release/v1_release_design.md create mode 100644 docs/release/v1_release_roadmap.md diff --git a/.agent-plan.md b/.agent-plan.md index b810a04..a9c5093 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -6,13 +6,66 @@ ## Current System State -**v1.0.0 released (2026-05-02).** All milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py. +**leadforge package v1.0.0 released (2026-05-02).** All framework milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py. + +**v0.1.0-alpha datasets shipped (2026-05-05).** Five bundles (intro / intermediate / advanced / intermediate_instructor / tiny_demo) in `release/`, packaged for external review by Gemini + ChatGPT (two iterations each). Synthesis lives under `docs/external_review/summaries/`. **Critical finding:** public relational tables reconstruct the target via joins (verified locally by ChatGPT v2 in `chatgpt_report_v2.md §0`). The v1 release work below addresses this plus the rest of the review consensus. + +--- + +## Next Up — `leadforge-lead-scoring-v1` Curated Dataset Release + +Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family to Kaggle and Hugging Face. Dataset version is decoupled from the leadforge package version (package stays at `1.x`). + +**Source of truth:** `docs/release/v1_release_roadmap.md` +**Companion docs:** `docs/release/v1_release_design.md`, `docs/release/v1_acceptance_gates.md`, `docs/release/post_v1_roadmap.md` +**External review materials:** `docs/external_review/{gemini,chatgpt}/` (raw) + `docs/external_review/summaries/` (synthesized) + +### Phase 1 — Audit and naming +- [ ] Reproduce relational-leakage finding on alpha bundles → `docs/release/v1_current_state_audit.md` +- [ ] Lock dataset release name `leadforge-lead-scoring-v1` + +### Phase 2 — Snapshot-safe relational export +- [ ] `leadforge/render/relational_snapshot_safe.py` (new) +- [ ] `leadforge/validation/relational_leakage.py` (new) +- [ ] `BUNDLE_SCHEMA_VERSION` 4 → 5; manifest gains `relational_snapshot_safe` +- [ ] Drop `converted_within_90_days` / `conversion_timestamp` from public `leads`; drop `close_outcome` / `closed_at` from public `opportunities`; omit `customers` / `subscriptions` from public bundles +- [ ] Hash-determinism preserved on regenerated bundles + +### Phase 3 — Release validation hardening +- [ ] `leadforge/validation/{release_quality,leakage_probes,reporting}.py` (new) +- [ ] `scripts/validate_release_candidate.py` (new) +- [ ] Resolve numeric `TBD-*` bands in `v1_acceptance_gates.md` +- [ ] `release/validation/validation_report.{json,md}` + figures auto-generated + +### Phase 4 — Channel-signal audit + dataset card hardening +- [ ] `scripts/audit_channel_signal.py` → `docs/release/channel_signal_audit.md` +- [ ] `release/README.md` rewrite (release-grade dataset card; macro-framing paragraph; simulation-simplifications section) +- [ ] `docs/release/{generation_method,feature_dictionary}.md` + +### Phase 5 — Platform packaging +- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json` +- [ ] `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags +- [ ] `release/dataset-cover-image.png` (≥560×280) +- [ ] Local `load_dataset()` smoke test; Kaggle dry-run package validation + +### Phase 6 — Notebook sequence + adversarial framing +- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` +- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics +- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` +- [ ] `docs/release/{break_me_guide,v2_decision_log}.md` + +### Phase 7 — LLM critique + publish +- [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) +- [ ] `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py` +- [ ] Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`) +- [ ] `scripts/{publish_kaggle,publish_hf}.py` (dry-run → private/draft → public) +- [ ] Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` --- -## Next Up — Public Kaggle/HuggingFace Release +## Alpha Release — v0.1.0-alpha (shipped; superseded by v1 work above) -First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tiers (intro/intermediate/advanced) as full relational bundles + flat CSV convenience exports, plus a research_instructor companion for intermediate. +First public dataset alpha: `leadforge-b2b-lead-scoring`. Three difficulty tiers (intro/intermediate/advanced) as full relational bundles + flat CSV convenience exports, plus a research_instructor companion for intermediate. ### Public release — Phase 1: Dataset card improvement ✓ (in PR) diff --git a/CLAUDE.md b/CLAUDE.md index c45eb93..ffdb587 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -204,6 +204,7 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp ## Hard Constraints — Do Not Violate - Never use a single fixed hidden world (DGP must vary by motif family + rewiring). - Never leak post-snapshot-anchor data into flat task features. +- **Never publish public relational tables that allow label reconstruction via joins.** Public relational exports must be snapshot-safe: event tables filtered to `event_timestamp <= lead_created_at + snapshot_day`; no terminal-state fields (`close_outcome`, `closed_at`, `converted_within_90_days`, `conversion_timestamp`) in public `leads`/`opportunities`; no conversion-conditional entities (`customers`, `subscriptions`) in public bundles. - Never require external APIs for core generation. - Never publish hidden truth in `student_public` mode. - Never derive `converted_within_90_days` as a directly sampled label; it must emerge from simulated events. @@ -360,3 +361,8 @@ The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` f - Architecture/spec: `docs/leadforge_architecture_spec.md` - Implementation roadmap: `docs/leadforge_implementation_plan.md` - v4 dataset plan: `docs/v4/design.md` +- **v1 dataset release roadmap (active): `docs/release/v1_release_roadmap.md`** +- v1 release design: `docs/release/v1_release_design.md` +- v1 acceptance gates: `docs/release/v1_acceptance_gates.md` +- Post-v1 roadmap: `docs/release/post_v1_roadmap.md` +- External review synthesis: `docs/external_review/summaries/` diff --git a/docs/external_review/chatgpt/.gitkeep b/docs/external_review/chatgpt/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/docs/external_review/chatgpt/chatgpt_report_v1.md b/docs/external_review/chatgpt/chatgpt_report_v1.md new file mode 100644 index 0000000..926bd02 --- /dev/null +++ b/docs/external_review/chatgpt/chatgpt_report_v1.md @@ -0,0 +1,148 @@ +# Roadmap and Research for Leadforge v1 Lead‑Scoring Dataset Release + +## 1. Review of the Current Project + +### 1.1 Repository and Architecture + +The **leadforge** repository implements an opinionated framework for generating synthetic CRM/GTM datasets. The architecture specification summarises the project’s ambitions and proposed structure: + +- **Seven‑layer design** – narrative, schema, structure, mechanism, simulation, rendering and validation. Each layer has clear responsibilities: narrative defines the business story, schema defines entities/relationships, structure manages hidden graph motifs, mechanisms implement stochastic event dynamics, simulation orchestrates world evolution, rendering outputs relational/snapshot tables and cards, and validation checks invariants and realism【176731919908143†L15-L89】. This layered approach promotes clarity and decouples core concerns. +- **Deterministic generation** – every dataset is reproducible from a recipe, seed, package version and exposure mode. A single random‑number root ensures deterministic substreams【176731919908143†L15-L89】. +- **Strongly typed internals and relational‑first generation** – the world model distinguishes accounts, contacts, leads, events, latent/observed variables, targets, and metadata【176731919908143†L15-L89】. Normalized tables form the canonical representation; flat machine‑learning datasets are derived exports【176731919908143†L15-L89】. +- **Narrative‑anchored semantics and motif‑based variability** – features map to interpretable business concepts. Hidden graphs are sampled from motif families and then rewired to introduce realistic variability【176731919908143†L15-L89】. +- **Exposure modes and truth separation** – the system supports public (“student”) and instructor/research modes; sensitive columns or derived truths are redacted in the student version and recorded in the manifest【41808566082802†L372-L388】. +- **Recipe‑driven API and CLI** – users will call `Generator.from_recipe(recipe, seed, mode…)` to produce a `WorldBundle` with relational tables, snapshots, dataset cards and metadata【176731919908143†L15-L89】. A CLI (`leadforge generate`, `list‑recipes`, `inspect`, `validate`) is specified【176731919908143†L15-L89】. + +The repo includes skeletons for modules—`core/`, `narrative/`, `schema/`, `structure/`, `mechanisms/`, `simulation/`, `render/`, `exposure/`, `validation/`, and `recipes/`—but many functions are placeholders. The alpha datasets are produced by an earlier version of this pipeline. + +### 1.2 Dataset Repository and Alpha Releases + +The `leadforge‑datasets` repository hosts alpha releases. The README notes that the datasets are **alpha / quasi‑release** bundles, not yet published to Kaggle or Hugging Face, and primarily for inspection【41808566082802†L264-L286】. Five bundles are provided: *intro*, *intermediate*, *advanced*, *intermediate_instructor*, and *tiny_demo*. Each bundle includes: + +- A **manifest** capturing recipe id, package version, seed, exposure mode, difficulty, row counts for each table, snapshot/label windows and tasks. Redacted columns are listed (e.g. `current_stage` and `is_sql` for student mode)【41808566082802†L372-L388】【689797725370268†L0-L89】. +- A **dataset card** describing the vendor narrative, product, target market, GTM motion, buyer personas, label definition (`converted_within_90_days`), table inventory and feature categories (demographics, behavioral, engine features)【176731919908143†L15-L89】. The card notes that the dataset is synthetic, uses a 30‑day snapshot window, removes columns that leak the label, and withholds latent structure【176731919908143†L15-L89】. +- A **feature dictionary** listing each column, data type, description, category, and whether it is a target or potential leakage【784306424018116†L25-L60】. +- **Relational tables** in Parquet format (e.g. `account.parquet`, `contact.parquet`, `lead.parquet`, `touch.parquet`) and a **flat CSV** with joined features for the lead‑scoring task. +- **Baseline results** for logistic regression and histogram gradient boosting. AUCs around 0.87–0.89 are reported for all difficulty levels; higher difficulty mainly reduces precision at a fixed recall【866903914790547†L21-L35】. + +The dataset packaging is comprehensive: it includes tasks, baselines, metadata and redaction lists. However, these are not yet in Kaggle/Hugging‑Face format. + +### 1.3 Strengths + +- **Ambitious design and clear layering** – the architecture separates domain narrative, structural motifs, simulation mechanisms and rendering. This supports extensibility (e.g. adding LTV simulation) and encourages transparency. +- **Reproducibility and determinism** – using seeds and versioning ensures that worlds can be regenerated. This is critical for academic benchmarking. +- **Narrative depth** – dataset cards embed a plausible B2B procurement scenario, including vendor, product and buyer personas. The feature dictionary distinguishes demographic, behavioral and engineered features, and labels are defined clearly. +- **Comprehensive packaging** – the manifest, dataset card, feature dictionary and baselines make the dataset easy to understand and evaluate. The redaction list warns students about potential leakage columns. + +### 1.4 Weaknesses and Areas for Improvement + +- **Incomplete implementation** – many modules in the leadforge repo are still placeholders. Generation of hidden graphs, conditional mechanisms, simulation engines and validation checks needs to be completed before v1. +- **Single vertical and task** – v1 focuses on mid‑market procurement with a 90‑day conversion label. The design emphasises LTV readiness but does not yet include LTV labels or other GTM tasks (e.g. churn prediction). A flexible design should allow additional tasks but still needs demonstration. +- **Complexity and learning curve** – the architecture is deep, with many modules. This is powerful but may be daunting for contributors. Clear documentation, typed models and examples will be required. +- **Data realism** – although the narrative is plausible, verifying that the synthetic patterns mirror real lead‑scoring data will require domain expertise and validation. Without LTV labels, evaluation of more complex tasks is limited. +- **Lack of built‑in evaluation** – while baseline metrics are provided, automated checks for statistical fidelity, privacy (e.g. membership inference), and scenario plausibility are not yet part of the generation pipeline. +- **No Kaggle/HF packaging** – the alpha bundles are not packaged for Kaggle/Hugging‑Face. They need dataset metadata files, licensing, tags and dataset cards formatted according to each platform’s requirements. + +## 2. Research on Best‑Practices for Dataset Releases + +### 2.1 Kaggle Dataset Guidelines + +Kaggle uses a `dataset‑metadata.json` file adhering to the **Data Package** specification. Key fields include `title`, `id`, `licenses`, `resources`, `description`, `keywords`, `isPrivate`, `maintainer`, and `updateFrequency`【202514253005571†L6-L63】. Each resource (file) can define a `path`, `schema` (with field names/types), and `description`【202514253005571†L66-L130】. Kaggle accepts standard licenses like Creative Commons; recommended update frequencies include “Never”, “Hourly” and “Monthly”【202514253005571†L146-L175】. The dataset must include a cover image (1200×400 px) and optionally a README. The README should describe the data, tasks, and evaluation metrics; Kaggle emphasises reproducibility and instructive notebooks. + +**Implications for leadforge**: A Kaggle release should include a `dataset‑metadata.json` with the dataset title (e.g. “Synthetic Lead‑Scoring Dataset – Procurement SaaS”), an appropriate license (e.g. CC‑BY‑4.0 or CC‑BY‑NC‑SA), keywords (CRM, lead scoring, synthetic data), and resources listing each file (CSV, Parquet, manifest, card). A concise description referencing the dataset card and design document will help users understand the narrative and simulation. A cover image could depict a stylised procurement workflow or CRM dashboard. Kaggle also benefits from example notebooks demonstrating how to load the data and build baseline models; the existing baselines can be converted into Kaggle notebooks. + +### 2.2 Hugging Face Dataset Cards + +Hugging Face hosts datasets in Git‑based repositories. A dataset card (README.md) must begin with a YAML metadata block that specifies fields like `language`, `pretty_name`, `tags` (including tasks, modalities), `license`, and `task_categories`【767850007519643†L120-L154】. The dataset card then describes the dataset summary, curation, data sources, intended uses, structure, creation processes, annotation details, personal/sensitive information, and potential biases/limitations【109638070174581†L0-L93】. The Data Cards Playbook emphasises transparency: dataset cards should report who created the data, motivations, collection and processing steps, and known shortcomings【708475815999079†L17-L27】. + +**Implications for leadforge**: The dataset card should include YAML metadata specifying that the language is English, the license (e.g. MIT or CC‑BY‑4.0), dataset size, file sizes and modalities (`tabular`). Tags such as `synthetic`, `lead‑scoring`, `crm`, and `education` will improve discoverability. The card should detail the simulation process, enumerating entity types, features, time windows and narrative context. It should discuss biases inherent in the simulation (e.g. assumptions about buyer behaviour, latent variables) and emphasise that real performance on actual CRM datasets may differ. Since no human subjects are involved, the “Personal & Sensitive Information” section should explain that the data is synthetic but has analogues in real business processes; caution should be taken not to treat the data as factual. Finally, recommended uses (e.g. educational, model prototyping) and out‑of‑scope uses (e.g. real lead targeting) should be listed. + +### 2.3 Data Card Content and Responsible AI + +The **Data Cards Playbook** from Google describes data cards as structured summaries capturing essential facts about datasets for responsible AI【708475815999079†L17-L27】. Key themes include: + +- **Provenance** – who created the dataset, when and why; versioning and maintenance plans. +- **Motivation and intended uses** – the purpose of the dataset and contexts in which it should or should not be used. +- **Content and structure** – variable descriptions, units, relationships, sampling methods, and transformations. +- **Quality and validation** – checks performed (e.g. missingness, statistical fidelity, fairness analysis). +- **Privacy and sensitivity** – human attributes or personally identifiable information; in synthetic data, methods to reduce re‑identification risk. +- **Bias, risk and limitations** – assumptions made, potential misuses, and strategies for mitigation. + +Adhering to these themes will strengthen the dataset card and reduce misuse. + +### 2.4 Synthetic Data Evaluation + +Synthetic datasets must balance **resemblance**, **utility**, and **privacy**【314702971766939†L87-L112】. + +- *Resemblance* measures how closely synthetic data matches real‑world distributions at multiple levels (marginal distributions, correlations and high‑order relationships). Techniques include histogram comparisons, correlation matrices, distribution distance metrics, PCA and t‑SNE visualizations【314702971766939†L142-L166】. Domain‑specific invariants (e.g. conversion rates by industry or lead source) should hold. +- *Utility* assesses whether models trained on synthetic data perform well on downstream tasks compared to models trained on real data【314702971766939†L87-L112】. For lead scoring, metrics like AUC, lift curves and precision‑recall curves are relevant. Feature importance from models trained on synthetic data should align with real‑world intuitions (e.g. MQL, job seniority or behavioural scores). +- *Privacy* ensures that the synthetic dataset does not permit re‑identification or leakage of real records. For purely simulated worlds, membership inference and attribute inference risk is minimal, but caution is needed if parameters are learned from real data. Techniques like noise injection, differential privacy and redaction of synthetic IDs can provide additional safety. + +Implementing automated evaluation within leadforge will increase confidence in each release. BlueGen AI highlights additional metrics like the Data Plagiarism Index and authenticity scores that can detect overfitting to training data【314702971766939†L109-L165】. + +### 2.5 Lead‑Scoring Best Practices from Industry + +Industry articles on predictive lead scoring recommend moving beyond rule‑based scoring when there is sufficient historical data (500+ closed‑won deals) and numerous attributes (>20 per lead)【42917181688173†L104-L152】. They emphasize combining **firmographic**, **technographic**, **behavioral**, **temporal** and **engineered** features【42917181688173†L104-L152】. For example, growth rate, revenue per employee and high‑intent page‑visit ratios often outperform raw variables. Signals that predictive scoring is warranted include plateauing performance of heuristics, non‑obvious patterns, and business complexity【42917181688173†L104-L152】. + +For leadforge, this implies that synthetic datasets should include meaningful engineered features (e.g. month‑over‑month change in page visits, engagement score composites, normalized contact engagement relative to account size) and latent variables representing intent. Difficulty tiers can vary the predictive signal strength by adding noise, removing engineered features or redacting certain labels. Datasets should be large enough for machine learning models to generalize; thousands of leads across hundreds of accounts are typical. + +### 2.6 Cutting‑Edge Approaches to Synthetic Data Generation + +Google’s **Simula** framework argues that synthetic data should be treated as **datasets of functions**, where designers control axes like global diversity, local diversity, complexity and quality【226463155018957†L222-L258】. Instead of simply sampling from learned distributions, Simula designs generative mechanisms with explicit structure and semantics. Complexity and quality can be calibrated through coverage tests and complexity scoring【226463155018957†L286-L334】. + +This perspective aligns with leadforge’s motif‑based graphs and narrative‑anchored semantics; however, Simula emphasizes *programmability*, enabling the generation of edge cases and rare scenarios. For leadforge, introducing tunable motif families and mechanism parameters (e.g. contact authority distributions or campaign effects) will allow educators to craft datasets with targeted complexity. Publishing the mechanism specification alongside the dataset (in a “truth exposure” mode for instructors) will enhance reproducibility and encourage contributions. + +## 3. Suggested Roadmap for the v1 Dataset Release + +Below is a recommended roadmap to take leadforge from its current alpha stage to a polished v1 release on Kaggle and Hugging Face. Each phase includes deliverables and references to research observations. + +### 3.1 Finalize Core Generation Engine (Milestone 1) + +1. **Complete the simulation pipeline** – implement missing modules (graph sampling, mechanisms, transitions, measurement logic and scheduler) so that the world model can evolve accounts, contacts, leads and events across discrete time. Follow the architecture spec: use typed dataclasses, motif families and deterministic RNGs【176731919908143†L15-L89】. +2. **Implement difficulty profiles** – encode intro/intermediate/advanced presets in `difficulty_profiles.yaml`. Adjust signal strength by varying noise levels, feature availability and latent variable complexity. Ensure that AUC remains realistic (~0.85–0.90) and that precision decreases with difficulty【866903914790547†L21-L35】. +3. **Add engineered features** – incorporate firmographic growth rates, revenue per employee, behavioral summarizations (e.g. click‑rate ratios), and composite engagement scores【42917181688173†L104-L152】. Create latent variables (problem awareness, budget readiness) that influence conversion probability and appear indirectly through engineered features. +4. **Expand motifs and policies** – provide several motif families (e.g. linear funnel, multi‑stakeholder, partner‑assisted) with tunable parameters. Document each motif’s semantics. Make the mechanism layer easily extensible to support future tasks (e.g. churn, cross‑sell). +5. **Implement validation checks** – build automated validators to check invariants (no negative counts, time ordering), realism (distribution comparisons, correlation heatmaps) and difficulty (expected lift curves). Borrow metrics from synthetic data literature: compare synthetic vs. expected distributions, compute correlation alignment, and run baseline models【314702971766939†L87-L112】. Provide options for deeper checks using external LLMs to review dataset cards and highlight potential inconsistencies. + +### 3.2 Packaging and Documentation Tools (Milestone 2) + +1. **Generation CLI** – implement `leadforge generate`, `list‑recipes`, `inspect` and `validate` commands. Provide machine‑readable (`--json`) output and human‑friendly summaries. Allow saving bundles to local directories or directly zipping them for distribution. +2. **Automatic manifest and dataset card creation** – for each generation run, automatically produce a manifest (JSON) with run parameters (recipe id, seed, mode, difficulty, horizon, row counts, redacted columns, checksums). Generate a dataset card (Markdown) using a template aligned with the Data Cards Playbook【708475815999079†L17-L27】. Fill in YAML front matter for Hugging Face (language, pretty_name, tags, license, size, dataset_info). Include sections on narrative, table schemas, feature categories, target definitions, provenance, known limitations and recommended uses. +3. **License selection** – select an appropriate open license. For an educational synthetic dataset with code for generation, MIT or Apache‑2.0 for the code and CC‑BY‑4.0 for the data are common. Document this choice in the dataset card and in Kaggle metadata【202514253005571†L146-L175】. +4. **Generate a cover image** – create a 1200×400 px banner illustrating synthetic CRM or procurement processes. Use a graphics tool or automatically produce diagrams of the funnel; ensure it conveys that the data is synthetic. Provide alt‑text for accessibility. +5. **Notebook tutorials** – convert baseline evaluation scripts into Jupyter notebooks that show how to load the relational and flat tables, explore the narrative, perform exploratory analysis and train baseline models. Include data‑viz (e.g. feature distributions, correlation heatmaps) and evaluation metrics (AUC, lift, precision‑recall curves). Provide both Kaggle notebook (.ipynb) and HF Space markdown variants. +6. **Public vs. instructor modes** – ensure the generation script can output both public (redacted) and instructor (full latent truth) bundles. Document the difference; redacted columns should be recorded in the manifest for transparency【41808566082802†L372-L388】. + +### 3.3 Prepare Kaggle & Hugging Face Releases (Milestone 3) + +1. **Kaggle packaging**: + - Create a `dataset‑metadata.json` file listing the dataset’s title, id (slug), description, license, tags/keywords, cover image, and resources (CSV, Parquet, manifest, dataset card, feature dictionary). Provide a schema for the flat CSV (field names and types)【202514253005571†L6-L63】. Set the update frequency to “Never” or “On demand” as synthetic data is generated deterministically【202514253005571†L146-L175】. + - Ensure that the zipped dataset folder contains all necessary files and is under Kaggle’s size limits (usually <2 GB for public datasets). Use compression (e.g. Parquet + CSV zipped) to reduce size. + - Write a Kaggle README (Markdown) referencing the dataset card. Link to the leadforge repository and design document. Add example code for loading data and training models. + - Use the Kaggle API or CLI (`kaggle datasets create`) with a personal access token to upload the dataset. Provide instructions for subsequent version updates. + +2. **Hugging Face packaging**: + - Create a repository on the Hugging Face Hub (via CLI `huggingface-cli repo create`). Include all dataset files along with a README using the dataset card template. The YAML metadata at the top should include language (`en`), pretty_name, license, tags (synthetic, lead scoring, CRM, education), dataset_info (size, splits), and task_categories (`tabular-classification`). + - Add a `dataset_dict.json` or `dataset_infos.json` file if using the `datasets` library, describing splits (train/test) and features. Provide a `load_dataset` script if needed. Use `push_to_hub` to upload the repository. + - Provide usage examples: `from datasets import load_dataset`; show how to access the flat table and relational tables. + +3. **Community engagement**: + - Announce the dataset release on social channels (LinkedIn, X) and relevant forums (e.g. Kaggle discussions). Encourage participants to break the dataset, identify flaws, and propose improvements. Provide a feedback form and a GitHub discussion board for issues. + - Engage with early adopters; incorporate feedback into v1.1 or v2. Document known issues and planned improvements. + +### 3.4 Quality Assurance and Validation (Ongoing) + +1. **Statistical validation** – incorporate evaluation of resemblance: compare distributions of synthetic features to real distributions (if available) or to plausibility constraints. Use correlation matrices and PCA to detect unrealistic independence or correlation patterns【314702971766939†L87-L112】. Provide heatmaps and summary statistics in the validation report. +2. **Utility validation** – compute baseline model performance (AUC, lift curves, precision at top‑k) across difficulty tiers. Compare feature importances from models trained on synthetic vs. real data (if accessible). For each release, ensure that performance is within the targeted range and that engineered features provide lift. +3. **Privacy and ethical checks** – confirm that no personally identifiable information or real company names appear in the dataset. Document the purely synthetic nature. If real data informs parameter tuning, ensure de‑identification and differential privacy techniques. Evaluate membership inference risk; synthetic worlds built from fully simulated processes have negligible risk but this should be stated. +4. **LLM‑assisted critique** – develop a script that uses large language models to review the dataset card and manifest. The model can highlight ambiguous statements, missing sections or potential biases. This can be integrated into the validation pipeline so that each release triggers an automated critique. + +### 3.5 Post‑Release Activities + +1. **User feedback loop** – gather user reports on issues (e.g. unrealistic distributions, leakage, missing features) via GitHub issues or Kaggle comments. Triage issues and incorporate feasible improvements into a v2 roadmap. +2. **Broaden verticals and tasks** – plan to add an LTV prediction dataset and additional verticals (e.g. cyber‑security SaaS, martech). Use the v1 architecture to extend the world model. Introduce new motifs and tasks such as churn, expansion or upsell. +3. **Leaderboard and benchmarks** – design a benchmark site (future milestone) where participants can submit models trained on the synthetic data and evaluated on withheld hidden test sets. Use this to study how models trained on synthetic data generalize to real patterns. + +## 4. Conclusion + +The **leadforge** project has a robust design for generating narrative‑grounded synthetic lead‑scoring datasets. The alpha datasets demonstrate thorough packaging and plausible features but remain pre‑release. By completing the generation engine, introducing engineered features and validation checks, and packaging the data according to Kaggle and Hugging Face standards, leadforge can deliver the world’s most comprehensive synthetic CRM dataset for education and research. Transparency through detailed dataset cards, adherence to data card best practices and engagement with the community will ensure responsible use and continual improvement. diff --git a/docs/external_review/chatgpt/chatgpt_report_v2.md b/docs/external_review/chatgpt/chatgpt_report_v2.md new file mode 100644 index 0000000..0d4805f --- /dev/null +++ b/docs/external_review/chatgpt/chatgpt_report_v2.md @@ -0,0 +1,780 @@ +# Leadforge v1 Lead-Scoring Dataset Release Plan + +## 0. Executive summary + +**Verdict:** Leadforge is much further along than a greenfield project. The current repo contains an end-to-end deterministic generator, motif-sampled hidden graphs, population generation, a 90-day simulation engine, relational bundle writing, public/instructor exposure modes, CLI commands, validation modules, a public-release builder, a Hugging Face-style card, a baseline release notebook, and a mature `lead_scoring_intro` v7 teaching lineage. The v1 work should therefore be framed as **release hardening and adversarial validation**, not core implementation. + +**The biggest release blocker I found is not absence of generation; it is public relational leakage.** In a local 500-lead `student_public` smoke bundle, `tables/leads.parquet` still contained `converted_within_90_days` and `conversion_timestamp`, and `tables/opportunities.parquet.close_outcome == "closed_won"` plus `customers` existence reconstructed the target with **100% accuracy**. This is acceptable only if those relational tables are documented as post-outcome world records, not if they are marketed as feature-engineering inputs for a lead-scoring task. For a best-in-class public Kaggle/HF dataset, the public relational path must be made **snapshot-safe** or moved to an instructor/research companion. + +**Recommended v1 release shape:** + +```text +Public Kaggle/HF release: + intro / intermediate / advanced flat lead-scoring task splits + snapshot-safe relational tables only + feature dictionary with leakage flags + validation report, charts, notebooks, data card, break-me guide + +Separate instructor/research companion: + full world graph + latent registry + mechanisms + full-horizon relational tables + leakage-trap materials + reproducibility manifest +``` + +**Definition of v1 ready:** A fresh release candidate can be generated from code; passes structural, snapshot, redaction, relational-leakage, split-leakage, calibration, lift, top-K, value-ranking, and platform packaging checks; renders valid Kaggle and Hugging Face packages; has notebooks that run top-to-bottom; and has no unresolved high-severity LLM or human review findings. + +--- + +## 1. Evidence and method + +I treated the second-attempt guidance and critique as constraints, not optional context. Those files specifically require an evidence-first, code-aware, release-oriented audit and warn against calling implemented components skeletal, ignoring existing CLI/release/validation/HF assets, or missing the `lead_scoring_intro` v7 track. I extracted and inspected the attached Repomix package. + +**Repository inventory from extracted Repomix package:** + +| Item | Count | +| ------------------------------- | ----: | +| Total files | 194 | +| Python files | 149 | +| Python files under `leadforge/` | 78 | +| Test files | 56 | +| Scripts | 15 | +| Markdown/RST/TXT docs | 22 | +| Notebooks | 2 | +| CSV files | 4 | +| YAML/YML files | 8 | +| Release files | 3 | +| `lead_scoring_intro/` files | 9 | + +Line counts from the extracted package: `leadforge/` ≈10.7k lines, `tests/` ≈9.4k lines, `scripts/` ≈3.9k lines, `lead_scoring_intro/` ≈4.8k lines, `docs/` ≈5.6k lines. + +**Dynamic checks run:** + +| Command / check | Result | +| -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `PYTHONPATH=. python -m pytest --collect-only -q` | 937 tests collected, exit 0 | +| Full `pytest -q` | Timed out at 300s around 53%; no failures observed before timeout | +| CLI help through Typer app | Worked; commands present | +| `leadforge list-recipes` | Worked; found `b2b_saas_procurement_v1` | +| `leadforge generate ... --mode student_public --difficulty intermediate --n-leads 500` | Exit 0; bundle generated | +| `leadforge inspect /tmp/leadforge_smoke` | Exit 0; reported 9 tables, task splits, no metadata dir | +| `leadforge validate /tmp/leadforge_smoke` | Exit 0 | +| Same smoke generation in `research_instructor` mode | Exit 0; metadata dir present | +| `scripts/build_public_release.py ... --generation-timestamp ...` | Timed out before final summary, but generated intro/intermediate/advanced/intermediate_instructor bundles; separate validation calls passed | +| Same-seed determinism smoke check | Two saved bundles with pinned timestamp matched by tree/hash comparison | + +**What I did not verify:** I did not complete the full test suite within the timeout; I did not upload to Kaggle or Hugging Face; I did not run `load_dataset()` against a real HF repo; I did not run a full multi-model leakage-probe suite beyond the smoke-bundle relational leakage check; and I did not download every alpha Parquet file from the public dataset repo, though I did inspect the public GitHub pages and local regenerated release artifacts. + +--- + +## 2. Current-state audit of Leadforge + +### 2.1 Architecture and design docs + +**Exists:** The repo has strong design intent: world-first, relational-first, narrative-grounded synthetic CRM generation, one B2B SaaS procurement vertical, exposure modes, difficulty profiles, and LTV-ready but not LTV-shipping foundations. The README frames Leadforge as a simulated commercial-world generator, not a row sampler, and documents CLI/API usage, exposure modes, difficulty profiles, output bundle shape, and deterministic/relational/simulation-driven principles. `README.md:L1-L6`, `README.md:L34-L56`, `README.md:L74-L127`. + +**Strength:** The design docs and implementation are aligned enough that the project is not just speculative documentation. + +**Gap:** `pyproject.toml` already declares `version = "1.0.0"` and `Development Status :: 5 - Production/Stable`, while the public dataset release state is still alpha and v1 release blockers remain. `pyproject.toml:L5-L24`. This can confuse users because package version, framework maturity, and curated dataset v1 are not the same product milestone. + +**Release implication:** Rename the upcoming public dataset release explicitly, for example `leadforge-lead-scoring-v1`, and document that it is the first curated public dataset release even if the Python package version is already `1.0.0`. + +--- + +### 2.2 Public API + +**Exists:** `Generator.from_recipe()` loads a registered recipe, resolves config, applies overrides, loads narrative, and constructs a `WorldSpec`. `Generator.generate()` samples the hidden graph, loads difficulty profile parameters, builds population, simulates the world, and returns a populated `WorldBundle`. `leadforge/api/generator.py:L43-L122`, `leadforge/api/generator.py:L124-L248`. + +**Strength:** This is a working vertical-slice generator, not a placeholder. + +**Gap:** The API does not yet expose release-oriented workflows: build a release candidate, validate release quality, package for Kaggle/HF, or publish/dry-run. + +**Release implication:** Keep `Generator` as the framework API. Add release APIs separately, for example `leadforge.release.build_release_candidate()` and CLI wrappers. + +--- + +### 2.3 CLI + +**Exists:** The Typer CLI registers `list-recipes`, `generate`, `inspect`, and `validate`. `leadforge/cli/main.py:L1-L42`. `generate` supports recipe, seed, exposure mode, output, difficulty, population counts, horizon, and override YAML. `leadforge/cli/commands/generate.py:L12-L86`. `inspect` reads `manifest.json` and prints recipe, seed, mode, difficulty, horizon, package version, schema version, motif, table counts, task rows, and metadata presence. `leadforge/cli/commands/inspect.py:L14-L75`. `validate` calls `validate_bundle()` and exits nonzero on errors. `leadforge/cli/commands/validate.py:L12-L42`. + +**Strength:** The core CLI is present and usable. + +**Gaps:** `inspect` and `validate` do not yet expose `--json`; no `release` subcommands exist; no dry-run publishing; no credential checks; no platform package validation. + +**Release implication:** Do not “implement CLI” from scratch. Add: + +```text +leadforge release build +leadforge release validate +leadforge release package-kaggle +leadforge release package-hf +leadforge release publish-kaggle --dry-run +leadforge release publish-hf --dry-run +``` + +--- + +### 2.4 Recipe system and difficulty profiles + +**Exists:** Recipe objects include id, title, vertical, primary task, supported modes, difficulty profiles, population defaults, horizon, label window, and snapshot day. Config precedence is implemented across defaults, override files, and explicit args. `leadforge/api/recipes.py:L30-L45`, `leadforge/api/recipes.py:L140-L164`, `leadforge/api/recipes.py:L187-L240`. + +**Strength:** Difficulty is a first-class named profile, not an accidental result. + +**Gap:** Current release validation does not fully prove that difficulty profiles reward stronger modeling rather than simply changing base rate. The public alpha baselines show AUC is roughly flat across tiers while AP and P@K collapse, which is pedagogically useful, but LogReg and HistGBM are too close for “better modeling lifts realistically” in the alpha. The alpha `BASELINES.md` reports LogReg AUC ≈0.87–0.89 and HistGBM ≈0.866–0.868 across tiers. ([GitHub][1]) + +**Release implication:** For v1, require difficulty gates that include not only conversion rate and AUC, but also AP, P@K, lift@K, calibration, Brier score, log loss, and model-family deltas. + +--- + +### 2.5 Hidden graph and motif sampler + +**Exists:** `sample_hidden_graph()` selects a motif, applies stochastic rewiring, and returns a validated graph. `leadforge/structure/sampler.py:L26-L83`. The motif library includes fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, and buying-committee-friction structures. `leadforge/structure/motifs.py:L46-L230`. Graph validation checks acyclicity, type legality, nondegeneracy, and outcome reachability. `leadforge/structure/graph.py:L252-L315`. Rewiring can drop optional nodes, jitter edge weights, and inject latent confounders. `leadforge/structure/rewiring.py:L42-L125`. + +**Strength:** This directly implements the “distribution over plausible worlds” idea. + +**Gaps:** Release reports do not yet summarize graph diversity across seeds/tier releases, motif frequencies, structural edit distances, or which mechanisms drive the released v1 seed. External reviewers need those summaries without opening `world_spec.json`. + +**Release implication:** Add a `validation/graph_diversity.md` and public-safe `mechanism_summary_public.json`. + +--- + +### 2.6 Mechanisms and simulation engine + +**Exists:** The simulation engine is a discrete 90-day world simulator. It creates RNG substreams, assigns mechanisms, evolves leads daily, applies churn/stage transitions/direct conversion, emits touches/sessions/sales activities, updates labels from conversion within the label window, and creates opportunities, customers, and subscriptions. `leadforge/simulation/engine.py:L166-L210`, `leadforge/simulation/engine.py:L260-L398`, `leadforge/simulation/engine.py:L416-L476`. + +Population generation creates accounts, contacts, leads, latent scores, and category-latent correlations. `leadforge/simulation/population.py:L141-L211`, `leadforge/simulation/population.py:L219-L380`, `leadforge/simulation/population.py:L410-L495`. + +**Strength:** Leadforge has real simulation machinery, including post-conversion entities and multiple event tables. + +**Gaps:** Some realism choices need external calibration and release documentation: sales-cycle timing, partner/inbound/outbound mix, opportunity lifecycle, direct conversion rate, rep policy, and customer/subscription treatment in public bundles. + +**Release implication:** The v1 data card must include a “simulation simplifications” section. It should say which CRM phenomena are modeled, which are approximate, and which are not modeled. + +--- + +### 2.7 Relational rendering and snapshot task generation + +**Exists:** The bundle writer writes relational Parquet tables, builds snapshot task splits, writes dataset cards and feature dictionaries, applies exposure metadata, and writes a manifest. Redaction is applied to both relational tables and task splits. `leadforge/api/bundle.py:L62-L140`. Relational rendering writes accounts, contacts, leads, touches, sessions, sales activities, opportunities, customers, and subscriptions. `leadforge/render/relational.py:L42-L83`. Snapshot building supports `snapshot_day`, filters touches/sessions/sales/opportunities by snapshot cutoff, computes early/recent touches, expected ACV, and applies noise/missingness/outliers. `leadforge/render/snapshots.py:L57-L131`, `leadforge/render/snapshots.py:L153-L243`, `leadforge/render/snapshots.py:L307-L404`. Task splitting is deterministic and writes train/valid/test Parquet plus a task manifest. `leadforge/render/tasks.py:L20-L77`. + +**Strength:** The flat task path is much safer than the full relational path. + +**Critical gap:** Public relational tables currently include post-outcome information. In my smoke bundle, `tables/leads.parquet` included `converted_within_90_days` and `conversion_timestamp`; `tables/opportunities.parquet` included `close_outcome` and `closed_at`; `customers` and `subscriptions` existed only for converted leads. Those fields reconstruct the label directly. This is not just a documentation issue if the public release invites relational feature engineering. + +**Release implication:** For v1, create a **snapshot-safe relational export**: + +```text +data/relational_snapshot_safe// + leads.parquet # no target, no conversion timestamp + touches.parquet # only events <= snapshot_time + sessions.parquet # only events <= snapshot_time + sales_activities.parquet # only events <= snapshot_time + opportunities.parquet # only opps created <= snapshot_time, no close_outcome/closed_at +``` + +Move full-horizon `customers`, `subscriptions`, full opportunities, and label-bearing lead records to instructor/research companion only. + +--- + +### 2.8 Exposure and redaction modes + +**Exists:** Exposure filters distinguish `student_public` and `research_instructor`; instructor mode writes metadata, public mode does not. `leadforge/exposure/filters.py:L23-L58`. Metadata writing includes graph, GraphML, latent registry, world spec, and mechanism summary. `leadforge/exposure/metadata.py:L1-L70`. Feature specs differentiate leakage-risk documentation from redaction policy, and `current_stage` / `is_sql` are redacted in public mode. `leadforge/schema/features.py:L16-L57`, `leadforge/schema/features.py:L153-L165`, `leadforge/schema/features.py:L287-L304`. + +**Strength:** The exposure-mode architecture is real and valuable. + +**Gap:** Redaction currently targets known columns, but does not yet enforce a full “no public join path reconstructs label” guarantee. The alpha exposure delta says `current_stage` and `is_sql` are removed and redaction is applied uniformly, but it does not address target labels, opportunity close outcome, customer existence, or subscription existence in public relational tables. ([GitHub][2]) + +**Release implication:** Add `leadforge/validation/relational_leakage.py` and fail v1 if any public relational join path can reconstruct the target above a strict threshold. + +--- + +### 2.9 Validation suite + +**Exists:** `validate_bundle()` checks required files, tables, task split files, hashes, foreign keys, unexpected leakage columns, exposure redaction, realism, and difficulty. `leadforge/validation/bundle_checks.py:L26-L55`, `leadforge/validation/bundle_checks.py:L71-L260`. Realism checks cover conversion-rate guardrails, nonempty tables, ranges, booleans, and stage diversity. `leadforge/validation/realism.py:L23-L162`. Difficulty validation defines profile target ranges and ordering. `leadforge/validation/difficulty.py:L12-L102`. Drift validation checks cross-seed stability. `leadforge/validation/drift.py:L44-L104`. + +The lead-scoring validation module already includes ROC-AUC, PR-AUC, precision/recall/lift@K, value-aware ranking, leakage-trap deltas, group determinism, and v7 validation flow. `leadforge/validation/lead_scoring.py:L120-L161`, `leadforge/validation/lead_scoring.py:L423-L537`, `leadforge/validation/lead_scoring.py:L733-L808`. + +**Strength:** Validation is present and nontrivial. + +**Gaps:** Release-grade validation is not yet a single reproducible artifact with charts, calibration, Brier/log loss, relational leakage probes, split leakage probes, public/instructor diff assertions, cross-seed bands, and LLM critique. + +**Release implication:** The v1 release should not rely only on `leadforge validate`; it needs `scripts/validate_release_candidate.py` producing `validation_report.json`, `validation_report.md`, and figures. + +--- + +### 2.10 Release tooling, HF material, and notebooks + +**Exists:** `scripts/build_public_release.py` builds intro/intermediate/advanced public bundles plus an intermediate instructor bundle, writes flat CSVs for public bundles, pins generation timestamps, copies the license, and validates bundles. `scripts/build_public_release.py:L1-L21`, `scripts/build_public_release.py:L37-L87`, `scripts/build_public_release.py:L112-L170`. + +`release/HF_DATASET_CARD.md` already has YAML front matter with license, task categories, tags, size category, and configs for intro/intermediate/advanced splits. `release/HF_DATASET_CARD.md:L1-L44`. `release/README.md` already describes the release layout, quick start, dataset summary, leakage handling, research companion, and provenance. `release/README.md:L1-L167`. The repo contains a baseline release notebook, and the examples notebook inspects generated worlds. + +**Strength:** Hugging Face packaging is partial, not absent. + +**Gaps:** Kaggle metadata is missing; HF card is not yet a final repo `README.md` with `pretty_name`, `tabular`, `datasets`, `pandas`, `default: true`, and tested configs; there is no cover image; no publisher scripts; no post-upload smoke tests; and only one release notebook is present. + +**Release implication:** Add platform package generation scripts, not hand-authored upload folders. + +--- + +### 2.11 Test suite and CI + +**Exists:** CI runs Ruff, mypy, tests on Python 3.11/3.12, and v5/v6/v7 dataset validation jobs. `.github/workflows/ci.yml:L13-L52`, `.github/workflows/ci.yml:L61-L140`. The extracted test suite collected 937 tests. + +**Strength:** Test coverage is broad for a small project. + +**Gap:** CI does not yet gate full release-candidate packaging, Kaggle/HF metadata validation, relational leakage, notebook execution, or release report generation. + +**Release implication:** Add a release-candidate CI workflow that runs on demand and uploads validation artifacts. + +--- + +## 3. Existing dataset and alpha release forensics + +### 3.1 Public alpha release inventory + +The public `leadforge-datasets` repo currently has a `v0.1.0-alpha` release folder with five bundles: intro, intermediate, advanced, intermediate instructor, and tiny demo. The README reports all bundles are generated from `b2b_saas_procurement_v1`, seed 42, leadforge 1.0.0, bundle schema v4, and 5,000 leads for the three main public tiers. It also lists companion artifacts: `BASELINES.md`, `EXPOSURE_DELTA.md`, `provenance.json`, `build.sh`, `validation.log`, and `baselines.py`. ([GitHub][3]) + +The alpha validation log reports all five bundles passed `leadforge validate`. ([GitHub][4]) + +### 3.2 Alpha difficulty tiers + +The alpha baselines show a useful but incomplete difficulty story: + +| Tier | Train conversion | LogReg AUC | LogReg AP | LogReg P@100 | +| ------------ | ---------------: | ---------: | --------: | -----------: | +| intro | 41.5% | 0.886 | 0.785 | 79% | +| intermediate | 20.1% | 0.880 | 0.559 | 65% | +| advanced | 7.9% | 0.870 | 0.271 | 26% | + +The alpha interpretation is reasonable: rank-order AUC remains high, while AP/P@K degrade as positives become sparser. ([GitHub][1]) The v1 release should go further: calibration, lift curves, and value capture need to be generated as figures, and a stronger model should show some realistic improvement over a simple model. + +### 3.3 Public vs instructor mode + +The alpha exposure delta documents that public and instructor bundles share the same recipe/seed/difficulty, with public redacting `current_stage` and `is_sql` and omitting hidden-truth metadata. ([GitHub][2]) This is a good pattern. The missing piece is a deeper assertion: no public relational table or join path should reveal the label, terminal opportunity status, post-snapshot activity, or customer existence. + +### 3.4 `lead_scoring_intro` v7 lessons + +The `lead_scoring_intro` v7 track is one of the strongest assets in the repo. It defines a 1,000-row student CSV at snapshot day 20, with target conversion within 90 days, student and instructor variants, and a purely causal temporal leakage trap in the instructor file. `lead_scoring_intro/RELEASE_v7.md:L19-L38`. + +The v7 release records baseline AUC ≈0.671 and PR-AUC ≈0.426, GBM improving LR by ≈0.072 AUC, value-aware ranking uplift of 13.4% at K=25 and 20.3% at K=50, a subtle leakage-trap delta of ≈0.013, and a cohort split AUC drop of ≈0.089. `lead_scoring_intro/RELEASE_v7.md:L121-L196`, `lead_scoring_intro/validation_v7_report.json`. + +**Lessons to carry into v1:** + +1. Keep the student path simple and safe. +2. Keep leakage traps clearly separated from student-facing features. +3. Teach value-aware ranking, not only probability ranking. +4. Include cohort/time-shift evaluation. +5. Make tree/GBM lift over LR visible but not absurd. +6. Document limitations bluntly. +7. Provide a lecture/notebook sequence. `lead_scoring_intro/RELEASE_v7.md:L206-L278`. + +### 3.5 What currently makes the dataset easy or hard to break + +**Harder than common public datasets:** + +* Relational world with accounts, contacts, leads, touches, sessions, sales activities, opportunities, customers, and subscriptions. +* Hidden motifs and stochastic rewiring. +* Public/instructor modes. +* Windowed snapshot logic. +* Feature dictionary with leakage flags. +* Difficulty tiers. +* Value-aware ACV signal. +* Cohort split lesson in v7. + +**Easy to break today:** + +* Public relational tables leak the label and post-outcome state unless redesigned. +* Alpha LogReg AUC is high and close to HistGBM, so the alpha may not reward model sophistication enough. +* Release-level validation does not yet probe ID leakage, account/contact split leakage, relational join leakage, calibration, Brier/log loss, or post-snapshot event leakage. +* Public release docs and feature dictionary must make the intentional leakage trap impossible to miss for Kaggle/HF users. + +--- + +## 4. External research + +### 4.1 Public lead-scoring dataset census + +| Dataset | Platform | Domain | Rows | Shape | Documentation quality | Main weakness | +| ------------------------------------ | ------------ | ------------------------: | -----: | ----------------- | --------------------- | ------------------------------------------------- | +| X Education Lead Scoring | Kaggle | Online education | 9,240 | Flat, 37 cols | Many notebooks | Overused, flat, leakage-suspect status/tag fields | +| `shawhin/lead-scoring-x` | Hugging Face | Processed X Education | 5,688 | Flat, 7 features | Minimal card | Very reduced feature set | +| Online Shoppers Purchasing Intention | UCI | E-commerce session intent | 12,330 | Flat, 17 features | Solid UCI metadata | Not CRM/B2B lead scoring | +| GitHub/PyCaret demos | GitHub/blogs | Usually X Education | Varies | Flat | Tutorial-centric | Repeats same source dataset | + +The canonical public X Education dataset is a flat online-education CRM dataset. A public EDA article reports 9,240 rows and 37 columns, and a 38.54% conversion rate. ([Analytics Vidhya][5]) The Hugging Face processed variant uses only seven key features and reports 5,688 rows. ([Hugging Face][6]) The UCI Online Shoppers dataset is a useful adjacent benchmark, but it is session-level e-commerce intent, not B2B CRM lead scoring; UCI reports 12,330 sessions, 17 features, and 84.5% negative class. ([UCI Machine Learning Repository][7]) + +**Implication:** Leadforge can plausibly be best-in-class if it ships relational/snapshot-safe data, data cards, validation, notebooks, and break-me artifacts. The public landscape is shallow. + +### 4.2 Lead-scoring and B2B GTM realism + +Current product documentation and case-study literature support Leadforge’s fit + engagement + stage/process design. + +HubSpot’s scoring tool distinguishes fit scores based on properties, engagement scores based on events, and combined scores using both property values and events. ([HubSpot Knowledge Base][8]) Salesforce describes lead scoring as ranking leads based on behavior, demographics, and engagement to help sellers prioritize effort. ([Salesforce][9]) Adobe Real-Time CDP B2B describes predictive lead/account scoring as learning from opportunity-stage conversion events, aggregating person activities to account level, and using tree-based random forest/gradient boosting methods. ([Experience League][10]) + +The 2025 Frontiers B2B lead-scoring case study is especially relevant. It used real CRM data from January 2020 to April 2024, evaluated 15 classifiers, and found Gradient Boosting superior; it also identified source and lead status as important predictive features. ([Frontiers][11]) The same paper reports 23,154 CRM records and 67 fields, including source, status, reason for status, last activity, and contact fields, and later highlights lead source, reason/status, lead classification, product, responses, account type, and interest level as important. ([Frontiers][11]) It also notes B2B processes can involve longer consultative sales cycles and overloaded sales reps, matching Leadforge’s prioritization framing. ([Frontiers][11]) + +**Implication for v1:** The release should emphasize lead prioritization, top-K sales capacity, lift, value capture, calibration, and process timing, not only binary classification AUC. + +### 4.3 Synthetic data generation and evaluation + +For pure synthetic data like Leadforge, “fidelity to real data” is not enough because there is no single real reference dataset. Still, synthetic-data evaluation literature and tooling point to useful axes: statistical quality, relational/cardinality preservation, utility, privacy/disclosure risk, and documentation. + +SDMetrics’ Quality Report evaluates statistical similarity through column shapes, column pair trends, and for multi-table data, cardinality and intertable trends. ([Synthetic Data Vault][12]) Leadforge should borrow the idea of multi-axis reporting, but adapt it to **mechanism-designed synthetic worlds**: validity, leakage safety, difficulty, utility, structural diversity, narrative plausibility, and public artifact correctness. + +Datasheets for Datasets argues datasets should document motivation, composition, collection/creation, recommended uses, and related information. ([arXiv][13]) Google’s Data Cards Playbook defines data cards as structured summaries of essential dataset facts for stakeholders across the lifecycle and includes themes such as authorship, dataset overview, motivation, provenance, transformations, annotations/labeling, validation, sampling, and benchmarks. ([Google Research][14]) + +**Implication for v1:** Leadforge should ship a generated validation report and a human-readable data card. The card should not only describe files; it should describe DGP, snapshot policy, label policy, leakage traps, limitations, intended use, out-of-scope use, and maintenance. + +### 4.4 Kaggle release requirements + +Kaggle’s current official API docs say a dataset upload folder must contain `dataset-metadata.json` next to the uploaded files, and the metadata follows the Data Package specification. Supported fields include `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image`. ([GitHub][15]) + +Important Kaggle constraints: `title` must be 6–50 characters, `subtitle` 20–80 characters, dataset slug 3–50 characters, exactly one license entry is evaluated, and `resources[].schema.fields` must include all fields in order if provided. ([GitHub][15]) Supported `expectedUpdateFrequency` values include `never`, `annually`, `quarterly`, `monthly`, `weekly`, `daily`, and `hourly`. ([GitHub][15]) Kaggle’s cover image guidance currently recommends `dataset-cover-image.png` or `.jpg/.jpeg/.webp` beside `dataset-metadata.json`, with minimum 560×280 dimensions and specified 2:1 header and 1:1 thumbnail crops. ([GitHub][15]) + +**Implication:** Add a generator for `release/kaggle/dataset-metadata.json`, a cover image, and validation against these constraints. + +### 4.5 Hugging Face release requirements + +Hugging Face dataset repos render `README.md` as the dataset card, with YAML metadata at the top for license, language, tags, size, and data-file configuration. ([Hugging Face][16]) Supported repository structures and file formats such as CSV and Parquet can be loaded automatically with `load_dataset()` and can show a Dataset Viewer. ([Hugging Face][17]) The YAML `configs` field defines splits and subsets; multiple configurations can be loaded by name, and `default: true` can set the default config. ([Hugging Face][17]) Hugging Face also documents manual split/subset configuration and notes Parquet viewer-size issues can be mitigated with smaller row groups and page indexes. ([Hugging Face][18]) + +**Implication:** Convert `release/HF_DATASET_CARD.md` into the final HF repo `README.md`, add `pretty_name`, `tags: [tabular, lead-scoring, synthetic-data, crm, b2b, datasets, pandas]`, `configs` for all tiers, a default config, and test local `load_dataset()`. + +--- + +## 5. Best-in-class v1 release specification + +### 5.1 Dataset family shape + +Ship a family, not one CSV: + +```text +leadforge-lead-scoring-v1 + intro_public + intermediate_public + advanced_public + intermediate_research_companion +``` + +The public tiers should be snapshot-safe. The research companion should be clearly marked “not for student exercises.” + +### 5.2 Canonical public release tree + +```text +leadforge-lead-scoring-v1/ + README.md + LICENSE + CITATION.cff + CHANGELOG.md + dataset-cover-image.png + + docs/ + DATASET_CARD.md + GENERATION_METHOD.md + VALIDATION_REPORT.md + FEATURE_DICTIONARY.md + BREAK_ME_GUIDE.md + STUDENT_QUICKSTART.md + LIMITATIONS.md + + data/ + intro/ + train.csv + validation.csv + test.csv + lead_scoring.csv + manifest.json + feature_dictionary.csv + intermediate/ + ... + advanced/ + ... + + relational_snapshot_safe/ + intro/ + accounts.parquet + contacts.parquet + leads.parquet + touches.parquet + sessions.parquet + sales_activities.parquet + opportunities.parquet + intermediate/ + advanced/ + + validation/ + validation_report.json + validation_report.md + figures/ + lift_curve_intro.png + lift_curve_intermediate.png + lift_curve_advanced.png + calibration_intermediate.png + leakage_delta.png + split_shift.png + value_capture.png + + notebooks/ + 01_intro_flat_csv_baseline.ipynb + 02_relational_feature_engineering.ipynb + 03_leakage_and_time_windows.ipynb + 04_lift_calibration_value_ranking.ipynb + + kaggle/ + dataset-metadata.json + + huggingface/ + README.md +``` + +### 5.3 Public bundle contents + +Public bundle should include: + +* Flat task splits with labels. +* Snapshot-safe relational tables with labels and post-outcome fields removed. +* Feature dictionary with `leakage_risk`, `available_at`, `derived_from`, `entity_level`, and `recommended_for_modeling`. +* Manifest with row counts, checksums, recipe, seed, package version, schema version, snapshot day, horizon, and validation report hash. +* Notebook-safe starter path that excludes leakage-trap features by default. + +### 5.4 Instructor/research companion contents + +Instructor companion should include: + +* Full hidden graph. +* Full world spec. +* Mechanism summary. +* Latent registry. +* Full-horizon relational tables. +* Instructor leakage-trap features. +* Public/instructor diff report. +* LLM critique raw outputs and adjudication. + +Recommendation: keep this out of the default Kaggle dataset. Put it in a separate GitHub Release artifact or a separate HF repo/config named clearly as instructor/research material. This preserves teaching utility while enabling external audit. + +### 5.5 Notebooks + +Minimum notebooks: + +1. **Intro flat CSV baseline:** LR/GBM, AUC, PR-AUC, P@K, lift, calibration. +2. **Relational feature engineering:** only snapshot-safe tables; demonstrate legal joins. +3. **Leakage and time windows:** deliberately add leakage trap and post-snapshot fields; show why invalid. +4. **Lift, calibration, value ranking:** use `expected_acv`, `P(convert) × expected_acv`, calibration curves, thresholding. + +Acceptance: all notebooks run top-to-bottom and reproduce validation metrics within tolerance. + +### 5.6 Validation report + +Minimum release validation metrics: + +* Row counts, class balance, split sizes. +* ROC-AUC, PR-AUC, log loss, Brier score. +* Calibration bins and reliability curve. +* Lift@1/5/10%, precision@50/100, recall@K. +* Top-decile conversion rate. +* Expected ACV captured at K. +* LR vs GBM vs source-only vs engagement-only vs leakage-probe models. +* ID-only model. +* Stage/opportunity/customer-only suspect models. +* Post-snapshot aggregate leakage model. +* Account/contact overlap across splits. +* Near-duplicate rows across splits. +* Public/instructor diff. +* Snapshot-window audit. +* Relational join leakage audit. +* Cross-seed stability. +* Cross-tier difficulty ordering. + +--- + +## 6. Gap matrix + +| Area | Current evidence | Gap | Severity | Recommended fix | Acceptance criterion | +| ------------------------ | ------------------------------------------------------------------------------------------------------- | ------------------------------------------- | -------- | ------------------------------------------------------------------- | -------------------------------------------- | +| Core generation | End-to-end API exists; graph → population → simulation → bundle. `leadforge/api/generator.py:L124-L248` | Not the blocker | Low | Keep stable | Smoke generation passes | +| CLI | `generate`, `inspect`, `validate`, `list-recipes` exist. `leadforge/cli/main.py:L39-L42` | No release commands / JSON | Medium | Add `leadforge release ...`, `--json` | Machine-readable CI output | +| Public relational tables | Writes full leads/opps/customers/subscriptions. `leadforge/render/relational.py:L42-L83` | Direct target/post-outcome leakage | Critical | Add snapshot-safe relational export; move full horizon to companion | No public join path reconstructs label | +| Flat task | Snapshot filtering exists. `leadforge/render/snapshots.py:L57-L243` | Needs full leakage probes | High | Add time-window and suspect-feature probes | No high-severity leakage | +| Exposure | Public/instructor modes exist. `leadforge/exposure/filters.py:L23-L58` | Redaction too narrow for relational leakage | Critical | Expand redaction/safe-export policy | Relational leak test passes | +| Validation | Bundle, realism, difficulty, drift, v7 metrics exist | No release report/charts/calibration/LLM | High | Add `release_quality.py`, `leakage_probes.py`, reporting | `validation_report.{json,md}` generated | +| HF packaging | `release/HF_DATASET_CARD.md` exists | Needs final README/config/default/load test | Medium | Add `package_hf_release.py` | Local `load_dataset()` works | +| Kaggle packaging | No `dataset-metadata.json` found | Missing platform package | High | Add `package_kaggle_release.py` | Metadata validates; dry-run package produced | +| Notebooks | One release baseline notebook | Missing relational/leakage/value sequence | Medium | Add 4 notebooks | All execute | +| v7 lessons | Strong v7 track exists | Not fully propagated into v1 spec | Medium | Port v7 teaching sequence | Data card/notebooks include v7 lessons | +| Feedback loop | Alpha repo exists | No issue templates/break-me guide | Medium | Add GitHub templates + guide | Public pages link feedback channels | +| Scope | LTV-ready internals exist | Risk of v1 scope creep | Medium | State out-of-scope clearly | No LTV/leaderboard work in v1 | + +--- + +## 7. Roadmap to v1 + +### Milestone 1 — Release audit and acceptance gates + +**Goal:** Freeze the current-state evidence and define v1 gates. + +**Work items:** + +* Add `docs/release/v1_current_state_audit.md`. +* Add `docs/release/v1_acceptance_gates.md`. +* Regenerate intro/intermediate/advanced/instructor bundles with pinned timestamp. +* Record command logs. +* Record public/instructor diff. +* Record known relational leakage finding. + +**Files likely touched:** + +```text +docs/release/v1_current_state_audit.md +docs/release/v1_acceptance_gates.md +scripts/build_public_release.py +``` + +**Commands:** + +```bash +python -m pytest --collect-only -q +python -m pytest -q +python scripts/build_public_release.py /tmp/leadforge_v1_rc \ + --generation-timestamp 2026-01-01T00:00:00+00:00 +leadforge validate /tmp/leadforge_v1_rc/intermediate +``` + +**Acceptance:** Full tests pass or failures triaged; release bundles regenerate; relational leakage is documented as a blocker, not ignored. + +--- + +### Milestone 2 — Snapshot-safe public relational export + +**Goal:** Remove direct and join-based label leakage from public relational data. + +**Work items:** + +* Add `leadforge/render/relational_snapshot_safe.py`. +* Add `leadforge/validation/relational_leakage.py`. +* Drop target and conversion timestamps from public `leads`. +* Filter event tables to `timestamp <= lead_created_at + snapshot_day`. +* Drop `close_outcome` and `closed_at` from public `opportunities`. +* Omit `customers` and `subscriptions` from public feature-engineering exports. +* Keep full-horizon tables only in instructor companion. + +**Acceptance:** A leak probe using only public relational tables cannot reconstruct `converted_within_90_days` above configured tolerance; a customer/opportunity-only model fails because those fields are absent or snapshot-safe. + +--- + +### Milestone 3 — Platform package generation + +**Goal:** Build Kaggle and HF upload folders from release manifests. + +**Work items:** + +```text +scripts/package_kaggle_release.py +scripts/package_hf_release.py +release/kaggle/dataset-metadata.json +release/huggingface/README.md +release/dataset-cover-image.png +``` + +**Kaggle acceptance:** + +* `dataset-metadata.json` contains valid `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image`. +* Title/subtitle/slug/image constraints pass. +* Resource schemas include fields in order. +* Package zip is produced without credentials. + +**HF acceptance:** + +* `README.md` has YAML metadata with `pretty_name`, `license`, `language`, `task_categories`, `size_categories`, `tags`, and `configs`. +* Main config is `default: true`. +* `load_dataset(local_path, "intermediate")` works or blocker is recorded. + +--- + +### Milestone 4 — Release validation hardening + +**Goal:** Turn validation into a release artifact. + +**Work items:** + +```text +leadforge/validation/release_quality.py +leadforge/validation/leakage_probes.py +leadforge/validation/reporting.py +scripts/validate_release_candidate.py +release/validation/validation_report.json +release/validation/validation_report.md +release/validation/figures/*.png +``` + +**Acceptance:** + +* No critical leakage findings. +* Metrics within configured tier bands. +* Calibration and lift charts generated. +* Relational leak probes pass. +* Public/instructor diff is intentional. +* Report is included in Kaggle/HF packages. + +--- + +### Milestone 5 — Documentation and notebooks + +**Goal:** Make the release usable by educators, students, and external breakers. + +**Work items:** + +```text +docs/release/DATASET_CARD.md +docs/release/GENERATION_METHOD.md +docs/release/BREAK_ME_GUIDE.md +docs/release/STUDENT_QUICKSTART.md +docs/release/INSTRUCTOR_GUIDE.md +notebooks/01_intro_flat_csv_baseline.ipynb +notebooks/02_relational_feature_engineering.ipynb +notebooks/03_leakage_and_time_windows.ipynb +notebooks/04_lift_calibration_value_ranking.ipynb +``` + +**Acceptance:** Notebooks run top-to-bottom; notebook metrics match validation report; leakage-trap use is clearly separated from normal modeling. + +--- + +### Milestone 6 — LLM critique integration + +**Goal:** Add the external LLM review loop requested in the original milestone. + +**Work items:** + +```text +leadforge/validation/llm_critique.py +docs/release/llm_critique_prompt.md +release/validation/llm_critique_raw/*.json +release/validation/llm_critique_summary.md +``` + +**Input bundle:** + +* README / dataset card. +* Generation method. +* Manifest. +* Feature dictionary. +* Validation report. +* First 100 public rows. +* Public/instructor diff. +* Public-safe mechanism summary. + +**Output schema:** + +```json +{ + "release_id": "leadforge-lead-scoring-v1", + "model": "provider/model/version", + "run_timestamp": "ISO-8601", + "overall_score": 0, + "findings": [ + { + "severity": "critical|high|medium|low|nit", + "category": "leakage|realism|documentation|platform|ethics|pedagogy|code", + "claim": "...", + "evidence": "file/path:line or artifact reference", + "reproducer": "optional command", + "suggested_fix": "..." + } + ], + "missing_sections": [], + "questions_for_maintainer": [] +} +``` + +**Acceptance:** Runs with credentials, skips cleanly without credentials, produces structured findings, and no unresolved high-severity findings remain. + +--- + +### Milestone 7 — Dry-run publication and public feedback loop + +**Goal:** Publish safely and invite breakage reports. + +**Work items:** + +```text +scripts/publish_kaggle.py +scripts/publish_hf.py +.github/ISSUE_TEMPLATE/dataset_breakage_report.yml +.github/ISSUE_TEMPLATE/realism_feedback.yml +docs/release/v1_release_notes.md +``` + +**Acceptance:** + +* Kaggle private/draft upload tested. +* HF private repo upload tested. +* Download/load smoke tests pass. +* Public pages link to break-me guide and issue templates. +* LTV, leaderboard, and other task families remain out of v1 scope. + +--- + +## 8. Suggested v2 feedback plan + +The v1 public framing should explicitly ask users to break the dataset in these ways: + +1. Find direct leakage. +2. Reconstruct labels through relational joins. +3. Beat baseline lift with legitimate features. +4. Show unrealistic marginal or joint distributions. +5. Show unrealistic sales-cycle or funnel dynamics. +6. Identify documentation ambiguity. +7. Find platform loading/viewer problems. +8. Propose better industry calibration sources. + +Feedback should be triaged into: + +```text +critical-leakage +realism +difficulty +documentation +platform +notebook +pedagogy +v2-idea +out-of-scope-v1 +``` + +Keep a `docs/release/v2_decision_log.md` that records accepted/rejected feedback and why. Do not add LTV, leaderboard, or other GTM tasks to v1. + +--- + +## 9. Bottom line + +Leadforge’s current state is strong. The right next move is not to build the generator; it is to make the curated public release impossible to dismiss. + +The required release-hardening work is concrete: + +1. Fix public relational leakage. +2. Generate platform-native Kaggle/HF packages. +3. Produce a release validation report with charts and adversarial probes. +4. Port v7’s strongest teaching lessons into the multi-table v1 release. +5. Add notebooks, break-me guide, issue templates, and LLM critique. +6. Publish public data and separate instructor/research truth cleanly. + +Until the relational leakage issue is fixed, the v1 dataset should not be released as a best-in-class public lead-scoring dataset. Once fixed, Leadforge has enough implemented machinery to plausibly exceed the current public lead-scoring dataset landscape. + +[1]: https://github.com/leadforge-dev/leadforge-datasets/blob/main/releases/v0.1.0-alpha/BASELINES.md "leadforge-datasets/releases/v0.1.0-alpha/BASELINES.md at main · leadforge-dev/leadforge-datasets · GitHub" +[2]: https://github.com/leadforge-dev/leadforge-datasets/blob/main/releases/v0.1.0-alpha/EXPOSURE_DELTA.md "leadforge-datasets/releases/v0.1.0-alpha/EXPOSURE_DELTA.md at main · leadforge-dev/leadforge-datasets · GitHub" +[3]: https://github.com/leadforge-dev/leadforge-datasets "GitHub - leadforge-dev/leadforge-datasets · GitHub" +[4]: https://github.com/leadforge-dev/leadforge-datasets/blob/main/releases/v0.1.0-alpha/validation.log "leadforge-datasets/releases/v0.1.0-alpha/validation.log at main · leadforge-dev/leadforge-datasets · GitHub" +[5]: https://www.analyticsvidhya.com/blog/2022/09/exploratory-data-analysis-eda-on-lead-scoring-dataset/ "Exploratory Data Analysis (EDA) on Lead Scoring Dataset -" +[6]: https://huggingface.co/datasets/shawhin/lead-scoring-x "shawhin/lead-scoring-x · Datasets at Hugging Face" +[7]: https://archive.ics.uci.edu/ml/datasets/Online%2BShoppers%2BPurchasing%2BIntention%2BDataset "UCI Machine Learning Repository" +[8]: https://knowledge.hubspot.com/scoring/build-lead-scores?utm_source=chatgpt.com "Build lead scores to qualify contacts, companies, and deals" +[9]: https://www.salesforce.com/blog/lead-scoring/?utm_source=chatgpt.com "Lead Scoring: How to Find the Best Prospects in 4 Steps" +[10]: https://experienceleague.adobe.com/en/docs/experience-platform/rtcdp/b2b-cdp-ai-ml/predictive-lead-and-account-scoring-intro/predictive-lead-and-account-scoring?utm_source=chatgpt.com "Predictive lead and account scoring in Real-Time CDP B2B" +[11]: https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1554325/full "Frontiers | The relevance of lead prioritization: a B2B lead scoring model based on machine learning" +[12]: https://docs.sdv.dev/sdmetrics/data-metrics/quality/quality-report "Quality Report | SDMetrics" +[13]: https://arxiv.org/abs/1803.09010?utm_source=chatgpt.com "Datasheets for Datasets" +[14]: https://sites.research.google/datacardsplaybook/ "The Data Cards Playbook - Data Cards Playbook" +[15]: https://github.com/Kaggle/kaggle-api/blob/main/docs/datasets_metadata.md "kaggle-cli/docs/datasets_metadata.md at main · Kaggle/kaggle-cli · GitHub" +[16]: https://huggingface.co/docs/hub/datasets-cards "Dataset Cards · Hugging Face" +[17]: https://huggingface.co/docs/datasets/repository_structure "Structure your repository · Hugging Face" +[18]: https://huggingface.co/docs/hub/datasets-data-files-configuration "Data files Configuration · Hugging Face" diff --git a/docs/external_review/chatgpt/leadforge_report_v1_critique.md b/docs/external_review/chatgpt/leadforge_report_v1_critique.md new file mode 100644 index 0000000..2eaecfd --- /dev/null +++ b/docs/external_review/chatgpt/leadforge_report_v1_critique.md @@ -0,0 +1,678 @@ +# Critique of `chatgpt_report_v1.md` and Suggested Better Process + +## Executive verdict + +The generated report is useful as a **first rough planning memo**, but it does **not** satisfy the original task prompt. The prompt asked for a thorough code-and-release review, a deep research expedition, and an actionable roadmap toward a best-in-class Kaggle/Hugging Face lead-scoring dataset. The report instead gives a mostly generic roadmap, lightly references the alpha dataset repository, and misses major evidence present in the attached `leadforge-repomix-output.xml` package. + +The biggest problem is not tone or formatting. It is **methodological under-inspection**. The report appears to have skimmed the architecture documents and external platform docs, then inferred a future roadmap. It did not build an evidence matrix from the repository, did not run or audit the package, did not inspect the existing release assets, and did not distinguish what already exists from what remains to build. + +This caused several materially wrong or misleading statements, including: + +- It says many modules are placeholders and that the simulation pipeline, CLI, validation, and release packaging need to be implemented. The attached repo already contains implemented generator, population, hidden graph, simulation, bundle writer, CLI commands, validation modules, release scripts, release README, and Hugging Face dataset card files. +- It says there is no Kaggle/HF packaging. There is no Kaggle `dataset-metadata.json`, but there **is** a Hugging Face dataset card and a release README in `release/`. +- It says built-in evaluation is missing. The repo contains bundle validation, realism checks, difficulty checks, drift/cross-seed checks, v7 dataset validation, release validation reports, and a large test suite. +- It gives platform guidance that is partly inaccurate or too loose, such as an unsupported “1200×400” Kaggle image claim and non-current field names like `updateFrequency` instead of `expectedUpdateFrequency`. + +A better report should be much more forensic: it should audit the actual code, release artifacts, generated data, validation scripts, tests, and external platform requirements; then produce a gap matrix, acceptance gates, and a PR-sized roadmap. + +## What I reviewed + +I reviewed: + +1. The task prompt in the current conversation. +2. The generated report: `chatgpt_report_v1.md`. +3. The attached core package: `leadforge-repomix-output(1).xml`. +4. The extracted repository files from the Repomix XML. +5. Current public documentation for Kaggle dataset metadata and Hugging Face dataset cards/repository structure. + +I also performed lightweight static and dynamic checks on the extracted repo: + +- Extracted **194 files** from the Repomix XML. +- Counted **78 Python files** under `leadforge/`. +- Counted roughly **10,398 lines** under `leadforge/` and **9,312 lines** under `tests/`. +- Found **81 classes**, **286 functions**, **0 `NotImplementedError` occurrences**, **0 TODO occurrences**, and only **2 literal `pass` statements** in `leadforge/`. +- `pytest --collect-only` found **937 tests**. +- A partial test run after editable install passed the first ~53% of tests before the execution timeout. I do **not** treat this as a failed test run; it only means I did not complete the full dynamic validation within this critique pass. + +## Scorecard for the generated report + +| Dimension | Grade | Why | +|---|---:|---| +| Prompt compliance | C- | Covers the requested headings, but not at the requested depth. | +| Repository review | D | Misses substantial implemented code and release assets. | +| Dataset release audit | C- | Mentions alpha bundles, but does not inspect data or artifacts deeply. | +| External research | C | Uses a few relevant sources, but too shallow and vendor-blog heavy. | +| Roadmap actionability | C- | Generic milestones; few concrete files, commands, gates, or PRs. | +| Citation quality | D | Browser-internal citations are not portable; several claims are miscited. | +| Strategic usefulness | C | Good high-level instincts, but unsafe as an execution plan. | + +The report is directionally aligned with leadforge’s vision, but it would mislead an implementer about what is already done. + +## Major factual and evidentiary problems + +### 1. It misclassifies the repo as mostly skeletal + +The report states that the repo includes skeletons and “many functions are placeholders,” and later recommends “complete the simulation pipeline.” That is not supported by the attached package. + +Evidence in the attached package shows an implemented end-to-end generation path: + +- `leadforge/api/generator.py` builds a `Generator` from a recipe, resolves config and narrative, samples a hidden graph, builds population, simulates the world, and returns a populated `WorldBundle`. +- `leadforge/structure/sampler.py` selects a motif family, performs stochastic rewiring, and returns a validated hidden world graph. +- `leadforge/simulation/population.py` generates accounts, contacts, leads, and latent states. +- `leadforge/simulation/engine.py` contains a detailed discrete-time 90-day simulation with stage transitions, conversion hazards, event emission, opportunity creation, customers, and subscriptions. +- `leadforge/api/bundle.py` writes relational Parquet tables, snapshot task splits, dataset card, feature dictionary, exposure metadata, and manifest. + +The right critique would not be “implement the engine.” It would be: + +- Audit whether the engine is realistic enough. +- Identify where mechanisms are too simple, over-tuned, or under-documented. +- Assess whether difficulty profiles are stable across seeds. +- Test whether public artifacts remain leakage-safe under relational feature engineering. +- Add release-grade validations and publishing automation where missing. + +### 2. It misses existing CLI implementation + +The report’s roadmap says to implement `leadforge generate`, `list-recipes`, `inspect`, and `validate`. Those commands already exist in the attached repo. + +Evidence: + +- `leadforge/cli/commands/generate.py` implements generation, override handling, config resolution, bundle generation, and save. +- `leadforge/cli/commands/inspect.py` reads `manifest.json` and prints recipe, seed, mode, difficulty, horizon, package version, schema version, motif family, table row counts, task rows, and metadata presence. +- `leadforge/cli/commands/validate.py` calls `validate_bundle()` and exits nonzero on failures. + +A better roadmap would focus on gaps in those commands: + +- Add `--json` output to `inspect` and `validate`. +- Add `release build`, `release validate`, `release package-kaggle`, `release package-hf`, and `release publish-*` commands. +- Add a dry-run mode for publishing. +- Add environment-variable checks for Kaggle/HF credentials. +- Add artifact hashing and upload manifest verification. + +### 3. It says there is no Hugging Face packaging, but there is + +The report says: “No Kaggle/HF packaging.” That is only half true. + +The attached repo contains: + +- `release/HF_DATASET_CARD.md`, with YAML front matter for `language`, `license`, `task_categories`, `tags`, `size_categories`, and `configs` for intro/intermediate/advanced splits. +- `release/README.md`, with release layout, quick-start code, dataset summary, leakage handling, research companion explanation, and provenance. +- `scripts/build_public_release.py`, which builds intro, intermediate, advanced, and intermediate instructor bundles; writes flat CSVs for public bundles; copies the license; and validates each bundle. + +What is missing is more specific: + +- A Kaggle `dataset-metadata.json` template/generator. +- A final HF `README.md` that satisfies the full dataset-card template, not just a concise release card. +- A release asset manifest for platform upload. +- A publishing command or CI workflow. +- A cover image asset. +- Automated post-upload smoke tests. + +### 4. It underestimates existing validation + +The report says built-in evaluation is lacking. That is too broad. + +The repo already has multiple validation layers: + +- `leadforge/validation/bundle_checks.py` validates required files, table files, task split files, hashes, FK integrity, leakage columns, and exposure redaction. +- `leadforge/validation/realism.py` checks conversion-rate guardrails, nonempty tables, feature ranges, boolean dtypes, and stage diversity where available. +- `leadforge/validation/difficulty.py` checks known difficulty profiles and difficulty ordering across bundles. +- `leadforge/validation/drift.py` checks cross-seed stability and degenerate conversion-rate patterns. +- `lead_scoring_intro/validation_v7_report.json` contains concrete v7 metrics including baseline AUC, PR-AUC, value-aware ranking uplift, leakage-trap deltas, missingness, and cohort split degradation. + +The better critique is not that validation is absent. It is that v1 release validation needs to become **release-grade**, with explicit acceptance thresholds, persisted reports, charts, adversarial leakage probes, platform packaging checks, and LLM critique artifacts. + +### 5. It ignores the `lead_scoring_intro` v6/v7 track + +The task prompt explicitly mentioned a one-CSV lead-scoring dataset used in an Intro to ML course. The attached repo contains a substantial `lead_scoring_intro/` track: + +- `lead_scoring_intro/RELEASE_v7.md` documents a v7 educational dataset, a purely causal leakage trap, snapshot definition, student/instructor files, column dictionary, baseline metrics, tree-model comparison, value-aware ranking, cohort split evaluation, known limitations, and lecture guidance. +- `lead_scoring_intro/validation_v7_report.json` stores metrics used by that release document. +- `scripts/build_v7_snapshot.py` and `scripts/validate_v7_dataset.py` support generation and validation. + +The generated report almost completely misses this. That is a major omission because the v7 CSV track is likely one of the best sources of lessons for v1: leakage trap design, lecture sequencing, cohort shift, value-aware ranking, and student/instructor split design. + +### 6. It does not distinguish “framework v1” from “curated dataset v1” + +The task contains two related products: + +1. The `leadforge` package/framework. +2. A curated, exemplary, v1 lead-scoring dataset family generated by the framework. + +The report treats these as one blended thing. That makes the roadmap blurry. A better report should maintain two parallel lanes: + +- **Framework readiness lane:** engine, config, CLI, validation, documentation, release automation, publishing integrations, reproducibility. +- **Dataset readiness lane:** chosen recipe/seed(s), size, splits, public/instructor variants, data cards, notebooks, validation reports, public challenge framing, feedback channels. + +Each lane should have its own acceptance criteria and release gates. + +### 7. It gives a generic roadmap, not an execution plan + +The roadmap is too high-level. It says things like “add engineered features,” “expand motifs,” and “implement validation checks,” but does not identify: + +- Which files to change. +- Which commands should exist. +- Which release artifacts should be produced. +- What acceptance thresholds define success. +- What the Kaggle/HF upload directory should look like. +- Which validation reports should be persisted. +- Which notebooks should be shipped. +- Which tests should gate CI. +- Which items are out of scope for the next milestone. + +For example, “Add engineered features” should become a concrete feature plan: + +- Add `engagement_velocity_7d`, `high_intent_session_ratio`, `multi_threaded_account`, `stakeholder_coverage`, `days_to_first_sales_activity`, and `source_normalized_activity_rate` only if they are causally available before the snapshot window. +- For each new feature, add a schema entry, feature dictionary description, leakage flag, snapshot test, monotonicity or range test, and at least one validation check. +- Require a clean-model/lift delta report with and without each feature family. + +### 8. It weakly satisfies the “deep research expedition” requirement + +The external research in the report is thin. It cites a few platform docs, one industry article, a synthetic-data vendor blog, and a Google research blog. That is not enough for the stated goal: “best ever synthetic lead scoring dataset.” + +Missing research streams include: + +- Public lead-scoring dataset census: Kaggle, HF, UCI, GitHub, Data.World, and common bootcamp datasets such as X Education. +- Lead-scoring literature: predictive lead scoring, sales funnel modeling, survival/hazard modeling, uplift/lift evaluation, conversion-rate calibration, CRM data leakage patterns. +- B2B GTM realism: funnel conversion benchmarks, sales cycle durations, lead source mix, enterprise buying committee dynamics, SDR/outbound/inbound attribution, opportunity creation rates. +- Synthetic tabular data evaluation: SDMetrics, TSTR/TRTS, SynthCity, statistical fidelity metrics, privacy/disclosure risk, plausibility constraints, graph/relational synthetic data evaluation. +- Dataset documentation standards: Data Cards, Dataset Nutrition Labels, Model Cards analogues, MLCommons/Croissant if applicable, Kaggle metadata, HF dataset card specs. +- Educational dataset design: notebooks, assignments, instructor keys, leakage traps, calibration exercises, lift curves, cohort shift, and rubrics. + +### 9. Its citations are not publication-grade + +The report’s citations are internal browser IDs such as `【176731919908143†L15-L89】`. In a downloaded Markdown file, these are not durable references. They do not identify the source title, URL, or accessed date. + +There are also citation-matching problems. For example, broad architecture claims are repeatedly cited to the same line span that appears to be a dataset-card snippet rather than the actual architecture specification. The report’s cited line ranges are too broad to be useful and sometimes do not support the claim precisely. + +A better report should use: + +- Source title. +- URL or repository path. +- Access date for web sources. +- Exact file path and line range for repository evidence. +- A bibliography grouped by platform docs, academic research, industry evidence, and repository files. + +### 10. It contains platform-specific inaccuracies + +The Kaggle section should be corrected. Kaggle’s current dataset metadata docs state that the upload folder should contain `dataset-metadata.json`; supported fields include `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image`. The docs also describe a recommended `dataset-cover-image.png` sibling file and specify a **minimum** image size of **560×280**, with header and thumbnail crops. The generated report instead mentions fields such as `isPrivate`, `maintainer`, and `updateFrequency`, and says a cover image must be 1200×400. That is not precise enough for an automated publisher. + +The Hugging Face section is broadly right but underspecified. HF dataset cards use `README.md` with YAML metadata. The YAML can include `configs` and `data_files` so that multiple subsets/splits load without custom code. The attached repo already has such a card, but it needs hardening: `pretty_name`, `tags: tabular`, `dataset_info`, `default: true` for the main config, and a clearer split between task splits and relational tables. + +## What the report did well + +The report has useful instincts: + +- It recognizes that leadforge should be a world simulator, not a generic tabular sampler. +- It emphasizes narrative context, reproducibility, leakage safety, documentation, and validation. +- It correctly points toward Kaggle and HF metadata/data-card requirements. +- It recommends external LLM critique, which matches the original prompt. +- It identifies the importance of lift curves, precision@K, and teaching notebooks. + +These are good high-level themes. The problem is that the report stops before doing the hard work of connecting those themes to the actual codebase and release artifacts. + +## Better process for producing the report the user actually asked for + +### Phase 0: Build an evidence inventory + +Extract the Repomix XML into files. Produce a source inventory: + +- File tree by module. +- Python module/function/class counts. +- Docs inventory. +- Release artifact inventory. +- Test inventory. +- Scripts inventory. +- Dataset artifact inventory. + +Create a table of evidence claims with source paths and line ranges. Do not write the final report until this matrix exists. + +### Phase 1: Static code and architecture audit + +Audit each package layer: + +- `api/`: public surface, config precedence, bundle lifecycle. +- `cli/`: implemented commands, missing flags, error behavior, JSON output. +- `recipes/`: recipe schema, difficulty profiles, extensibility. +- `structure/`: motif families, graph validity, rewiring semantics. +- `simulation/`: population generation, stage transitions, hazards, event emission, direct conversion, churn, post-simulation entities. +- `render/`: relational tables, snapshots, task splits, manifests, cards. +- `exposure/`: public/instructor redaction and metadata filtering. +- `validation/`: invariants, realism, difficulty, drift, leakage, artifact integrity. +- `release/` and `scripts/`: build, validate, package, publish readiness. + +For each layer, classify findings as: + +- Exists and seems mature. +- Exists but needs hardening. +- Missing. +- Risky/unclear. + +### Phase 2: Dynamic reproducibility audit + +Install the package in editable mode. Run: + +```bash +python -m pytest --collect-only -q +python -m pytest -q +leadforge list-recipes +leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --mode student_public --difficulty intermediate --out /tmp/leadforge_smoke +leadforge inspect /tmp/leadforge_smoke +leadforge validate /tmp/leadforge_smoke +python scripts/build_public_release.py /tmp/leadforge_release --generation-timestamp 2026-01-01T00:00:00+00:00 +``` + +Record: + +- Whether tests pass. +- Runtime and memory. +- Bundle row counts. +- Hash determinism. +- Validation errors/warnings. +- Any mismatch between docs and artifacts. + +### Phase 3: Alpha dataset forensic audit + +Download or inspect the alpha public dataset repository and the generated bundles, not just README snippets. + +For each tier: + +- Load task splits and flat CSV. +- Verify manifest hashes. +- Check row counts and class balance. +- Compute baseline AUC, PR-AUC, log loss, calibration curves, lift@K, precision@K, expected-value ranking, and top-decile conversion rates. +- Test leakage probes: label permutation, timestamp leakage, post-snapshot event leakage, ID leakage, duplicates, account-contact leakage across splits, relational rejoin leakage. +- Compare train/valid/test distribution shift. +- Compare public vs instructor bundles. +- Check whether redacted columns can be reconstructed from remaining relational tables. +- Generate a validation report with charts and JSON. + +### Phase 4: Public dataset and competitor census + +Search Kaggle, HF, GitHub, and UCI for lead-scoring datasets and notebooks. For each public dataset, record: + +- Topic/domain. +- Row count and feature count. +- Whether relational or flat. +- Presence/quality of card/README. +- Label definition. +- Known leakage columns. +- Baseline performance. +- Educational value. +- Weaknesses that leadforge can explicitly surpass. + +This would provide evidence for “best-in-class” rather than assuming it. + +### Phase 5: Current platform packaging research + +Use official docs only for platform requirements. + +For Kaggle, produce an exact `dataset-metadata.json` schema and a publishing command. Include current license identifiers, resource schema requirements, image requirements, and API behavior. + +For Hugging Face, produce a final `README.md` dataset card with YAML metadata, configs, data files, and optional `dataset_info`. Test `load_dataset()` locally or in a temporary repo if possible. + +### Phase 6: Release specification and acceptance gates + +Define a v1 release candidate as a directory plus validation reports. Each release candidate should include: + +```text +release_candidate/ + README.md + LICENSE + DATASET_CARD.md + CHANGELOG.md + CITATION.cff + dataset-cover-image.png + kaggle/ + dataset-metadata.json + README.md + huggingface/ + README.md + validation/ + validation_report.json + validation_report.md + figures/ + lift_curve.png + calibration_curve.png + missingness_heatmap.png + conversion_by_source.png + leakage_delta.png + notebooks/ + 01_intro_baseline.ipynb + 02_feature_engineering_from_relational_tables.ipynb + 03_leakage_and_time_windows.ipynb + 04_lift_curves_and_value_ranking.ipynb + bundles/ + intro/ + intermediate/ + advanced/ + intermediate_instructor/ +``` + +Acceptance gates should include: + +- `leadforge validate` passes all bundles. +- Release-level validation passes across tiers and seeds. +- Kaggle metadata validates locally. +- Hugging Face `load_dataset()` works for all configs. +- Dataset cards pass markdown and metadata linting. +- Baseline metrics fall in target ranges. +- No forbidden leakage columns in public artifacts. +- Public/instructor exposure diff is exactly as intended. +- External LLM critiques produce no unresolved high-severity findings. +- A human spot-check confirms the first notebook runs end-to-end. + +### Phase 7: LLM critique loop + +Add a release critique runner, but make it structured and auditable. + +Inputs: + +- `README.md` / dataset card. +- `manifest.json`. +- `feature_dictionary.csv`. +- `validation_report.json`. +- Sample rows. +- Public/instructor diff summary. +- Mechanism summary, if instructor mode. + +Output schema: + +```json +{ + "model": "provider/model/version", + "release_id": "...", + "overall_score": 0, + "findings": [ + { + "severity": "critical|high|medium|low|nit", + "category": "leakage|realism|documentation|platform|ethics|pedagogy|code", + "claim": "...", + "evidence": "file/path:line or artifact reference", + "reproducer": "optional command or check", + "suggested_fix": "..." + } + ], + "missing_sections": [], + "questions_for_maintainer": [] +} +``` + +Use at least two providers or two model families when available. Save raw model outputs, parsed JSON, and an adjudicated summary. Treat LLM findings as review inputs, not as pass/fail truth. + +## Better report format to aim for + +A strong version of the original report should look like this: + +```text +# Leadforge v1 Lead-Scoring Dataset Release Plan + +## 0. Executive Summary +- One-page verdict. +- Top 10 release blockers. +- Recommended release shape. +- Definition of “v1 ready.” + +## 1. Evidence and Method +- Inputs reviewed. +- Commands run. +- Web sources used. +- What was not verified. +- Source/evidence map. + +## 2. Current-State Audit +### 2.1 Package architecture +### 2.2 Generator and simulation engine +### 2.3 Recipes and difficulty profiles +### 2.4 Rendering and bundle schema +### 2.5 Exposure modes and redaction +### 2.6 Validation and tests +### 2.7 Release tooling +### 2.8 Documentation + +Each subsection: +- What exists. +- Evidence. +- Strengths. +- Gaps. +- Release implications. + +## 3. Alpha Dataset Forensics +- Bundle inventory. +- Schema audit. +- Baseline metrics. +- Lift curves. +- Calibration. +- Leakage probes. +- Missingness and drift. +- Public vs instructor diff. +- Student CSV v7 lessons. + +## 4. External Research +### 4.1 Public lead-scoring datasets and their weaknesses +### 4.2 Lead-scoring and B2B GTM realism +### 4.3 Synthetic data generation/evaluation standards +### 4.4 Dataset documentation standards +### 4.5 Kaggle/Hugging Face release standards + +## 5. Best-in-Class Release Specification +- Dataset family name and positioning. +- File tree. +- Platform-specific packaging. +- Dataset cards. +- Notebooks. +- Validation report. +- Feedback channels. + +## 6. Gap Matrix +- Gap. +- Current evidence. +- Severity. +- Recommended fix. +- Target artifact/test. + +## 7. Roadmap +### Milestone 1: Release audit and gates +### Milestone 2: Release builder and platform packaging +### Milestone 3: Validation hardening +### Milestone 4: Documentation and notebooks +### Milestone 5: LLM critique integration +### Milestone 6: Dry run and publication + +Each milestone: +- Goal. +- PRs/files. +- Commands. +- Acceptance criteria. +- Risks. + +## 8. v2 Feedback Plan +- Break-me guide. +- Issue templates. +- Metrics requested from users. +- Triage labels. +- Planned v2 decision log. + +## 9. Appendices +- Exact commands. +- Validation JSON schema. +- Dataset card template. +- Kaggle metadata template. +- HF README template. +- LLM critique prompt. +- Bibliography. +``` + +This format would directly answer the original prompt while giving the project owner a buildable plan. + +## Concrete improved roadmap for leadforge from the current attached state + +### Milestone A — Audit current release candidate + +Deliverables: + +- `docs/release/v1_current_state_audit.md` +- `release/validation/validation_report.md` +- `release/validation/validation_report.json` +- `release/validation/figures/*.png` + +Work: + +- Run full tests. +- Generate fresh release bundles with fixed timestamp. +- Validate all bundles. +- Verify public/instructor diffs. +- Reproduce baselines. +- Add missing evidence from `lead_scoring_intro` v7 into the v1 design notes. + +Acceptance criteria: + +- Full tests pass or all failures are triaged. +- Release bundles regenerate byte-identically with pinned timestamp. +- Validation report is produced from code, not hand-written. +- Known limitations are explicit and reconciled across README, HF card, and dataset card. + +### Milestone B — Platform packaging + +Deliverables: + +- `release/kaggle/dataset-metadata.json` +- `release/kaggle/README.md` +- `release/huggingface/README.md` +- `release/dataset-cover-image.png` +- `scripts/package_kaggle_release.py` +- `scripts/package_hf_release.py` + +Work: + +- Generate Kaggle metadata from manifest and feature dictionary. +- Convert the existing HF card into a full HF README with `pretty_name`, `tabular` tag, configs, and dataset information. +- Validate image dimensions and file paths. +- Produce zip/tar artifacts. + +Acceptance criteria: + +- `kaggle datasets create --dir-mode zip` can run in dry-run/local packaging mode. +- `load_dataset(local_path, name="intermediate")` works for HF-style structure. +- All public artifacts have stable checksums. + +### Milestone C — Release validation hardening + +Deliverables: + +- `leadforge/validation/release_quality.py` +- `leadforge/validation/leakage_probes.py` +- `leadforge/validation/reporting.py` +- `scripts/validate_release_candidate.py` + +Work: + +- Add lift curves, calibration, precision@K, AP, log loss, expected-value ranking. +- Add adversarial leakage probes. +- Add cross-seed and cross-tier stability summaries. +- Add relational rejoin leakage checks. +- Add account leakage / split independence checks. +- Add data-card consistency checks: manifest vs README vs feature dictionary. + +Acceptance criteria: + +- No high-severity leakage findings. +- Metrics fall within configured target bands. +- Charts and JSON are generated automatically. +- Validation output is included in both Kaggle and HF releases. + +### Milestone D — Documentation and notebooks + +Deliverables: + +- `notebooks/01_baseline_lead_scoring.ipynb` +- `notebooks/02_relational_feature_engineering.ipynb` +- `notebooks/03_leakage_and_time_windows.ipynb` +- `notebooks/04_lift_curves_and_value_ranking.ipynb` +- `docs/release/break_me_guide.md` +- `docs/release/instructor_guide.md` + +Work: + +- Turn existing v7 teaching guidance into notebook structure. +- Include a “try to break this dataset” guide. +- Add a short modeling baseline and a stronger tree/GBM baseline. +- Add warnings about `total_touches_all` and how to use/remove it. +- Include expected outputs and sanity checks. + +Acceptance criteria: + +- Notebooks run top-to-bottom. +- Notebook outputs match validation report within tolerance. +- Every public-facing artifact links to the issue tracker and feedback instructions. + +### Milestone E — LLM release critique + +Deliverables: + +- `leadforge/validation/llm_critique.py` +- `docs/release/llm_critique_prompt.md` +- `release/validation/llm_critique_raw/*.json` +- `release/validation/llm_critique_summary.md` + +Work: + +- Implement provider abstraction with env-var credentials. +- Create structured critique prompts. +- Feed dataset card, manifests, feature dictionary, validation report, and samples. +- Save raw and summarized findings. + +Acceptance criteria: + +- At least two independent critiques run successfully when credentials are present. +- No unresolved high-severity findings before release. +- LLM critique is optional and skipped cleanly without credentials. + +### Milestone F — Dry run, publish, and feedback loop + +Deliverables: + +- `scripts/publish_kaggle.py` +- `scripts/publish_hf.py` +- `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` +- `.github/ISSUE_TEMPLATE/realism_feedback.yml` +- `docs/release/v1_release_notes.md` + +Work: + +- Upload private/draft versions. +- Smoke-test downloads and HF loading. +- Publish public versions. +- Open a “break this dataset” discussion and issue templates. + +Acceptance criteria: + +- Kaggle page renders with files, metadata, and notebook. +- HF page renders with card, configs, and dataset viewer where supported. +- Download and load examples work from a clean environment. +- Feedback intake is documented. + +## Suggested grading rubric for the final v1 report + +Use this rubric before accepting a future research report: + +1. **Evidence fidelity**: Every claim about the repo or datasets has a file path, line range, command output, or artifact reference. +2. **Current-state accuracy**: The report distinguishes existing, partial, missing, and future work. +3. **Research depth**: The report surveys public lead-scoring datasets, industry lead-scoring practices, synthetic-data evaluation methods, and platform docs. +4. **Platform correctness**: Kaggle/HF instructions are current and tested. +5. **Release specificity**: The roadmap names exact files, commands, artifacts, tests, and acceptance gates. +6. **Pedagogical value**: The report addresses notebooks, instructor mode, leakage teaching, lift curves, calibration, value ranking, and cohort shift. +7. **Adversarial readiness**: The report includes how users should break the dataset and how feedback becomes v2 work. +8. **Citation quality**: All sources are durable and human-readable. + +## Current official documentation corrections to preserve in future work + +The next report should use official platform docs directly: + +- Kaggle’s current metadata docs say the upload folder should include `dataset-metadata.json`; supported fields include `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image`. The image guidance uses `dataset-cover-image.png` and a minimum image size of 560×280, with specific header and thumbnail crops. Source: Kaggle API `datasets_metadata.md`, accessed 2026-05-05. +- Hugging Face dataset cards are repository `README.md` files with YAML metadata at the top. Metadata helps display license, language, size, tags, and data-file configuration. Source: Hugging Face Hub dataset card docs, accessed 2026-05-05. +- Hugging Face repository structure docs support `configs` and `data_files` in YAML for multiple configurations and splits, which is relevant to intro/intermediate/advanced dataset tiers. Source: Hugging Face Datasets repository structure docs, accessed 2026-05-05. +- Data Cards should document provenance, motivation, dataset overview, sampling, transformations, annotations/labels, validation, sensitivity, limitations, and maintenance. Source: Google Data Cards Playbook, accessed 2026-05-05. + +## Bottom line + +The generated report captured some right themes but did not perform the requested level of inspection or research. It should not be used as the roadmap for leadforge v1. The next iteration should be evidence-first, code-aware, artifact-aware, and release-gated. The best report would read less like generic advice and more like a product/engineering release plan backed by concrete repository findings, generated metrics, platform-ready artifacts, and acceptance criteria. diff --git a/docs/external_review/chatgpt/leadforge_second_attempt_guidance.md b/docs/external_review/chatgpt/leadforge_second_attempt_guidance.md new file mode 100644 index 0000000..226ab95 --- /dev/null +++ b/docs/external_review/chatgpt/leadforge_second_attempt_guidance.md @@ -0,0 +1,1167 @@ +# Guidance for the Second Attempt at the Leadforge v1 Release Report + +**Purpose:** This document is guidance to attach to the second attempt at the original leadforge research/report task. It is not the final leadforge roadmap itself. Its job is to prevent the second attempt from repeating the first attempt’s methodological mistakes and to force an evidence-first, code-aware, release-oriented report. + +**Inputs the second attempt must use:** + +1. The original task prompt. +2. The attached Repomix package: `leadforge-repomix-output.xml`. +3. The critique report: `leadforge_report_v1_critique.md`. +4. Current official platform documentation and current public lead-scoring dataset landscape from the web. + +**High-level instruction to the second-attempt author:** +Treat the critique report as a constraint file, not as optional context. Verify its claims against the Repomix package, then produce a final report that is forensic, current, and directly actionable for publishing a v1 educational lead-scoring dataset to Kaggle and Hugging Face. + +--- + +## 1. Non-negotiable objective + +The final second-attempt report must answer the original prompt at the requested depth: + +- Review the current state of leadforge through the code, documentation, tests, release scripts, and alpha / quasi-release dataset assets. +- Review the current state of the generated datasets and release artifacts. +- Conduct a deep research expedition into what a best-in-class synthetic lead-scoring educational dataset should look like on Kaggle and Hugging Face. +- Produce a concrete roadmap to v1 release. +- Provide a project critique: positives, negatives, risks, and opportunities. + +The output should be useful to an implementer who needs to make code and release changes, not merely to a reader who wants a generic strategic overview. + +--- + +## 2. Core lesson from the failed first attempt + +The first report failed mainly because it **under-inspected the actual repository**. It treated implemented components as placeholders, missed release assets that already exist, and produced generic advice instead of a repository-grounded plan. + +The second attempt must therefore be built around an evidence matrix: + +- What exists now? +- Where is it in the repo? +- How mature is it? +- What did dynamic checks show? +- What is missing for v1? +- What exact release artifact, command, test, or code path should be added? + +Do not write the final report from memory, from architectural intent alone, or from platform docs alone. + +--- + +## 3. Mandatory corrections to carry forward + +The final report must explicitly avoid the following false or misleading claims unless a fresh audit proves otherwise. + +### 3.1 Do not say the repo is mostly skeletal + +The critique found evidence of a working end-to-end generation path. The second attempt must verify this. + +Files to inspect carefully: + +- `leadforge/api/generator.py` +- `leadforge/api/bundle.py` +- `leadforge/simulation/population.py` +- `leadforge/simulation/engine.py` +- `leadforge/structure/sampler.py` +- `leadforge/mechanisms/*` +- `leadforge/render/*` +- `leadforge/exposure/*` +- `leadforge/validation/*` + +The correct posture is not “implement the engine.” The correct posture is “audit whether the existing engine is realistic, sufficiently validated, and release-grade.” + +### 3.2 Do not say the CLI is absent + +The critique reports that CLI commands already exist: + +- `leadforge generate` +- `leadforge list-recipes` +- `leadforge inspect` +- `leadforge validate` + +Verify this in: + +- `leadforge/cli/main.py` +- `leadforge/cli/commands/generate.py` +- `leadforge/cli/commands/list_recipes.py` +- `leadforge/cli/commands/inspect.py` +- `leadforge/cli/commands/validate.py` + +The roadmap should focus on CLI hardening and release automation, such as: + +- `leadforge release build` +- `leadforge release validate` +- `leadforge release package-kaggle` +- `leadforge release package-hf` +- `leadforge release publish-kaggle` +- `leadforge release publish-hf` +- `--json` output where missing +- dry-run publishing +- credentials checks +- deterministic artifact checks + +### 3.3 Do not say there is no Hugging Face packaging + +The critique found existing Hugging Face-oriented material: + +- `release/HF_DATASET_CARD.md` +- `release/README.md` + +The correct finding is likely: + +- Hugging Face packaging partially exists. +- It needs hardening into a full Hub `README.md` with current metadata, configs, default config, dataset viewer-friendly file structure, examples, and possibly `dataset_info`. +- Kaggle-specific metadata is likely missing or incomplete and must be verified. + +### 3.4 Do not say built-in validation is missing + +The critique found implemented validation layers, including: + +- `leadforge/validation/bundle_checks.py` +- `leadforge/validation/realism.py` +- `leadforge/validation/difficulty.py` +- `leadforge/validation/drift.py` +- validation reports in `lead_scoring_intro/` + +The correct finding is likely: + +- Validation exists. +- It must be raised to release-grade: persisted reports, charts, leakage probes, platform checks, cross-seed checks, LLM critique, acceptance thresholds, and CI gates. + +### 3.5 Do not ignore the `lead_scoring_intro` track + +The original task mentioned the user’s one-CSV intro-course dataset. The attached repo contains a substantial `lead_scoring_intro/` section. + +Inspect: + +- `lead_scoring_intro/RELEASE_v7.md` +- `lead_scoring_intro/BACKGROUND_v7.md` +- `lead_scoring_intro/validation_v7_report.json` +- `lead_scoring_intro/lead_scoring_intro_v7.csv` +- `lead_scoring_intro/lead_scoring_intro_v7_instructor.csv` +- `scripts/build_v7_snapshot.py` +- `scripts/validate_v7_dataset.py` + +The final report should extract lessons from this track, especially: + +- Leakage trap design. +- Student vs instructor versions. +- Teaching sequence. +- Lift and value-aware ranking. +- Cohort split degradation. +- Calibration and leakage-trap metrics. +- How the v7 single-CSV teaching artifact should inform the richer v1 multi-table release. + +### 3.6 Separate the framework from the curated dataset + +The original task has two intertwined products: + +1. **The leadforge framework/package** — API, CLI, recipe system, simulation, validation, release tooling. +2. **The curated v1 lead-scoring dataset release** — the best-in-class public educational dataset family generated by the framework. + +The final report must maintain separate lanes: + +- Framework readiness. +- Dataset readiness. +- Platform readiness. +- Documentation/readiness for educators. +- Feedback-loop readiness. + +Do not collapse these into one generic “project roadmap.” + +--- + +## 4. Required methodology for the second attempt + +### Phase 0 — Evidence inventory + +Extract the Repomix XML into a working directory. + +Recommended extraction approach: + +```python +import re +from pathlib import Path + +xml = Path("leadforge-repomix-output.xml").read_text(encoding="utf-8") +out = Path("leadforge_extracted") +out.mkdir(exist_ok=True) + +for m in re.finditer(r'\n(.*?)', xml, re.S): + path = out / m.group(1) + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(m.group(2), encoding="utf-8") +``` + +Then create an inventory: + +- total files +- Python files +- tests +- docs +- scripts +- release artifacts +- generated CSV/JSON/Markdown files +- recipe files +- validation modules +- notebooks, if any +- CI workflows, if any + +Do not trust the critique’s counts blindly. Recompute them. + +Useful commands: + +```bash +find leadforge_extracted -maxdepth 3 -type f | sort +find leadforge_extracted/leadforge -name "*.py" | wc -l +find leadforge_extracted/tests -name "*.py" | wc -l +grep -R "NotImplementedError\|TODO\|pass$" -n leadforge_extracted/leadforge || true +grep -R "dataset-metadata.json\|HF_DATASET_CARD\|huggingface\|kaggle" -n leadforge_extracted || true +``` + +### Phase 1 — Static code audit + +Audit these areas in separate subsections. + +| Area | Files to inspect | What to determine | +|---|---|---| +| Public API | `leadforge/api/generator.py`, `leadforge/api/bundle.py`, `leadforge/api/recipes.py` | Is generation end-to-end? Are defaults, overrides, exposure modes, and difficulty handled cleanly? | +| CLI | `leadforge/cli/*` | Which commands exist? Are `--json`, errors, dry-run, and release commands missing? | +| Core models | `leadforge/core/models.py`, enums, RNG, IDs, hashing | Are data contracts typed and reproducible? | +| Recipes | `leadforge/recipes/*` | What recipe metadata, narrative, schema, difficulty, and motif settings exist? | +| Structure | `leadforge/structure/*` | Are motif graphs sampled and validated? Is rewiring meaningful and documented? | +| Mechanisms | `leadforge/mechanisms/*` | Are hazards, static features, scores, policies, and measurement logic plausible? | +| Simulation | `leadforge/simulation/population.py`, `engine.py`, `world.py`, `state.py` | What is actually simulated? What is simplified? What events are generated? | +| Rendering | `leadforge/render/*`, `leadforge/api/bundle.py` | What tables, snapshots, tasks, dictionaries, manifests, graph exports, cards are written? | +| Exposure | `leadforge/exposure/*` | What is redacted in public mode? Can redacted truths be reconstructed? | +| Validation | `leadforge/validation/*` | What checks exist? Which release-grade checks are missing? | +| Release tooling | `scripts/*`, `release/*` | What already builds platform-ready assets? What is missing? | +| Tests | `tests/*` | How broad is the test coverage? What important release risks are not tested? | +| Intro dataset track | `lead_scoring_intro/*` | What teaching/release lessons should inform v1? | + +For each area, the report should state: + +- What exists. +- Evidence from files. +- Strengths. +- Gaps. +- Release implications. +- Concrete suggested changes. + +### Phase 2 — Dynamic reproducibility audit + +If the environment permits, install and run the project. + +Suggested commands: + +```bash +cd leadforge_extracted + +python -m pip install -e ".[dev]" || python -m pip install -e . +python -m pytest --collect-only -q +python -m pytest -q + +leadforge list-recipes + +leadforge generate \ + --recipe b2b_saas_procurement_v1 \ + --seed 42 \ + --mode student_public \ + --difficulty intermediate \ + --n-leads 500 \ + --out /tmp/leadforge_smoke + +leadforge inspect /tmp/leadforge_smoke +leadforge validate /tmp/leadforge_smoke + +python scripts/build_public_release.py \ + /tmp/leadforge_release \ + --generation-timestamp 2026-01-01T00:00:00+00:00 +``` + +Record: + +- Commands run. +- Runtime. +- Exit codes. +- Failures and likely causes. +- Generated file tree. +- Row counts. +- Whether checksums are deterministic. +- Whether validation passes. +- What was not run and why. + +If dynamic checks cannot be run, the final report must say so plainly and must not imply they were run. + +### Phase 3 — Alpha release and dataset forensic audit + +The second attempt should not only read repository READMEs. It should inspect actual release bundles where accessible. + +Minimum checks for each relevant bundle: + +- Manifest schema and row counts. +- Train/valid/test split sizes. +- Class balance by split. +- Flat CSV columns. +- Relational table presence. +- Feature dictionary coverage. +- Public vs instructor exposure diff. +- Redacted-column enforcement. +- Snapshot/label time-window logic. +- Potential leakage via IDs, dates, stage columns, opportunity status, post-snapshot events. +- Account/contact leakage across splits. +- Duplicate leads or near-duplicate rows. +- Train/test distribution shift. +- Metrics: ROC-AUC, PR-AUC, log loss, Brier score, calibration, lift@K, precision@K, top-decile conversion. +- Stronger baselines: logistic regression, tree/GBM, simple target-encoding pipeline where safe, and intentionally “bad” leakage model for demonstration. +- Value-aware ranking: expected ACV / opportunity value if present. +- Cohort shift: time-based split or lead-source split, not only random splits. + +The final report should include a forensic subsection for: + +- `leadforge-datasets` alpha release family. +- `lead_scoring_intro` v7 CSV track. +- Any differences between the two. + +### Phase 4 — External research expedition + +The original prompt demands deep research. The second attempt should do a real census and literature/platform review. + +Use current web search. Do not rely on stored knowledge for current platform requirements or current public datasets. + +#### 4.1 Public lead-scoring dataset census + +Search at least: + +- Kaggle +- Hugging Face +- GitHub +- UCI / common educational repositories +- Data.World or other public dataset catalogs if relevant + +Queries to run: + +```text +lead scoring dataset Kaggle +"lead scoring" "X Education" dataset +"lead scoring" Hugging Face dataset +"predictive lead scoring" dataset GitHub +"CRM lead scoring" dataset machine learning +"lead conversion prediction" dataset +``` + +For each dataset found, record: + +- Name and URL. +- Domain. +- Row count. +- Feature count. +- Label definition. +- Flat vs relational. +- Documentation/card quality. +- Baselines if present. +- Known leakage concerns. +- What leadforge can do better. + +The final report should include a competitor/benchmark table, but keep prose outside tables. Tables should use short entries only. + +#### 4.2 Lead-scoring and B2B GTM realism + +Research both academic and industry sources. + +Topics: + +- Predictive lead scoring methods. +- CRM data leakage. +- Conversion funnel stages and stage definitions. +- MQL/SQL/opportunity/closed-won conversion dynamics. +- B2B SaaS sales-cycle durations and ACV distributions. +- Inbound/outbound/partner channel mix. +- Buying committees and multi-threaded accounts. +- Lead-source attribution limitations. +- Lift curves, top-K prioritization, and precision/recall tradeoffs. +- Calibration and score interpretability. + +Prefer high-quality sources: + +- Academic papers. +- Vendor docs/articles from credible GTM vendors, used critically. +- Industry benchmarks from sources such as Salesforce, HubSpot, Demandbase, Gartner, Forrester, OpenView, SaaS benchmark reports, or similar if accessible. +- Explain uncertainty when industry benchmarks conflict. + +#### 4.3 Synthetic data generation and evaluation + +Research: + +- Synthetic tabular data quality dimensions: fidelity, utility, privacy. +- TSTR/TRTS evaluation. +- SDMetrics / SDV-style metrics. +- Relational synthetic data metrics. +- Disclosure risk, membership inference, exact-match risk. +- Constraint validation. +- Dataset-level mechanism design / scenario coverage. +- Diversity and difficulty calibration. + +Use sources such as: + +- SDV / SDMetrics documentation. +- Academic papers on synthetic tabular data evaluation. +- Google Data Cards / Dataset documentation standards. +- Datasheets for Datasets. +- Data Cards Playbook. +- Any current work on dataset-level mechanism design or synthetic data evaluation, if relevant. + +#### 4.4 Platform requirements + +Use official docs only for platform packaging details. + +Current platform facts that should be verified at report time: + +- Kaggle metadata requires `dataset-metadata.json` next to uploaded files, follows Data Package style, and supports fields such as `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image`. The official Kaggle API docs also describe supported licenses, data types, and image requirements. +- Kaggle cover image guidance currently says `dataset-cover-image.png` / `.jpg` / `.jpeg` / `.webp` can be placed beside `dataset-metadata.json`, with minimum 560×280 dimensions and specified header/thumbnail crops. +- Hugging Face dataset repositories render `README.md` as the dataset card, use YAML front matter for metadata, and support `configs` / `data_files` for splits and subsets. +- Hugging Face Datasets can automatically load supported formats such as CSV and Parquet when repository structure and metadata are compatible. + +Primary sources to use and cite: + +- Kaggle API dataset metadata documentation: `https://github.com/Kaggle/kaggle-api/blob/main/docs/datasets_metadata.md` +- Hugging Face Hub dataset card docs: `https://huggingface.co/docs/hub/datasets-cards` +- Hugging Face Datasets repository structure docs: `https://huggingface.co/docs/datasets/repository_structure` +- Hugging Face data files configuration docs: `https://huggingface.co/docs/hub/datasets-data-files-configuration` + +### Phase 5 — Build a gap matrix + +The final report must include a gap matrix. Suggested columns: + +- Area +- Current evidence +- Gap +- Severity +- Recommended fix +- Files/commands affected +- Acceptance criterion + +Example rows the second attempt should verify: + +| Area | Likely current state | Likely gap | +|---|---|---| +| Kaggle packaging | Release README and HF card exist | `dataset-metadata.json` generator likely missing | +| Hugging Face packaging | `release/HF_DATASET_CARD.md` exists | Needs full card, configs, default config, tested `load_dataset()` | +| Validation | Structural/realism/difficulty/drift validators exist | Needs release-quality report, charts, leakage probes, LLM critique | +| Release builder | `scripts/build_public_release.py` exists | Needs platform-specific package/publish commands | +| Teaching assets | v7 intro docs exist | Need polished notebooks for Kaggle/HF | +| Feedback loop | likely informal | Need issue templates, break-me guide, triage labels | +| Metrics | baselines exist | Need calibration, lift, top-K, value ranking, seed stability | + +--- + +## 5. Required final report format + +The second-attempt report should be a single comprehensive Markdown report with this structure. + +```text +# Leadforge v1 Lead-Scoring Dataset Release Plan + +## 0. Executive Summary +- Verdict. +- What is already strong. +- Top release blockers. +- Recommended release shape. +- Definition of “v1 ready.” + +## 1. Evidence and Method +- Inputs reviewed. +- Repository extraction method. +- Commands run. +- Web sources used. +- What was not verified. +- Evidence-quality limitations. + +## 2. Current-State Audit of Leadforge +### 2.1 Architecture and design docs +### 2.2 Public API +### 2.3 CLI +### 2.4 Recipe system and difficulty profiles +### 2.5 Hidden graph and motif sampler +### 2.6 Mechanisms and simulation engine +### 2.7 Relational rendering and snapshot task generation +### 2.8 Exposure/redaction modes +### 2.9 Validation suite +### 2.10 Release tooling +### 2.11 Test suite +### 2.12 Documentation + +For each subsection: +- What exists. +- Evidence. +- Strengths. +- Gaps. +- Release implications. + +## 3. Existing Dataset and Alpha Release Forensics +### 3.1 leadforge-datasets alpha release inventory +### 3.2 Intro/intermediate/advanced difficulty tiers +### 3.3 Public vs instructor mode +### 3.4 `lead_scoring_intro` v7 lessons +### 3.5 Baselines, lift, calibration, leakage, and drift +### 3.6 What currently makes the dataset hard/easy to break + +## 4. External Research +### 4.1 Public lead-scoring dataset census +### 4.2 Lead-scoring and B2B GTM realism +### 4.3 Synthetic data generation and evaluation +### 4.4 Dataset documentation standards +### 4.5 Kaggle release requirements +### 4.6 Hugging Face release requirements +### 4.7 Lessons for leadforge + +## 5. Best-in-Class v1 Release Specification +### 5.1 Dataset family shape +### 5.2 File tree +### 5.3 Public bundle contents +### 5.4 Instructor/research companion contents +### 5.5 Kaggle package +### 5.6 Hugging Face package +### 5.7 Dataset cards and documents +### 5.8 Notebooks +### 5.9 Validation report +### 5.10 Feedback and break-me process + +## 6. Gap Matrix +- Repository gaps. +- Dataset gaps. +- Platform gaps. +- Documentation gaps. +- Validation gaps. +- Pedagogical gaps. + +## 7. Roadmap to v1 +### Milestone 1: Release audit and acceptance gates +### Milestone 2: Platform package generation +### Milestone 3: Release validation hardening +### Milestone 4: Documentation and notebooks +### Milestone 5: LLM critique integration +### Milestone 6: Dry-run publication +### Milestone 7: Public release and feedback intake + +Each milestone: +- Goal. +- Work items. +- Files likely touched. +- Commands. +- Acceptance criteria. +- Risks. + +## 8. Suggested v2 Feedback Plan +- Break-me guide. +- Issue templates. +- Metrics requested from users. +- Triage process. +- How feedback becomes v2 decisions. +- Explicitly keep leaderboard/LTV out of the v1 milestone. + +## 9. Appendices +- Commands run. +- Candidate release tree. +- Kaggle metadata template. +- Hugging Face README/YAML template. +- Release validation JSON schema. +- LLM critique prompt schema. +- Bibliography. +``` + +--- + +## 6. Required level of concreteness + +The final report should contain PR-sized work, not only broad themes. + +Weak recommendation: + +> Add better validation. + +Strong recommendation: + +> Add `leadforge/validation/release_quality.py` and `scripts/validate_release_candidate.py`. The script should read each bundle’s `manifest.json`, `feature_dictionary.csv`, task splits, and flat CSV; compute ROC-AUC, PR-AUC, Brier score, calibration bins, lift@1%, lift@5%, lift@10%, precision@50/100, leakage-probe metrics, split-shift summaries, redaction checks, and relational rejoin leakage checks; then write `validation/validation_report.json`, `validation/validation_report.md`, and figures. Acceptance: no high-severity leakage probes; metrics within configured difficulty bands; all public/instructor diffs are intentional. + +Weak recommendation: + +> Publish to Kaggle. + +Strong recommendation: + +> Add `scripts/package_kaggle_release.py` that reads bundle manifests and `feature_dictionary.csv`, generates `kaggle/dataset-metadata.json`, copies `dataset-cover-image.png`, writes resource descriptions for flat CSVs and Parquet bundle files, validates title/subtitle/id/license/image dimensions against Kaggle docs, creates a zip, and offers a dry-run command. Acceptance: the package can be uploaded with `kaggle datasets create -p --dir-mode zip` after credentials are configured. + +--- + +## 7. Required platform-correctness guardrails + +The final report must not repeat the first report’s platform inaccuracies. + +### Kaggle + +Use current official documentation. + +Specific guidance to verify and cite: + +- The upload folder should include `dataset-metadata.json` beside data files. +- `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, and `image` are supported fields. +- `title` and `subtitle` have length constraints. +- `resources[].schema.fields` should include all fields in order when provided. +- `expectedUpdateFrequency` uses values like `never`, `annually`, `quarterly`, `monthly`, `weekly`, `daily`, `hourly`. +- Cover image minimum is 560×280 according to the current Kaggle API docs, not 1200×400. +- The recommended sibling name is `dataset-cover-image.png` or supported alternatives. + +### Hugging Face + +Use current official documentation. + +Specific guidance to verify and cite: + +- Hub dataset cards are repository `README.md` files. +- YAML metadata block controls visible metadata. +- Use `language`, `license`, `pretty_name`, `tags`, `task_categories`, `size_categories`, and `configs`. +- Include `tags: tabular`, `pandas`, and `datasets` where appropriate. +- Use `configs` and `data_files` for intro/intermediate/advanced and possibly a separate relational bundle config. +- Mark one config as `default: true` if appropriate. +- Test `load_dataset()` locally or state that this was not tested. + +--- + +## 8. Dataset-forensics requirements + +The final report should discuss not only whether a model performs well, but whether the dataset is **hard to break**. + +Required probes: + +### 8.1 Direct leakage probes + +- Train models with all features. +- Train without known leakage traps. +- Train using only suspect temporal/stage/opportunity columns. +- Train using IDs or hashed IDs if present. +- Train using post-snapshot aggregates if present. +- Compare performance deltas. + +### 8.2 Time-window leakage + +Check that all public features intended to be pre-snapshot are derived from events at or before `lead_created_at + snapshot_day`. Verify that label resolution uses the full label horizon but not as a feature source, except explicitly documented teaching leakage traps. + +### 8.3 Relational leakage + +The public flat CSV may be safe while relational tables reveal label information. Check: + +- Opportunity status after snapshot. +- Customer/subscription rows that only exist for conversions. +- Sales activities after snapshot. +- Stage tables. +- Join paths that reconstruct `is_sql`, `current_stage`, or terminal states. + +### 8.4 Split leakage + +Check: + +- Same account appearing in train/test. +- Same contact appearing in train/test. +- Near-duplicate leads in different splits. +- Temporal split leakage if lead dates overlap in unrealistic ways. + +This is especially important because real CRM use cases often score future leads from accounts with prior activity. The report should decide whether account overlap is intentional and document it. + +### 8.5 Model realism + +Compute or request: + +- ROC-AUC. +- PR-AUC. +- Brier score. +- Calibration curves. +- Lift curves. +- Precision@K. +- Recall@K. +- Top-decile conversion. +- Expected value captured at K if ACV exists. +- Baseline comparison against naive source-only or engagement-only models. + +A best-in-class dataset should not be “solved” by a single shortcut, but should still reward better modeling. + +--- + +## 9. Research deliverables that must appear in the final report + +### 9.1 Public lead-scoring dataset benchmark + +The report should include a concise dataset census table and then discuss it in prose. + +Suggested columns: + +- Dataset +- Platform +- Domain +- Rows +- Shape +- Documentation quality +- Main weakness + +Keep table entries short. Put details in prose. + +### 9.2 Documentation benchmark + +Compare leadforge’s intended documentation against: + +- Hugging Face dataset card template. +- Data Cards Playbook. +- Kaggle dataset metadata / README conventions. +- High-quality Kaggle notebooks. + +### 9.3 Synthetic data benchmark + +Discuss what “best ever synthetic lead scoring dataset” should mean operationally: + +- Relational, not just flat. +- Narrative-grounded. +- Deterministic and reproducible. +- Has latent truth in instructor mode. +- Uses time windows correctly. +- Has realistic class imbalance and lift behavior. +- Has multiple difficulty tiers. +- Includes validation reports and notebooks. +- Includes a break-me guide. +- Is transparent about limitations. + +### 9.4 Educational design benchmark + +Discuss what makes it useful for teaching: + +- Intro flat CSV path. +- Advanced relational path. +- Leakage-trap lesson. +- Calibration and lift. +- Class imbalance. +- Feature engineering. +- Temporal validation. +- Instructor-only truth artifacts. +- Assignment/rubric possibilities. +- Student-friendly notebook and instructor guide. + +--- + +## 10. Roadmap requirements + +The roadmap should be written as a sequence of release work packages. Suggested work packages: + +### Milestone A — Evidence-backed current-state audit + +Deliverables: + +- `docs/release/v1_current_state_audit.md` +- regenerated release bundles +- command log +- inventory table + +Acceptance: + +- Full or partial test results recorded. +- Release builder behavior verified. +- Existing HF/Kaggle packaging status classified correctly. +- v7 intro lessons captured. + +### Milestone B — Release-candidate specification + +Deliverables: + +- `docs/release/v1_release_spec.md` +- canonical release file tree +- list of public/instructor artifacts +- v1 acceptance gates + +Acceptance: + +- Public and instructor artifact scopes are explicit. +- Out-of-scope items are explicit: no LTV, no leaderboard mini-site, no other GTM task. + +### Milestone C — Platform package automation + +Deliverables: + +- `scripts/package_kaggle_release.py` +- `scripts/package_hf_release.py` +- `release/kaggle/dataset-metadata.json` +- `release/huggingface/README.md` +- `release/dataset-cover-image.png` + +Acceptance: + +- Kaggle metadata validates against official docs. +- HF `load_dataset()` works or known blockers are documented. +- Dry-run command does not require credentials. + +### Milestone D — Release validation hardening + +Deliverables: + +- `leadforge/validation/release_quality.py` +- `leadforge/validation/leakage_probes.py` +- `leadforge/validation/reporting.py` +- `scripts/validate_release_candidate.py` +- `release/validation/validation_report.json` +- `release/validation/validation_report.md` +- figures + +Acceptance: + +- No critical leakage findings. +- Metrics are within configured bands. +- Figures and reports are generated automatically. +- Validation output is included in platform packages. + +### Milestone E — Notebooks and teaching materials + +Deliverables: + +- `notebooks/01_intro_flat_csv_baseline.ipynb` +- `notebooks/02_relational_feature_engineering.ipynb` +- `notebooks/03_leakage_and_time_windows.ipynb` +- `notebooks/04_lift_calibration_value_ranking.ipynb` +- `docs/release/instructor_guide.md` +- `docs/release/student_quickstart.md` + +Acceptance: + +- Notebooks run top-to-bottom. +- Notebook metrics match validation report within tolerance. +- Student vs instructor usage is clear. + +### Milestone F — LLM critique integration + +Deliverables: + +- `leadforge/validation/llm_critique.py` +- `docs/release/llm_critique_prompt.md` +- `release/validation/llm_critique_summary.md` +- raw model-output archive + +Acceptance: + +- Runs with provider credentials. +- Skips gracefully without credentials. +- Produces structured findings with severity, evidence, and suggested fix. +- No unresolved high-severity findings before release. + +### Milestone G — Publish and feedback loop + +Deliverables: + +- `scripts/publish_kaggle.py` +- `scripts/publish_hf.py` +- GitHub issue templates +- break-me guide +- release notes +- public feedback instructions + +Acceptance: + +- Private/draft upload tested. +- Public download/load smoke tests pass. +- Feedback channels are linked from Kaggle, HF, GitHub, and README. + +--- + +## 11. Suggested final release artifact tree + +The second-attempt report should propose a concrete tree similar to this: + +```text +leadforge-v1-lead-scoring/ + README.md + LICENSE + CITATION.cff + CHANGELOG.md + dataset-cover-image.png + + docs/ + DATASET_CARD.md + GENERATION_METHOD.md + VALIDATION_REPORT.md + FEATURE_DICTIONARY.md + BREAK_ME_GUIDE.md + INSTRUCTOR_GUIDE.md + + data/ + intro/ + lead_scoring.csv + train.csv + validation.csv + test.csv + manifest.json + feature_dictionary.csv + intermediate/ + ... + advanced/ + ... + relational/ + intro/ + intermediate/ + advanced/ + + instructor_companion/ + intermediate_instructor/ + metadata/ + graph/ + mechanism_summary.json + latent_registry.json + + validation/ + validation_report.json + validation_report.md + figures/ + lift_curve_intro.png + lift_curve_intermediate.png + lift_curve_advanced.png + calibration_intermediate.png + leakage_delta.png + split_shift.png + + notebooks/ + 01_intro_flat_csv_baseline.ipynb + 02_relational_feature_engineering.ipynb + 03_leakage_and_time_windows.ipynb + 04_lift_calibration_value_ranking.ipynb + + kaggle/ + dataset-metadata.json + README.md + + huggingface/ + README.md +``` + +The final report should decide whether instructor companion artifacts should be publicly downloadable, gated, omitted from Kaggle, or stored separately. It should explain the tradeoff: transparency and reproducibility versus student-exercise leakage. + +--- + +## 12. LLM critique guidance + +The original prompt explicitly requests built-in deep release validation using outside LLMs. The final report should propose this as a concrete module and workflow. + +Suggested input bundle: + +- `README.md` +- `DATASET_CARD.md` +- `GENERATION_METHOD.md` +- `manifest.json` +- `feature_dictionary.csv` +- `validation_report.json` +- first 100 public rows +- public/instructor diff summary +- mechanism summary if instructor mode is available + +Suggested output schema: + +```json +{ + "release_id": "leadforge-lead-scoring-v1", + "model": "provider/model/version", + "run_timestamp": "ISO-8601", + "overall_score": 0, + "findings": [ + { + "severity": "critical|high|medium|low|nit", + "category": "leakage|realism|documentation|platform|ethics|pedagogy|code", + "claim": "...", + "evidence": "file/path:line or artifact reference", + "reproducer": "optional command", + "suggested_fix": "..." + } + ], + "missing_sections": [], + "questions_for_maintainer": [] +} +``` + +Guidelines: + +- Use at least two model/provider families when possible. +- Save raw outputs and parsed findings. +- Treat LLM outputs as review inputs, not ground truth. +- Require human adjudication of high-severity findings before release. +- Include LLM critique summaries in the release validation directory. + +--- + +## 13. Citation and evidence standards + +The final report must use durable citations. + +### 13.1 Repository evidence + +Use extracted file paths and line ranges. Example format: + +```text +`leadforge/api/generator.py:L42-L117` +``` + +To create line-numbered extracts: + +```bash +nl -ba leadforge/api/generator.py | sed -n '1,160p' +``` + +When quoting or summarizing repository content, cite exact file paths. Do not cite the whole Repomix package generically for every claim. + +### 13.2 Web evidence + +Use official or primary sources for platform requirements. + +For web citations, include: + +- Title. +- URL. +- Access date. +- Exact fact supported. + +Do not use hidden browser/source IDs as the only citation in a downloadable Markdown report. They are not portable. + +### 13.3 Research evidence + +For academic claims, prefer: + +- original papers +- official documentation +- peer-reviewed or arXiv/OpenReview papers when appropriate + +For industry claims, note that sources are often vendor-authored and may be biased. Use them for plausible ranges and practices, not as hard universal truths. + +### 13.4 Unverified items + +Mark unverified items clearly: + +- “Verified by command.” +- “Verified by static inspection.” +- “Observed in alpha release.” +- “Reported by critique, not independently verified.” +- “Not verified in this pass.” + +The final report should include a “What I did not verify” subsection. + +--- + +## 14. Pitfalls to avoid + +Do not: + +- Call implemented modules placeholders without evidence. +- Recommend implementing commands that already exist. +- Treat Hugging Face packaging as absent without inspecting `release/HF_DATASET_CARD.md`. +- Ignore the intro v7 CSV dataset track. +- Use outdated or unofficial Kaggle/HF platform requirements. +- Rely on a single vendor blog as “industry knowledge.” +- Treat AUC as the only success metric. +- Ignore lift, precision@K, calibration, and value-aware ranking. +- Ignore relational leakage. +- Ignore split leakage through accounts/contacts. +- Over-plan LTV or leaderboard work for the v1 milestone. +- Produce a roadmap with no file names, commands, artifacts, or acceptance criteria. +- Use citations that cannot be followed outside the chat environment. +- End with vague “next steps” instead of a concrete PR/release plan. + +--- + +## 15. Recommended stance and tone + +The report should be candid and precise. + +Good stance: + +> Leadforge appears much further along than the first report recognized. The right v1 task is not to create the framework from scratch, but to harden an already functional generator and release pipeline into a best-in-class public dataset product. + +Good stance: + +> The current alpha release already has strong bones: deterministic generation, relational tables, public/instructor exposure, manifests, feature dictionaries, baseline metrics, and validation. The remaining work is release-grade packaging, deeper adversarial validation, stronger documentation, notebooks, and a public feedback loop. + +Bad stance: + +> The simulation engine must be implemented before v1. + +Bad stance: + +> No Kaggle/Hugging Face packaging exists. + +Bad stance: + +> Add validation. + +--- + +## 16. Rubric for accepting the second-attempt report + +Score the second-attempt report against this rubric before using it. + +| Criterion | Pass condition | +|---|---| +| Evidence-first review | Every consequential repo claim has file-path evidence or command evidence. | +| Correct current-state classification | Existing, partial, missing, and out-of-scope items are clearly separated. | +| Dynamic checks | Commands are run where possible; failures and limitations are reported. | +| Alpha dataset forensics | Existing releases and v7 intro dataset are analyzed, not just mentioned. | +| External research depth | Public datasets, industry practice, synthetic-data evaluation, and platform requirements are surveyed. | +| Platform accuracy | Kaggle and HF requirements are sourced from current official docs. | +| Actionable roadmap | Milestones include files, commands, artifacts, gates, risks, and acceptance criteria. | +| Pedagogical value | Notebooks, leakage lessons, instructor mode, calibration, lift, and assignments are addressed. | +| Adversarial readiness | Break-me guide, leakage probes, LLM critique, and feedback triage are included. | +| Scope control | LTV, other tasks, and leaderboard are kept out of v1 except as future notes. | +| Citation quality | Citations are durable and human-readable. | +| Honesty | Unverified claims and failed checks are explicitly labeled. | + +--- + +## 17. Minimal acceptable second-attempt report + +If time is constrained, the second attempt must still include: + +1. Corrected current-state audit of the package. +2. Corrected current-state audit of release/HF/Kaggle packaging. +3. Discussion of `lead_scoring_intro` v7. +4. External platform requirements from official docs. +5. Public lead-scoring dataset census. +6. Gap matrix. +7. Roadmap with files, commands, deliverables, and acceptance criteria. +8. Citation/evidence appendix. + +A shorter but accurate and evidence-based report is better than a longer generic report. + +--- + +## 18. Suggested opening thesis for the final report + +The second-attempt report may use a thesis like this, if supported by the audit: + +> Leadforge is not a blank-slate synthetic-data idea. It already appears to contain an end-to-end deterministic CRM world generator, a relational bundle writer, public/instructor exposure modes, validation modules, release scripts, a Hugging Face-style dataset card, and a mature intro-course CSV lineage. The v1 milestone should therefore be framed as a release-hardening and evidence-building project: prove the generator’s realism, make leakage and difficulty measurable, produce platform-native Kaggle and Hugging Face packages, ship polished notebooks and data cards, and create a public break-me feedback loop. The roadmap should concentrate on release quality, not on re-implementing core generation. + +Only use this thesis if the actual second-attempt audit verifies it. + +--- + +## 19. Appendix: official platform documentation checked while preparing this guidance + +The second-attempt author should re-check these at report time because platform requirements change. + +- Kaggle API dataset metadata documentation: `https://github.com/Kaggle/kaggle-api/blob/main/docs/datasets_metadata.md` + - Notes: documents `dataset-metadata.json`, supported metadata fields, licenses, data types, update frequencies, and image requirements. +- Hugging Face Datasets repository structure: `https://huggingface.co/docs/datasets/repository_structure` + - Notes: documents repository structures, `README.md`, supported formats, `load_dataset()`, splits, and YAML `configs`. +- Hugging Face Hub dataset cards: `https://huggingface.co/docs/hub/datasets-cards` + - Notes: documents README-based dataset cards and YAML metadata. +- Hugging Face Hub data files configuration: `https://huggingface.co/docs/hub/datasets-data-files-configuration` + - Notes: documents automatic/manual split and subset configuration. + +--- + +## 20. Final instruction to the second-attempt author + +Do not optimize for sounding impressive. Optimize for making the leadforge v1 release shippable. + +The final report should let the project owner answer: + +- What exactly is already working? +- What exactly is missing? +- What exactly should be built next? +- What evidence proves the dataset is realistic, useful, and not trivially broken? +- What files and commands will produce a Kaggle/Hugging Face release? +- What artifacts will convince educators, Kaggle users, Hugging Face users, and skeptics that this is a serious synthetic CRM dataset? diff --git a/docs/external_review/gemini/.gitkeep b/docs/external_review/gemini/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/docs/external_review/gemini/gemini_report_v1.md b/docs/external_review/gemini/gemini_report_v1.md new file mode 100644 index 0000000..f966544 --- /dev/null +++ b/docs/external_review/gemini/gemini_report_v1.md @@ -0,0 +1,245 @@ +# **The Architecture of Synthetic Commercial Realities: An Exhaustive Analysis, Critique, and Roadmap for the Leadforge Framework** + +## **Introduction to the Synthetic Data Paradigm in Commercial Machine Learning** + +The intersection of artificial intelligence, customer relationship management (CRM), and go-to-market (GTM) strategy is currently constrained by a fundamental and pervasive data bottleneck. While algorithmic architectures for predictive analytics, classification, and reinforcement learning have matured rapidly over the past decade, the open-source community, alongside academic educators, lacks access to the highly complex, relational, and narratively rich commercial datasets required to train and evaluate these systems effectively.1 Proprietary enterprise datasets, which contain the nuanced behavioral and firmographic signals necessary for advanced predictive modeling, are encumbered by stringent privacy regulations, personal identifiable information (PII) restrictions, and competitive secrecy.3 Consequently, these high-quality datasets are strictly isolated behind corporate firewalls, rendering them entirely unavailable for academic research, educational instruction, or open competitive benchmarking. The development of robust machine learning models for critical commercial subproblems—such as lead scoring, lifetime value (LTV) prediction, and churn analysis—has therefore become disproportionately reliant on synthetic alternatives.5 + +However, the existing corpus of open-source synthetic datasets frequently suffers from fatal methodological flaws. The vast majority of these datasets are statistically shallow, failing to capture the intricate causal relationships and multi-touch attribution inherent in real-world buyer journeys.6 They are typically generated using simplistic rule-based engines or basic statistical sampling methods that treat features as independent variables, ignoring the profound conditional dependencies of commercial behavior.8 As a result, machine learning models can easily overfit these datasets or "break" the underlying data generating process (DGP) through trivial heuristics, yielding artificially inflated accuracy metrics that completely fail to generalize to real-world deployment scenarios.7 To advance the field of predictive GTM analytics, the data science community urgently requires synthetic datasets generated from complex, non-trivial DGPs that accurately and holistically simulate commercial worlds. + +The Leadforge framework represents an ambitious and critically necessary initiative to solve this problem. Positioned as an opinionated engine for generating deep, synthetic CRM datasets, Leadforge aims to bridge the gap between algorithmic capability and data availability. The immediate objective of bringing the Leadforge framework to maturity culminates in the release of a definitive, "best-in-class" educational lead scoring dataset to major machine learning platforms, specifically Kaggle and HuggingFace \[User Query\]. This inaugural v1 release must not only exhibit realistic statistical distributions and withstand rigorous algorithmic probing but also establish a new industry standard for dataset documentation, metadata structuring, and automated validation.10 + +The following document serves as a comprehensive research report, architectural critique, and strategic roadmap. It explores the state-of-the-art in synthetic dataset generation, evaluation methodologies, and publishing standards. It subsequently provides a critical review of the Leadforge project's current state based on its alpha releases and architectural trajectory. Finally, it presents an exhaustive, phased roadmap for the framework's maturation and the successful deployment of its inaugural v1 lead scoring dataset. + +## **Detailed Research Report: The Anatomy of Realistic Synthetic CRM Datasets** + +The foundation of any best-in-class synthetic dataset is its Data Generating Process. The methodology utilized to create the data determines its ultimate utility, realism, and resilience against trivial predictive modeling. In the context of B2B lead scoring, the DGP must transcend simple randomization and embrace causal inference, temporal dynamics, and strict industry benchmarks. + +### **Causal Graphs and the Data Generating Process** + +The core challenge in generating a realistic synthetic lead scoring dataset lies in constructing a DGP that mirrors the intricate, multi-touch nature of modern business-to-business sales cycles. Traditional, flat-file synthetic datasets often rely on independent variable generation or simple covariance matrices, which fundamentally fail to capture the directional causal dependencies present in actual CRM environments.6 A truly robust synthetic dataset must be underpinned by a causal graph (often modeled via Directed Acyclic Graphs or DAGs) that dictates how firmographic attributes, technographic signals, and behavioral events interact chronologically.8 + +For instance, the generative logic must understand that a lead's company size and industry directly influence their budgetary constraints, which in turn casually dictates the likelihood of them requesting an enterprise-tier demonstration versus autonomously signing up for a freemium trial.9 If the DGP simply assigns "pricing page views" and "company size" independently, the resulting data will lack narrative coherence, allowing a machine learning model to exploit these statistical anomalies rather than learning genuine commercial intent. The most effective predictive models in the B2B space leverage a hybrid scoring approach, combining explicit demographic fit with implicit behavioral intent.15 Therefore, the synthetic DGP must intertwine these dimensions. A high behavioral engagement score should only strongly predict conversion if the underlying demographic fit (e.g., job title authority, company revenue) meets specific, simulated thresholds.7 + +### **The Peril of Temporal Leakage in Projected Data** + +Furthermore, the simulation of these commercial realities must rigorously account for temporal dynamics to prevent the most pervasive and destructive flaw in predictive modeling: temporal leakage. Temporal leakage, a specific subset of data leakage, occurs when future information—data that would definitively not be available at the exact moment a prediction is required in a production environment—inadvertently leaks into the training features.18 + +In the context of lead scoring, the objective is to predict whether a lead will convert into a closed-won deal based solely on the information gathered up to the point of scoring (e.g., the moment they become a Marketing Qualified Lead or MQL). If a synthetic DGP generates "post-event aggregates," such as the total number of sales calls a prospect attended over a six-month period, and includes this in the feature set for an initial lead score calculated on day seven, predictive models will naturally achieve artificially inflated, near-perfect accuracy.20 + +This risk is heavily exacerbated when complex, relational CRM data (often stored across multiple tables or parquet files representing Accounts, Contacts, Leads, and Activities) is "projected" or flattened down into a single, two-dimensional CSV file for ease of use.7 When flattening this data, the generation pipeline must enforce a strict "prediction timestamp" for every single row. Any behavioral event, email open, or website visit that occurs chronologically after this timestamp must be aggressively filtered out during the projection process.18 To build a dataset that is "hard to discern or break via a simple prediction model," the generation engine must demand that the data scientist carefully engineer features based on chronologically valid timestamps, forcing them to reconstruct the state of the CRM accurately.20 + +By integrating realistic noise, missing values, and corrupted categorical fields—analogous to the methodologies employed in the widely cited IEEE-CIS Fraud Detection dataset—the synthetic data can mirror the data corruption and input errors typical of human-operated CRM systems.21 This noise injection prevents the model from learning deterministic rules and forces it to rely on probabilistic patterns, thereby significantly increasing the difficulty and pedagogical value of the modeling challenge. + +### **Empirical Benchmarks and Funnel Calibration** + +A synthetic dataset must ultimately be grounded in the empirical realities of the industry it simulates. A common, amateur flaw in synthetic generation is the creation of highly balanced target variables (e.g., simulating a scenario where 50% of leads convert). This entirely misrepresents the severe class imbalance inherent in real-world B2B funnels.23 Industry benchmarks unequivocally dictate that the vast majority of generated leads will never convert into paying customers. The DGP must be carefully calibrated to align with these established industry metrics to ensure the resulting dataset feels authentic to domain experts and provides a mathematically realistic challenge for classification algorithms. + +The following table synthesizes current B2B SaaS industry benchmarks regarding funnel conversion rates, which must serve as the foundational constraints for the Leadforge simulation engine: + +| Funnel Stage Transition | Median Conversion Benchmark | High Performer / Top Quartile | Contextual Mechanisms & Generation Constraints | +| :---- | :---- | :---- | :---- | +| Visitor to Lead | 2.0% \- 3.0% | 4.0% \- 6.0% | Highly dependent on user experience and inbound content strategy. Represents the absolute top of the simulated funnel.25 | +| Lead to MQL (Marketing Qualified) | 23.0% | 31.0% \- 41.0% | Reflects basic demographic and explicit behavioral filtering. The simulation must disqualify the majority of raw leads here.27 | +| MQL to SQL (Sales Qualified) | 13.0% | 28.0% \- 40.0% | This is the primary bottleneck and the exact stage where predictive lead scoring proves its value. High performers utilize predictive modeling to achieve these rates.26 | +| SQL to Opportunity (SAL) | 56.0% | 73.0% | Dependent on strict Service Level Agreement (SLA) enforcement and active sales representative engagement.27 | +| Overall Lead to Customer (Closed-Won) | 1.3% (Enterprise) \- 2.7% (SMB) | 5.0%+ | Enterprise funnels exhibit greater leakage due to complex, multi-stakeholder dynamics and prolonged sales cycles.27 | + +The ultimate metric of success for a predictive lead scoring model is not standard binary classification accuracy, nor is it strictly the Area Under the Receiver Operating Characteristic Curve (ROC AUC). In commercial reality, marketing and sales operations evaluate models based on their ability to rank leads efficiently, heavily utilizing decile analysis and lift curves.29 Decile analysis categorizes the scored dataset from the highest predicted probability of conversion to the lowest, segmenting the population into ten equal, 10% buckets.30 + +The "lift" is calculated as the cumulative percentage of actual responders (converted leads) captured in a given decile divided by the percentage of the total baseline population that decile represents.30 For example, if a company targets the top two deciles (20% of all leads) and captures 40% of all actual conversions, the model provides a lift of 2.0.30 A highly realistic lead scoring dataset should not allow a trivial, unoptimized model to achieve a massive lift in the first decile. In flat-table B2B lead scoring, basic demographic and behavioral counting typically yields a lift of 2x to 4x over random in the top decile.7 A highly complex, relational dataset—which captures interconnected colleague conversions, account-level buying patterns, and nuanced content progression—should require advanced feature engineering (such as interaction terms, temporal decay factors, and graph-based aggregations) to push the top-decile lift toward the theoretical maximum.7 The synthetic data must obscure the most powerful predictive signals behind this relational complexity. + +### **Advanced Validation Mechanisms: The LLM-as-a-Judge Paradigm** + +Ensuring the absolute highest quality for a v1 Kaggle and HuggingFace release requires validation mechanisms that transcend traditional statistical unit testing. While standard tests can verify distribution means, referential integrity, and standard deviations, they are fundamentally incapable of assessing the "narrative realism" of a synthetic commercial world.33 To achieve an unprecedented level of dataset quality, the generation pipeline must integrate Large Language Models (LLMs) acting as automated, semantic judges.34 + +The "LLM-as-a-judge" paradigm involves utilizing powerful foundation models (such as GPT-4-class architectures) to evaluate synthetic outputs against complex, multi-dimensional rubrics.34 This approach has been empirically validated in recent research to correlate highly with human expert judgment, offering a level of scalability and consistency that manual review processes simply cannot match.37 In the context of the Leadforge dataset, an LLM judge must be deployed to critique both the generated tabular data and the accompanying metadata documentation.40 + +For tabular data validation, the LLM evaluator is prompted with a structured sample of lead trajectories. A trajectory consists of the chronological sequence of a lead's interactions, their firmographic background, and their eventual conversion status. The LLM is instructed via a meticulously engineered prompt to evaluate the sequence for logical coherence, behavioral plausibility, and the absence of contradictory actions.33 For example, the judge would flag a narrative anomaly if a synthesized lead from a two-person, newly founded startup begins behaving exactly like a Fortune 500 procurement officer, or if a lead generates extensive "pricing page views" chronologically after a "closed-lost" status is formally recorded in the CRM.18 + +The successful implementation of LLM-as-a-judge requires careful prompt engineering and the definition of strict evaluation criteria.35 Modern observability frameworks, such as DeepEval, provide a highly structured programmatic environment to define these metrics, allowing for continuous integration testing of the dataset's semantic quality.42 The evaluator must be configured with a specific evaluation prompt, a carefully selected underlying model, precise variable mapping, and a rubric that defines what constitutes "realistic" behavior in a specific B2B SaaS context.35 By implementing this mechanism, the Leadforge framework can iteratively self-correct, automatically rejecting or adjusting DGPs that produce unnatural commercial narratives before the dataset is ever compiled for public release.40 + +### **Publishing Standards, Metadata Specifications, and Data Cards** + +Publishing a dataset that achieves "Gold Tier" status on Kaggle or widespread adoption and high download metrics on the HuggingFace Hub requires meticulous attention to formatting, accessibility, and metadata documentation.46 Both platforms possess distinct paradigms and structural requirements that must be satisfied simultaneously to maximize the dataset's educational impact. + +On Kaggle, the primary unit of tabular data consumption is the Comma-Separated Values (CSV) file.10 While relational databases and multi-parquet formats hold immense value for advanced data engineering and complex system representation, Kaggle's internal data explorer, preview metrics, and community expectations are heavily optimized for flat CSVs.10 To cater to the widest possible audience—ranging from novice students to Grandmasters seeking a rapid challenge—the optimal architectural approach is to provide the dataset in dual formats. The release should contain both a deeply relational schema (e.g., separate files for Leads, Activities, Accounts, Opportunities) and a high-quality, pre-joined "projected" CSV that serves as the immediate entry point for rapid predictive modeling.23 + +The absolute hallmark of a world-class dataset is the inclusion of an exhaustive "Data Card." Inspired by the Model Cards framework popularized by researchers like Mitchell et al., the Data Card provides critical context regarding the dataset's provenance, construction methodology, structural limitations, and intended use cases.51 An exhaustive Data Card for a synthetic dataset must explicitly address the underlying DGP, detailing the overarching assumptions made during synthetic generation without revealing the exact mathematical weights or decision boundaries that would trivialize the prediction challenge.6 It must meticulously outline the feature definitions, highlight any injected noise or intentional class imbalances, and clearly explain the chronological structure to guide users away from accidental temporal leakage during their feature engineering processes.19 + +Furthermore, platform-specific metadata integration is non-negotiable for discoverability. On HuggingFace, the README.md file must contain a highly specific YAML header block. This block drives the platform's search algorithms, categorization engines, and dataset filtering capabilities.11 + +| HuggingFace YAML Key | Specification Context and Purpose | Required / Recommended Status | +| :---- | :---- | :---- | +| language | Standard ISO 639-1 code (e.g., en for English text fields). Essential for NLP and text-based tabular features. | Required 11 | +| license | Open-source identifier (e.g., mit, apache-2.0, cc-by-4.0). Dictates downstream commercial and academic usability. | Required 11 | +| task\_categories | Defines the primary ML objective (e.g., tabular-classification). Drives discovery via task-based filtering. | Required 11 | +| tags | Keywords for nuanced discoverability (e.g., synthetic, crm, lead-scoring, b2b). | Recommended 11 | +| pretty\_name | The human-readable title of the dataset displayed on the Hub UI. | Recommended 11 | +| datasets / base\_model | Links to related artifacts or origin models, though less applicable for purely programmatic synthetic generation. | Optional 55 | + +Conversely, Kaggle relies on a dataset-metadata.json file for programmatic uploads, API interactions, and dataset versioning.56 This JSON schema strictly defines the dataset's title, unique URL slug, designated license, and local file paths.57 + +To achieve elite status within the Kaggle community, the dataset release must also be accompanied by a comprehensive "Starter Notebook".47 The most highly regarded introductory notebooks provide a rich narrative flow.59 They conduct thorough exploratory data analysis (EDA), explain the underlying business context of lead scoring to practitioners unfamiliar with CRM mechanics, explicitly highlight potential pitfalls like train-test contamination, and establish a credible baseline model using contemporary algorithms such as XGBoost or LightGBM.49 + +### **Automated Deployment Pipelines via MLOps Infrastructure** + +The explicitly stated requirement that the dataset be published via an automated, "one-or-two command" process necessitates a robust Continuous Integration/Continuous Deployment (CI/CD) pipeline. This is typically orchestrated via GitHub Actions, bridging the local or cloud-based dataset generation engine with the respective public repositories.61 + +For HuggingFace integration, the official huggingface\_hub Python library provides native, highly documented methods (such as create\_repo, upload\_file, and upload\_folder) to programmatically manage datasets without relying on manual git operations.63 A GitHub Action workflow can authenticate securely using a repository-scoped secret (HF\_TOKEN), generate the dataset artifacts in a headless environment, dynamically construct the YAML-infused README.md file, and push the entire finalized directory to the Hub.61 + +Simultaneously, the Kaggle Command Line Interface (CLI) provides the kaggle datasets create and kaggle datasets version commands.57 The automated CI/CD pipeline must be capable of generating the required dataset-metadata.json file on the fly, injecting the Kaggle API credentials (KAGGLE\_USERNAME, KAGGLE\_KEY) securely via encrypted environment variables, and executing the upload sequence seamlessly.58 By entirely automating the publication layer, the Leadforge framework allows the educator or curator to focus exclusively on iterating the mathematical DGP and refining the commercial narrative, completely abstracting the operational friction of multi-platform distribution. + +## **Review and General Critique of the Leadforge Project** + +An analysis of the current state of the Leadforge project—incorporating its stated programmatic goals, the existence of alpha quasi-releases, and its envisioned architectural trajectory—reveals a framework with immense transformative potential. However, it also highlights several critical areas requiring structural maturation before a definitive v1 release can be executed successfully. + +### **Positive Attributes and Core Strengths** + +The primary and most significant strength of Leadforge is its fundamental premise: acknowledging that high-quality, open-source B2B CRM datasets are essentially non-existent, and that simplistic random-number generation is radically insufficient for complex educational and competitive modeling \[User Query\]. The commitment to establishing an "opinionated" framework that embeds deep narrative consistency and complex Data Generating Processes directly addresses the primary weakness of contemporary synthetic data literature.6 + +The successful projection of a complex, multi-parquet output format into a single, highly usable CSV for an introductory Machine Learning course demonstrates that the project already possesses a highly functional operational foundation \[User Query\]. This dual-format capability—maintaining a deep, normalized relational structure in the backend while simultaneously offering a projected, flattened view for ease of use—is excellent. It precisely mirrors the reality of modern enterprise data engineering, where raw data lakes (comprising relational tables and event streams) are eventually transformed into feature stores (flattened, aggregated dataframes) for consumption by data scientists.7 + +Furthermore, the strategic decision to target Kaggle and HuggingFace simultaneously, paired with an open invitation for the global data science community to actively attempt to "break" the dataset, is a brilliant mechanism for adversarial stress-testing \[User Query\]. Kaggle's highly competitive ecosystem is unrivaled in its collective ability to ruthlessly expose data leakage, overfitting vulnerabilities, and subtle statistical anomalies.24 Embracing this adversarial testing environment will rapidly accelerate the framework's mathematical sophistication and expose blind spots in the simulated commercial logic. + +### **Architectural and Structural Critiques** + +Despite its evident strengths, the current alpha state and stated methodologies exhibit several areas that require substantial architectural refinement to meet the ambition of delivering the "hands-down, best ever synthetic lead scoring dataset." + +First, the management of temporal dynamics in the dataset projection logic must be rigorously scrutinized and fundamentally overhauled if necessary. As established in the research section, the projection of multi-parquet relational data into a single CSV is inherently dangerous regarding temporal leakage.18 When disparate activities, changing firmographic states, and historical engagements are flattened into a single row representing a single lead, it is exceptionally easy to accidentally aggregate events that occurred *after* the target conversion date.20 Leadforge's internal architecture must enforce a strict "point-in-time" rendering engine for its projected CSVs. This engine must demand a predefined prediction timestamp and guarantee mathematically that all aggregate features (e.g., recent\_email\_opens, total\_website\_sessions) are calculated strictly prior to that arbitrary timestamp.18 Without this programmatic guarantee, the framework will inevitably generate flawed teaching materials. + +Second, the complexity of the internal DGP must be explicitly designed to ensure that the resulting dataset cannot be easily "solved" using linear models or basic demographic heuristics. If a simple Logistic Regression model utilizing only two features—such as job\_title and company\_size—achieves a top-decile lift of 4x (capturing 80% of all simulated conversions), the dataset is demonstrably too simplistic.7 The real commercial world requires complex, hybrid models where implicit intent drives the score.15 The framework must weave complex, non-linear interaction effects into the DGP, forcing users to engineer advanced interaction features or deploy sophisticated gradient boosting architectures (e.g., XGBoost, LightGBM) to capture the underlying relationships and extract meaningful predictive lift.49 The API must allow the user to define the strength of these non-linearities. + +Finally, the project currently lacks the automated, LLM-driven critique layer outlined in the user's future vision. Relying solely on standard unit tests or statistical bounds checking to validate an alpha dataset is insufficient for ensuring narrative depth.33 The absence of an automated mechanism to continuously evaluate the semantic logic of the simulated actors means that as the DGP scales in complexity, the risk of generating absurd commercial behaviors increases exponentially. The framework is currently flying blind regarding the "story" its data tells. + +### **API and Specification Design Considerations** + +To support the automated creation of world-class datasets by educators and curators, the Leadforge Python API must transition toward a highly declarative architecture, tightly integrated with metadata generation. The architecture should abstract the underlying mathematical DGP parameters from the core execution engine. Users (or the framework itself during automated runs) should be able to define the entire "world state"—including base conversion rates, complex feature correlation matrices, time-decay functions, and noise injection levels—via structured configuration files, preferably utilizing YAML or JSON schemas.12 + +The API must also seamlessly handle the automated, real-time construction of the HuggingFace README.md and Kaggle dataset-metadata.json files.57 This requires the framework to possess an intelligent internal data dictionary that continuously tracks feature definitions, data types, value distributions, and missing value percentages as the synthetic data is generated. The framework must automatically compile these tracked metrics into the rich Data Cards required by the publishing platforms.10 By embedding the documentation generation step directly into the data compilation pipeline, Leadforge guarantees that the metadata remains perfectly synchronized with the underlying data, permanently eliminating the documentation drift that plagues nearly all open-source synthetic datasets. + +## **Suggested Roadmap for the v1 Dataset Release** + +To navigate the critical transition from internal alpha quasi-releases to a globally recognized, best-in-class v1 dataset deployed seamlessly to Kaggle and HuggingFace, a phased, highly structured engineering roadmap is required. This roadmap prioritizes mathematical rigor, automated semantic validation, comprehensive metadata generation, and CI/CD deployment execution. + +### **Phase 1: Framework Maturation and DGP Refinement** + +The foundational phase must prioritize the absolute integrity, realism, and computational difficulty of the synthetic data generation engine. The dataset must be fundamentally immune to basic methodological errors before any publication pipelines are considered. + +* **Implement a Temporal Integrity Engine:** Develop an ironclad "time-travel" prevention mechanism within the Leadforge data projection logic. The API must require the explicit definition of a prediction\_timestamp for every simulated lead. Ensure that the internal pipeline responsible for flattening the relational multi-parquet data into the final CSV strictly and provably filters out any behavioral events, state changes, or intent signals that occur chronologically after the prediction timestamp.18 This is the most critical programmatic safeguard for the framework. +* **Execute Deep Funnel Calibration:** Calibrate the baseline conversion rates within the DGP to accurately mirror empirical B2B SaaS realities. Hardcode configurable constraints ensuring the overall simulated lead-to-customer conversion rate rests realistically between 1% and 4%, and the critical MQL-to-SQL rate rests between 15% and 30%.25 Severe class imbalance must be a defining, unalterable characteristic of the default dataset generation. +* **Inject Non-Linear Complexity (Interaction Effects):** Enhance the underlying statistical DGP to ensure simple linear combinations of features cannot perfectly separate the target classes. Introduce complex conditional probability layers: for instance, ensure that high website engagement is highly predictive of conversion *only if* the company revenue feature exceeds a certain threshold, simulating the profound difference between qualified enterprise buyers and unqualified academic researchers browsing a product.7 +* **Simulate Systemic Noise and Corruption:** Deliberately inject realistic CRM noise into the final output. Introduce randomized missing values (e.g., leads lacking phone numbers or accurate job titles), typographical errors in categorical string fields, and slight timing desynchronizations to simulate delayed CRM API syncing.21 This forces end-users to practice realistic data cleaning, imputation, and feature engineering, elevating the dataset's educational value. + +### **Phase 2: Implementation of Automated LLM-as-a-Judge Validation** + +Before any data is prepared for formatting and public release, it must successfully pass an automated, intelligent quality assurance gauntlet designed to verify narrative realism. + +* **Integrate the LLM Evaluator Architecture:** Integrate a robust observability and evaluation framework, such as DeepEval, directly into the Leadforge continuous integration test suite.42 Configure the framework to utilize a powerful, instruction-tuned judge model (e.g., GPT-4o or Claude 3.5 Sonnet) via standard API connections. +* **Design Tabular Realism Prompting Rubrics:** Design a highly specific, reference-less scoring rubric tailored for B2B environments. The automated test suite will randomly sample hundreds of "lead journeys" (the chronologically ordered events, state changes, and attributes of a single lead) and pass them to the LLM. The LLM must be prompted to score them on a strict scale of 1-10 for logical coherence, behavioral plausibility, and narrative consistency.34 +* **Establish Hard Metric Thresholds:** Establish a CI/CD rule that categorically fails the data generation build if the LLM judge's average realism score falls below a predetermined threshold, or if it detects flagrant logical impossibilities (e.g., logging an event where "The lead requested an advanced pricing discussion prior to ever visiting the website or opening an email").41 + +### **Phase 3: Documentation, Metadata, and Data Card Construction** + +A best-in-class dataset is ultimately defined by the quality and exhaustiveness of its documentation. This phase automates the creation of the explanatory wrapper, ensuring zero documentation drift. + +* **Automate Data Dictionary Generation:** Extend the core Leadforge API to automatically output comprehensive feature descriptions, inferred data types, standard deviations, and missing value percentages based directly on the generated output dataframe.10 +* **Construct the Kaggle Metadata Schema Generator:** Create a dedicated Python module that dynamically writes the dataset-metadata.json file. It must populate the title, id (the repository slug), licenses, and precise local file paths automatically based on the specific run configuration.57 +* **Develop the HuggingFace YAML and Markdown Builder:** Create a parallel module that constructs the README.md file formatted specifically for the HuggingFace Hub. It must programmatically inject the mandatory YAML metadata block (language, license, task\_categories: tabular-classification, tags).11 Below the YAML header, the module should automatically generate the Markdown body for the Data Card, detailing the DGP assumptions, dataset overview, and intended pedagogical motivations.51 +* **Execute LLM Document Critique:** Feed the final, automatically generated Data Card back to the LLM judge. Prompt the LLM to adopt the persona of a Kaggle Grandmaster reviewing the documentation for clarity, completeness, and formatting standards, automatically suggesting or directly applying stylistic revisions.34 + +### **Phase 4: CI/CD Pipeline Integration and Deployment Orchestration** + +The execution of the dataset release must be abstracted away from the developer and reduced to a single automated command sequence. + +* **Establish Secure Credential Management:** Establish environment variables and GitHub Secrets for HF\_TOKEN, KAGGLE\_USERNAME, and KAGGLE\_KEY to ensure secure, headless authentication during the deployment process.61 +* **Orchestrate GitHub Actions Workflows:** Write a comprehensive .github/workflows/publish-dataset.yml workflow file. Upon triggering, this action must sequentially: + 1. Spin up an appropriately sized cloud compute runner. + 2. Install the internal Leadforge package and all external dependencies (specifically including kagglehub and huggingface\_hub).63 + 3. Execute the core Leadforge dataset generation script, producing the multi-parquet and projected CSV files. + 4. Run the deep LLM-as-a-judge validation suite, halting the pipeline upon failure. + 5. Generate the metadata schemas and Data Cards. + 6. Execute the kaggle datasets create (or version for updates) CLI command to push the artifacts to Kaggle.57 + 7. Execute the HfApi().upload\_folder() Python method to push the identical artifacts and documentation to the HuggingFace Hub.64 + +### **Phase 5: Launch Strategy and Community Engagement** + +The final phase ensures the dataset is actually adopted, rigorously challenged, and recognized by the wider machine learning community. + +* **Develop the Masterclass Starter Notebook:** The v1 dataset must definitively not be released in a vacuum. The Leadforge curator must generate a companion Jupyter/Colab notebook to be published alongside the dataset on Kaggle and HuggingFace.59 This notebook must: + * Load the projected CSV and conduct high-quality EDA using advanced visualization libraries.49 + * Explicitly explain the concept of temporal leakage and demonstrate mathematically how the dataset's architecture prevents it. + * Build a credible baseline model using XGBoost or LightGBM.49 + * Calculate and prominently plot a decile lift chart to evaluate the baseline model, establishing the initial benchmark score that the community must attempt to surpass.29 +* **Initiate the Public Adversarial Challenge:** In the dataset descriptions, READMEs, and associated platform forums, explicitly challenge the community to "break" the underlying DGP. Actively ask users to search for hidden temporal leaks, exploit the injected noise, and utilize hyper-parameter tuning to achieve maximum top-decile lift.48 This public, adversarial engagement strategy represents the fastest and most rigorous path to identifying structural flaws and designing an even more robust v2 DGP for the Leadforge framework. + +#### **Works cited** + +1. REFINING B2B CRM SYSTEMS CUSTOMER DATA US- ING MACHINE LEARNING, accessed May 5, 2026, [https://jyx.jyu.fi/bitstreams/ab1718fe-f0b6-403f-af9d-e5da146ec05c/download](https://jyx.jyu.fi/bitstreams/ab1718fe-f0b6-403f-af9d-e5da146ec05c/download) +2. A multi-factor machine learning framework for predicting and profiling student academic performance using behavioral, financial, and wearable data \- PMC, accessed May 5, 2026, [https://pmc.ncbi.nlm.nih.gov/articles/PMC12642127/](https://pmc.ncbi.nlm.nih.gov/articles/PMC12642127/) +3. What is Synthetic Data Generation? A Practical Guide \- K2view, accessed May 5, 2026, [https://www.k2view.com/what-is-synthetic-data-generation/](https://www.k2view.com/what-is-synthetic-data-generation/) +4. Report: Using Synthetic Data in Financial Services, accessed May 5, 2026, [https://www.fca.org.uk/publication/corporate/report-using-synthetic-data-in-financial-services.pdf](https://www.fca.org.uk/publication/corporate/report-using-synthetic-data-in-financial-services.pdf) +5. \[2603.29791\] Reasoning-Driven Synthetic Data Generation and Evaluation \- arXiv, accessed May 5, 2026, [https://arxiv.org/abs/2603.29791](https://arxiv.org/abs/2603.29791) +6. Synthetic Data Generation Benchmark \- AIMultiple, accessed May 5, 2026, [https://aimultiple.com/synthetic-data-generation](https://aimultiple.com/synthetic-data-generation) +7. The Complete Guide to Lead Scoring: From Point Systems to Predictive ML, accessed May 5, 2026, [https://kumo.ai/resources/learn/guide/lead-scoring-complete-guide/](https://kumo.ai/resources/learn/guide/lead-scoring-complete-guide/) +8. Making Causal Discovery work in real-world business settings | Towards Data Science, accessed May 5, 2026, [https://towardsdatascience.com/making-causal-discovery-work-in-real-world-business-settings-80e80c5f66b8/](https://towardsdatascience.com/making-causal-discovery-work-in-real-world-business-settings-80e80c5f66b8/) +9. A Causal Approach for Business Optimization: Application on an Online Marketplace \- arXiv, accessed May 5, 2026, [https://arxiv.org/pdf/2207.01722](https://arxiv.org/pdf/2207.01722) +10. How To Use Kaggle: Datasets, accessed May 5, 2026, [https://www.kaggle.com/docs/datasets](https://www.kaggle.com/docs/datasets) +11. Dataset Cards \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/hub/datasets-cards](https://huggingface.co/docs/hub/datasets-cards) +12. Generating Synthetic Data with Preserved Higher-Order Correlations \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2510.21610v1](https://arxiv.org/html/2510.21610v1) +13. Safeguarding Demand Forecasting with Causal Graphs | Towards Data Science, accessed May 5, 2026, [https://towardsdatascience.com/safeguarding-demand-forecasting-with-causal-graphs-591511fc8e0e/](https://towardsdatascience.com/safeguarding-demand-forecasting-with-causal-graphs-591511fc8e0e/) +14. Right-size your lead scoring \- Clearbit, accessed May 5, 2026, [https://clearbit.com/resources/books/lead-qualification/lead-scoring-stages](https://clearbit.com/resources/books/lead-qualification/lead-scoring-stages) +15. B2B Lead Scoring 101 For Small Businesses In 2026, accessed May 5, 2026, [https://www.thesmallbusinessexpo.com/blog/b2b-lead-scoring/](https://www.thesmallbusinessexpo.com/blog/b2b-lead-scoring/) +16. Guide to Lead Scoring in SaaS & B2B \- Insights by Ortto, accessed May 5, 2026, [https://ortto.com/learn/what-is-lead-scoring/](https://ortto.com/learn/what-is-lead-scoring/) +17. Understand the lead scoring tool \- HubSpot Knowledge Base, accessed May 5, 2026, [https://knowledge.hubspot.com/scoring/understand-the-lead-scoring-tool](https://knowledge.hubspot.com/scoring/understand-the-lead-scoring-tool) +18. Understanding Data Leakage in Machine Learning | by Aziz Özmen Ph.D. \- Medium, accessed May 5, 2026, [https://medium.com/@azizozmen/understanding-data-leakage-in-machine-learning-c04ea4e72bc6](https://medium.com/@azizozmen/understanding-data-leakage-in-machine-learning-c04ea4e72bc6) +19. Data Leakage in Machine Learning: Prevention Guide & Security \- Northhaven Analytics, accessed May 5, 2026, [https://northhavenanalytics.com/definitive-guide-data-leakage-machine-learning-prevention/](https://northhavenanalytics.com/definitive-guide-data-leakage-machine-learning-prevention/) +20. Will You Spot the Leaks? A Data Science Challenge, accessed May 5, 2026, [https://towardsdatascience.com/will-you-spot-the-leaks-a-data-science-challenge/](https://towardsdatascience.com/will-you-spot-the-leaks-a-data-science-challenge/) +21. RABEM: risk-adaptive Bayesian ensemble model for fraud detection \- PMC, accessed May 5, 2026, [https://pmc.ncbi.nlm.nih.gov/articles/PMC12540735/](https://pmc.ncbi.nlm.nih.gov/articles/PMC12540735/) +22. Fraud Dataset Benchmark and Applications \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2208.14417v3](https://arxiv.org/html/2208.14417v3) +23. Lead Scoring X Online Education \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/datasets/lakshmikalyan/lead-scoring-x-online-education](https://www.kaggle.com/datasets/lakshmikalyan/lead-scoring-x-online-education) +24. The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2604.08001v1](https://arxiv.org/html/2604.08001v1) +25. B2B sales conversion rate by industry: benchmarks, formulas, and optimization tactics \- Zeliq, accessed May 5, 2026, [https://www.zeliq.com/blog/b2b-conversion-rates-by-industry](https://www.zeliq.com/blog/b2b-conversion-rates-by-industry) +26. Benchmarks for Digital Marketing in B2B Lead Gen | SalesHive Blog, accessed May 5, 2026, [https://saleshive.com/blog/b2b-lead-benchmarks-digital-marketing-gen/](https://saleshive.com/blog/b2b-lead-benchmarks-digital-marketing-gen/) +27. B2B Lead Generation Statistics 2026: 180 Data Points \- Digital Applied, accessed May 5, 2026, [https://www.digitalapplied.com/blog/b2b-lead-generation-statistics-2026-data-points](https://www.digitalapplied.com/blog/b2b-lead-generation-statistics-2026-data-points) +28. B2B Lead Conversion Rates: 2026 Benchmarks by Stage \- Prospeo, accessed May 5, 2026, [https://prospeo.io/s/b2b-lead-conversion-rates](https://prospeo.io/s/b2b-lead-conversion-rates) +29. Lift Chart \- DataRobot docs, accessed May 5, 2026, [https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/evaluate/lift-chart-classic.html](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/evaluate/lift-chart-classic.html) +30. Decile Analysis: Logistic Regression applied correctly | Aryma Labs \- Medium, accessed May 5, 2026, [https://medium.com/aryma-labs/the-lost-art-of-decile-analysis-a93f3636f1ad](https://medium.com/aryma-labs/the-lost-art-of-decile-analysis-a93f3636f1ad) +31. Evaluating the potential return of a model with Lift, Gain, and Decile Analysis, accessed May 5, 2026, [https://towardsdatascience.com/evaluating-the-potential-return-of-a-model-with-lift-gain-and-decile-analysis-319f00fde5b6/](https://towardsdatascience.com/evaluating-the-potential-return-of-a-model-with-lift-gain-and-decile-analysis-319f00fde5b6/) +32. Feature Engineering for Lead Scoring Models \- Reform.app, accessed May 5, 2026, [https://www.reform.app/blog/feature-engineering-for-lead-scoring-models](https://www.reform.app/blog/feature-engineering-for-lead-scoring-models) +33. A Comparison of LLMs for Use in Generating Synthetic Test Data for Automated Testing of a Patient-Focused, Survey-Based System \- PMC, accessed May 5, 2026, [https://pmc.ncbi.nlm.nih.gov/articles/PMC12099342/](https://pmc.ncbi.nlm.nih.gov/articles/PMC12099342/) +34. Evaluate with LLM-as-a-Judge — NVIDIA NeMo Platform Documentation, accessed May 5, 2026, [https://docs.nvidia.com/nemo/microservices/latest/evaluator/metrics/llm-as-a-judge.html](https://docs.nvidia.com/nemo/microservices/latest/evaluator/metrics/llm-as-a-judge.html) +35. How to Calibrate LLM-as-a-Judge with Human Corrections \- LangChain, accessed May 5, 2026, [https://www.langchain.com/articles/llm-as-a-judge](https://www.langchain.com/articles/llm-as-a-judge) +36. LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale, accessed May 5, 2026, [https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method](https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method) +37. Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs \- ACL Anthology, accessed May 5, 2026, [https://aclanthology.org/2024.emnlp-main.285.pdf](https://aclanthology.org/2024.emnlp-main.285.pdf) +38. LLM-as-a-judge for enterprises: evaluate model alignment at scale | Snorkel AI, accessed May 5, 2026, [https://snorkel.ai/llm-as-judge-for-enterprises/](https://snorkel.ai/llm-as-judge-for-enterprises/) +39. Enhancing LLM-as-a-Judge through Active-Sampling-based Prompt Optimization \- ACL Anthology, accessed May 5, 2026, [https://aclanthology.org/2025.acl-industry.67.pdf](https://aclanthology.org/2025.acl-industry.67.pdf) +40. Prompting-Based Synthetic Data Generation \- Emergent Mind, accessed May 5, 2026, [https://www.emergentmind.com/topics/prompting-based-synthetic-data-generation](https://www.emergentmind.com/topics/prompting-based-synthetic-data-generation) +41. LLM-as-a-judge: a complete guide to using LLMs for evaluations \- Evidently AI, accessed May 5, 2026, [https://www.evidentlyai.com/llm-guide/llm-as-a-judge](https://www.evidentlyai.com/llm-guide/llm-as-a-judge) +42. confident-ai/deepeval: The LLM Evaluation Framework \- GitHub, accessed May 5, 2026, [https://github.com/confident-ai/deepeval](https://github.com/confident-ai/deepeval) +43. DeepEval Evaluations \- Datadog Docs, accessed May 5, 2026, [https://docs.datadoghq.com/llm\_observability/evaluations/deepeval\_evaluations/](https://docs.datadoghq.com/llm_observability/evaluations/deepeval_evaluations/) +44. Understanding DeepEval: A Practical Guide for Evaluating Large Language Models, accessed May 5, 2026, [https://codemaker2016.medium.com/understanding-deepeval-a-practical-guide-for-evaluating-large-language-models-d7272b6c2634](https://codemaker2016.medium.com/understanding-deepeval-a-practical-guide-for-evaluating-large-language-models-d7272b6c2634) +45. Reinforcement fine-tuning with LLM-as-a-judge | Artificial Intelligence \- AWS, accessed May 5, 2026, [https://aws.amazon.com/blogs/machine-learning/reinforcement-fine-tuning-with-llm-as-a-judge/](https://aws.amazon.com/blogs/machine-learning/reinforcement-fine-tuning-with-llm-as-a-judge/) +46. Top 20 Hugging Face Datasets \- Analytics Vidhya, accessed May 5, 2026, [https://www.analyticsvidhya.com/blog/2024/12/huggingface-datasets/](https://www.analyticsvidhya.com/blog/2024/12/huggingface-datasets/) +47. Progression System \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/progression](https://www.kaggle.com/progression) +48. Kaggle \- Machine Learning and Data Science Competitions \- DS@GT ARC Notes, accessed May 5, 2026, [https://notes.dsgt-arc.org/competition/venues/kaggle/](https://notes.dsgt-arc.org/competition/venues/kaggle/) +49. Lead scoring \- Deepnote, accessed May 5, 2026, [https://deepnote.com/app/deepnote/Lead-scoring-4edf6e52-c0a7-4d50-baf5-610ab23bc878](https://deepnote.com/app/deepnote/Lead-scoring-4edf6e52-c0a7-4d50-baf5-610ab23bc878) +50. Lead Scoring Case Study | Kaggle, accessed May 5, 2026, [https://www.kaggle.com/competitions/lead-scoring-case-study-acpc1/data](https://www.kaggle.com/competitions/lead-scoring-case-study-acpc1/data) +51. The Data Cards Playbook \- Google Research, accessed May 5, 2026, [https://sites.research.google/datacardsplaybook/](https://sites.research.google/datacardsplaybook/) +52. Create a dataset card \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/datasets/dataset\_card](https://huggingface.co/docs/datasets/dataset_card) +53. Scorecards for Synthetic Medical Data Evaluation and Reporting \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2406.11143v1](https://arxiv.org/html/2406.11143v1) +54. Data to Infinity and Beyond: Examining Data Sharing and Reuse Practices in the Computer Security Community, accessed May 5, 2026, [https://www.cise.ufl.edu/\~k.childs/Papers/DataSetReproducability.pdf](https://www.cise.ufl.edu/~k.childs/Papers/DataSetReproducability.pdf) +55. Model Cards \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/hub/model-cards](https://huggingface.co/docs/hub/model-cards) +56. Public API \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/docs/api](https://www.kaggle.com/docs/api) +57. How to use the Kaggle API to upload data from a server to Kaggle as a dataset? · GitHub, accessed May 5, 2026, [https://github.com/Lizhecheng02/Kaggle-Dataset-API-Upload](https://github.com/Lizhecheng02/Kaggle-Dataset-API-Upload) +58. How to Build an Auto-Updating Open-Source Dataset Using Kaggle API and GitHub Actions, accessed May 5, 2026, [https://python.plainenglish.io/how-to-build-an-auto-updating-open-source-dataset-using-kaggle-api-and-github-actions-a7b010eca222](https://python.plainenglish.io/how-to-build-an-auto-updating-open-source-dataset-using-kaggle-api-and-github-actions-a7b010eca222) +59. Hidden Gems: A Collection of Underrated Notebooks \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/code/headsortails/hidden-gems-a-collection-of-underrated-notebooks](https://www.kaggle.com/code/headsortails/hidden-gems-a-collection-of-underrated-notebooks) +60. A Machine Learning and Explainable Artificial Intelligence Approach to Student Dropout Prediction Using Multidimensional Educational Data | Cureus Journals | Article, accessed May 5, 2026, [https://www.cureusjournals.com/articles/12679-a-machine-learning-and-explainable-artificial-intelligence-approach-to-student-dropout-prediction-using-multidimensional-educational-data](https://www.cureusjournals.com/articles/12679-a-machine-learning-and-explainable-artificial-intelligence-approach-to-student-dropout-prediction-using-multidimensional-educational-data) +61. GitHub Actions \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/hub/repositories-github-actions](https://huggingface.co/docs/hub/repositories-github-actions) +62. Sync With Hugging Face Hub · Actions · GitHub Marketplace, accessed May 5, 2026, [https://github.com/marketplace/actions/sync-with-hugging-face-hub](https://github.com/marketplace/actions/sync-with-hugging-face-hub) +63. Upload files to the Hub \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/huggingface\_hub/guides/upload](https://huggingface.co/docs/huggingface_hub/guides/upload) +64. huggingface\_hub/README.md at main \- GitHub, accessed May 5, 2026, [https://github.com/huggingface/huggingface\_hub/blob/main/README.md](https://github.com/huggingface/huggingface_hub/blob/main/README.md) +65. Quickstart \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/huggingface\_hub/en/quick-start](https://huggingface.co/docs/huggingface_hub/en/quick-start) +66. Kaggle/kagglehub: Python library to access Kaggle resources \- GitHub, accessed May 5, 2026, [https://github.com/Kaggle/kagglehub](https://github.com/Kaggle/kagglehub) +67. Data 360 Architecture | Data 360 and Integration | Fundamentals \- Salesforce Architects, accessed May 5, 2026, [https://architect.salesforce.com/docs/architect/fundamentals/guide/data-360-architecture](https://architect.salesforce.com/docs/architect/fundamentals/guide/data-360-architecture) +68. Transformers Notebooks \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/transformers/notebooks](https://huggingface.co/docs/transformers/notebooks) +69. A Very Extensive Data Analysis of Yelp \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/code/ambarish/a-very-extensive-data-analysis-of-yelp](https://www.kaggle.com/code/ambarish/a-very-extensive-data-analysis-of-yelp) diff --git a/docs/external_review/gemini/gemini_report_v2.md b/docs/external_review/gemini/gemini_report_v2.md new file mode 100644 index 0000000..cd38afa --- /dev/null +++ b/docs/external_review/gemini/gemini_report_v2.md @@ -0,0 +1,247 @@ +# **Strategic Blueprint and Technical Research Report for Leadforge V1: Architecting the Gold Standard in Synthetic CRM Datasets** + +## **Evaluation of the Leadforge Project: Current State and Critical Assessment** + +The contemporary landscape of machine learning education and algorithmic development suffers from a profound deficiency: the severe scarcity of high-fidelity, relational Commercial Revenue Management (CRM) and Go-To-Market (GTM) datasets. Real-world CRM data is heavily guarded, sequestered behind proprietary business intelligence walls and stringent global privacy regulations. Consequently, data science practitioners and students are frequently relegated to training models on trivial, static, or historically irrelevant datasets that fail to capture the extreme class imbalances, noisy firmographics, and complex temporal dynamics inherent in modern commercial pipelines. The Leadforge framework addresses this fundamental pedagogical and professional gap by engineering an opinionated architecture capable of generating synthetic commercial worlds governed by non-trivial Data Generating Processes (DGPs). + +A rigorous review of the Leadforge project in its current state, including its alpha quasi-releases and underlying codebase paradigm, reveals a foundation of significant promise paired with critical architectural vulnerabilities. The project successfully demonstrates the capacity to synthesize narratively deep lead scoring datasets. The fundamental pedagogical approach—synthesizing commercial entities, user behaviors, and lifecycle events to simulate a realistic economic environment—provides a highly valuable testing ground for modeling techniques such as lead scoring, pipeline forecasting, and ultimately, lifetime value (LTV) prediction. By simulating these environments, Leadforge allows educators to project multi-parquet relational structures down into accessible, single-CSV educational datasets tailored for introductory economics and management cohorts. + +However, a critical assessment of the alpha releases and the theoretical limits of the current methodology exposes several severe flaws that must be eradicated before a definitive V1 release can be deployed to elite repositories such as Kaggle and HuggingFace. The primary vulnerability lies in the methodology used to flatten complex, relational CRM data into a single tabular projection. While pedagogically convenient, this process currently introduces acute risks of temporal data leakage, specifically the inadvertent inclusion of post-event aggregates. Real-world lead scoring relies intrinsically on time-series behavioral data and evolving firmographic enrichment. When relational entity-event graphs are collapsed into a single row per lead without strict temporal boundaries, the crucial distinction between pre-prediction features and post-event consequences is easily blurred. Models trained on such data often exhibit exceptionally high cross-validation scores but fail catastrophically in production environments because they have inadvertently memorized future information.1 + +Secondly, the statistical distributions governing the current DGPs rely heavily on generalized assumptions rather than precise, industry-calibrated empirical benchmarks. For a synthetic dataset to be resilient against trivial predictive models and to yield realistic lift curves, its underlying generative bounds must accurately mirror the extreme drop-offs and low signal-to-noise ratios characteristic of modern B2B SaaS funnels. A truly unbreakable dataset requires an intricate causal graph that weaves precise industry deciles into its generative logic, forcing predictive algorithms to uncover subtle, multi-collinear relationships rather than relying on overt, synthesized correlations. + +Finally, the current iteration lacks the fully automated, continuous integration and continuous deployment (CI/CD) pipelines required by modern MLOps standards. The absence of structured, metadata-rich documentation schemas mandated by platforms like Kaggle and HuggingFace severely limits the discoverability and utility of the data. Furthermore, without an automated, Large Language Model (LLM)-driven validation layer, the dataset's internal logical coherence, demographic diversity, and syntax validity remain mathematically unverified prior to release. To achieve the milestone of releasing the definitive V1 lead scoring dataset, the Leadforge architecture must undergo a comprehensive overhaul focused on statistical calibration, strict temporal boundary enforcement, reference-less LLM validation, and modernized publishing automation. + +## **Macroeconomic Context: The Imperative for High-Fidelity B2B SaaS Data** + +To fully comprehend the structural requirements of a best-in-class synthetic CRM dataset, one must first analyze the macroeconomic realities of the B2B SaaS industry that the dataset seeks to emulate. The commercial environment of 2024 through 2026 has witnessed a pronounced shift from a "growth-at-all-costs" paradigm to an environment defined by capital efficiency and scrutinized revenue operations.3 Growth rates across private SaaS companies have decelerated, with the median growth rate falling from 30% in 2023 to approximately 25% in 2025, mirroring pandemic-era stabilization levels.5 Concurrently, Customer Acquisition Costs (CAC) have continued to escalate, with the New CAC Ratio rising by 14% in 2024, meaning that companies are frequently spending upwards of $2.00 in sales and marketing expenses to acquire merely $1.00 of new recurring revenue.6 + +In this highly constrained environment, the ability to accurately score and prioritize leads is not merely an academic exercise; it is a critical determinant of corporate survival. Marketing and sales teams are heavily reliant on predictive algorithms to filter immense volumes of low-intent noise and identify the scarce, high-value prospects that warrant expensive human intervention. However, the efficacy of these predictive models is entirely bottlenecked by the quality of the training data. If a model is trained on data that fails to represent the true friction of the modern sales funnel, the resulting predictions will misallocate critical sales resources, thereby exacerbating the already inflating CAC ratios.6 + +Therefore, the pedagogical value of the Leadforge V1 dataset depends entirely on its ability to faithfully replicate the severe class imbalances, elongated sales cycles, and complex behavioral signals that define contemporary B2B SaaS operations. The dataset must simulate an environment where the vast majority of generated leads represent dead ends, forcing students and practitioners to engineer sophisticated features and deploy advanced gradient boosting or neural network architectures to extract actionable signal. By anchoring the synthetic DGP in the empirical realities of 2025 SaaS performance metrics, Leadforge transcends the limitations of typical academic datasets and provides a rigorous, commercially relevant training ground. + +## **Calibrating the Data Generating Process (DGP) with Empirical Benchmarks** + +To construct a synthetic environment that mimics real-world commercial difficulty, the Leadforge generation engine must be calibrated to highly specific empirical benchmarks rather than relying on intuitive or generalized ranges. The architecture of a B2B sales pipeline is characterized by sequential stages—Visitor, Lead, Marketing Qualified Lead (MQL), Sales Qualified Lead (SQL), Opportunity, and Closed-Won—each functioning as a restrictive filter.8 The transition probabilities between these stages must be intricately woven into the framework's causal graph. + +### **Pipeline Conversion Dynamics and Class Imbalance** + +The most critical bottleneck in the commercial funnel, and the exact juncture where lead scoring models are predominantly deployed, is the transition from MQL to SQL. This stage represents the handoff between automated marketing nurture and expensive, human-driven sales evaluation. The baseline industry average for MQL-to-SQL conversion rests at approximately 13%, with broader general medians hovering between 13% and 15%.10 However, an elite, best-in-class dataset cannot simply apply a uniform 13% probability across all generated entities. The conversion probabilities must fracture significantly based on synthesized lead characteristics, acquisition channels, and organizational maturity. + +Empirical data reveals stark contrasts in funnel efficiency based on organizational performance tiers and go-to-market motions. Top-quartile performers utilizing advanced behavioral scoring models and tight sales-marketing alignment consistently achieve MQL-to-SQL conversion rates of 28% to 40%.10 Conversely, product-led growth (PLG) SaaS models exhibit distinctly different dynamics, often showing Lead-to-MQL rates of 45% to 65% due to the inclusion of Product-Qualified Leads (PQLs) who signal intent through in-app behavior.12 + +The generative engine must establish complex conditional probability distributions to replicate these variances. The following table synthesizes the empirical boundaries that the Leadforge DGP must utilize to parameterize its conversion logic across different verticals and channels: + +| Pipeline Stage Transition | Baseline Industry Median | Top-Quartile / AI-Scored | SEO Sourced | PPC Sourced | Email Sourced | Industry Specific: Cybersecurity | Industry Specific: Fintech | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| **Visitor ![][image1] Lead** | 1.0% \- 3.0% | 3.0% \- 5.0% | N/A | N/A | N/A | 1.6% | N/A | +| **Lead ![][image1] MQL** | 20% \- 25% | 31% \- 40% | N/A | N/A | N/A | 44% | N/A | +| **MQL ![][image1] SQL** | 13% \- 15% | 28% \- 40% | 51% | 26% | \< 1.0% | 15% \- 18% | 11% \- 19% | +| **SQL ![][image1] Opportunity** | 10% \- 12% | 45% \- 60% | N/A | N/A | N/A | 40% | N/A | +| **Opportunity ![][image1] Won** | 6% \- 9% | 20% \- 35% | N/A | N/A | N/A | 39% | N/A | + +*Data synthesized from comprehensive 2024-2026 industry benchmarks.* 8 + +By hardcoding these transition probabilities as conditional dependencies within the DGP, Leadforge ensures that the resulting dataset exhibits realistic class imbalance. For example, if the generative engine creates a lead acquired via an email marketing campaign, the probability of that lead successfully reaching SQL status must be synthetically suppressed to below 1%.10 This mimics the reality that email lists generate high volumes of low-quality MQLs, often referred to as "measuring noise, not signal".10 A predictive model trained on this dataset will therefore be forced to learn that lead\_source \= 'email' is a strong negative predictor, while lead\_source \= 'organic\_search' provides a substantial positive lift, mirroring the 51% MQL-to-SQL conversion rate associated with high-intent inbound traffic.10 + +### **Temporal Friction and Sales Cycle Modeling** + +Furthermore, the temporal duration of the synthesized sales cycle must be rigorously modeled. The median B2B SaaS sales cycle lasts approximately 84 days, though highly optimized pipelines operate within a 46 to 75-day window.11 To simulate a realistic temporal environment, the DGP must sample timestamps for event generation using heavily skewed distributions, such as log-normal or Weibull probability density functions. This ensures that the generated dataset contains the long tail of delayed conversions that consistently confounds linear time-series forecasting models. By injecting temporal friction into the simulated world, Leadforge ensures that practitioners must account for lag times and cohort decay when attempting to predict eventual conversion outcomes. + +### **Synthesizing Explicit and Implicit Feature Spaces** + +The predictive utility of a lead scoring dataset is directly proportional to the richness and noise of its feature space. Modern scoring algorithms synthesize two distinct categories of data: explicit data (demographics and firmographics) and implicit data (behavioral signals and engagement metrics).16 + +1. **Demographic and Firmographic Topologies**: The generative engine must produce realistic categorizations of commercial entities. Firmographic attributes must include company size, annual recurring revenue (ARR), industry vertical, geographic location, and technographic stack deployments.18 Demographic fields must focus on the individual actor's seniority, job title, department, and decision-making authority.18 To ensure the dataset cannot be easily parsed by simplistic heuristic models, the DGP must introduce deliberate synthetic variance into these fields. For instance, instead of standardizing all senior operations roles as "VP of Operations," the generator should introduce noisy permutations such as "Head of Ops," "Director of Global Operations," or "Operations VP".21 This structural noise forces data science students to apply Natural Language Processing (NLP) techniques, string clustering, or categorical embedding layers prior to executing standard classification algorithms. +2. **Behavioral Engagement Signatures**: The true predictive signal in contemporary CRM ecosystems stems from behavioral telemetry. The generative engine must simulate intricate event logs, including weighted page visits (where views of a pricing page carry significantly more predictive weight than views of a top-of-funnel blog post), content downloads, webinar attendance records, and granular email interaction metrics (opens, click-throughs, and unsubscribes).16 The interaction between explicit firmographics and implicit behaviors must also be modeled; for example, a synthesized "C-suite" executive should exhibit different web browsing patterns compared to a synthesized "technical contributor." + +## **Eradicating Temporal Data Leakage in Synthetic Projections** + +The single most pervasive and destructive flaw in both synthetic and organically aggregated commercial datasets is the presence of data leakage. Specifically, temporal leakage and the inclusion of post-event aggregates fundamentally compromise the integrity of predictive modeling.1 Data leakage occurs when information from outside the designated training dataset, or from a point in time strictly after the event being predicted, is inadvertently allowed to influence the model's feature set.2 This phenomena invariably leads to an overly optimistic estimation of the model's performance, resulting in algorithms that demonstrate near-perfect cross-validation scores but fail entirely when deployed against live, unseen data.2 + +In the context of the Leadforge project, the methodology of projecting a complex, multi-parquet relational database (representing the simulated commercial world) down into a single, flattened CSV file introduces massive risk vectors for temporal leakage.23 When relational entity-event graphs are collapsed into a single row per lead, the temporal boundaries are easily obfuscated. + +### **Enforcing the Predictive Boundary (![][image2])** + +To architect a mathematically sound dataset, the Leadforge engine must define a strict, immutable temporal boundary known as ![][image2]. This variable represents the exact chronological timestamp at which the hypothetical predictive model would execute its scoring algorithm in a live production environment (e.g., the precise moment a lead breaches the MQL threshold and is evaluated for SQL transition).24 + +During the projection phase, when the framework aggregates the relational tables to build the CSV feature matrix, any behavioral event, email interaction, form submission, or firmographic enrichment that possesses a timestamp strictly greater than ![][image2] must be aggressively masked, filtered, and excluded from the computation.24 + +The danger of "Post-Event Aggregates" is particularly insidious. If the DGP generates an aggregated feature such as total\_lifetime\_website\_visits or cumulative\_webinar\_minutes, this calculation must be strictly bounded. It must only sum the events that occurred prior to ![][image2]. If the aggregation function inadvertently scans the entire synthesized history of the lead, it will include behaviors that occurred after the lead converted to an SQL, effectively allowing the target variable to leak backwards into the predictive features.2 In clinical machine learning, similar leakage regarding post-diagnostic features routinely invalidates peer-reviewed models; the same stringent standards must apply to synthetic commercial datasets.22 + +### **Mitigating Group and Similarity Leakage** + +Beyond temporal boundaries, the dataset generation must carefully avoid group or similarity leakage.23 In synthetic data generation, it is common for the engine to produce multiple samples derived from the same underlying latent seed, resulting in near-duplicate entities.23 If these highly correlated synthetic leads are randomly split between the training and testing sets, the model will essentially memorize the shared underlying pattern, leading to inflated performance metrics. + +To counter this, the Leadforge dataset should include pre-defined, time-based splits that emulate real-world rolling forecasting techniques. By dividing the dataset into multiple non-overlapping temporal windows, the framework forces users to train their models on an initial historical window and validate against a strictly subsequent future window.1 This ensures that the simulated distribution drift and temporal evolution of the commercial environment are preserved, demanding that practitioners develop robust, generalizable models capable of handling non-stationary data. + +## **Validation Architecture via LLM-as-a-Judge** + +Guaranteeing the statistical purity, logical coherence, and demographic variance of the V1 dataset before it is deployed to global repositories necessitates a profound evolution in automated quality assurance. Traditional deterministic metrics—such as N-gram overlaps (BLEU, ROUGE) or strict distributional heuristics—are fundamentally incapable of evaluating the semantic nuance, contextual logic, and edge-case validity of synthetically generated tabular data.28 The integration of an "LLM-as-a-judge" evaluation layer provides a scalable, highly sensitive mechanism for continuous dataset validation. + +### **The Single-Output, Reference-Less Paradigm** + +The validation engine must utilize a single-output, reference-less architectural paradigm.31 In conventional evaluation workflows, an LLM compares a generated output against a "gold standard" human reference. However, in the context of synthetic tabular generation, there is no single correct trajectory for a simulated lead. Therefore, a secondary judge model must be presented with individual rows or longitudinal trajectories of generated data and prompted to score them against an explicit, multidimensional rubric without relying on a predefined baseline.28 + +The validation module should programmatically sample a statistically significant cohort of generated leads from the pipeline and pass them to the LLM judge. Advanced frameworks, such as Nvidia's NeMo Evaluator or the G-Eval methodology, demonstrate that language models can perform highly reliable classifications of tabular and generative outputs when the evaluation prompts are meticulously engineered to score specific, isolated dimensions.29 + +### **Multidimensional Evaluation Rubric** + +To ensure the dataset is unbreakable and pedagogically sound, the LLM judge must assess the synthetic trajectories across three primary axes: + +1. **Logical Coherence and Semantic Solvability**: The judge must evaluate whether the generated sequence of events and firmographic assignments align with real-world commercial logic. For instance, if the DGP synthesizes a lead designated as the "Chief Information Security Officer (CISO) of a global financial institution," does the subsequent behavioral trajectory reflect that status? A generated trajectory showing that same CISO submitting a form for a $15/month basic marketing plugin, bypassing all security protocol evaluations, and converting in two days represents a catastrophic failure in logical coherence.33 The judge must flag these logical incongruities. +2. **Effective Semantic Diversity**: Recent research indicates that heavily aligned generative models and complex DGPs frequently suffer from mode collapse, producing highly homogenized, safe outputs.35 A synthetic dataset loses its pedagogical value if every converting lead follows the exact same "happy path" trajectory. The validation layer must explicitly measure diversity to ensure the generation engine is exploring the full extremities of the statistical space. The judge must evaluate the sampled cohort to verify that it covers a wide, realistic assortment of firmographics, unpredictable behavioral permutations, and edge cases, rather than merely repeating identical permutations of an ideal customer profile.37 +3. **Syntax Validity and Formatting Integrity**: The LLM must verify that all categorical fields are syntactically valid and entirely free from hallucinatory artifacts or structural anomalies. This includes ensuring that generated strings for industry verticals conform to recognized Standard Industrial Classification (SIC) logic, that phone numbers match the geographical conventions of the generated location, and that numerical fields do not contain impossible values (e.g., negative employee counts or fractional website visits).33 + +### **Mitigating Algorithmic Judge Biases** + +Deploying an LLM as an automated evaluator introduces inherent systemic risks, primarily verbosity bias (the tendency to favor longer text fields regardless of their actual accuracy) and self-preference bias (the tendency of a model to rate outputs generated by architectures similar to its own more favorably).28 + +To rigorously combat these biases, the Leadforge validation prompts must utilize a forced-rationale structure. The prompt matrix must compel the LLM to output a detailed, step-by-step analytical rationale before it is permitted to yield a final numerical score. This technique forces the model to engage in analytical decomposition, significantly stabilizing the scoring output.32 The ultimate scores generated by the LLM-as-a-judge must act as a strict, automated quality gate within the CI/CD pipeline. If the mean coherence or diversity scores of a generated batch fall below a scientifically calibrated threshold, the pipeline must automatically halt the release, preventing compromised data from reaching public repositories.31 + +## **Modernizing the MLOps Publishing Pipeline and Documentation Schemas** + +The technical delivery mechanism for the Leadforge V1 dataset must completely abandon manual uploading processes and ad-hoc scripting in favor of programmatic, continuous deployment tools designed specifically for modern Machine Learning Operations (MLOps). Furthermore, achieving a "best-in-class" designation relies as much on the structural quality of the documentation as it does on the underlying data. A dataset's utility is inextricably linked to its discoverability, the clarity of its metadata, and the depth of its accompanying exploratory analysis. + +### **CI/CD Pipeline Automation for Elite Repositories** + +The Leadforge framework must integrate automated, bidirectional syncing directly from the local repository environment to both Kaggle and HuggingFace, ensuring that updates to the generation engine are seamlessly reflected in the public data artifacts. + +**HuggingFace Hub Synchronization:** The deployment pipeline for HuggingFace must utilize the official huggingface/hub-sync GitHub Action. This purpose-built tool enables secure, direct file mirroring from the GitHub repository directly to the HuggingFace Hub, entirely eliminating the need for intermediary storage or manual Git LFS interventions.41 The pipeline configuration requires the creation of a fine-grained access token with strict write permissions, securely stored within the repository secrets as HF\_TOKEN. By explicitly setting the repo\_type parameter to dataset within the workflow YAML, the GitHub Action ensures flawless version control synchronization, automatically pushing newly generated parquet files or CSV projections to the dataset repository upon designated release triggers.41 + +**Kaggle Programmatic Deployment:** For deployment to Kaggle, the pipeline must be orchestrated via the official kagglehub Python library. This library provides a seamless programmatic interface intended for native integration within automated Python ML workflows, superseding older, fragile command-line interfaces.44 The deployment script must authenticate using securely managed API credentials and execute the kagglehub.dataset\_upload() function. This function requires a highly specific handle formatted as \/\ alongside the local directory path containing the generated artifacts.45 Crucially, the pipeline should leverage the version\_notes argument to programmatically inject the current GitHub commit hash or release tag, ensuring strict, auditable lineage tracking between the exact state of the Leadforge codebase and the resulting Kaggle dataset artifact.45 + +### **Architecting "Gold Standard" Documentation Rubrics** + +To maximize visibility, community engagement, and pedagogical utility, the dataset documentation must strictly adhere to the formalized metadata schemas that drive the search algorithms of each respective platform. + +#### **HuggingFace Dataset Card YAML Specification** + +The HuggingFace platform mandates that dataset documentation be contained within a README.md file prefaced by a meticulously structured YAML metadata block.46 This YAML configuration is not merely informational; it actively dictates how the dataset is indexed, filtered, and rendered by the Hub's interactive Dataset Viewer.47 + +The optimal metadata schema for the Leadforge V1 release must include the following strictly formatted keys: + +* **language**: Explicit declaration of the dataset language using ISO 639-1 codes (e.g., en).47 +* **pretty\_name**: A stylized, highly readable title optimized for search indexing.47 +* **tags**: Critical for algorithmic discoverability. The metadata must force the dataset modality by including the tabular tag. It must also include domain-specific keywords such as crm, lead-scoring, b2b, and synthetic-data to capture relevant search traffic.47 +* **license**: A valid open-source license identifier (e.g., mit, apache-2.0, or cc-by-4.0) is mandatory to ensure broad academic adoption and clear commercial usage boundaries.47 +* **task\_categories**: To ensure the dataset populates correctly in the Hub's task-specific repositories, it must be explicitly tagged with tabular-classification.46 +* **configs**: The YAML block must contain detailed configuration instructions specifying how data libraries should load the files. This involves mapping the generated CSVs or Parquet subsets to specific train and test splits using the data\_files parameter, allowing end-users to load the data with a single line of Python code.47 + +#### **Kaggle Metadata and Analytical Notebook Schemas** + +Kaggle dataset releases require a companion dataset-metadata.json file. This highly specific JSON schema strictly defines the dataset's unique slug, title, and licensing terms. This file ensures that Kaggle's backend ingestion engine correctly parses the tabular data to generate automated column metadata, statistical distributions, and metric visualizations upon upload.44 + +The definitive benchmark for Kaggle documentation excellence in the tabular domain is found in datasets like the seminal IEEE-CIS Fraud Detection competition. That specific dataset successfully modeled complex temporal dynamics by splitting the data into distinct identity and transaction tables, linked by a primary key, mirroring the relational reality of payment gateways.50 While Leadforge V1 will project its data down to a single CSV to ensure pedagogical accessibility for introductory students, the underlying documentation must explicitly detail the relational dynamics that were compressed during generation. This approach mirrors the analytical depth and structural transparency seen in the top-tier IEEE documentation, elevating the perceived rigor of the dataset.51 + +To ensure the introductory starter notebook drives high community engagement and upvotes, it must rigidly follow Kaggle's official Solution Write-Up rubric.52 The notebook must be structurally divided into four core pillars: + +1. **Context**: Clear hyperlinks to the business objectives of lead scoring, explicit definitions of the data schema, and the pedagogical purpose of the synthetic release.52 +2. **Overview of the Approach**: A highly detailed, mathematical exploration of the DGP. This section must reveal the empirical industry benchmarks used to calibrate the conversion rates, and critically, provide a transparent explanation of the anti-leakage mechanisms and ![][image2] boundaries implemented during the data projection phase.52 +3. **Details of the Data**: An exploratory data analysis (EDA) of the synthesized features, highlighting the non-obvious dynamics engineered into the dataset. This should visualize the differential conversion rates based on simulated lead sources, demonstrating the underlying class imbalance.52 +4. **Sources**: Comprehensive, academically formatted citations of the empirical industry reports, SaaS metrics, and pipeline benchmarks that informed the dataset's calibration, proving its alignment with real-world scenarios.52 + +## **Suggested Roadmap for the V1 Dataset Release** + +Based on the exhaustive synthesis of empirical CRM dynamics, advanced MLOps best practices, and state-of-the-art synthetic validation techniques, the following sequential, phase-gated roadmap is proposed for the execution of the Leadforge V1 release. + +### **Phase 1: Statistical Calibration and Core Engine Refinement** + +* **Objective**: Overhaul the generative statistical boundaries to perfectly reflect empirical 2025 B2B SaaS realities. +* **Action Items**: + * Hardcode conditional probability matrices mapping synthesized lead sources (e.g., SEO, PPC, Email) to distinct MQL-to-SQL conversion probabilities, enforcing the 13% median while respecting channel variance (e.g., 51% for SEO, \<1% for Email). + * Implement temporal skew algorithms leveraging log-normal distributions to enforce realistic, delayed sales cycle durations ranging from 46 to 84 days. + * Expand the firmographic and behavioral feature generation logic to include complex, noisy categorical strings for job titles and highly weighted behavioral event logs, ensuring adequate feature space dimensionality. + +### **Phase 2: Anti-Leakage Architecture Implementation** + +* **Objective**: Mathematically guarantee the absolute absence of temporal leakage, similarity leakage, and post-event aggregates within the flattened CSV projection. +* **Action Items**: + * Define a strict, programmatic ![][image2] boundary logic within the data projection module, representing the exact moment of model inference. + * Engineer aggregation functions that strictly mask any behavioral, interaction, or firmographic modifications timestamped chronologically after ![][image2]. + * Implement time-based dataset splitting mechanisms to generate native train and test cohorts that force predictive models to generalize across distinct temporal gaps, rather than random shuffling. + +### **Phase 3: Integration of the LLM-as-a-Judge Validation Layer** + +* **Objective**: Build and deploy the automated, reference-less LLM quality gate to ensure semantic and structural integrity. +* **Action Items**: + * Develop a single-output, reference-less prompt matrix requiring the LLM to output extensive analytical rationale prior to assigning a score, mitigating verbosity and self-preference biases. + * Establish strict rubrics for evaluating Logical Coherence, Effective Semantic Diversity, and Syntax Validity across the synthesized trajectories. + * Integrate this evaluation module directly into the generation pipeline, setting rigid numerical failure thresholds that automatically halt the CI/CD process if low-scoring data is detected. + +### **Phase 4: CI/CD Pipeline Construction and Documentation Automation** + +* **Objective**: Fully automate the publishing workflows and align all repository metadata with exact platform specifications. +* **Action Items**: + * Write scripts to automatically generate the HuggingFace README.md containing the exact YAML specification (including task\_categories: tabular-classification, necessary modality tags, and data loading configs). + * Generate the Kaggle dataset-metadata.json artifact dynamically alongside the CSV data extraction. + * Configure GitHub Actions workflows utilizing the huggingface/hub-sync action and Python scripts leveraging kagglehub.dataset\_upload() to execute fully automated, auditable deployments triggered exclusively by formal release tags. + +### **Phase 5: Synthesis of the Definitive Introductory Notebook** + +* **Objective**: Draft the premier "starter notebook" adhering to Kaggle's most rigorous community rubrics to drive maximum engagement. +* **Action Items**: + * Structure the notebook meticulously using the mandatory Context, Overview, Details, and Sources schema. + * Include rich visualizations of the synthetic conversion bottlenecks, temporal distributions, and baseline lift curves to empirically demonstrate the dataset's non-trivial difficulty. + * Publish the release with explicit calls-to-action, challenging the global data science community to identify residual leakage or break the underlying DGP, thereby driving engagement and aggregating the necessary feedback for the V2 iteration. + +#### **Works cited** + +1. What is Data Leakage in Machine Learning? \- IBM, accessed May 5, 2026, [https://www.ibm.com/think/topics/data-leakage-machine-learning](https://www.ibm.com/think/topics/data-leakage-machine-learning) +2. Data Leakage : Causes, Effects and Solutions | by Arash Nicoomanesh | Medium, accessed May 5, 2026, [https://medium.com/@anicomanesh/data-leakage-causes-effects-and-solutions-6cc44a149e1c](https://medium.com/@anicomanesh/data-leakage-causes-effects-and-solutions-6cc44a149e1c) +3. 2025 B2B SaaS Benchmarks Report \- Maxio, accessed May 5, 2026, [https://www.maxio.com/resources/2025-saas-benchmarks-report](https://www.maxio.com/resources/2025-saas-benchmarks-report) +4. B2B SaaS benchmarks in 2025 \- Orb, accessed May 5, 2026, [https://www.withorb.com/blog/b2b-saas-benchmarks](https://www.withorb.com/blog/b2b-saas-benchmarks) +5. 2025 Private B2B SaaS Company Growth Rate Benchmarks \- SaaS Capital, accessed May 5, 2026, [https://www.saas-capital.com/research/private-saas-company-growth-rate-benchmarks/](https://www.saas-capital.com/research/private-saas-company-growth-rate-benchmarks/) +6. 2025 SaaS Performance Metrics \- Benchmarkit, accessed May 5, 2026, [https://www.benchmarkit.ai/2025benchmarks](https://www.benchmarkit.ai/2025benchmarks) +7. A Global Marketing & Sales Performance Analysis of 2025 — and Strategic Preparation for 2026 for Israeli companies | match-b2b, accessed May 5, 2026, [https://www.match-b2b.com/a-global-marketing-and-sales-performance-analysis-of-2025-and-strategic-preparation-for-2026-for-israeli-companies](https://www.match-b2b.com/a-global-marketing-and-sales-performance-analysis-of-2025-and-strategic-preparation-for-2026-for-israeli-companies) +8. B2B Sales Pipeline Conversion Rates – MarketJoy Data, accessed May 5, 2026, [https://marketjoy.com/b2b-sales-pipeline-conversion-rates-marketjoy-data/](https://marketjoy.com/b2b-sales-pipeline-conversion-rates-marketjoy-data/) +9. Understanding your sales funnel conversion rates \- HiBob, accessed May 5, 2026, [https://www.hibob.com/blog/sales-funnel-conversion-rate/](https://www.hibob.com/blog/sales-funnel-conversion-rate/) +10. Is the MQL Dead? Why B2B Marketing Must Shift to SQL as Its Primary KPI, accessed May 5, 2026, [https://www.geisheker.com/mql-vs-sql-b2b-marketing-kpi/](https://www.geisheker.com/mql-vs-sql-b2b-marketing-kpi/) +11. 2025 B2B SaaS Funnel Benchmarks & Pipeline Audit Framework \- The Digital Bloom, accessed May 5, 2026, [https://thedigitalbloom.com/learn/pipeline-performance-benchmarks-2025/](https://thedigitalbloom.com/learn/pipeline-performance-benchmarks-2025/) +12. MQL to SQL Conversion Rates: B2B SaaS Benchmarks \- Understory Agency, accessed May 5, 2026, [https://www.understoryagency.com/blog/mql-to-sql-conversion-rate-benchmarks](https://www.understoryagency.com/blog/mql-to-sql-conversion-rate-benchmarks) +13. 2026 B2B SaaS Funnel Conversion Benchmarks Guide \- CausalFunnel, accessed May 5, 2026, [https://www.causalfunnel.com/blog/b2b-saas-funnel-conversion-benchmarks-2026-data-insights/](https://www.causalfunnel.com/blog/b2b-saas-funnel-conversion-benchmarks-2026-data-insights/) +14. B2B sales conversion rate by industry: benchmarks, formulas, and optimization tactics \- Zeliq, accessed May 5, 2026, [https://www.zeliq.com/blog/b2b-conversion-rates-by-industry](https://www.zeliq.com/blog/b2b-conversion-rates-by-industry) +15. B2B SaaS Funnel Conversion Benchmarks \- First Page Sage, accessed May 5, 2026, [https://firstpagesage.com/seo-blog/b2b-saas-funnel-conversion-benchmarks-fc/](https://firstpagesage.com/seo-blog/b2b-saas-funnel-conversion-benchmarks-fc/) +16. 7 Effective Tips For B2B Lead Scoring Examples \- Small Business Expo, accessed May 5, 2026, [https://www.thesmallbusinessexpo.com/blog/b2b-lead-scoring-examples/](https://www.thesmallbusinessexpo.com/blog/b2b-lead-scoring-examples/) +17. Lead Scoring: The Complete Guide for B2B Sales and Marketing \- 2025 Update \- Outfunnel, accessed May 5, 2026, [https://outfunnel.com/lead-scoring/](https://outfunnel.com/lead-scoring/) +18. B2B Lead Scoring Model: 7-Step Template \+ CRM Setup \- Scalarly, accessed May 5, 2026, [https://scalarly.com/blog/b2b-lead-scoring-model/](https://scalarly.com/blog/b2b-lead-scoring-model/) +19. Ultimate Guide to Demographic Lead Scoring Models \- LeadBoxer, accessed May 5, 2026, [https://www.leadboxer.com/learn/ultimate-guide-to-demographic-lead-scoring-models](https://www.leadboxer.com/learn/ultimate-guide-to-demographic-lead-scoring-models) +20. Lead Enrichment Explained: A B2B Marketer's Guide for 2025 \- Factors.ai, accessed May 5, 2026, [https://www.factors.ai/blog/lead-enrichment-explained](https://www.factors.ai/blog/lead-enrichment-explained) +21. Lead Scoring: How to Find the Best Prospects in 4 Steps \- Salesforce, accessed May 5, 2026, [https://www.salesforce.com/blog/lead-scoring/](https://www.salesforce.com/blog/lead-scoring/) +22. The Effect of Data Leakage and Feature Selection on Machine Learning Performance for Early Parkinson's Disease Detection \- PMC, accessed May 5, 2026, [https://pmc.ncbi.nlm.nih.gov/articles/PMC12383348/](https://pmc.ncbi.nlm.nih.gov/articles/PMC12383348/) +23. Engineer's Guide to Automatically Identifying and Mitigating Data Leakage \- LatticeFlow AI, accessed May 5, 2026, [https://latticeflow.ai/news/engineers-guide-to-data-leakage](https://latticeflow.ai/news/engineers-guide-to-data-leakage) +24. Data leakage \- Article \- SailPoint, accessed May 5, 2026, [https://www.sailpoint.com/identity-library/data-leakage](https://www.sailpoint.com/identity-library/data-leakage) +25. Preventing Data Leakage in Feature Engineering: Strategies and Solutions \- dotData, accessed May 5, 2026, [https://dotdata.com/blog/preventing-data-leakage-in-feature-engineering-strategies-and-solutions/](https://dotdata.com/blog/preventing-data-leakage-in-feature-engineering-strategies-and-solutions/) +26. Preventing Training Data Leakage in AI Systems | Blog | Tonic.ai, accessed May 5, 2026, [https://www.tonic.ai/blog/prevent-training-data-leakage-ai](https://www.tonic.ai/blog/prevent-training-data-leakage-ai) +27. When Privacy Isn't Synthetic: Hidden Data Leakage in Generative AI Models \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2512.06062v1](https://arxiv.org/html/2512.06062v1) +28. Rubric-Based Evaluations & LLM-as-a-Judge — Methodologies, Biases, and Empirical Validation in Domain-Specific Contexts. | by Adnan Masood, PhD. | Apr, 2026 | Medium, accessed May 5, 2026, [https://medium.com/@adnanmasood/rubric-based-evals-llm-as-a-judge-methodologies-and-empirical-validation-in-domain-context-71936b989e80](https://medium.com/@adnanmasood/rubric-based-evals-llm-as-a-judge-methodologies-and-empirical-validation-in-domain-context-71936b989e80) +29. LLM-as-a-Judge Metrics | Confident AI Docs, accessed May 5, 2026, [https://www.confident-ai.com/docs/llm-evaluation/core-concepts/llm-as-a-judge](https://www.confident-ai.com/docs/llm-evaluation/core-concepts/llm-as-a-judge) +30. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2412.05579v2](https://arxiv.org/html/2412.05579v2) +31. Evaluate with LLM-as-a-Judge — NVIDIA NeMo Platform Documentation, accessed May 5, 2026, [https://docs.nvidia.com/nemo/microservices/latest/evaluator/metrics/llm-as-a-judge.html](https://docs.nvidia.com/nemo/microservices/latest/evaluator/metrics/llm-as-a-judge.html) +32. Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge), accessed May 5, 2026, [https://eugeneyan.com/writing/llm-evaluators/](https://eugeneyan.com/writing/llm-evaluators/) +33. Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2409.16341v2](https://arxiv.org/html/2409.16341v2) +34. Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs \- ACL Anthology, accessed May 5, 2026, [https://aclanthology.org/2024.emnlp-main.285.pdf](https://aclanthology.org/2024.emnlp-main.285.pdf) +35. Evaluating the Diversity and Quality of LLM Generated Content \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2504.12522v2](https://arxiv.org/html/2504.12522v2) +36. Evaluating Synthetic Data Generation from User Generated Text | Computational Linguistics, accessed May 5, 2026, [https://direct.mit.edu/coli/article/51/1/191/124625/Evaluating-Synthetic-Data-Generation-from-User](https://direct.mit.edu/coli/article/51/1/191/124625/Evaluating-Synthetic-Data-Generation-from-User) +37. Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges \- arXiv, accessed May 5, 2026, [https://arxiv.org/html/2511.04478v1](https://arxiv.org/html/2511.04478v1) +38. How do you evaluate the quality of synthetic data analysis results? \- BlueGen AI, accessed May 5, 2026, [https://bluegen.ai/how-do-you-evaluate-the-quality-of-synthetic-data-analysis-results/](https://bluegen.ai/how-do-you-evaluate-the-quality-of-synthetic-data-analysis-results/) +39. Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist \- ACL Anthology, accessed May 5, 2026, [https://aclanthology.org/2025.emnlp-demos.1.pdf](https://aclanthology.org/2025.emnlp-demos.1.pdf) +40. Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2\) | Artificial Intelligence, accessed May 5, 2026, [https://aws.amazon.com/blogs/machine-learning/evaluate-generative-ai-models-with-an-amazon-nova-rubric-based-llm-judge-on-amazon-sagemaker-ai-part-2/](https://aws.amazon.com/blogs/machine-learning/evaluate-generative-ai-models-with-an-amazon-nova-rubric-based-llm-judge-on-amazon-sagemaker-ai-part-2/) +41. GitHub Actions \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/hub/repositories-github-actions](https://huggingface.co/docs/hub/repositories-github-actions) +42. Sync With Hugging Face Hub · Actions · GitHub Marketplace, accessed May 5, 2026, [https://github.com/marketplace/actions/sync-with-hugging-face-hub](https://github.com/marketplace/actions/sync-with-hugging-face-hub) +43. How to sync Hugging Face model commits with GitHub? \- Intermediate, accessed May 5, 2026, [https://discuss.huggingface.co/t/how-to-sync-hugging-face-model-commits-with-github/149599](https://discuss.huggingface.co/t/how-to-sync-hugging-face-model-commits-with-github/149599) +44. Public API \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/docs/api](https://www.kaggle.com/docs/api) +45. GitHub \- Kaggle/kagglehub: Python library to access Kaggle resources, accessed May 5, 2026, [https://github.com/Kaggle/kagglehub](https://github.com/Kaggle/kagglehub) +46. Create a dataset card \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/datasets/dataset\_card](https://huggingface.co/docs/datasets/dataset_card) +47. Dataset Cards \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/docs/hub/datasets-cards](https://huggingface.co/docs/hub/datasets-cards) +48. What is Tabular Classification? \- Hugging Face, accessed May 5, 2026, [https://huggingface.co/tasks/tabular-classification](https://huggingface.co/tasks/tabular-classification) +49. How To Use Kaggle: Datasets, accessed May 5, 2026, [https://www.kaggle.com/docs/datasets](https://www.kaggle.com/docs/datasets) +50. Dataset Description \- IEEE-CIS Fraud Detection | Kaggle, accessed May 5, 2026, [https://www.kaggle.com/competitions/ieee-fraud-detection/data](https://www.kaggle.com/competitions/ieee-fraud-detection/data) +51. IEEE-CIS Fraud Detection | Kaggle, accessed May 5, 2026, [https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284](https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284) +52. Kaggle Solution Write-Up Documentation, accessed May 5, 2026, [https://www.kaggle.com/solution-write-up-documentation](https://www.kaggle.com/solution-write-up-documentation) +53. \[Product Update\] Competition Solution Write-Ups: Improving the Way Insights Are Gathered on Kaggle, accessed May 5, 2026, [https://www.kaggle.com/discussions/product-feedback/373153](https://www.kaggle.com/discussions/product-feedback/373153) +54. Introducing Writeups\! \- Kaggle, accessed May 5, 2026, [https://www.kaggle.com/discussions/product-announcements/593763](https://www.kaggle.com/discussions/product-announcements/593763) + +[image1]: + +[image2]: diff --git a/docs/external_review/summaries/README.md b/docs/external_review/summaries/README.md new file mode 100644 index 0000000..a497806 --- /dev/null +++ b/docs/external_review/summaries/README.md @@ -0,0 +1,45 @@ +# External Review Summaries + +This directory holds Claude-authored summaries and takeaways for the six +external-review files dropped under `docs/external_review/{gemini,chatgpt}/` +by Gemini and ChatGPT. They are inputs to a forthcoming v1 release roadmap +and are NOT themselves the roadmap. + +## Source corpus + +| File | Lines | Role | +|---|---:|---| +| `gemini/gemini_report_v1.md` | 244 | Gemini's first research+roadmap report | +| `gemini/gemini_report_v2.md` | 246 | Gemini's second pass with sharper macro/empirical framing | +| `chatgpt/chatgpt_report_v1.md` | 149 | ChatGPT's first attempt — generic, since superseded | +| `chatgpt/leadforge_report_v1_critique.md` | 678 | Critique of ChatGPT v1 — methodology rebuke | +| `chatgpt/leadforge_second_attempt_guidance.md` | 1167 | Guidance attached to the v2 retry | +| `chatgpt/chatgpt_report_v2.md` | 781 | ChatGPT's evidence-grounded second attempt — the strongest single source | + +Total: 3265 lines, ~240 KB. + +## Summary files (per-source) + +- `gemini_v1_summary.md` +- `gemini_v2_summary.md` +- `chatgpt_v1_summary.md` (brief — historical) +- `chatgpt_v1_critique_summary.md` +- `chatgpt_guidance_summary.md` +- `chatgpt_v2_summary.md` (the substantive review) + +## Synthesis files (across sources) + +- `cross_source_takeaways.md` — themes consolidated across all six sources, agreement vs. divergence vs. unique-to-source +- `key_findings.md` — action-prioritized synthesis: critical → high → medium → low/defer; this is the input list a roadmap would consume + +## What's NOT here yet + +- A consolidated roadmap. That comes after a "process and recommendations" pass on every key finding (accept / accept-with-different-approach / reject / out-of-scope-and-open-issue / defer) with sign-off from the user. + +## How to read this corpus + +1. Start with `key_findings.md` — the action-ranked list. +2. Read `cross_source_takeaways.md` for the agreement/divergence map. +3. Drill into `chatgpt_v2_summary.md` for the most actionable single source (relational-leakage blocker, gap matrix, milestone roadmap). +4. Dip into the per-source summaries when you want to know which reviewer said what, especially for items unique to one source. +5. Original review files remain untouched in `gemini/` and `chatgpt/`. diff --git a/docs/external_review/summaries/chatgpt_guidance_summary.md b/docs/external_review/summaries/chatgpt_guidance_summary.md new file mode 100644 index 0000000..b50a281 --- /dev/null +++ b/docs/external_review/summaries/chatgpt_guidance_summary.md @@ -0,0 +1,51 @@ +# Summary — leadforge_second_attempt_guidance.md + +**Source:** `docs/external_review/chatgpt/leadforge_second_attempt_guidance.md` (1167 lines, ~39 KB) +**Author:** ChatGPT (guidance document attached to the v2 retry as a constraint file) +**Verdict in one line:** The methodology spec that produced chatgpt_v2 — also independently useful for its dataset-forensics requirements (leakage probes) and required release-tree structure. + +## Document role + +A "how to do the v2 attempt correctly" instruction document. Required mandatory phases, file-paths-to-inspect tables, commands to run, web queries to issue, citation standards, and a rubric for accepting the result. Treated as a constraint file by the v2 author. + +## Top points + +1. **Mandatory methodology (7 phases):** + - Phase 0 — extract Repomix, build evidence inventory + - Phase 1 — static code audit by repo area (table maps area → files-to-inspect → questions-to-answer) + - Phase 2 — dynamic reproducibility audit (install, run pytest, run the CLI, run the build script, record exit codes) + - Phase 3 — alpha dataset forensic audit (manifest schema, row counts, splits, leakage probes) + - Phase 4 — external research expedition (public dataset census, B2B realism, synthetic-data evaluation, platform requirements) + - Phase 5 — gap matrix (area / current evidence / gap / severity / recommended fix / files+commands / acceptance) + - Phase 6 — roadmap with files, commands, deliverables, acceptance, risks +2. **Dataset-forensics required probes (the v2 leakage finding came out of this):** + - 8.1 Direct leakage: train w/ all features vs w/o suspect cols vs IDs vs post-snapshot aggregates; compare deltas. + - 8.2 Time-window leakage: every public feature must derive from events ≤ `lead_created_at + snapshot_day`; label resolution uses full horizon but not as a feature source (except documented teaching traps). + - 8.3 **Relational leakage**: opportunity status, customer/subscription rows that exist only for conversions, sales activities after snapshot, stage tables, join paths reconstructing `is_sql` / `current_stage` / terminal states. ← This is the probe class that surfaced THE blocker in v2. + - 8.4 Split leakage: same account in train/test, same contact in train/test, near-duplicates across splits, temporal-split overlap. + - 8.5 Model realism: AUC, PR-AUC, Brier, calibration, lift, P@K, R@K, top-decile, expected-value-at-K. +3. **Required release-tree structure** (the v1 release should look like a release, not a code dump): `dataset-cover-image.png`, `docs/{DATASET_CARD,GENERATION_METHOD,VALIDATION_REPORT,FEATURE_DICTIONARY,BREAK_ME_GUIDE,INSTRUCTOR_GUIDE}`, `data/{intro,intermediate,advanced}/{train,validation,test}.csv` + `relational/`, `instructor_companion/intermediate_instructor/`, `validation/` with figures, `notebooks/{01_baseline,02_relational,03_leakage,04_lift_calibration_value}`, `kaggle/dataset-metadata.json` + README, `huggingface/README.md`. +4. **LLM critique loop schema** (output JSON: release_id, model, run_timestamp, overall_score, findings[severity/category/claim/evidence/reproducer/suggested_fix], missing_sections[], questions_for_maintainer[]). Two model families recommended; raw outputs archived; high-severity findings adjudicated by humans before release. +5. **Pitfalls explicitly forbidden:** call implemented modules placeholders; recommend already-existing commands; treat HF card as absent; ignore the v7 track; use outdated platform requirements; treat AUC as the only metric; ignore relational leakage; ignore account/contact split overlap; over-plan LTV/leaderboard for v1 scope. +6. **Citation discipline:** repo evidence as `path:Lstart-Lend`, web evidence with title + URL + access date + exact fact supported. Mark unverified items explicitly. +7. **Required final-report TOC** (the structure v2 follows): Executive Summary → Evidence and Method → Current-State Audit (12 subsections) → Alpha Forensics → External Research → Release Spec → Gap Matrix → Roadmap → v2 Feedback Plan → Appendices. +8. **Out-of-scope for v1:** LTV labels as first-class outputs, leaderboard mini-site, second vertical, plugin architecture, web UI. + +## Useful artifacts / templates / schemas + +- Code-area audit table (12 areas × files to inspect × questions) +- Dataset-forensics probe taxonomy (5 categories: direct / time-window / relational / split / model-realism) +- Release-tree spec (`leadforge-v1-lead-scoring/` directory layout) +- LLM-critique JSON output schema +- Acceptance rubric (12 criteria) for the final v2 report +- Suggested opening thesis ("Leadforge is not a blank-slate idea…") + +## Limitations / blind spots + +- Meta document; doesn't itself produce findings, just specifies the process. + +## Items unique to this source + +- The relational-leakage probe class (8.3) — without this in the spec, v2 would not have caught the blocker. +- The "five-lane" framing (framework readiness vs curated dataset readiness vs platform readiness vs educator readiness vs feedback-loop readiness). +- The exact release-tree layout that the v2 author then largely adopted. diff --git a/docs/external_review/summaries/chatgpt_v1_critique_summary.md b/docs/external_review/summaries/chatgpt_v1_critique_summary.md new file mode 100644 index 0000000..13e6d54 --- /dev/null +++ b/docs/external_review/summaries/chatgpt_v1_critique_summary.md @@ -0,0 +1,41 @@ +# Summary — leadforge_report_v1_critique.md + +**Source:** `docs/external_review/chatgpt/leadforge_report_v1_critique.md` (678 lines, ~31 KB) +**Author:** ChatGPT (self-critique of its own v1 report) +**Verdict in one line:** A methodology rebuke that reset the second attempt — also the place where the corrected platform facts (Kaggle 560×280 minimum cover image, `expectedUpdateFrequency` field name) live. + +## Document role + +A forensic critique of `chatgpt_report_v1.md` against the original task prompt. Diagnoses why v1 was inadequate, scores it on a rubric, lays out the better process the v2 author should follow, and ships an improved-roadmap sketch (Milestones A-F) that v2 then expanded. + +## Top points + +1. **Diagnosis:** v1's biggest failure was *methodological under-inspection* — it skimmed architecture docs and platform docs, then inferred a roadmap, instead of building an evidence matrix from the actual code, tests, and release artifacts. +2. **Scorecard verdict:** Prompt compliance C-, Repository review D, Dataset audit C-, External research C, Roadmap C-, Citation D, Strategic usefulness C. +3. **Major factual corrections:** repo is not skeletal (937 tests, ~10.4k LoC under `leadforge/`); CLI exists; HF card and release scripts exist; validation modules exist; `lead_scoring_intro/` v6/v7 track exists. +4. **Process prescription (7 phases):** evidence inventory → static audit → dynamic reproducibility audit → alpha dataset forensics → external research → release spec + acceptance gates → LLM critique loop. v2 follows this. +5. **Distinguish two products:** framework readiness vs curated dataset readiness — must run as parallel lanes with separate acceptance criteria. +6. **Concreteness gradient (weak vs strong example):** + - Weak: "Add better validation." + - Strong: "Add `leadforge/validation/release_quality.py` and `scripts/validate_release_candidate.py` that read each bundle's manifest, feature dictionary, task splits, and flat CSV; compute ROC-AUC, PR-AUC, Brier, calibration bins, lift@K, leakage-probe metrics, split-shift summaries, redaction checks, relational rejoin checks; write `validation/validation_report.{json,md}` and figures. Acceptance: no high-severity leakage; metrics within configured difficulty bands; intentional public/instructor diff." +7. **Corrected platform facts (durable, accurate as of 2026-05-05):** + - Kaggle metadata file is `dataset-metadata.json`; supported fields: `title`, `subtitle`, `description`, `id`, `licenses`, `resources`, `keywords`, `expectedUpdateFrequency`, `userSpecifiedSources`, `image`. + - Kaggle cover image: `dataset-cover-image.png` (or `.jpg/.jpeg/.webp`), **minimum 560×280** (not 1200×400 as v1 claimed), with header and thumbnail crops specified. + - HF YAML supports `configs` and `data_files` for splits/subsets; mark one config `default: true`. +8. **Citation discipline:** every consequential repo claim must have file path + line range; every web claim needs URL + access date; bibliography grouped by platform docs / academic / industry / repository files. + +## Useful artifacts / templates / schemas + +- Acceptance rubric for the v2 report (8 dimensions): evidence fidelity, current-state accuracy, research depth, platform correctness, release specificity, pedagogical value, adversarial readiness, citation quality. +- Improved-roadmap sketch (Milestones A-F) — porting v7 lessons explicitly into v1. +- LLM critique JSON output schema (severity / category / claim / evidence / reproducer / suggested_fix). + +## Limitations / blind spots + +- Self-referential — its job is to fix v1, not to do the substantive review itself. Subsumed by v2. + +## Items unique to this source + +- The 8-dimension evidence-fidelity / current-state-accuracy / etc. acceptance rubric that should be applied to any future report. +- Corrected platform facts with timestamps. +- The framework-vs-dataset lane separation argument. diff --git a/docs/external_review/summaries/chatgpt_v1_summary.md b/docs/external_review/summaries/chatgpt_v1_summary.md new file mode 100644 index 0000000..470c5c3 --- /dev/null +++ b/docs/external_review/summaries/chatgpt_v1_summary.md @@ -0,0 +1,31 @@ +# Summary — chatgpt_report_v1.md + +**Source:** `docs/external_review/chatgpt/chatgpt_report_v1.md` (149 lines, ~24 KB) +**Author:** ChatGPT (first attempt, since superseded) +**Verdict in one line:** Generic planning memo that under-inspected the repo; superseded by chatgpt v2. + +## Document role + +ChatGPT's first attempt at the same brief Gemini received. It treated leadforge as mostly skeletal, recommended building things that already exist (CLI, HF card, validation), and used non-portable browser-internal citations. Its own follow-up critique (`leadforge_report_v1_critique.md`) details what went wrong. + +## Top points (the parts that survive the critique) + +1. Same temporal-leakage emphasis (prediction-time boundary, post-event aggregate filtering). +2. Same industry-benchmarks emphasis (resemblance, utility, privacy axes). +3. Same suggestion to add LLM critique loops. +4. Same insistence on Datasheets-for-Datasets / Data Cards Playbook compliance. +5. Notes the Simula framing (datasets-as-functions; programmable diversity, complexity, quality axes) — a pointer worth following up. + +## Limitations / why it was discarded + +- Misclassified the repo as mostly skeletal; recommended implementing already-implemented modules. +- Said "no Kaggle/HF packaging" while the repo has `release/HF_DATASET_CARD.md`, `release/README.md`, and `scripts/build_public_release.py`. +- Said built-in evaluation is missing while `leadforge/validation/{bundle_checks,realism,difficulty,drift}.py` exist. +- Ignored the `lead_scoring_intro/` v6/v7 track entirely. +- Conflated framework-readiness and dataset-readiness lanes. +- Used unverified-or-outdated platform claims (Kaggle 1200×400 image, `updateFrequency` instead of `expectedUpdateFrequency`). +- Citations like `【176731919908143†L15-L89】` are not portable outside the chat environment. + +## Why this file is in the corpus + +For traceability — it is the failed-first-attempt that prompted the critique and the guidance file. The substantive ChatGPT contribution lives in `chatgpt_report_v2.md`, not here. diff --git a/docs/external_review/summaries/chatgpt_v2_summary.md b/docs/external_review/summaries/chatgpt_v2_summary.md new file mode 100644 index 0000000..8df8bd7 --- /dev/null +++ b/docs/external_review/summaries/chatgpt_v2_summary.md @@ -0,0 +1,86 @@ +# Summary — chatgpt_report_v2.md + +**Source:** `docs/external_review/chatgpt/chatgpt_report_v2.md` (781 lines, ~49 KB) +**Author:** ChatGPT (second attempt, evidence-grounded) +**Verdict in one line:** The single most actionable artifact in this corpus — a forensic, file:line-cited audit + 7-milestone roadmap that surfaces THE release blocker (relational leakage in `student_public` mode). + +## Document role + +The substantive ChatGPT review — evidence-first, repo-aware, release-oriented. Builds on the critique (what went wrong with v1) and the guidance (how to do v2). Followed the prescribed methodology and produced a verdict + gap matrix + roadmap. + +## Top points (ranked by importance to the v1 release) + +### THE release blocker — verify before anything else +1. **Public relational tables leak the label end-to-end.** In a 500-lead `student_public` smoke bundle generated locally: + - `tables/leads.parquet` still contained `converted_within_90_days` and `conversion_timestamp`. + - `tables/opportunities.parquet.close_outcome == "closed_won"`, plus `customers` and `subscriptions` existing only for converted leads, **reconstructs the target with 100% accuracy** via joins. + - Acceptable only if those relational tables are documented as post-outcome world records, not if they are marketed as feature-engineering inputs for the lead-scoring task. + - Recommended fix: **snapshot-safe relational export** for public bundles (drop target/timestamp from `leads`, drop `close_outcome`/`closed_at` from `opportunities`, omit `customers`/`subscriptions` from public). Move full-horizon tables to instructor companion only. + +### Audit findings — by repo area, with evidence +2. Architecture and design docs aligned with implementation (`README.md:L1-L6,L34-L56,L74-L127`). Strength. +3. **Versioning friction:** `pyproject.toml` declares `version = "1.0.0"` and `Production/Stable`, while the public dataset is still alpha. Recommends naming the upcoming dataset release explicitly (e.g., `leadforge-lead-scoring-v1`) so package-version vs framework-maturity vs curated-dataset-v1 don't get conflated. +4. Public API exists (`generator.py:L43-L122,L124-L248`) — vertical-slice generator is real. Gap: no release-oriented APIs (`leadforge.release.build_release_candidate()` etc.). +5. CLI exists (`generate`, `inspect`, `validate`, `list-recipes`). Gaps: no `release` subcommands, no `--json` (recently shipped on `inspect`), no dry-run publishing, no credential checks. +6. Recipes + difficulty profiles are first-class. Gap: alpha baselines show LR AUC ≈0.87-0.89 across all tiers and HistGBM ≈0.866-0.868 — too flat to demonstrate "stronger modeling lifts realistically." Need difficulty gates on calibration, lift, P@K, AP, model-family deltas. +7. Hidden-graph + motif sampler implemented (5 motif families, rewiring, validation). Gap: no public-facing diversity summary across seeds. +8. Simulation engine: real 90-day discrete-time simulator with stage transitions, conversion hazards, churn, direct conversion, post-conversion entities. Gap: data card needs a "simulation simplifications" section listing what's modeled / approximate / not modeled. +9. Bundle writer + snapshot logic implemented. **Critical gap:** flat task path is much safer than the full relational path; relational rendering needs the snapshot-safe variant. +10. Exposure modes work (`student_public` vs `research_instructor`); redaction targets known columns (`current_stage`, `is_sql`). Gap: redaction does not enforce "no public join path reconstructs label" — needs `leadforge/validation/relational_leakage.py`. +11. Validation suite is real and broad (`bundle_checks`, `realism`, `difficulty`, `drift`, `lead_scoring`). Gap: not yet a single reproducible release report with charts, calibration, Brier/log loss, leakage probes, public/instructor diff assertions, cross-seed bands, LLM critique. +12. Release tooling partial: `scripts/build_public_release.py`, `release/HF_DATASET_CARD.md` (with YAML), `release/README.md` exist. Missing: Kaggle metadata, final HF README with `pretty_name`/`tabular`/`datasets`/`pandas`/`default: true`/tested configs, cover image, publisher scripts, post-upload smoke tests, more notebooks. +13. CI runs lint/mypy/pytest plus v5/v6/v7 dataset validation jobs. Missing: release-candidate workflow. + +### Alpha forensics +14. Alpha LR baselines: intro 41.5% conv → AUC 0.886 / AP 0.785 / P@100 79%; intermediate 20.1% → 0.880/0.559/65%; advanced 7.9% → 0.870/0.271/26%. AP and P@K degrade meaningfully across tiers; AUC stays flat. v1 needs to show difficulty in calibration, lift, value capture, and stronger-model deltas, not only AP. +15. **v7 lessons to carry into v1:** + - Keep student path simple and safe. + - Keep leakage traps clearly separated from student-facing features. + - Teach value-aware ranking (not just probability ranking). + - Include cohort/time-shift evaluation. + - Make tree/GBM lift over LR visible but not absurd. + - Document limitations bluntly. + - Provide a lecture/notebook sequence. + +### External research grounding +16. **Public lead-scoring dataset census:** X Education on Kaggle (9240 rows, flat, overused, leakage-suspect status fields) → `shawhin/lead-scoring-x` on HF (5688 rows, only 7 features) → UCI Online Shoppers (12330 sessions, e-commerce not B2B). Gap is real; leadforge can plausibly be best-in-class. +17. **Industry realism citations:** HubSpot (fit + engagement + combined scoring), Salesforce, Adobe RT-CDP B2B (predictive lead/account scoring → opportunity-stage events, account-level activity aggregation, tree models). Frontiers 2025 paper (real CRM, Jan 2020 - Apr 2024, 23154 records, 67 fields, 15 classifiers, gradient boosting wins; key features: source, status, reason for status, last activity). +18. **Synthetic data evaluation:** SDMetrics quality report (column shapes, column-pair trends, multi-table cardinality + intertable trends). Datasheets for Datasets + Data Cards Playbook. +19. **Kaggle requirements (verified from official docs):** `dataset-metadata.json` adjacent to files, Data Package style, fields: `title` (6-50 chars), `subtitle` (20-80 chars), `description`, `id` (slug 3-50 chars), `licenses` (one entry), `resources` (with `schema.fields` in order if provided), `keywords`, `expectedUpdateFrequency` (never/annually/quarterly/monthly/weekly/daily/hourly), `userSpecifiedSources`, `image`. Cover image `dataset-cover-image.png/.jpg/.jpeg/.webp`, **minimum 560×280**, with 2:1 header and 1:1 thumbnail crops. +20. **HF requirements:** README.md as dataset card, YAML metadata (license, language, tags, size, configs/data_files), `load_dataset()` viewer support, configs with `default: true`, manual split/subset configuration documented. + +### Recommended v1 release shape (canonical tree) +21. Public Kaggle/HF release: intro/intermediate/advanced flat lead-scoring task splits + snapshot-safe relational tables + feature dictionary with leakage flags + validation report + charts + notebooks + data card + break-me guide. +22. Separate instructor/research companion: full world graph + latent registry + mechanisms + full-horizon relational tables + leakage-trap materials + reproducibility manifest. +23. Recommended that the instructor companion live in a separate GitHub Release artifact or HF repo/config, NOT in the default Kaggle dataset. +24. Notebooks (4): `01_intro_flat_csv_baseline` → `02_relational_feature_engineering` → `03_leakage_and_time_windows` → `04_lift_calibration_value_ranking`. All must run top-to-bottom and reproduce validation metrics within tolerance. + +### "v1 ready" definition +25. Fresh release candidate generates from code; passes structural, snapshot, redaction, **relational-leakage**, split-leakage, calibration, lift, top-K, value-ranking, and platform-packaging checks; renders valid Kaggle and HF packages; notebooks run top-to-bottom; no unresolved high-severity LLM/human review findings. + +## Useful artifacts / templates / schemas + +- Gap matrix (Area / Current evidence / Gap / Severity / Recommended fix / Acceptance criterion) — directly portable to a roadmap. +- 7-milestone roadmap: Audit → Snapshot-safe relational → Platform packaging → Validation hardening → Docs+notebooks → LLM critique → Dry-run publish + feedback. +- Release-validation metric checklist (~25 items): row counts, class balance, ROC/PR-AUC, log loss, Brier, calibration, lift@1/5/10%, P@50/100, recall@K, top-decile rate, expected ACV at K, model deltas (LR vs GBM vs source-only vs engagement-only vs leakage-probe vs ID-only vs stage-only vs post-snapshot-aggregates), account/contact overlap, near-duplicates, public/instructor diff, snapshot-window audit, relational-join leakage audit, cross-seed stability, cross-tier difficulty ordering. +- v2 feedback triage labels: critical-leakage / realism / difficulty / documentation / platform / notebook / pedagogy / v2-idea / out-of-scope-v1. + +## Limitations / blind spots + +- Test suite full run timed out at 53% (300s budget) — not a failure, just incomplete dynamic verification. +- Did not upload to Kaggle/HF; did not run `load_dataset()` against a real HF repo; did not run a full multi-model leakage probe beyond the smoke-bundle finding. +- Did not download every alpha Parquet file from the public dataset repo — relied on public GitHub pages + locally-regenerated artifacts. +- Notes M12 polish items (`--json` on `inspect`) as gaps without realizing they shipped in PR #60 the same day this report was prepared. + +## Items unique to this source + +- The relational-leakage finding (THE blocker) +- File:line-cited current-state audit +- Concrete gap matrix with severities +- Snapshot-safe relational export design +- Public-vs-companion split with explicit recommendation to keep companion off Kaggle +- v1-ready definition (the acceptance contract) +- v2 feedback triage labels +- Frontiers 2025 paper as a real B2B-realism citation (23154 records, 15 classifiers) +- "Pin the timestamp; verify byte-equal regeneration" as a release-readiness check +- Alpha LR/HistGBM gap finding (model-family deltas should be larger to reward sophistication) diff --git a/docs/external_review/summaries/cross_source_takeaways.md b/docs/external_review/summaries/cross_source_takeaways.md new file mode 100644 index 0000000..f905b4c --- /dev/null +++ b/docs/external_review/summaries/cross_source_takeaways.md @@ -0,0 +1,161 @@ +# Cross-Source Takeaways + +A consolidation of the six external review files. Each theme below tracks +where the agreement is strong, where reviewers diverge, and where one +source surfaces something the others miss. + +Source codes used: +- `G1` = gemini_v1 +- `G2` = gemini_v2 +- `C1` = chatgpt_v1 +- `Crit` = chatgpt v1 critique +- `Guid` = chatgpt second-attempt guidance +- `C2` = chatgpt_v2 (the substantive one) + +--- + +## 1. Strongly agreed themes (act on these) + +### 1.1 Temporal leakage prevention is the foundational concern +All sources lead with this. (G1, G2, C1, C2) +- Strict `prediction_timestamp` / snapshot boundary +- Aggregations strictly bounded to events ≤ snapshot +- Label resolution uses full label horizon but not as a feature source + +### 1.2 LLM-as-a-judge integration belongs in CI +(G1, G2, C1, C2, Guid) +- Reference-less rubric scoring of synthetic trajectories +- Logical coherence + behavioral plausibility + semantic diversity + syntax validity +- Strict numeric thresholds halt the build on failure +- C2/Guid contribute the concrete output JSON schema (severity / category / claim / evidence / reproducer / suggested_fix) +- G2 contributes bias mitigation (forced-rationale prompting) +- G1 contributes DeepEval as a candidate framework +- C2 recommends ≥2 model families with adjudication of high-severity findings before release + +### 1.3 Lift / calibration / P@K / value-aware ranking, not raw AUC +(G1, G2, C2) +- Decile lift charts as a headline metric (G1, G2) +- Calibration curves + Brier + log loss (C2) +- Top-K precision and recall (C2) +- Expected-value-at-K — `P(convert) × expected_acv` (C2 from v7 lineage) + +### 1.4 Industry-calibrated funnel benchmarks +(G1, G2) +- Channel-conditional MQL→SQL rates (G2: SEO 51%, PPC 26%, Email <1%) — strongest differential predictor design +- Top-quartile vs baseline contrast across all funnel stages +- Sales-cycle distributions sampled from log-normal/Weibull (G2) + +### 1.5 Release as a family, not a single CSV +(C2 explicit; G1/G2 implicit) +- intro / intermediate / advanced public tiers + instructor companion +- Public bundle = flat task splits + snapshot-safe relational tables + feature dict + validation report + notebooks + data card + break-me guide +- Instructor companion = full hidden graph, latent registry, mechanisms, full-horizon relational tables, leakage-trap materials + +### 1.6 Platform packaging must be programmatic +(G1, G2, C2, Guid) +- Kaggle: `dataset-metadata.json` generator, dry-run command, cover image +- HF: README.md with YAML configs/default/pretty_name/tabular tag, `load_dataset()` smoke test +- CI/CD: GitHub Actions with `HF_TOKEN`/`KAGGLE_USERNAME`/`KAGGLE_KEY` secrets, dry-run publishing +- Use `huggingface_hub` library and `kagglehub` library (not raw CLI) for Python integration + +### 1.7 Companion notebook(s) are non-negotiable +- G1: "masterclass starter notebook" — single deep notebook +- G2: Kaggle Solution Write-Up rubric (Context / Overview / Details / Sources) +- C2: 4-notebook sequence — baseline → relational FE → leakage demo → lift/calibration/value +- C2's sequence wins on pedagogical depth and aligns with v7 lecture sequencing + +### 1.8 Adversarial public framing + feedback loop +- G1: explicit "challenge community to break it" +- C2: explicit issue templates + break-me guide + triage labels + v2 decision log +- Public invitation to find leakage / break baselines / report unrealistic distributions +- Triage taxonomy: critical-leakage / realism / difficulty / documentation / platform / notebook / pedagogy / v2-idea / out-of-scope-v1 + +### 1.9 Dataset card adheres to Datasheets / Data Cards Playbook +(G1, G2, C1, C2) +- Provenance, motivation, content, quality, privacy, biases/limitations, intended use, out-of-scope use + +--- + +## 2. Divergent themes (resolve before roadmap) + +### 2.1 What's the biggest v1 risk? +- **C2:** Public relational tables leak the target with 100% accuracy via join paths through `opportunities.close_outcome` + `customers`/`subscriptions` existence. THE blocker. +- **G1/G2:** Don't surface this; their leakage worry is the temporal one, which the engine has partially addressed. +- **Resolution:** C2 is right; G1/G2 missed it because they didn't open the bundles. Verify locally as the very first thing. + +### 2.2 How much DGP work is needed before release? +- **G1/G2:** Significant. "Inject non-linear complexity," "deeper funnel calibration," "channel-conditional probabilities," "demographic noise injection." +- **C2:** "Leadforge is much further along than greenfield. v1 is release hardening + adversarial validation, not core implementation." +- **Resolution:** C2 has the evidence (937 tests, vertical-slice generator, alpha bundles). G1/G2's DGP recommendations are still useful inputs but should be prioritized against current state, not assumed greenfield. + +### 2.3 Is a single masterclass notebook enough, or do we need a sequence? +- **G1:** One masterclass starter notebook with baseline + decile lift chart. +- **C2:** Four notebooks (baseline, relational FE, leakage demo, lift/value). +- **Resolution:** C2's sequence is stronger pedagogically and matches v7-track lessons (4 lectures already designed in `RELEASE_v7.md`). Use the sequence. + +### 2.4 Should the instructor companion ship to Kaggle? +- **C2:** No — separate GitHub Release artifact or HF repo/config. Don't put hidden truth on the public Kaggle page. +- **G1/G2:** Don't address the instructor-companion question explicitly. +- **Resolution:** C2's instinct is sound — keep it separate to preserve the leakage trap's pedagogical value. + +### 2.5 What should the LLM judge actually score? +- **G1:** Logical coherence + behavioral plausibility + narrative consistency, scored 1-10. +- **G2:** Adds effective semantic diversity (mode collapse check) and syntax validity. +- **C2:** Adds severity/category/evidence/reproducer/suggested-fix structure for findings; treats it as a release-quality gate not a per-row scorer. +- **Resolution:** Combine — per-trajectory rubric scoring (G1+G2) AND per-release findings document (C2). The latter is more important for v1. + +--- + +## 3. Items only one source surfaces (worth absorbing) + +### Only in G1 +- DeepEval as a concrete LLM-judge framework name +- "Hidden Gems" Kaggle notebook quality reference + +### Only in G2 +- Channel-conditional MQL→SQL rates as differential predictor design +- Log-normal / Weibull sales-cycle long-tail distributions +- Demographic noise injection (job title permutations forcing NLP) +- Mode collapse / semantic diversity validation as an explicit dimension +- LLM-judge bias mitigation via forced rationale (analytical decomposition before scoring) +- Group/similarity leakage from latent-seed duplication +- 2024-2026 SaaS macroeconomic framing (CAC ratio +14%, growth decline) as pedagogical motivation +- Kaggle Solution Write-Up rubric (Context / Overview / Details / Sources) + +### Only in C1 (mostly historical) +- Simula framing (datasets-as-functions; programmable diversity / complexity / quality axes) +- Reference to BlueGen AI's Data Plagiarism Index / authenticity scores + +### Only in Crit +- Acceptance rubric for evaluating any future report (8 dimensions) +- Corrected platform facts (Kaggle 560×280 minimum, `expectedUpdateFrequency` field name) +- Framework-vs-dataset lane separation argument + +### Only in Guid +- Mandatory dataset-forensics probe taxonomy (5 categories) — the spec that surfaced the relational-leakage blocker +- Required release-tree layout (`leadforge-v1-lead-scoring/` structure) +- 12-criterion acceptance rubric for the report itself +- Pitfalls list (don't ignore v7 track, don't treat HF card as absent, etc.) + +### Only in C2 +- The relational-leakage blocker finding (verified via local smoke bundle) +- File:line-cited current-state audit +- Gap matrix with severities +- Snapshot-safe relational export design +- Public-Kaggle / instructor-companion split recommendation +- Concrete v1-ready definition +- v2 feedback triage labels +- Frontiers 2025 paper as a real B2B realism citation (23154 records, 15 classifiers) +- "Pin the timestamp; verify byte-equal regeneration" as a release-readiness check +- Alpha LR/HistGBM gap finding (model-family deltas should be larger to reward sophistication) + +--- + +## 4. Where the corpus is silent (worth flagging in the roadmap) + +- **No reviewer addresses the engineering cost** of any recommendation against the current state of the codebase. +- **No reviewer offers prioritization between LLM critique investment and snapshot-safe relational** — both are recommended as critical, but with different cost/risk profiles. +- **No reviewer specifies the cover-image content or sourcing.** +- **No reviewer addresses how v1 lessons should feed back into the framework (vs into a v2 dataset)** — i.e., should a release-blocking issue cause a major framework version bump? C2 separates package version from dataset release name, which is the closest answer. +- **No reviewer specifies what difficulty bands the release validation should enforce** (only that bands should exist). +- **No reviewer engages with the v0.1.0-alpha datasets repo's reviewer-targeted artifacts** (`build.sh`, `provenance.json`, `BASELINES.md`, `EXPOSURE_DELTA.md`, `validation.log`) beyond surface mention. These already model some of what's recommended. diff --git a/docs/external_review/summaries/gemini_v1_summary.md b/docs/external_review/summaries/gemini_v1_summary.md new file mode 100644 index 0000000..605977d --- /dev/null +++ b/docs/external_review/summaries/gemini_v1_summary.md @@ -0,0 +1,44 @@ +# Summary — gemini_report_v1.md + +**Source:** `docs/external_review/gemini/gemini_report_v1.md` (244 lines, ~51 KB) +**Author:** Gemini (research-report style with academic-form citations) +**Verdict in one line:** Useful research synthesis on temporal leakage, funnel realism, LLM-as-judge, and platform packaging — but does not audit the actual repo and assumes work is needed where work has shipped. + +## Document role + +A "what should a best-in-class synthetic CRM dataset look like?" report. Mostly external-evidence-driven (industry benchmarks, academic citations) with a 5-phase roadmap. Treats leadforge primarily as ambition, not as code. + +## Top points (ranked by usefulness) + +1. **Temporal leakage as the single biggest threat.** Demands a strict `prediction_timestamp` boundary; flat-CSV projection of relational data must filter out events with timestamps after the boundary. Programmatic guarantee, not just documentation. +2. **Industry funnel benchmarks (table form):** + - Visitor → Lead: 2–3% median, 4–6% top quartile + - Lead → MQL: 23% median, 31–41% top quartile + - MQL → SQL: 13% median, 28–40% top quartile (the modeling battleground) + - SQL → Opportunity: 56% median, 73% top quartile + - Lead → Customer: 1.3% (enterprise) – 2.7% (SMB), 5%+ top +3. **Decile lift / top-decile capture as the headline metric**, not raw classification accuracy. Basic demo+behavioral counting on flat tables typically yields 2-4× lift on top decile in B2B; complex relational data should obscure top signals so simple LR doesn't trivially win. +4. **LLM-as-a-judge integration.** DeepEval-style framework with GPT-4-class or Claude 3.5 Sonnet as instruction-tuned judge. Sample lead trajectories → 1–10 score on logical coherence, behavioral plausibility, narrative consistency. CI fails on threshold breach or flagrant logical impossibility. +5. **HF dataset card YAML schema (specific keys):** `language`, `license`, `task_categories: tabular-classification`, `tags: [synthetic, crm, lead-scoring, b2b]`, `pretty_name`. Kaggle: `dataset-metadata.json` with title, slug, license, file paths. +6. **CI/CD via GitHub Actions** wraps Hugging Face `huggingface_hub` (create_repo, upload_file/folder) and Kaggle CLI (`kaggle datasets create / version`) with `HF_TOKEN`, `KAGGLE_USERNAME`, `KAGGLE_KEY` as repository secrets. +7. **Companion "masterclass" starter notebook** is non-negotiable: EDA → temporal-leakage explainer → LR baseline → LightGBM/XGBoost → decile lift chart establishing the community baseline. +8. **Adversarial public framing:** publicly invite the community to break the DGP — fastest path to v2 robustness. + +## Useful artifacts / templates / schemas + +- HF YAML schema table (fields × required/recommended) +- Funnel benchmark table (median vs top-quartile, by stage) +- 5-phase roadmap: DGP refinement → LLM judge → metadata + data card → CI/CD → starter notebook + adversarial challenge + +## Limitations / blind spots + +- Does not inspect the leadforge repo. Treats simulation engine, validation, CLI, HF card as "to build" when they exist. +- Does not catch the relational-table leakage chatgpt v2 surfaces as THE blocker. +- Cited Kaggle image dimensions ("1200×400") and outdated metadata field names; chatgpt v1 critique flagged these. +- Citations are bracketed reference IDs (`[10]`, `[11]`, …) without portable URLs anchored to the report body. + +## Items unique to this source (not duplicated as strongly elsewhere) + +- Funnel benchmarks expressed by quartile (median vs top-quartile vs SMB vs enterprise) +- DeepEval as a concrete framework recommendation +- "Masterclass starter notebook" framing for the launch deliverable diff --git a/docs/external_review/summaries/gemini_v2_summary.md b/docs/external_review/summaries/gemini_v2_summary.md new file mode 100644 index 0000000..47b8a62 --- /dev/null +++ b/docs/external_review/summaries/gemini_v2_summary.md @@ -0,0 +1,52 @@ +# Summary — gemini_report_v2.md + +**Source:** `docs/external_review/gemini/gemini_report_v2.md` (246 lines, ~46 KB) +**Author:** Gemini (revised second attempt) +**Verdict in one line:** Same shape as v1 but sharper — adds macroeconomic motivation, channel-conditional rates, sales-cycle distributions, and LLM-judge bias mitigation; still does not audit the repo. + +## Document role + +A second pass at the same brief. Tighter, more empirically anchored than v1, but with the same blind spot toward the existing codebase. Adds 2024-2026 SaaS macro framing as pedagogical motivation. + +## Top points (ranked by what's new vs v1) + +1. **Macro framing as pedagogical justification.** 2024-2026 SaaS environment: median growth rate dropped from 30% (2023) to 25% (2025); New CAC Ratio rose 14% in 2024 (~$2 spent per $1 ARR). Frames lead scoring as survival-critical, motivating realism investment. +2. **Channel-conditional MQL→SQL rates** as a strong differential predictor: + - SEO ~51% + - PPC ~26% + - Email <1% + - Cybersecurity 15-18%, Fintech 11-19% + - DGP should produce sharply different conversion probabilities by channel; `lead_source` becomes a meaningful feature, not a uniform 13% prior. +3. **Top-quartile vs baseline contrast** to make difficulty tiers meaningful: + - MQL→SQL baseline 13-15%, top-quartile 28-40% + - SQL→Opp baseline 10-12%, top-quartile 45-60% + - Opp→Won baseline 6-9%, top-quartile 20-35% +4. **Sales-cycle distributions:** median ~84 days, optimized 46-75 day window; sample with log-normal or Weibull to produce a realistic delayed-conversion long tail that confounds linear time-series forecasting. +5. **Demographic noise injection:** instead of standardizing "VP of Operations," produce variants ("Head of Ops", "Director of Global Operations", "Operations VP") to force NLP / categorical embedding cleanup before modeling. +6. **Mode collapse risk in synthetic generators:** explicitly validate effective semantic diversity of generated cohorts so every "happy-path" trajectory isn't identical. Without this, synthetic data loses pedagogical breadth. +7. **LLM-judge bias mitigation:** verbosity bias and self-preference bias are known LLM-evaluator failure modes. Mitigate by **forced-rationale** prompts — model emits step-by-step analytical decomposition before assigning a numerical score. +8. **Group/similarity leakage:** synthetic engines can produce near-duplicates from similar latent seeds; if these end up split across train/test, models memorize. Require time-based / temporally-shifted splits over random shuffle. +9. **HF `huggingface/hub-sync` GitHub Action** + Kaggle `kagglehub.dataset_upload()` Python library (preferred over CLI) with `version_notes` parameter for commit-hash lineage. +10. **Kaggle Solution Write-Up rubric (4 pillars):** Context → Overview of Approach → Details of Data → Sources. Notebook should follow this structure with mathematical exploration of DGP and explicit anti-leakage explanation. + +## Useful artifacts / templates / schemas + +- Channel × stage transition probability matrix +- HF YAML metadata key list (with `configs`, `default: true`) +- 5-phase roadmap: stat calibration → anti-leakage → LLM judge → CI/CD + docs → starter notebook + +## Limitations / blind spots + +- Same as v1 — no repo audit, no awareness of current code state, does not detect the relational-table leakage. +- Funnel benchmark numbers source-cited but not always cross-referenced; some industry sources are vendor-blog quality. + +## Items unique to this source (relative to v1 and chatgpt v2) + +- Channel-conditional conversion rates (SEO 51% vs Email <1% MQL→SQL) +- Log-normal / Weibull sales-cycle long-tail distributions +- Demographic noise injection forcing NLP cleanup +- Mode collapse / semantic diversity validation as an explicit dimension +- LLM-judge verbosity-bias / self-preference-bias mitigation via forced rationale +- Group/similarity leakage from latent-seed duplication +- 2024-2026 SaaS macroeconomic framing as pedagogical justification +- Kaggle Solution Write-Up rubric (Context / Overview / Details / Sources) diff --git a/docs/external_review/summaries/key_findings.md b/docs/external_review/summaries/key_findings.md new file mode 100644 index 0000000..516f18b --- /dev/null +++ b/docs/external_review/summaries/key_findings.md @@ -0,0 +1,151 @@ +# Key Findings — Action-Prioritized Synthesis + +Distilled from the six review files. Items are ranked by what they imply +for the v1 release. This file is NOT a roadmap; it is the input list a +roadmap would consume after a process-and-recommendations sign-off pass. + +## Severity legend + +- **CRITICAL** — release blocker; must verify and resolve before any v1 publish +- **HIGH** — release-quality gate; should resolve before v1 publishes or ship with explicit acknowledgment +- **MEDIUM** — improves the release substantially but could be deferred to a fast v1.1 +- **LOW / Defer** — accepted-with-different-approach or out-of-scope candidates + +--- + +## CRITICAL + +### 1. Public relational tables reconstruct the label +**Source:** chatgpt_report_v2.md §0, §2.7, §3.5, §6 (gap matrix), §7 (Milestone 2) +**Evidence:** Local 500-lead `student_public` smoke bundle: +- `tables/leads.parquet` contained `converted_within_90_days` and `conversion_timestamp` +- `tables/opportunities.parquet.close_outcome == "closed_won"` + `customers` + `subscriptions` reconstruct the target with 100% accuracy via joins + +**Implication:** v1 cannot ship as best-in-class until either (a) public relational tables are made snapshot-safe (drop target/timestamp from `leads`, drop `close_outcome`/`closed_at` from `opportunities`, omit `customers`/`subscriptions` from public), or (b) full-horizon relational tables are moved entirely to an instructor companion. + +**First action:** reproduce the finding locally on the alpha bundle and confirm severity before designing the fix. + +--- + +## HIGH + +### 2. Difficulty signal is too flat across the alpha tiers on AUC +**Source:** chatgpt_report_v2.md §2.4, §3.2 +**Evidence:** Alpha LR AUC 0.886 / 0.880 / 0.870; HistGBM 0.866-0.868. AP and P@K do degrade meaningfully (intro 0.785 / 79% → advanced 0.271 / 26%), but model-family deltas don't reward sophistication enough. +**Implication:** v1 difficulty gates must include calibration, lift, P@K, AP, and model-family deltas — not just AUC. The release is a teaching dataset; "GBM beats LR realistically" is a pedagogical requirement. + +### 3. No Kaggle `dataset-metadata.json` generator +**Source:** chatgpt_report_v2.md §2.10, §4.4, §7 Milestone 3 / Guid §11 / G1 / G2 +**Evidence:** Repo has `release/HF_DATASET_CARD.md` and `release/README.md` but no Kaggle metadata. +**Implication:** Build `scripts/package_kaggle_release.py` that produces validated `dataset-metadata.json` (correct field names, `expectedUpdateFrequency`, title 6-50 chars, subtitle 20-80 chars, slug 3-50 chars), copies a `dataset-cover-image.png` (≥560×280, with 2:1 header and 1:1 thumbnail crops), validates against current Kaggle requirements, and supports a dry-run mode. + +### 4. HF README needs hardening to be a real dataset card +**Source:** chatgpt_report_v2.md §2.10, §4.5; Crit §10; Guid §7 +**Evidence:** Existing `release/HF_DATASET_CARD.md` has YAML configs but is not the final repo `README.md`. +**Implication:** Build `scripts/package_hf_release.py` that emits a final `README.md` with `pretty_name`, `tags: [tabular, lead-scoring, synthetic-data, crm, b2b, datasets, pandas]`, `configs` for all tiers with `default: true` on the main config, and a verified local `load_dataset()` smoke test. + +### 5. Release validation must move beyond `leadforge validate` +**Source:** chatgpt_report_v2.md §2.9, §5.6, §7 Milestone 4; Guid §10 Milestone D; G1/G2 Phase 2/3 +**Evidence:** Current validation handles structural / FK / leakage-column / realism / difficulty / drift but produces no charts, no calibration curves, no Brier/log loss, no relational-leakage probes, no public-vs-instructor diff assertion, no cross-seed bands. +**Implication:** Add `leadforge/validation/release_quality.py` + `leadforge/validation/leakage_probes.py` + `leadforge/validation/reporting.py` + `scripts/validate_release_candidate.py`. Output `release/validation/validation_report.{json,md}` and `figures/{lift_curve_*,calibration_intermediate,leakage_delta,split_shift,value_capture}.png`. Acceptance: no critical leakage findings, metrics in tier bands, charts auto-generated. + +### 6. Snapshot-safe relational export design needs to land before any data goes public +**Source:** chatgpt_report_v2.md §5.2, §7 Milestone 2 +**Evidence:** Direct consequence of finding #1. +**Implication:** New module `leadforge/render/relational_snapshot_safe.py` + new validator `leadforge/validation/relational_leakage.py`. Public relational tables must filter event tables to `timestamp <= lead_created_at + snapshot_day`, drop terminal-state fields, omit conversion-conditional entities. Full-horizon stays in instructor companion only. + +### 7. Notebook sequence (4 notebooks) — only one exists today +**Source:** chatgpt_report_v2.md §5.5, §7 Milestone 5; Guid §10 Milestone E; G1/G2 implicit +**Evidence:** Only one release notebook (`01_baseline_lead_scoring.ipynb`) exists. +**Implication:** Add `02_relational_feature_engineering.ipynb`, `03_leakage_and_time_windows.ipynb`, `04_lift_calibration_value_ranking.ipynb`. All run top-to-bottom; outputs match validation report within tolerance. + +--- + +## MEDIUM + +### 8. Channel-conditional conversion rates as differential predictor design +**Source:** gemini_report_v2.md §3.1 +**Evidence:** Industry data shows MQL→SQL ranges from <1% (email) to 51% (SEO). Current recipe uses generic motif families without explicit channel attribution as a strong predictor. +**Implication:** Either accept-with-different-approach (channel signal already partially present via motif structure) or extend the simulation to encode source-channel as a top-tier conditional probability. Risk: non-trivial DGP work right when we should be hardening for release. + +### 9. Train/test split policy — temporal/cohort + account-overlap audit + group/similarity leakage +**Source:** chatgpt_report_v2.md §5.6 (account/contact overlap); gemini_report_v2.md §6 (time-based splits over random shuffle); Guid §8.4 +**Evidence:** Current splits are deterministic 70/15/15 lead-level random. Real CRM use cases score future leads from accounts with prior activity — same-account-train/test may be intentional but must be audited and documented. +**Implication:** Add account/contact overlap probe; add cohort-time-shift split as an additional evaluation axis; document the choice in the data card. + +### 10. v7 teaching lessons should be ported into the v1 multi-table release +**Source:** chatgpt_report_v2.md §3.4, §5.5 +**Evidence:** v7 track has purely-causal trap, value-aware ranking, cohort split, GBM-vs-LR honest delta, lecture sequencing — already proven in `lead_scoring_intro/RELEASE_v7.md`. +**Implication:** Make v1 documentation (and notebooks) explicitly inherit the v7 teaching arc. + +### 11. LLM-as-a-judge integration as release-quality gate +**Source:** all reviewers; concrete schema in chatgpt_report_v2.md §7 Milestone 6 + Guid §12 +**Evidence:** Repo has no LLM critique today. +**Implication:** Build `leadforge/validation/llm_critique.py` with provider abstraction (env-var creds, skips cleanly without). At least 2 model families. Output schema fixed. High-severity findings require human adjudication before release. + +### 12. Mode-collapse / semantic-diversity validation +**Source:** gemini_report_v2.md §6 +**Evidence:** Current cohort-level diversity measured only by stage/category distribution; not by trajectory variety. +**Implication:** Add a diversity probe (likely as part of the LLM critique rubric) — sample N trajectories, ask judge whether the cohort covers the full firmographic / behavioral space. + +### 13. Demographic noise injection (NLP-forcing categorical permutations) +**Source:** gemini_report_v2.md §3.3 +**Evidence:** Job titles and similar categorical fields are likely standardized today. +**Implication:** Optional enrichment — adds pedagogical realism by forcing string-cleanup before modeling. Lower priority than leakage fixes. + +### 14. Cover image asset +**Source:** chatgpt_report_v2.md §4.4; Guid §11 +**Evidence:** No cover image in the repo. +**Implication:** Need `release/dataset-cover-image.png` ≥560×280 with documented 2:1/1:1 crops. Sourcing/design TBD. + +### 15. Versioning / naming clarification +**Source:** chatgpt_report_v2.md §2.1 +**Evidence:** `pyproject.toml` says `1.0.0` + Production/Stable; public dataset still alpha. +**Implication:** Name the upcoming dataset release explicitly (e.g., `leadforge-lead-scoring-v1`) so package-version vs framework-maturity vs curated-dataset-v1 don't get conflated. Cheap; do early. + +### 16. Issue templates + break-me guide + v2 decision log +**Source:** chatgpt_report_v2.md §7 Milestone 7 + §8; Guid §10 Milestone G +**Evidence:** No GitHub issue templates today. +**Implication:** Add `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml`, `realism_feedback.yml`, plus `docs/release/break_me_guide.md` and `v2_decision_log.md` once feedback starts flowing. + +--- + +## LOW / Defer / Out-of-scope candidates + +### 17. CI workflow for release-candidate packaging +Useful but later — manual run of `scripts/validate_release_candidate.py` covers the use case until v1 ships. + +### 18. `leadforge release ...` CLI subcommands +Convenient but not required for v1 if `scripts/{build,validate,package_kaggle,package_hf,publish_*}.py` cover the workflows. Subcommand consolidation is a good v1.1 polish target. + +### 19. Macro framing as data-card narrative (CAC ratios, growth decline) +Useful pedagogical context for the dataset card but not a release blocker. + +### 20. Channel-conditional rates / log-normal sales cycles / demographic noise injection +Real DGP improvements; should be staged and benchmarked against the current alpha. Risk: rebuilding part of the engine right when we should be hardening for release. + +### 21. Per-vertical industry calibration (cybersecurity, fintech) +Out of scope for v1 (single vertical: B2B SaaS procurement). Note for v2 / second vertical. + +### 22. LTV labels as first-class outputs / leaderboard mini-site / second vertical +Out of scope per current `.agent-plan.md` and explicitly out per Guid §3.6 / C2 §8. + +--- + +## Counts + +- 1 critical +- 6 high +- 9 medium +- 6 low/defer + +## Recommended next step + +Process-and-recommendations pass on every numbered item above with action codes: +- accept +- accept-with-different-approach +- reject +- out-of-scope-and-open-issue +- defer + +Then sign-off, then a single coherent roadmap. diff --git a/docs/external_review/summaries/recommendations_pass.md b/docs/external_review/summaries/recommendations_pass.md new file mode 100644 index 0000000..0730665 --- /dev/null +++ b/docs/external_review/summaries/recommendations_pass.md @@ -0,0 +1,175 @@ +# Process and Recommendations Pass + +For each of the 22 numbered findings in `key_findings.md`, an action code, +a one-line rationale, and a target roadmap (first-release vs after-release +vs out-of-scope). Items 1-7 (CRITICAL + HIGH) and items where there was +cross-source agreement are pre-accepted into the first-release-roadmap per +your direction; the heavier reasoning is on the Gemini-unique DGP items +where the v1 vs after-v1 split is a real call. + +## Action codes + +- **ACCEPT** — adopt as proposed; goes to specified roadmap +- **ACCEPT-DIFF** — adopt the *intent* with a different scope/approach +- **REJECT** — do not adopt +- **OOS-ISSUE** — out of scope for either roadmap; file a tracking issue +- **DEFER** — adopt later; goes to after-release-roadmap + +## Roadmap targets + +- **v1** = first-release-roadmap (what ships to Kaggle/HF as the inaugural release) +- **post-v1** = after-release-roadmap (engine/DGP improvements feeding the next dataset version, framework v1.x → v2) +- **issue** = file a GitHub issue, no roadmap commitment + +--- + +## CRITICAL + +### #1 — Public relational tables reconstruct the label +**Action:** ACCEPT → v1 +**Rationale:** Pre-accepted. THE blocker. Reproduce locally on alpha bundles, then build snapshot-safe relational export + validator (item #6 is the same workstream; folded together in the roadmap). + +--- + +## HIGH + +### #2 — Difficulty signal too flat on AUC across alpha tiers +**Action:** ACCEPT → v1 +**Rationale:** Cross-source agreement (chatgpt v2 directly; gemini implicit through difficulty-tier framework). Folds into validation hardening — the difficulty gate must include AP, P@K, calibration, lift, and model-family deltas, not just AUC. No new DGP work needed; the alpha already produces meaningful AP/P@K differentials, we just need the report to surface them. + +### #3 — No Kaggle `dataset-metadata.json` generator +**Action:** ACCEPT → v1 +**Rationale:** Cross-source agreement. Required to upload. Use ChatGPT-critique's verified field list (`expectedUpdateFrequency`, 6-50 char title, 20-80 char subtitle, 3-50 char slug, image ≥560×280). + +### #4 — HF README needs hardening to be a real dataset card +**Action:** ACCEPT → v1 +**Rationale:** Cross-source agreement. `pretty_name`, `tags: [tabular, lead-scoring, synthetic-data, crm, b2b, datasets, pandas]`, `configs` with `default: true`, local `load_dataset()` smoke test. + +### #5 — Release validation must move beyond `leadforge validate` +**Action:** ACCEPT → v1 +**Rationale:** Cross-source agreement. New modules: `leadforge/validation/{release_quality,leakage_probes,reporting}.py` + `scripts/validate_release_candidate.py`. Output: `release/validation/validation_report.{json,md}` + figures. Acceptance: zero critical leakage findings, metrics in bands, charts auto-generated. This is the single biggest piece of v1 work and absorbs most of the "release-grade gates" demanded by all reviewers. + +### #6 — Snapshot-safe relational export design +**Action:** ACCEPT → v1 +**Rationale:** Direct fix for #1. New module `leadforge/render/relational_snapshot_safe.py` + new validator `leadforge/validation/relational_leakage.py`. Filter event tables to `timestamp <= lead_created_at + snapshot_day`; drop terminal-state fields from public `opportunities`; omit `customers`/`subscriptions` from public bundles; full-horizon goes only to instructor companion. + +### #7 — Notebook sequence (4 notebooks; only 1 exists today) +**Action:** ACCEPT → v1 +**Rationale:** Cross-source agreement. v7 lecture sequence already exists in `lead_scoring_intro/RELEASE_v7.md` — operationalize as `02_relational_feature_engineering`, `03_leakage_and_time_windows`, `04_lift_calibration_value_ranking`. All run top-to-bottom; outputs match validation report. + +--- + +## MEDIUM — most depth on Gemini-unique DGP items here + +### #8 — Channel-conditional MQL→SQL rates as a strong differential predictor +**Action:** ACCEPT-DIFF → v1 (audit only) + post-v1 (full encoding) +**Rationale:** Gemini's strongest single DGP recommendation. Industry data (G2: SEO ~51%, PPC ~26%, Email <1%) shows lead source should be a top-tier conditional probability, and the Frontiers 2025 paper confirms `lead_source` is among the top important features in real CRM data. Genuinely valuable. +**But:** the leadforge engine drives conversion through motif-family-specific hazards keyed off latent traits, not through explicit channel-conditional probabilities. Properly encoding channel-conditional rates means (a) extending the recipe to declare per-channel transition probabilities, (b) reworking `assign_mechanisms()` to layer channel hazards on top of motif hazards, (c) re-running difficulty-band calibration across all three tiers, (d) re-baselining. That's an engine project, not release hardening, and risks rebuilding the DGP at exactly the wrong time. +**v1 scope:** audit how strongly `source_channel` already signals conversion in the alpha bundles; document realistic vs unrealistic mix in the dataset card; flag `lead_source` as a high-leverage feature that students should explore. +**post-v1 scope:** real channel-conditional encoding as a first-class generative axis. Worth a v1.1 dataset (same recipe, regenerated bundles) once the release is out and we can iterate on calibration without release-pressure. + +### #9 — Train/test split policy: cohort/time + account-overlap audit +**Action:** ACCEPT → v1 (folded into #5 + #7) +**Rationale:** Cross-source. Account/contact overlap probe is small work and belongs in `leakage_probes.py`. Cohort-time-shift split should ship as one of the evaluation axes in notebook #4 + the validation report. v7 already has cohort-split AUC drop measurement (`RELEASE_v7.md`); port the pattern. + +### #10 — v7 teaching lessons ported into v1 +**Action:** ACCEPT → v1 (folded into #7 + dataset card) +**Rationale:** Cross-source. Mostly documentation work — the lecture sequence and pedagogical patterns are already proven; we just operationalize them in the multi-table v1 release. + +### #11 — LLM-as-a-judge integration as release-quality gate +**Action:** ACCEPT-DIFF → v1 (minimal one-shot) + post-v1 (full CI integration) +**Rationale:** Cross-source agreement on the principle. But the gap between "minimal viable" and "full release-quality gate with multi-provider adjudication" is large. +**v1 scope:** `leadforge/validation/llm_critique.py` with a single provider abstraction (env-var creds, skip cleanly without). One-shot critique pass over dataset card + sample rows + validation report, structured output per Guid §12 schema (severity / category / claim / evidence / reproducer / suggested_fix). Run manually before tagging the release; high-severity findings adjudicated by hand. Output archived to `release/validation/llm_critique_*.json`. +**post-v1 scope:** multi-provider adjudication, CI gate, automated fail-on-high-severity, periodic re-runs against new bundles. +**Why split:** getting LLM critique right (prompt engineering, rubric design, threshold tuning, false-positive handling) takes meaningful iteration that we shouldn't gate v1 on. A minimal pass is enough to catch obvious gaps for the public release. + +### #12 — Mode-collapse / semantic-diversity validation +**Action:** ACCEPT-DIFF → v1 (LLM-judge rubric dimension) + post-v1 (quantitative validator) +**Rationale:** Gemini-unique. Real concern — heavily-aligned synthetic generators do produce homogenized "happy path" trajectories that lose pedagogical breadth. +**v1 scope:** include "Effective Semantic Diversity" as one of the rubric dimensions in the v1 LLM critique (item #11). Cohort sample → "does this set cover the full firmographic / behavioral space?" → severity-tagged finding. +**post-v1 scope:** dedicated quantitative validator — cohort embedding distance distribution, trajectory n-gram entropy, or similar. Engine-side work that depends on knowing what "diverse enough" looks like, which itself depends on running the v1 LLM critique a few times. + +### #13 — Demographic noise injection (job title permutations forcing NLP) +**Action:** DEFER → post-v1 (with tier-modulated approach) +**Rationale:** Gemini-unique. Real CRM messiness, but adding it to v1 risks distracting students from the core lead-scoring lessons — many will spend energy fighting NLP issues that aren't the lesson. Better: stage it as a difficulty-tier knob (intro stays clean; intermediate/advanced get the noise) once we have v1 feedback on which tiers students actually use. Note: difficulty profiles already modulate noise/missingness/outliers via `_apply_difficulty_distortions()` — this is an extension of that mechanism, not a new system. + +### #14 — Cover image asset +**Action:** ACCEPT → v1 +**Rationale:** Required for Kaggle. ≥560×280, with 2:1 header and 1:1 thumbnail crops. Sourcing/design TBD — recommend a stylized funnel diagram conveying "synthetic" + "B2B SaaS procurement". Cheap. + +### #15 — Versioning / naming clarification +**Action:** ACCEPT → v1 (do early) +**Rationale:** ChatGPT-unique but obviously right. `leadforge` package stays at 1.x; the curated dataset release is named `leadforge-lead-scoring-v1` (or similar). Decoupling avoids the "package says Production/Stable but the data is alpha" confusion. + +### #16 — Issue templates + break-me guide + v2 decision log +**Action:** ACCEPT → v1 +**Rationale:** Cross-source. `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` + `docs/release/break_me_guide.md` + `docs/release/v2_decision_log.md` (starts empty, populated post-launch). Required for the adversarial public framing both reviewers demand. + +--- + +## LOW / Defer / Out-of-scope + +### #17 — CI workflow for release-candidate packaging +**Action:** DEFER → post-v1 +**Rationale:** Manual run of `validate_release_candidate.py` covers the v1 use case. Add CI workflow once the release process is stable. + +### #18 — `leadforge release ...` CLI subcommands +**Action:** DEFER → post-v1 +**Rationale:** Scripts in `scripts/` cover v1 needs. Subcommand consolidation is polish, not load-bearing. + +### #19 — Macro framing in dataset card (CAC ratios, growth decline) +**Action:** ACCEPT-DIFF → v1 +**Rationale:** Gemini-unique. Cheap to add a short "Why lead scoring matters in 2026 SaaS" paragraph to the dataset card; high pedagogical value (motivates the dataset for students). Don't build a whole "industry context" section. One paragraph + a citation or two. + +### #20 — Channel-conditional / log-normal sales cycles / demographic noise (catch-all) +**Action:** Split per components: +- Channel-conditional → see #8 (audit in v1, full encoding post-v1) +- Log-normal / Weibull sales-cycle distributions → DEFER → post-v1 +- Demographic noise → see #13 (DEFER → post-v1) +**Rationale on log-normal sales cycles:** Gemini-unique. The engine's daily-step simulation produces whatever cycle distribution falls out of the hazard rates; explicitly targeting log-normal (median ~84d, top quartile 46-75d) requires either tuning hazards per stage to hit the target distribution or switching to a different sampling model. Real work, no leakage-safety payoff. Defer until post-v1 DGP overhaul. + +### #21 — Per-vertical industry calibration (cybersecurity, fintech) +**Action:** OOS-ISSUE → file as v2 second-vertical work +**Rationale:** v1 vertical is locked: B2B SaaS procurement. Per-vertical calibration is exactly v2 territory. File the issue with G2's industry-specific rates (cyber 15-18% MQL→SQL, fintech 11-19%) so it's not lost. + +### #22 — LTV labels / leaderboard / second vertical +**Action:** OOS-ISSUE +**Rationale:** Explicitly out per `.agent-plan.md` and Guid §3.6 / C2 §8. Already tracked in deferred items. + +--- + +## Summary + +| Roadmap | Items | Items pulled in | +|---|---:|---| +| **v1 (first-release-roadmap)** | 14 | #1, #2, #3, #4, #5, #6, #7, #8 (audit), #9, #10, #11 (minimal), #12 (LLM rubric), #14, #15, #16, #19 | +| **post-v1 (after-release-roadmap)** | 8 | #8 (full encoding), #11 (full CI), #12 (quantitative), #13, #17, #18, #20 (log-normal) | +| **out-of-scope-issue** | 2 | #21, #22 | + +(Some items appear in both roadmaps because they were split into a v1 minimal scope and a post-v1 full scope.) + +## Strategic frame for Gemini's deeper DGP work + +Gemini's strongest unique contributions are channel-conditional rates (#8), log-normal sales cycles (#20-log-normal), demographic noise (#13), and mode-collapse validation (#12). Three observations on the v1-vs-post-v1 split: + +1. **None of Gemini's DGP items are leakage-safety-load-bearing.** They make the dataset more pedagogically rich and more realistic, but none of them block v1 from being safe to publish. The blocker is structural (relational tables leak the label); Gemini didn't catch that. So the v1-vs-post-v1 split here is "ship a leak-safe v1 first, deepen the DGP for the next dataset version" — not "delay v1 for DGP work." + +2. **Channel-conditional rates have the highest pedagogical ROI** of the four. The Frontiers 2025 paper directly identifies `lead_source` as a top important feature in real CRM data. Students using v1 *should* find that `source_channel` is one of the strongest predictors. The v1 audit step (#8) tells us how close we already are. If the audit shows the alpha bundles already have a strong channel signal (because motif families cover similar territory implicitly), the post-v1 work is calibration tuning rather than a rebuild. + +3. **The LLM-judge work is the single highest-leverage post-v1 investment.** A working multi-provider critique gate would catch every Gemini concern (mode collapse, narrative incoherence, demographic flatness) and every realism issue we haven't anticipated. Worth investing in once v1 is out and we can iterate on prompt design without release pressure. + +## Suggested ordering for the v1 roadmap (preview, not the roadmap itself) + +A dependency-respecting sequence for the v1 work, just to surface ordering questions before drafting: + +1. Reproduce the relational-leakage finding locally (sanity check, no roadmap) +2. Versioning / naming decision (#15 — cheap, do first) +3. Snapshot-safe relational export + validator (#1 + #6) — fixes the blocker +4. Release validation hardening (#2 + #5 + #9 probes) — depends on #6 validators existing +5. Channel-signal audit + macro framing paragraph (#8 audit + #19) — light docs work +6. Platform packaging: Kaggle metadata + HF README + cover image (#3 + #4 + #14) +7. Notebook sequence (#7 + #10 + #9 cohort split) — depends on validation report figures to reproduce +8. Issue templates + break-me guide (#16) — readiness for public adversarial framing +9. Minimal LLM critique pass (#11 + #12 rubric) — final quality gate before tagging the release + +Sign-off question: do you want me to push back on any of these recommendations before drafting the actual roadmap? diff --git a/docs/release/post_v1_roadmap.md b/docs/release/post_v1_roadmap.md new file mode 100644 index 0000000..7ddadf2 --- /dev/null +++ b/docs/release/post_v1_roadmap.md @@ -0,0 +1,100 @@ +# Post-v1 Roadmap + +The work that should happen after `leadforge-lead-scoring-v1` ships, derived +from `docs/external_review/summaries/recommendations_pass.md`. This roadmap +is unscheduled — items are grouped by category and rationale, not by phase. +Most of these are accepted recommendations whose v1 scope was minimal or +deferred outright; a few are framework polish. + +## Categories + +- **DGP-deepening** — engine work that makes the *next* dataset version more realistic +- **Validation maturity** — moving from minimal v1 gates to full release-quality CI +- **Framework polish** — DX improvements that don't gate the dataset +- **v2 territory** — second vertical, LTV, leaderboard + +## DGP-deepening (feeds the next dataset version) + +### Channel-conditional MQL→SQL rates as a generative axis (recommendation #8 — full scope) +**v1 scope (already in v1 roadmap Phase 4):** audit how strongly `source_channel` signals conversion in alpha bundles; document realistic vs unrealistic mix. +**Post-v1 scope:** extend the recipe to declare per-channel transition probabilities; rework `assign_mechanisms()` to layer channel-conditional hazards on top of motif hazards; re-run difficulty-band calibration; re-baseline. Targets the gemini_v2 channel-mix benchmarks (SEO ~51%, PPC ~26%, Email <1% MQL→SQL); pedagogically validated by the Frontiers 2025 paper (`lead_source` is among the top important features in real CRM). +**Files likely touched:** `leadforge/recipes/b2b_saas_procurement_v1/recipe.yaml`, `leadforge/mechanisms/policies.py`, `leadforge/simulation/engine.py`, `leadforge/validation/difficulty.py`. +**Risk:** rebuilds part of the engine; requires re-validation of difficulty bands across all tiers. +**Reward:** significantly stronger, more realistic differential predictor; the most-cited feature in real-CRM literature. +**Trigger:** plan after v1 ships and the channel-signal audit (Phase 4) tells us how far the alpha already is from target. + +### Log-normal / Weibull sales-cycle distributions (recommendation #20 — sales cycles) +**Post-v1 scope:** target a specific sales-cycle distribution (median ~84 days, top quartile 46-75 days) by tuning per-stage hazard rates or switching to an explicit sampling model. No leakage-safety payoff; pure realism. +**Files likely touched:** `leadforge/mechanisms/transitions.py`, `leadforge/mechanisms/hazards.py`, `leadforge/recipes/b2b_saas_procurement_v1/recipe.yaml`. +**Risk:** changes the funnel velocity; need to verify difficulty bands hold. +**Reward:** realistic delayed-conversion long tail; lift-curve realism for time-series teaching. + +### Demographic noise injection — tier-modulated (recommendation #13) +**Post-v1 scope:** noisy job-title permutations ("Head of Ops" / "Director of Global Ops" / "Operations VP" instead of standardized "VP of Operations"), conditional address-format variation, occasional missing-field patterns. Modulated by difficulty tier (intro stays clean; intermediate/advanced get the noise) using the existing `_apply_difficulty_distortions()` extension point. +**Files likely touched:** `leadforge/render/snapshots.py`, `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml`. +**Risk:** could distract students from the lesson if applied to intro tier. +**Reward:** forces NLP / categorical embedding cleanup; closer to real CRM messiness. +**Trigger:** plan once we have v1 user feedback on which tiers students actually use. + +## Validation maturity + +### Quantitative semantic-diversity validator (recommendation #12 — full scope) +**v1 scope:** "Effective Semantic Diversity" as one rubric dimension in the v1 LLM critique. +**Post-v1 scope:** dedicated quantitative validator — cohort embedding distance distribution, trajectory n-gram entropy, mode-coverage metrics. Engine-side, runs in CI on every recipe change. +**Files likely touched:** `leadforge/validation/diversity.py` (new). +**Trigger:** after running the v1 LLM critique a few times and learning what "diverse enough" looks like operationally. + +### Multi-provider LLM critique CI integration (recommendation #11 — full scope) +**v1 scope:** single-provider one-shot critique pass before tag. +**Post-v1 scope:** multi-provider adjudication (≥2 model families); CI gate that fails on high-severity findings; periodic re-runs against new bundles; archive of raw outputs across runs for trend analysis. +**Files likely touched:** `leadforge/validation/llm_critique.py`, `.github/workflows/release_critique.yml` (new). +**Trigger:** after v1 ships and the prompt design / threshold tuning has stabilized. + +### CI release-candidate workflow (recommendation #17) +**Post-v1 scope:** GitHub Actions workflow that runs `scripts/validate_release_candidate.py` on demand, uploads the validation report and figures as artifacts, and gates merging on no-critical-findings. +**Files likely touched:** `.github/workflows/release_candidate.yml` (new). + +## Framework polish + +### `leadforge release ...` CLI subcommands (recommendation #18) +**Post-v1 scope:** consolidate `scripts/{build,validate,package_kaggle,package_hf,publish_*}.py` under a single `leadforge release` namespace with subcommands. Add `--json` to all release commands. Add credential-presence checks. Add `--dry-run` to publish commands. +**Files likely touched:** `leadforge/cli/commands/release.py` (new), Click/Typer wiring in `leadforge/cli/main.py`. + +### `--json` output across remaining CLI commands +`leadforge inspect --json` shipped in M12 (PR #60). `leadforge validate --json` and the new release commands should follow. + +## v2 territory (later) + +### Per-vertical industry calibration (recommendation #21) +File as v2-track issue. Industry-specific MQL→SQL rates from gemini_v2 (Cybersecurity 15-18%, Fintech 11-19%) should be retained as the seed numbers for whichever vertical lands first. + +### Second vertical +Already in agent-plan post-v1 list. Likely candidates from the existing roadmap: cybersecurity SaaS, martech. + +### LTV labels as first-class outputs +Customer/subscription entities exist in v1 internals already; the work is wiring them through to a labeled task and adding the appropriate task manifest. Out-of-scope for v1 by hard constraint in CLAUDE.md; tracked for v2. + +### Leaderboard mini-site +Out-of-scope. If we ship one, it would consume v1 dataset feedback to build the v2 dataset rather than being a v1 sibling. + +### Continuous-time engine +Already in agent-plan post-v1 list. Engine-level work; not coupled to dataset releases. + +### Plugin architecture +Already in agent-plan post-v1 list. Framework architecture work. + +### External-API enrichment +Already in agent-plan post-v1 list. Optional behind extras per existing hard constraint. + +### Web UI / dashboard +Already in agent-plan post-v1 list. + +## Out-of-roadmap + +Items the corpus is silent on but that v1 launch will surface: + +- Engineering-cost prioritization between competing post-v1 items. +- What difficulty bands the post-v1 generative changes target (depends on v1 baseline numbers). +- Cover-image content guidelines if we redesign for v1.1+. + +These are decisions to make with v1 metrics in hand. diff --git a/docs/release/v1_acceptance_gates.md b/docs/release/v1_acceptance_gates.md new file mode 100644 index 0000000..cd9af79 --- /dev/null +++ b/docs/release/v1_acceptance_gates.md @@ -0,0 +1,178 @@ +# v1 Acceptance Gates + +Concrete, machine-checkable criteria for "v1 ready". A release candidate +that satisfies every gate below can be tagged and published. Numeric bands +prefixed with `TBD` are placeholders set in Phase 3 of the v1 release +roadmap; a release candidate cannot ship until all `TBD`s are resolved. + +This file is the operational definition of done for the v1 release. It is +read by `scripts/validate_release_candidate.py` and by humans before tag. + +## Naming and versioning gate + +- **G1.1** Dataset release name: `leadforge-lead-scoring-v1`. Locked in Phase 1. +- **G1.2** Kaggle slug: `leadforge-lead-scoring-v1`. +- **G1.3** Hugging Face repo: `leadforge-lead-scoring-v1` (public family) and `leadforge-lead-scoring-v1-instructor` (companion). +- **G1.4** Bundle `package_version` reflects the leadforge package at build time. +- **G1.5** Bundle `bundle_schema_version == 5`. + +## Reproducibility gate + +- **G2.1** Two independent builds with the same `--generation-timestamp` produce byte-identical bundles modulo timestamp-derived fields. Verified by `scripts/verify_hash_determinism.py`. +- **G2.2** All file SHA-256 hashes recorded in `manifest.json` match the actual files at validation time. +- **G2.3** A clean-environment regeneration on a different machine produces identical bundles to the developer's build (if not literally identical, deviations must be explainable solely by the timestamp field). + +## Structural gate + +- **G3.1** Every bundle in the family contains `manifest.json`, `dataset_card.md`, `feature_dictionary.csv`, `tables/`, `tasks/`. +- **G3.2** Every required relational table for the bundle's mode is present and non-empty. +- **G3.3** All foreign-key constraints in `ALL_CONSTRAINTS` hold. +- **G3.4** All task splits (`train`, `valid`, `test`) are non-empty and disjoint. + +## Relational leakage gate (the v1 critical gate) + +This is the gate that motivates the v1 release. Failures here are blockers. + +- **G4.1** Public `tables/leads.parquet` does **not** contain `converted_within_90_days` or `conversion_timestamp`. +- **G4.2** Public `tables/opportunities.parquet` does **not** contain `close_outcome` or `closed_at`. +- **G4.3** Public bundles do **not** contain `tables/customers.parquet` or `tables/subscriptions.parquet`. +- **G4.4** Public event tables (`touches`, `sessions`, `sales_activities`) contain no rows where `event_timestamp > lead_created_at + snapshot_day`. +- **G4.5** Probabilistic relational reconstruction probe: a model trained using only public relational features (joined on `lead_id`/`account_id`/`contact_id`) achieves AUC ≤ TBD-G4.5 against `converted_within_90_days`. Threshold derived during Phase 3 from honest-feature baseline. +- **G4.6** Manifest field `relational_snapshot_safe == true` for `student_public` bundles; `false` for `research_instructor`. + +## Direct leakage gate + +- **G5.1** Models trained using only post-snapshot aggregate features cannot reconstruct the target above tolerance TBD-G5.1. +- **G5.2** Models trained using only suspect-stage columns (`current_stage`, `is_sql`) cannot reconstruct the target above tolerance TBD-G5.2. +- **G5.3** ID-only models (using only `lead_id`/`account_id`/`contact_id`) achieve AUC ≤ 0.5 + ε. +- **G5.4** No public feature derives from events with timestamp > `lead_created_at + snapshot_day` (audited at the `FeatureSpec` level — recipe must declare provenance). + +## Split leakage gate + +- **G6.1** Account-overlap audit: same `account_id` in train + test is documented as intentional or absent. +- **G6.2** Contact-overlap audit: same `contact_id` in train + test is documented as intentional or absent. +- **G6.3** Near-duplicate row detection: no rows with feature-vector cosine similarity > 0.99 across splits. +- **G6.4** Cohort-time-shift split exists: AUC degradation under cohort split ≥ TBD-G6.4 (lower bound — cohort split should be meaningfully harder than random) and ≤ TBD-G6.4-upper (upper bound — but not catastrophic). + +## Performance gates (per tier) + +Bands set in Phase 3 from baseline measurements; written here as the contract. + +### Intro tier +- **G7.1.1** Conversion rate within [TBD, TBD] +- **G7.1.2** LR AUC within [TBD, TBD] +- **G7.1.3** GBM AUC within [TBD, TBD] +- **G7.1.4** GBM-vs-LR AUC delta ≥ TBD-G7.1.4 +- **G7.1.5** AP within [TBD, TBD] +- **G7.1.6** P@100 within [TBD, TBD] +- **G7.1.7** Brier score within [TBD, TBD] +- **G7.1.8** Calibration max-bin error ≤ TBD-G7.1.8 + +### Intermediate tier +- **G7.2.1**–**G7.2.8** mirroring intro, with bands shifted to reflect higher difficulty (lower AP, lower P@K, similar AUC, similar GBM-vs-LR delta). + +### Advanced tier +- **G7.3.1**–**G7.3.8** mirroring intro, with hardest bands. + +### Cross-tier ordering +- **G7.4.1** AP ordering: intro > intermediate > advanced. +- **G7.4.2** P@K ordering: intro > intermediate > advanced. +- **G7.4.3** Conversion-rate ordering: intro > intermediate > advanced. +- **G7.4.4** GBM-vs-LR delta is positive in every tier (sophistication is rewarded). + +## Cross-seed stability gate + +- **G8.1** Run N=5 seeds per tier; each metric in G7 falls within ±TBD-G8.1 of the reported median. +- **G8.2** No degenerate seeds (conversion rate < 1% or > 99% in any seed). + +## Public/instructor diff gate + +- **G9.1** Every public/instructor difference is intentional and listed in `release/EXPOSURE_DELTA.md`. +- **G9.2** Manifest `redacted_columns` field matches the actual public bundle's column omissions. +- **G9.3** Instructor-companion-only artifacts (`metadata/`, leakage-trap features, full-horizon tables) are absent from public bundles. + +## Documentation gate + +- **G10.1** `release/README.md` (the dataset card) passes a Datasheets-for-Datasets / Data Cards Playbook checklist: + - Provenance (who, when, why) + - Motivation + - Composition (entities, features, label, splits) + - Collection / generation method + - Preprocessing and transformations + - Recommended uses + - Out-of-scope uses + - Known limitations and biases + - Maintenance plan +- **G10.2** `docs/release/generation_method.md` exists and is readable as a standalone document. +- **G10.3** `docs/release/feature_dictionary.md` covers every feature in the snapshot CSV with description, dtype, source, leakage flag, and recommended-for-modeling flag. +- **G10.4** `docs/release/break_me_guide.md` exists and links from `release/README.md`. +- **G10.5** `docs/release/v1_release_notes.md` exists and is human-readable. +- **G10.6** Every claim made in the dataset card about realism, calibration, or difficulty has a backing reference in `release/validation/validation_report.md`. + +## Platform packaging gate + +### Kaggle +- **G11.1** `release/kaggle/dataset-metadata.json` exists and validates against current Kaggle schema: + - `title` length 6-50 chars + - `subtitle` length 20-80 chars + - `id` slug 3-50 chars + - exactly one entry in `licenses` + - `expectedUpdateFrequency` from approved values (`never` for v1) + - all `resources[].schema.fields` listed in column order +- **G11.2** `release/dataset-cover-image.png` exists with dimensions ≥ 560 × 280. +- **G11.3** Kaggle dry-run package builds without error: `kaggle datasets create -p release/kaggle --dir-mode zip` (in `--dry-run` if available, or shape-validate without). + +### Hugging Face +- **G12.1** `release/huggingface/README.md` exists with valid YAML metadata: `pretty_name`, `license`, `language: en`, `task_categories: [tabular-classification]`, `size_categories`, `tags`, `configs`. +- **G12.2** Exactly one config has `default: true`. +- **G12.3** Local `load_dataset(release/huggingface, "intro")` succeeds; same for `intermediate`, `advanced`. +- **G12.4** Companion repo (`leadforge-lead-scoring-v1-instructor`) packages independently and loads via `load_dataset()` for at least one config. + +## Notebook gate + +- **G13.1** All four notebooks in `release/notebooks/` execute top-to-bottom from a clean environment without errors. +- **G13.2** Each notebook's printed metrics match the validation report within tolerance TBD-G13.2. +- **G13.3** Each notebook explicitly distinguishes the public path from the instructor companion path; instructor-only artifacts are not loaded by the public notebooks. + +## LLM critique gate + +- **G14.1** `scripts/run_llm_critique.py` runs successfully when credentials are present. +- **G14.2** The critique produces a structured findings JSON conforming to the schema in `v1_release_design.md` §"LLM critique". +- **G14.3** No unresolved high-severity findings remain. Each high-severity finding is either: + - resolved in code (with a backing PR), or + - documented in `docs/release/v2_decision_log.md` as intentional-and-accepted with rationale. +- **G14.4** Raw LLM outputs are archived under `release/validation/llm_critique_raw_*.json` for audit. + +## Adversarial framing gate + +- **G15.1** GitHub issue templates (`dataset_breakage_report.yml`, `realism_feedback.yml`) render correctly. +- **G15.2** `docs/release/break_me_guide.md` is linked from `release/README.md`, the Kaggle description, and the HF README. +- **G15.3** `docs/release/v2_decision_log.md` exists (may be empty at launch). + +## Out-of-scope acknowledgment + +The following are explicitly NOT release blockers for v1; they live in `post_v1_roadmap.md`: + +- Channel-conditional MQL→SQL rates (audit only in v1). +- Log-normal sales-cycle distributions. +- Demographic noise injection. +- Quantitative semantic-diversity validator. +- Multi-provider LLM critique CI integration. +- LTV labels as first-class outputs. +- Second vertical / per-vertical calibration. +- Leaderboard mini-site. + +## Definition of green + +A release candidate is **green** (ready to publish) when: +- All gates G1–G15 pass. +- All `TBD-*` placeholders have been resolved with concrete numeric values during Phase 3. +- The validation report explicitly cites the gate that justifies each metric band. +- A human signs off on `v2_decision_log.md` entries for any accepted-with-rationale findings. + +A release candidate is **blocked** if any of: +- G4.* relational leakage gate fails. +- G5.* direct leakage gate fails. +- G7.4.4 GBM-vs-LR delta is non-positive in any tier (the dataset doesn't reward sophistication). +- G14.3 has unresolved high-severity findings. +- Any `TBD-*` remains unresolved at tag time. diff --git a/docs/release/v1_release_design.md b/docs/release/v1_release_design.md new file mode 100644 index 0000000..0819fb4 --- /dev/null +++ b/docs/release/v1_release_design.md @@ -0,0 +1,236 @@ +# v1 Release Design + +Architectural decisions specific to `leadforge-lead-scoring-v1`. The +existing `docs/leadforge_design_doc.md` and +`docs/leadforge_architecture_spec.md` remain authoritative for the +framework; this document captures only the decisions that arise from the +v1 dataset release and that diverge from or extend the existing design. + +## Naming and versioning decoupling + +The leadforge **package** stays at `1.x` in `pyproject.toml`. The curated +public dataset release is named **`leadforge-lead-scoring-v1`** and is +versioned independently of the package. Future iterations of the dataset +(after fixing leakage / rebuilding channel signal / etc.) bump the +*dataset* version (`v1` → `v2`), not the package. + +**Rationale:** the alpha shipped while the package was already at `1.0.0` / +Production/Stable. Conflating the two confuses users — a v0.x dataset +could not exist while the package is v1, and a "v1 of the dataset" implies +a coordinated bump of the framework that is not actually planned. + +**Implication for releases:** +- Kaggle dataset slug: `leadforge-lead-scoring-v1` (or `/leadforge-lead-scoring-v1`). +- Hugging Face repo: `/leadforge-lead-scoring-v1`. +- GitHub release tag: `dataset/v1` or similar, distinct from package tags. + +## Dataset family architecture + +The release is a *family*, not a single CSV. + +### Public family (Kaggle / HF) +- **intro** — easiest tier (intro 41.5% conversion in alpha; AUC ~0.89). +- **intermediate** — middle tier (~20.1% conversion; AUC ~0.88). +- **advanced** — hardest tier (~7.9% conversion; AUC ~0.87). + +Each tier is a complete bundle: +- Flat task splits: `train.csv`, `validation.csv`, `test.csv`, plus a single joined `lead_scoring.csv`. +- **Snapshot-safe relational tables** (see below): `accounts.parquet`, `contacts.parquet`, `leads.parquet`, `touches.parquet`, `sessions.parquet`, `sales_activities.parquet`, `opportunities.parquet`. **No `customers` or `subscriptions` tables in public bundles.** +- `manifest.json`, `feature_dictionary.csv`, `dataset_card.md`. + +### Instructor / research companion (separate artifact) +Exists at the `intermediate_instructor` tier only (matches alpha pattern). Contents: +- Full hidden world graph (`metadata/graph.json`, `graph.graphml`). +- Latent registry (`metadata/latent_registry.json`). +- World spec (`metadata/world_spec.json`). +- Mechanism summary (`metadata/mechanism_summary.json`). +- **Full-horizon relational tables** including `customers.parquet`, `subscriptions.parquet`, `leads.parquet` with `converted_within_90_days` + `conversion_timestamp`, `opportunities.parquet` with `close_outcome` + `closed_at`. +- Leakage-trap features explicitly marked (`__leakage__*` naming convention from v6/v7). + +### Where the companion lives +The instructor companion ships as a **separate** GitHub Release artifact and a **separate** Hugging Face repo (`/leadforge-lead-scoring-v1-instructor`). It is **not** uploaded to Kaggle. + +**Rationale:** Kaggle's dataset model assumes one dataset per repo and surfaces all files alike. Hidden truth and leakage-trap features should not be one click away from student-facing files. A separate repo also lets us require explicit acceptance (e.g., HF gated repo) before download if needed for academic settings. + +## Snapshot-safe relational export — new architectural component + +The single most important architectural change in this release. + +### Problem +The v0.1.0-alpha public `student_public` bundles include relational tables that allow target reconstruction with 100% accuracy via joins: +- `tables/leads.parquet` retains `converted_within_90_days` and `conversion_timestamp`. +- `tables/opportunities.parquet.close_outcome == "closed_won"` perfectly distinguishes converted leads. +- `customers` and `subscriptions` tables exist *only* for converted leads — their presence is the label. + +Verified in a 500-lead `student_public` smoke bundle by ChatGPT v2 reviewer (chatgpt_report_v2.md §0). + +### Decision +A new module `leadforge/render/relational_snapshot_safe.py` produces a **snapshot-safe** relational export for `student_public` bundles. Properties: + +1. **Event tables** (`touches`, `sessions`, `sales_activities`) are filtered to `event_timestamp <= lead_created_at + snapshot_day`. The same temporal boundary used for flat-CSV features. +2. **`leads.parquet`** drops `converted_within_90_days` and `conversion_timestamp`. The label only lives in the task splits, where it is the explicit y-column. +3. **`opportunities.parquet`** is filtered to `created_at <= lead_created_at + snapshot_day` and drops `close_outcome` and `closed_at`. +4. **`customers.parquet` and `subscriptions.parquet`** are omitted from public bundles entirely. +5. **Account- and contact-level tables** are not filtered (they are static firmographic/personographic features). + +The full-horizon relational export remains in `leadforge/render/relational.py` and is used unchanged for the instructor companion. + +### Bundle schema bump: v4 → v5 +The `BUNDLE_SCHEMA_VERSION` constant moves from 4 to 5. The manifest gains a `relational_snapshot_safe: bool` field (true for `student_public`, false for `research_instructor`). This makes consumers self-describing — a tool reading a v5 bundle can tell from the manifest whether the relational tables are snapshot-safe or full-horizon. + +### New validator `leadforge/validation/relational_leakage.py` +Three categories of probe: +- **Structural**: assert no banned columns appear in public `leads`/`opportunities`; assert `customers`/`subscriptions` absent from public; assert event-table timestamps ≤ snapshot. +- **Probabilistic**: train a lightweight model using only public relational features and joinable keys; assert reconstructed-target AUC/accuracy is below tolerance. +- **Schema-vs-manifest**: assert the manifest's `relational_snapshot_safe` flag matches the actual table contents (catches misconfigured exposure routes). + +Wired into `leadforge/validation/bundle_checks.py:validate_bundle()` so any bundle that violates these contracts fails validation by default. + +## Release validation — new architectural component + +The framework already has `leadforge/validation/{bundle_checks,realism,difficulty,drift,lead_scoring}.py`. The v1 release adds a higher-level **release-grade** layer that consumes those primitives and produces a single reproducible report. + +### New modules +- `leadforge/validation/release_quality.py` — orchestrates the metric panel. +- `leadforge/validation/leakage_probes.py` — direct / time-window / relational / split / model-realism probes (per recommendations Guid §8). +- `leadforge/validation/reporting.py` — renders `validation_report.{json,md}` and figures. +- `scripts/validate_release_candidate.py` — driver script. + +### Output contract +``` +release/validation/ + validation_report.json # machine-readable; fields per v1_acceptance_gates.md + validation_report.md # human-readable + figures/ + lift_curve_intro.png + lift_curve_intermediate.png + lift_curve_advanced.png + calibration_intermediate.png + leakage_delta.png + cohort_shift.png + value_capture.png +``` + +### Difficulty bands +The current `validation/difficulty.py` validates conversion-rate ranges. v1 expands the check to: +- **AP** band per tier +- **P@K** band per tier +- **GBM-vs-LR delta** band (model-family delta — pedagogically meaningful) +- **Calibration** (Brier score) band per tier +- **Cohort-shift AUC degradation** band + +Concrete numeric ranges are set in `v1_acceptance_gates.md` once Phase 3 produces baseline numbers. + +## LLM critique — new architectural component (minimal v1 scope) + +A new validation module `leadforge/validation/llm_critique.py` provides a structured one-shot LLM review. + +### Scope decisions +- **Single provider** in v1 (Anthropic Claude as default). Multi-provider adjudication is post-v1 work. +- **Skips cleanly** without credentials — env var absence yields a clear "skipped: no credentials" message, not a failure. +- **Output schema** is fixed (per `recommendations_pass.md` §11; mirrors Guid §12): + ``` + { + "release_id": "leadforge-lead-scoring-v1", + "model": "anthropic/claude-opus-4-7/...", + "run_timestamp": "ISO-8601", + "overall_score": , + "findings": [ + { "severity": "critical|high|medium|low|nit", + "category": "leakage|realism|documentation|platform|ethics|pedagogy|code", + "claim": "...", + "evidence": "file/path:line or artifact ref", + "reproducer": "optional cmd", + "suggested_fix": "..." } + ], + "missing_sections": [], + "questions_for_maintainer": [] + } + ``` +- **Adjudication is manual** in v1 — high-severity findings are resolved by hand or filed in `v2_decision_log.md` if intentional-and-accepted. CI auto-fail on high severity is post-v1. + +### Rubric dimensions for v1 +- **Logical coherence** (G1, G2): does the lead trajectory make sense? +- **Behavioral plausibility** (G1, G2): are events consistent with firmographics? +- **Effective semantic diversity** (G2): does the cohort cover the firmographic / behavioral space? +- **Syntax validity** (G2): are categorical fields free of hallucinatory artifacts? +- **Documentation completeness** (Datasheets / Data Cards Playbook): is the dataset card complete? +- **Leakage flagging** (C2): does the documentation make the leakage policy clear? +- **Pedagogical clarity** (C2): does a student have a clear entry point? + +### Bias mitigation (G2) +- Forced-rationale prompts: judge must emit step-by-step analysis before assigning a numerical score. +- Explicit instruction not to favor verbose responses or self-similar outputs. + +## Module landscape (new in v1 release work) + +``` +leadforge/ + render/ + relational_snapshot_safe.py # NEW — Phase 2 + validation/ + relational_leakage.py # NEW — Phase 2 + release_quality.py # NEW — Phase 3 + leakage_probes.py # NEW — Phase 3 + reporting.py # NEW — Phase 3 + llm_critique.py # NEW — Phase 7 + +scripts/ + audit_channel_signal.py # NEW — Phase 4 + validate_release_candidate.py # NEW — Phase 3 + package_kaggle_release.py # NEW — Phase 5 + package_hf_release.py # NEW — Phase 5 + run_llm_critique.py # NEW — Phase 7 + publish_kaggle.py # NEW — Phase 7 + publish_hf.py # NEW — Phase 7 + +docs/release/ + v1_release_roadmap.md # this PR + post_v1_roadmap.md # this PR + v1_release_design.md # this PR (this file) + v1_acceptance_gates.md # this PR + v1_current_state_audit.md # Phase 1 + channel_signal_audit.md # Phase 4 + generation_method.md # Phase 4 + feature_dictionary.md # Phase 4 + break_me_guide.md # Phase 6 + v2_decision_log.md # Phase 6 (starts empty) + llm_critique_prompt.md # Phase 7 + v1_release_notes.md # Phase 7 + +release/ + kaggle/ # NEW — Phase 5 (generated) + dataset-metadata.json + huggingface/ # NEW — Phase 5 (generated) + README.md + validation/ # NEW — Phase 3+ (generated) + validation_report.{json,md} + figures/*.png + llm_critique_*.{json,md} + notebooks/ + 01_baseline_lead_scoring.ipynb # updated — Phase 6 + 02_relational_feature_engineering.ipynb # NEW — Phase 6 + 03_leakage_and_time_windows.ipynb # NEW — Phase 6 + 04_lift_calibration_value_ranking.ipynb # NEW — Phase 6 + dataset-cover-image.png # NEW — Phase 5 + +.github/ + ISSUE_TEMPLATE/ + dataset_breakage_report.yml # NEW — Phase 6 + realism_feedback.yml # NEW — Phase 6 +``` + +## What this design does NOT change + +- The seven-layer design (narrative / schema / structure / mechanism / simulation / render / validation / exposure) remains. +- Determinism, RNG roots, motif sampling, hidden-graph DAG construction, and exposure modes are unchanged. +- The flat-CSV path is unchanged at the feature level (it was already snapshot-safe via windowed snapshot in `BUNDLE_SCHEMA_VERSION` 4). +- `Generator.from_recipe(...).generate(...)` API surface is unchanged. +- CLI commands `generate`, `inspect`, `validate`, `list-recipes` are unchanged in shape (some grow `--json` output post-v1). + +## Risks captured + +- **Bundle schema v4 → v5 break.** Consumers of v0.1.0-alpha bundles may need to re-read against the new schema. Mitigated by retaining the schema version field in manifest and documenting the v4→v5 contract in `v1_release_notes.md`. +- **Snapshot-safe export may eliminate features that students actually want.** Mitigated by keeping the *flat task path* feature-rich (it was always snapshot-safe) and providing the relational FE notebook (Phase 6 #02) to demonstrate legitimate joins. +- **LLM critique false-positives.** Mitigated by manual adjudication in v1; fully automated gate deferred to post-v1. +- **Cover image sourcing TBD.** Captured as open question in roadmap. diff --git a/docs/release/v1_release_roadmap.md b/docs/release/v1_release_roadmap.md new file mode 100644 index 0000000..3320dc2 --- /dev/null +++ b/docs/release/v1_release_roadmap.md @@ -0,0 +1,344 @@ +# v1 Lead-Scoring Dataset Release Roadmap + +**Target:** Publish `leadforge-lead-scoring-v1` to Kaggle and Hugging Face as a best-in-class educational synthetic CRM dataset family. +**Source of truth:** This roadmap is derived from `docs/external_review/summaries/recommendations_pass.md` (signed off 2026-05-05). +**Companion docs:** `v1_release_design.md`, `v1_acceptance_gates.md`, `post_v1_roadmap.md`. +**Naming convention:** the *dataset* is `leadforge-lead-scoring-v1`. The leadforge *package* remains at `1.x` and is decoupled from this dataset version (resolves recommendation #15). + +## Vision + +Six external reviews, two reviewers, two iterations each, surface one shared verdict: leadforge is much further along than greenfield, and the v1 milestone is **release hardening + adversarial validation**, not core implementation. The single most important blocker is that `student_public` bundles currently leak `converted_within_90_days` end-to-end through public relational tables (verified locally by ChatGPT v2 in a 500-lead smoke bundle). Everything else is a quality-bar issue, not a correctness one. + +The v1 release ships: +- a public family — intro / intermediate / advanced flat task splits + snapshot-safe relational tables +- a separate research/instructor companion — full hidden graph, latent registry, mechanism summary, full-horizon relational tables +- a release-grade validation report with figures, lift curves, calibration, leakage probes, and cross-seed bands +- a 4-notebook teaching sequence (baseline → relational FE → leakage demo → lift/calibration/value) +- a Kaggle dataset and a Hugging Face dataset, both packaged programmatically and dry-run-tested +- a public adversarial framing (issue templates, break-me guide, v2 decision log) + +## v1-ready definition (operational) + +A release candidate is v1-ready when **all** of the following hold. Concrete bands and probes live in `v1_acceptance_gates.md`. + +1. Fresh release candidate generates from code, byte-identical to the previous build modulo the pinned `generation_timestamp`. +2. Structural and FK validation pass on every bundle in the family. +3. **Relational leakage probe**: no public-only join path reconstructs the target above tolerance. +4. **Direct leakage probes**: no model trained on suspect-only feature subsets reconstructs the target above tolerance. +5. **Split leakage**: account/contact split-overlap audit is intentional and documented. +6. **Cohort/time-shift split**: AUC degradation under cohort split is within configured band. +7. **Calibration**: Brier score and reliability curve within tier bands. +8. **Lift / P@K / value capture**: difficulty signal visible across tiers in AP, P@K, lift, and model-family deltas (LR vs GBM). +9. **Public/instructor diff**: every difference is intentional and listed in the manifest. +10. **Platform packages**: Kaggle `dataset-metadata.json` validates against current platform requirements; HF `README.md` loads via local `load_dataset()` for every config; cover image meets Kaggle minimums. +11. **Notebooks**: all four notebooks run top-to-bottom and reproduce validation report metrics within tolerance. +12. **LLM critique**: one-shot pass produces structured findings; no unresolved high-severity findings. + +## Phase summary + +| Phase | Title | Size | Depends on | Status | +|---|---|---|---|---| +| 1 | Audit and naming | S | — | not started | +| 2 | Snapshot-safe relational export | M | 1 | not started | +| 3 | Release validation hardening | L | 2 | not started | +| 4 | Channel-signal audit + dataset card | M-S | 3 | not started | +| 5 | Platform packaging | M | 4 | not started | +| 6 | Notebook sequence + adversarial framing | M-L | 5 | not started | +| 7 | LLM critique + publish | M | 6 | not started | + +Each phase = one PR (or a small cluster of PRs against a feature branch). PRs follow `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment. + +--- + +## Phase 1 — Audit and naming + +**Goal:** Reproduce the relational-leakage finding on the alpha bundles to confirm severity. Lock the dataset release name. Zero code changes. + +**Work items:** +- Run a leakage-probe script against `release/intermediate/tables/` to verify the v2 finding: train a join-only model using `opportunities.close_outcome` + customer/subscription existence; confirm it reconstructs `converted_within_90_days` with the predicted accuracy. +- Document the reproduction in `docs/release/v1_current_state_audit.md`. +- Confirm dataset release name `leadforge-lead-scoring-v1`; record decision in `v1_acceptance_gates.md`. + +**Files touched:** `docs/release/v1_current_state_audit.md` (new). No code. + +**Acceptance:** +- The relational-leakage finding is reproduced with a numeric AUC/accuracy. +- Dataset release name is committed. + +**PR labels:** `type: docs`. +**Milestone:** create new `v1.1.0 — Curated dataset v1 release` (or similar). + +--- + +## Phase 2 — Snapshot-safe relational export + +**Goal:** Eliminate the relational-leakage blocker. Public relational tables become snapshot-safe; full-horizon stays in the instructor companion only. + +**Work items:** +- New `leadforge/render/relational_snapshot_safe.py`: + - Filter `touches`, `sessions`, `sales_activities` to `timestamp <= lead_created_at + snapshot_day` per lead. + - Filter `opportunities` to `created_at <= lead_created_at + snapshot_day`; drop `close_outcome` and `closed_at` columns from public. + - Drop `converted_within_90_days` and `conversion_timestamp` from public `leads.parquet`. + - Omit `customers` and `subscriptions` from public bundles entirely (they exist only for converted leads — their presence is leakage). +- New `leadforge/validation/relational_leakage.py`: + - Probe: train a target-reconstruction model using only public relational features; assert AUC/accuracy below tolerance. + - Probe: assert no public table contains banned columns (configurable list). + - Probe: assert event tables contain no rows with `timestamp > lead_created_at + snapshot_day`. +- Update `leadforge/exposure/filters.py` and `leadforge/api/bundle.py` to route `student_public` through `relational_snapshot_safe`. +- Bundle schema bump: `BUNDLE_SCHEMA_VERSION` 4 → 5. Manifest records `relational_snapshot_safe: true` for `student_public`. Document the contract change in the dataset card. +- Update `leadforge/validation/bundle_checks.py` to call `relational_leakage.run_all_probes()` and fail the bundle on any violation. +- Tests: `tests/render/test_relational_snapshot_safe.py`, `tests/validation/test_relational_leakage.py`. Hash-determinism preserved. +- Regenerate alpha bundles using the new export; verify byte-identical regeneration with pinned timestamp. + +**Files touched:** +- `leadforge/render/relational_snapshot_safe.py` (new) +- `leadforge/validation/relational_leakage.py` (new) +- `leadforge/exposure/filters.py` +- `leadforge/api/bundle.py` +- `leadforge/render/relational.py` (refactor) +- `leadforge/render/manifests.py` (add `relational_snapshot_safe` flag, bump schema version) +- `leadforge/validation/bundle_checks.py` +- `tests/render/test_relational_snapshot_safe.py` (new) +- `tests/validation/test_relational_leakage.py` (new) +- `release/{intro,intermediate,advanced}/` regenerated; `release/intermediate_instructor/` retains full-horizon + +**Acceptance:** +- The Phase 1 leakage probe drops from "reconstructs target with 100% accuracy" to "below configured tolerance" on regenerated bundles. +- All existing tests pass; new tests added. +- Hash-determinism preserved across two builds with pinned timestamp. +- `instructor` companion still contains full-horizon tables for legitimate teaching use. + +**PR labels:** `type: feature`, `layer: render`, `layer: exposure`, `layer: validation`. +**Note:** This is the structural fix. Treat the regenerated bundles as a *fresh alpha*, not yet v1-ready. + +--- + +## Phase 3 — Release validation hardening + +**Goal:** Move beyond `leadforge validate` to a single reproducible release-grade validation artifact with charts, leakage probes, calibration, lift, and cross-seed bands. + +**Work items:** +- New `leadforge/validation/release_quality.py`: + - Computes ROC-AUC, PR-AUC, log loss, Brier score, calibration bins, lift@1/5/10%, P@50/100, recall@K, top-decile rate, expected ACV captured at K, model-family deltas (LR vs GBM vs source-only vs engagement-only vs leakage-probe vs ID-only vs stage-only vs post-snapshot-aggregates). + - Cross-seed stability (run N seeds; compute spread bands per metric). + - Cross-tier difficulty ordering check (AP, P@K, model-family delta — not AUC). +- New `leadforge/validation/leakage_probes.py`: + - 8.1 Direct leakage (per recommendations Guid §8.1): all-features vs no-suspect-cols vs IDs vs post-snapshot-aggregates deltas. + - 8.2 Time-window leakage: every public feature derives from events ≤ snapshot_day. + - 8.3 Relational leakage: re-runs Phase 2 probes over the RC bundles. + - 8.4 Split leakage: account/contact overlap, near-duplicate row detection. +- New `leadforge/validation/reporting.py`: + - Renders `validation_report.json` (machine-readable) and `validation_report.md` (human-readable). + - Renders figures: lift curves per tier, calibration reliability per tier, leakage delta bar chart, cohort-shift comparison, value-capture curves. +- New `scripts/validate_release_candidate.py` — reads RC bundles → runs all checks → writes `release/validation/validation_report.{json,md}` and `release/validation/figures/*.png`. +- Update `leadforge/validation/difficulty.py` to define tier bands per the new metrics (AP, P@K, GBM-vs-LR delta), not just conversion rates. +- Bands defined and documented in `v1_acceptance_gates.md`. +- Tests: synthetic minimal bundles to exercise each probe path. + +**Files touched:** +- `leadforge/validation/release_quality.py` (new) +- `leadforge/validation/leakage_probes.py` (new) +- `leadforge/validation/reporting.py` (new) +- `leadforge/validation/difficulty.py` +- `scripts/validate_release_candidate.py` (new) +- `tests/validation/test_release_quality.py`, `test_leakage_probes.py`, `test_reporting.py` (new) +- `release/validation/` (output directory, gitignored or committed depending on file sizes) + +**Acceptance:** +- `python scripts/validate_release_candidate.py release/` produces a `validation_report.{json,md}` and figures with no critical findings. +- All metrics on RC bundles fall within configured tier bands. +- Cross-seed bands established for every reported metric. + +**PR labels:** `type: feature`, `layer: validation`. + +--- + +## Phase 4 — Channel-signal audit + dataset card hardening + +**Goal:** Audit how strongly `source_channel` already signals conversion in the alpha bundles (per recommendation #8 v1 scope). Bring the dataset card to release-grade. + +**Work items:** +- New analysis script `scripts/audit_channel_signal.py`: + - For each tier, compute conversion rate by `source_channel`. + - Compute univariate AUC of `source_channel` against the target. + - Compare to gemini_v2's industry benchmarks (SEO ~51%, PPC ~26%, Email <1% MQL→SQL). + - Output `docs/release/channel_signal_audit.md`. +- Update `release/README.md` to a release-grade dataset card: + - Macro framing paragraph (one paragraph on 2024-2026 SaaS context — recommendation #19). + - Simulation simplifications section (per chatgpt v2 §2.6 — what's modeled / approximate / not modeled). + - Calibration documentation (link to validation report). + - Public-vs-companion redaction policy (concrete column lists). + - Intended use vs out-of-scope use. + - Known limitations. + - Adversarial framing pointer (link to break-me guide once Phase 6 lands). +- New `docs/release/generation_method.md` — full DGP summary written for external readers, separate from the release README. References the architecture spec but stands alone. +- New `docs/release/feature_dictionary.md` — narrative companion to the existing CSV feature dictionary. +- Validate all dataset-card content against Datasheets-for-Datasets / Data Cards Playbook checklist (provenance, motivation, content, quality, privacy, biases/limitations, intended use, out-of-scope use, maintenance). + +**Files touched:** +- `scripts/audit_channel_signal.py` (new) +- `docs/release/channel_signal_audit.md` (new) +- `docs/release/generation_method.md` (new) +- `docs/release/feature_dictionary.md` (new) +- `release/README.md` (substantial rewrite) + +**Acceptance:** +- Channel-signal audit is conclusive: clear statement of how the alpha's channel signal compares to industry benchmarks. +- Dataset card passes Datasheets-for-Datasets template. +- A new reader (no leadforge context) can understand the dataset, its provenance, and its limitations from the README + linked docs alone. + +**PR labels:** `type: docs`. + +--- + +## Phase 5 — Platform packaging + +**Goal:** Generate Kaggle and Hugging Face upload artifacts programmatically. Dry-run validate both. + +**Work items:** +- New `scripts/package_kaggle_release.py`: + - Reads bundle manifests and feature dictionaries. + - Generates `release/kaggle/dataset-metadata.json` validated against current Kaggle constraints (title 6-50 chars, subtitle 20-80, slug 3-50, single license, schema fields in order, `expectedUpdateFrequency` from approved values, image ≥560×280). + - Copies / generates `release/dataset-cover-image.png` (≥560×280, 2:1 header crop, 1:1 thumbnail crop). + - Produces a Kaggle-shaped upload directory under `release/kaggle/`. + - Supports `--dry-run` mode: no upload, validates structure only. +- New `scripts/package_hf_release.py`: + - Generates `release/huggingface/README.md` with full YAML metadata: `pretty_name`, `license`, `language: en`, `task_categories: [tabular-classification]`, `size_categories`, `tags: [tabular, lead-scoring, synthetic-data, crm, b2b, datasets, pandas]`, `configs` for intro/intermediate/advanced with `default: true` on intermediate (or whichever tier is the recommended entry point). + - Symlinks/copies bundle files into a HF-loadable structure under `release/huggingface/`. + - Runs a local `load_dataset(local_path, "intro")`, `("intermediate")`, `("advanced")` smoke test. +- New `release/dataset-cover-image.png` — funnel-themed cover (procurement SaaS visual). Source: TBD (could be auto-generated from the validation figures or hand-designed). +- Sanity test: zip the Kaggle upload dir; verify `kaggle datasets create -p --dir-mode zip` would succeed (dry-run with credentials available, or shape-validate without). + +**Files touched:** +- `scripts/package_kaggle_release.py` (new) +- `scripts/package_hf_release.py` (new) +- `release/kaggle/` (new) +- `release/huggingface/` (new) +- `release/dataset-cover-image.png` (new) +- `release/HF_DATASET_CARD.md` superseded — moved to `docs/release/hf_dataset_card_legacy.md` or deleted (decide during PR) + +**Acceptance:** +- Both packagers run cleanly on a fresh build. +- Kaggle metadata passes constraint validation. +- HF `load_dataset()` smoke test passes for every config. +- Cover image meets Kaggle minimums. + +**PR labels:** `type: feature`, `layer: cli`, `layer: render`. + +--- + +## Phase 6 — Notebook sequence + adversarial framing + +**Goal:** Ship the 4-notebook teaching sequence (recommendation #7) and the public adversarial framing (recommendation #16). + +**Work items:** +- Update `release/notebooks/01_baseline_lead_scoring.ipynb`: + - Reproduce Phase 3 validation report metrics within tolerance. + - LR + GBM + value-aware ranking baseline. + - Decile lift chart, calibration plot, P@K table. +- New `release/notebooks/02_relational_feature_engineering.ipynb`: + - Load snapshot-safe relational tables. + - Demonstrate legal joins and feature engineering. + - Show that with relational features GBM lift over the flat-CSV baseline is meaningful. +- New `release/notebooks/03_leakage_and_time_windows.ipynb`: + - Deliberately add a leakage trap (instructor-side feature) to the student data. + - Train a model and show inflated AUC. + - Walk through why it's invalid; reference the recommendation pass / break-me guide. +- New `release/notebooks/04_lift_calibration_value_ranking.ipynb`: + - `expected_acv` × `P(convert)` — value-aware ranking. + - Calibration curves and reliability diagrams. + - Threshold selection for top-K capacity. + - Cohort-shift evaluation as the final stress test. +- New `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` — structured form for "I broke the dataset" reports. +- New `.github/ISSUE_TEMPLATE/realism_feedback.yml` — structured form for realism critiques. +- New `docs/release/break_me_guide.md`: + - Explicit invitations to: find direct leakage, reconstruct labels through joins, beat baseline lift legitimately, show unrealistic distributions, identify documentation ambiguity, find platform issues, propose better calibration sources. + - Triage labels: `critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`. +- New `docs/release/v2_decision_log.md` — starts empty; populated post-launch as feedback flows in. + +**Files touched:** +- `release/notebooks/0{1,2,3,4}_*.ipynb` +- `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) +- `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) +- `docs/release/break_me_guide.md` (new) +- `docs/release/v2_decision_log.md` (new) +- `release/README.md` (link to break-me guide) + +**Acceptance:** +- All four notebooks run top-to-bottom from a clean environment. +- Notebook outputs reproduce validation report metrics within tolerance. +- Issue templates render correctly on the GitHub web UI. + +**PR labels:** `type: feature`, `layer: recipes` (notebooks), `type: docs`. + +--- + +## Phase 7 — LLM critique + publish + +**Goal:** Run a structured one-shot LLM critique over the RC; resolve high-severity findings; tag and publish. + +**Work items:** +- New `leadforge/validation/llm_critique.py`: + - Single-provider abstraction (Anthropic Claude as default; provider chosen via env var). + - Reads creds via env vars; skips cleanly with a clear message if absent (no failure). + - Prompt loaded from `docs/release/llm_critique_prompt.md`. + - Output schema (per Guid §12): release_id, model, run_timestamp, overall_score, findings[severity/category/claim/evidence/reproducer/suggested_fix], missing_sections[], questions_for_maintainer[]. + - Includes "Effective Semantic Diversity" as one rubric dimension (recommendation #12 v1 scope). +- New `docs/release/llm_critique_prompt.md` — the rubric document, structured as the prompt the script feeds. +- New `scripts/run_llm_critique.py` — driver: builds the input bundle (README.md, dataset card, generation method, manifest, feature dictionary, validation report, first 100 public rows, public/instructor diff summary, public-safe mechanism summary) → calls the critique → writes `release/validation/llm_critique_raw_*.json` and `release/validation/llm_critique_summary.md`. +- Adjudicate any high-severity findings; resolve in code or document acknowledgment in `v2_decision_log.md` if intentional-and-accepted. +- New `scripts/publish_kaggle.py` — uses `kagglehub.dataset_upload()` with `version_notes` containing the commit hash and tag. +- New `scripts/publish_hf.py` — uses `huggingface_hub.HfApi().upload_folder()` with the dataset repo type. +- Tag the release: `leadforge-lead-scoring-v1`. Tag the leadforge package release if a coordinated package version bump is needed (TBD — likely just a patch bump). +- `docs/release/v1_release_notes.md` — public-facing release notes. +- Both publish scripts exercised in **dry-run** before actual upload, then upload to **private/draft** repos for download smoke test, then promote to public. + +**Files touched:** +- `leadforge/validation/llm_critique.py` (new) +- `docs/release/llm_critique_prompt.md` (new) +- `docs/release/v1_release_notes.md` (new) +- `scripts/run_llm_critique.py`, `scripts/publish_kaggle.py`, `scripts/publish_hf.py` (new) +- `release/validation/llm_critique_raw_*.json`, `release/validation/llm_critique_summary.md` (output artifacts) + +**Acceptance:** +- LLM critique runs successfully with credentials; produces structured findings. +- No unresolved high-severity findings before tag. +- Both platform publishes succeed in dry-run. +- Both private/draft uploads succeed; download smoke test passes from a clean environment. +- Public Kaggle and HF pages render the dataset; `load_dataset()` from a clean env works. +- Feedback channels (issue templates, break-me guide) are linked from Kaggle, HF, and README. + +**PR labels:** `type: feature`, `layer: validation`, `layer: cli`. +**Note:** the publish step is the only step that requires manual approval and credentials. + +--- + +## Out-of-scope for this roadmap + +Out-of-scope items live in `post_v1_roadmap.md`. Highlights: + +- Channel-conditional MQL→SQL rates as a real generative axis (audit only in v1; full encoding deferred). +- Log-normal / Weibull sales-cycle distributions. +- Demographic noise injection (job title permutations forcing NLP). +- Quantitative semantic-diversity validator. +- Multi-provider LLM critique CI integration. +- CI workflow for release-candidate packaging. +- `leadforge release ...` CLI subcommand consolidation. +- Per-vertical industry calibration (cybersecurity, fintech). +- Second vertical, LTV labels, leaderboard mini-site. + +These are valuable but not v1-load-bearing. Most are post-v1-but-pre-v2-dataset; some are v2-vertical territory. + +## Open questions + +These need resolution during the roadmap, not before: + +1. **Difficulty bands** — concrete numeric ranges for AP, P@K, calibration, GBM-vs-LR delta per tier. To be set in `v1_acceptance_gates.md` during Phase 3. +2. **Cover image source** — generated from validation figures, hand-designed, or licensed stock. Decide during Phase 5. +3. **Should the instructor companion ship to HF as a separate config or as a separate repo?** Reviewer recommendation is "separate." This roadmap defaults to separate GitHub Release artifact + separate HF repo; revisit if HF tooling makes a single-repo split clean. +4. **Coordinated package version bump?** If new modules ship significant API surface (e.g., `leadforge.release` namespace), bump leadforge to 1.1.0 alongside the dataset tag. If purely internal, no bump. +5. **Where do regenerated bundles live during v1 work?** Options: continue using `release/` in-repo, branch-only; or switch to `leadforge-datasets` repo as the source of truth with the leadforge repo only producing the build script. Decide before Phase 5. + +## Status tracker + +Phase status is tracked in `.agent-plan.md` and updated on each PR merge per the branch workflow. From 2aedcdd99becc2de52f6dded9b3431d95374bb57 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Tue, 5 May 2026 18:51:07 +0300 Subject: [PATCH 2/3] =?UTF-8?q?docs(roadmap):=20add=20PR=20breakdown=20?= =?UTF-8?q?=E2=80=94=2014=20PRs=20mapped=20across=207=20phases?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The original roadmap stopped at the phase level; reviewer asked for the phase → N PRs → which-files-each-PR-touches decomposition. Adds: - a "PRs" column to the phase summary table - a new "PR breakdown" section with per-phase enumeration of the 14 planned PRs (planning IDs phase.seq, not GitHub PR numbers) - per-PR labels, file lists, sizes, and intra-phase dependencies Total: 14 PRs targeting the v1.1.0 — Curated dataset v1 release milestone. Phase 3 has 3 PRs (the largest phase); Phases 1 and 4 are single-PR each. Co-Authored-By: Claude Opus 4.7 --- docs/release/v1_release_roadmap.md | 155 +++++++++++++++++++++++++++-- 1 file changed, 144 insertions(+), 11 deletions(-) diff --git a/docs/release/v1_release_roadmap.md b/docs/release/v1_release_roadmap.md index 3320dc2..051c80c 100644 --- a/docs/release/v1_release_roadmap.md +++ b/docs/release/v1_release_roadmap.md @@ -36,17 +36,150 @@ A release candidate is v1-ready when **all** of the following hold. Concrete ban ## Phase summary -| Phase | Title | Size | Depends on | Status | -|---|---|---|---|---| -| 1 | Audit and naming | S | — | not started | -| 2 | Snapshot-safe relational export | M | 1 | not started | -| 3 | Release validation hardening | L | 2 | not started | -| 4 | Channel-signal audit + dataset card | M-S | 3 | not started | -| 5 | Platform packaging | M | 4 | not started | -| 6 | Notebook sequence + adversarial framing | M-L | 5 | not started | -| 7 | LLM critique + publish | M | 6 | not started | - -Each phase = one PR (or a small cluster of PRs against a feature branch). PRs follow `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment. +| Phase | Title | Size | PRs | Depends on | Status | +|---|---|---|---:|---|---| +| 1 | Audit and naming | S | 1 | — | not started | +| 2 | Snapshot-safe relational export | M | 2 | 1 | not started | +| 3 | Release validation hardening | L | 3 | 2 | not started | +| 4 | Channel-signal audit + dataset card | M-S | 1 | 3 | not started | +| 5 | Platform packaging | M | 2 | 4 | not started | +| 6 | Notebook sequence + adversarial framing | M-L | 3 | 5 | not started | +| 7 | LLM critique + publish | M | 2 | 6 | not started | + +**Total: 14 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`v1.1.0 — Curated dataset v1 release`). PR-level decomposition is in the **PR breakdown** section immediately below. + +## PR breakdown + +First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` is a planning ID, not a GitHub PR number. Sizes are estimates; we may merge or split during implementation. Within a phase, PRs are typically sequential (later sub-PRs depend on earlier ones); cross-phase dependencies follow the phase summary above. + +### Phase 1 — Audit and naming (1 PR) + +- **PR 1.1** — `docs: Phase 1 audit + dataset name decision` + - `docs/release/v1_current_state_audit.md` (the reproduction of the relational-leakage finding) + - `scripts/probe_relational_leakage.py` (small probe script; also seeds the Phase 3 leakage_probes module) + - Updates `v1_acceptance_gates.md` to lock G1.1 (dataset name `leadforge-lead-scoring-v1`) + - Labels: `type: docs` + - Size: S (~300 lines) + +### Phase 2 — Snapshot-safe relational export (2 PRs) + +- **PR 2.1** — `feat(render): snapshot-safe relational export + leakage validator` + - `leadforge/render/relational_snapshot_safe.py` (new) + - `leadforge/validation/relational_leakage.py` (new) + - `tests/render/test_relational_snapshot_safe.py`, `tests/validation/test_relational_leakage.py` + - Labels: `type: feature`, `layer: render`, `layer: validation` + - Size: M (~600 lines) + +- **PR 2.2** — `feat(exposure): route student_public through snapshot-safe export; bundle schema v5` + - Wire `relational_snapshot_safe` into `leadforge/exposure/filters.py`, `leadforge/api/bundle.py` + - `BUNDLE_SCHEMA_VERSION` 4 → 5; manifest field `relational_snapshot_safe` + - `leadforge/validation/bundle_checks.py` calls relational-leakage probes + - Regenerate alpha bundles under `release/` with pinned timestamp; hash-determinism check + - Labels: `type: feature`, `layer: exposure`, `layer: render` + - Size: M (~500 lines + regenerated parquet bundles) + - Depends on PR 2.1 + +### Phase 3 — Release validation hardening (3 PRs) + +- **PR 3.1** — `feat(validation): leakage_probes module` + - `leadforge/validation/leakage_probes.py` — direct + time-window + relational + split + model-realism probes (per Guid §8 taxonomy) + - Tests with synthetic minimal bundles + - Labels: `type: feature`, `layer: validation` + - Size: M (~600 lines) + +- **PR 3.2** — `feat(validation): release_quality + reporting modules` + - `leadforge/validation/release_quality.py` — calibration, lift, P@K, value capture, model-family deltas, cross-seed bands + - `leadforge/validation/reporting.py` — JSON+MD report rendering + matplotlib figures + - Tests + - Labels: `type: feature`, `layer: validation` + - Size: L (~900 lines) + +- **PR 3.3** — `feat(scripts): validate_release_candidate driver + acceptance bands resolved` + - `scripts/validate_release_candidate.py` (the driver) + - Update `leadforge/validation/difficulty.py` with new band checks (AP, P@K, GBM-vs-LR delta, calibration) + - Resolve `TBD-*` bands in `v1_acceptance_gates.md` using baseline measurements + - Generate first `release/validation/validation_report.{json,md}` + figures + - Labels: `type: feature`, `layer: validation`, `layer: cli` + - Size: M (~500 lines) + - Depends on PR 3.1, PR 3.2 + +### Phase 4 — Channel-signal audit + dataset card hardening (1 PR) + +- **PR 4.1** — `docs/feat: channel-signal audit + release-grade dataset card` + - `scripts/audit_channel_signal.py` (analysis script) + - `docs/release/channel_signal_audit.md` (audit results vs gemini_v2 industry benchmarks) + - `docs/release/generation_method.md` (standalone DGP summary for external readers) + - `docs/release/feature_dictionary.md` (narrative companion to feature dict CSV) + - `release/README.md` rewrite (release-grade dataset card; macro-framing paragraph; simulation-simplifications section) + - Labels: `type: docs` + - Size: M-S (~700 lines, mostly prose) + +### Phase 5 — Platform packaging (2 PRs) + +- **PR 5.1** — `feat(scripts): Kaggle release packager + cover image` + - `scripts/package_kaggle_release.py` — generates and validates `release/kaggle/dataset-metadata.json` + - `release/dataset-cover-image.png` (≥560×280; design TBD per roadmap open question) + - `release/kaggle/dataset-metadata.json` (generated) + - Kaggle dry-run package validation + - Labels: `type: feature`, `layer: cli` + - Size: M (~500 lines) + +- **PR 5.2** — `feat(scripts): HF release packager + load_dataset smoke test` + - `scripts/package_hf_release.py` — generates `release/huggingface/README.md` with full YAML metadata + - Local `load_dataset()` smoke test for every config + - Companion repo packaging stub for `leadforge-lead-scoring-v1-instructor` + - Labels: `type: feature`, `layer: cli` + - Size: M (~500 lines) + +### Phase 6 — Notebook sequence + adversarial framing (3 PRs) + +- **PR 6.1** — `notebooks: 01 baseline (refresh) + 02 relational feature engineering` + - Update `release/notebooks/01_baseline_lead_scoring.ipynb` to reproduce Phase 3 validation report metrics within tolerance + - New `release/notebooks/02_relational_feature_engineering.ipynb` — uses snapshot-safe relational tables; demonstrates legal joins + - Labels: `type: feature`, `layer: recipes` + - Size: M (~400 lines committed JSON; conceptually large) + +- **PR 6.2** — `notebooks: 03 leakage + 04 lift/calibration/value` + - New `release/notebooks/03_leakage_and_time_windows.ipynb` — leakage trap demo + walkthrough + - New `release/notebooks/04_lift_calibration_value_ranking.ipynb` — value-aware ranking + calibration + cohort-shift evaluation + - Labels: `type: feature`, `layer: recipes` + - Size: M (~400 lines committed JSON; conceptually large) + +- **PR 6.3** — `docs/feat(github): adversarial framing — issue templates + break-me guide` + - `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` + - `.github/ISSUE_TEMPLATE/realism_feedback.yml` + - `docs/release/break_me_guide.md` + - `docs/release/v2_decision_log.md` (empty stub) + - `release/README.md` updated to link to the break-me guide + - Labels: `type: docs` + - Size: S (~300 lines) + +### Phase 7 — LLM critique + publish (2 PRs) + +- **PR 7.1** — `feat(validation): llm_critique module + prompt + driver` + - `leadforge/validation/llm_critique.py` — single-provider, env-var creds, skip-cleanly without + - `docs/release/llm_critique_prompt.md` — the rubric document + - `scripts/run_llm_critique.py` — driver script + - First critique run committed to `release/validation/llm_critique_*.{json,md}` + - Adjudicate any high-severity findings (resolve in code in this or a follow-up PR; or document in `v2_decision_log.md`) + - Labels: `type: feature`, `layer: validation` + - Size: M (~500 lines) + +- **PR 7.2** — `feat(scripts): publish_kaggle + publish_hf + tag v1 release` + - `scripts/publish_kaggle.py` + - `scripts/publish_hf.py` + - `docs/release/v1_release_notes.md` + - Dry-run → private/draft → public publish (manual step performed by maintainer with credentials, within the PR or as a follow-up release tag) + - Tag `leadforge-lead-scoring-v1` + - Labels: `type: feature`, `layer: cli` + - Size: S (~300 lines code + manual publish step) + +## PR breakdown — totals + +- **14 PRs** across 7 phases. +- Estimated total LoC: ~6,500 (excluding regenerated parquet bundles and notebook JSON). +- All 14 PRs target the `v1.1.0 — Curated dataset v1 release` GitHub milestone. +- Calendar duration is not committed; depends on iteration cadence and review feedback. --- From ec12b88399cf12eeb8519f6d01c5657d838b0417 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Tue, 5 May 2026 19:01:38 +0300 Subject: [PATCH 3/3] =?UTF-8?q?docs:=20address=20Copilot=20review=20?= =?UTF-8?q?=E2=80=94=20accurate=20timestamp=20columns,=20tag/milestone=20n?= =?UTF-8?q?aming,=20CSV=20split=20contract?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six accepted review findings from PR #61 Copilot review (the three "||"-prefix table claims were false positives and resolved as irrelevant): - COPILOT-1 (v1_release_design.md:37): clarify that flat task splits are emitted as Parquet at tasks//{train,valid,test}.parquet in the current alpha contract; CSV exports are a Phase 5 deliverable. Standardize on `valid.csv` (matches existing `valid.parquet`), drop the `validation.csv`/`valid.csv` divergence. - COPILOT-2/3/4: replace fictional `event_timestamp` with the real per-table timestamp columns (`touch_timestamp`, `session_timestamp`, `activity_timestamp`) plus opportunities' `created_at`. Updates v1_release_design.md (the design statement), v1_acceptance_gates.md (gate G4.4), and CLAUDE.md (hard constraint). Makes all three machine-checkable against the actual schema. - COPILOT-8: resolve git-tag inconsistency. v1_release_design.md:25 no longer mentions `dataset/v1`; standardizes on `leadforge-lead-scoring-v1` as both the git tag and Release name. - COPILOT-9: rename GitHub milestone #7 from "v1.1.0 — Curated dataset v1 release" (which read like a semver package version) to "dataset: leadforge-lead-scoring-v1" — explicitly dataset-scoped, no conflict with the package version decoupling principle. Three references in v1_release_roadmap.md updated to match. Three Copilot threads (COPILOT-5/6/7) claimed tables had `||` prefixes; verified false on inspection — tables use standard `|` delimiters. Resolved as not-applicable. Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 2 +- docs/release/v1_acceptance_gates.md | 2 +- docs/release/v1_release_design.md | 6 +++--- docs/release/v1_release_roadmap.md | 6 +++--- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index ffdb587..e93486c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -204,7 +204,7 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp ## Hard Constraints — Do Not Violate - Never use a single fixed hidden world (DGP must vary by motif family + rewiring). - Never leak post-snapshot-anchor data into flat task features. -- **Never publish public relational tables that allow label reconstruction via joins.** Public relational exports must be snapshot-safe: event tables filtered to `event_timestamp <= lead_created_at + snapshot_day`; no terminal-state fields (`close_outcome`, `closed_at`, `converted_within_90_days`, `conversion_timestamp`) in public `leads`/`opportunities`; no conversion-conditional entities (`customers`, `subscriptions`) in public bundles. +- **Never publish public relational tables that allow label reconstruction via joins.** Public relational exports must be snapshot-safe: every `*_timestamp` column in event tables (`touches.touch_timestamp`, `sessions.session_timestamp`, `sales_activities.activity_timestamp`) must satisfy `<= lead_created_at + snapshot_day`; `opportunities` must be filtered by `created_at <= lead_created_at + snapshot_day`; no terminal-state fields (`close_outcome`, `closed_at`, `converted_within_90_days`, `conversion_timestamp`) in public `leads`/`opportunities`; no conversion-conditional entities (`customers`, `subscriptions`) in public bundles. - Never require external APIs for core generation. - Never publish hidden truth in `student_public` mode. - Never derive `converted_within_90_days` as a directly sampled label; it must emerge from simulated events. diff --git a/docs/release/v1_acceptance_gates.md b/docs/release/v1_acceptance_gates.md index cd9af79..31f2a6c 100644 --- a/docs/release/v1_acceptance_gates.md +++ b/docs/release/v1_acceptance_gates.md @@ -36,7 +36,7 @@ This is the gate that motivates the v1 release. Failures here are blockers. - **G4.1** Public `tables/leads.parquet` does **not** contain `converted_within_90_days` or `conversion_timestamp`. - **G4.2** Public `tables/opportunities.parquet` does **not** contain `close_outcome` or `closed_at`. - **G4.3** Public bundles do **not** contain `tables/customers.parquet` or `tables/subscriptions.parquet`. -- **G4.4** Public event tables (`touches`, `sessions`, `sales_activities`) contain no rows where `event_timestamp > lead_created_at + snapshot_day`. +- **G4.4** Public event tables contain no rows past the snapshot: no `touches` row with `touch_timestamp > lead_created_at + snapshot_day`, no `sessions` row with `session_timestamp > lead_created_at + snapshot_day`, no `sales_activities` row with `activity_timestamp > lead_created_at + snapshot_day`. Public `opportunities` rows must satisfy `created_at <= lead_created_at + snapshot_day`. - **G4.5** Probabilistic relational reconstruction probe: a model trained using only public relational features (joined on `lead_id`/`account_id`/`contact_id`) achieves AUC ≤ TBD-G4.5 against `converted_within_90_days`. Threshold derived during Phase 3 from honest-feature baseline. - **G4.6** Manifest field `relational_snapshot_safe == true` for `student_public` bundles; `false` for `research_instructor`. diff --git a/docs/release/v1_release_design.md b/docs/release/v1_release_design.md index 0819fb4..ac7ba27 100644 --- a/docs/release/v1_release_design.md +++ b/docs/release/v1_release_design.md @@ -22,7 +22,7 @@ a coordinated bump of the framework that is not actually planned. **Implication for releases:** - Kaggle dataset slug: `leadforge-lead-scoring-v1` (or `/leadforge-lead-scoring-v1`). - Hugging Face repo: `/leadforge-lead-scoring-v1`. -- GitHub release tag: `dataset/v1` or similar, distinct from package tags. +- GitHub release tag and Release name: `leadforge-lead-scoring-v1`. Distinct from any package tags. ## Dataset family architecture @@ -34,7 +34,7 @@ The release is a *family*, not a single CSV. - **advanced** — hardest tier (~7.9% conversion; AUC ~0.87). Each tier is a complete bundle: -- Flat task splits: `train.csv`, `validation.csv`, `test.csv`, plus a single joined `lead_scoring.csv`. +- Flat task splits as Parquet at `tasks//{train,valid,test}.parquet` (current alpha contract) plus a single joined `lead_scoring.csv` with a `split` column. Phase 5 platform packaging will additionally emit `{train,valid,test}.csv` exports for Kaggle/HF consumers who prefer flat CSV; filenames mirror the parquet split names (i.e. `valid.csv`, not `validation.csv`, to keep one canonical name). - **Snapshot-safe relational tables** (see below): `accounts.parquet`, `contacts.parquet`, `leads.parquet`, `touches.parquet`, `sessions.parquet`, `sales_activities.parquet`, `opportunities.parquet`. **No `customers` or `subscriptions` tables in public bundles.** - `manifest.json`, `feature_dictionary.csv`, `dataset_card.md`. @@ -67,7 +67,7 @@ Verified in a 500-lead `student_public` smoke bundle by ChatGPT v2 reviewer (cha ### Decision A new module `leadforge/render/relational_snapshot_safe.py` produces a **snapshot-safe** relational export for `student_public` bundles. Properties: -1. **Event tables** (`touches`, `sessions`, `sales_activities`) are filtered to `event_timestamp <= lead_created_at + snapshot_day`. The same temporal boundary used for flat-CSV features. +1. **Event tables** are filtered per-table to their snapshot-relative timestamp column: `touches.touch_timestamp`, `sessions.session_timestamp`, `sales_activities.activity_timestamp` — each must satisfy `<= lead_created_at + snapshot_day`. Same temporal boundary used for flat-CSV features. 2. **`leads.parquet`** drops `converted_within_90_days` and `conversion_timestamp`. The label only lives in the task splits, where it is the explicit y-column. 3. **`opportunities.parquet`** is filtered to `created_at <= lead_created_at + snapshot_day` and drops `close_outcome` and `closed_at`. 4. **`customers.parquet` and `subscriptions.parquet`** are omitted from public bundles entirely. diff --git a/docs/release/v1_release_roadmap.md b/docs/release/v1_release_roadmap.md index 051c80c..e7b0bb2 100644 --- a/docs/release/v1_release_roadmap.md +++ b/docs/release/v1_release_roadmap.md @@ -46,7 +46,7 @@ A release candidate is v1-ready when **all** of the following hold. Concrete ban | 6 | Notebook sequence + adversarial framing | M-L | 3 | 5 | not started | | 7 | LLM critique + publish | M | 2 | 6 | not started | -**Total: 14 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`v1.1.0 — Curated dataset v1 release`). PR-level decomposition is in the **PR breakdown** section immediately below. +**Total: 14 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`dataset: leadforge-lead-scoring-v1`). PR-level decomposition is in the **PR breakdown** section immediately below. ## PR breakdown @@ -178,7 +178,7 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` - **14 PRs** across 7 phases. - Estimated total LoC: ~6,500 (excluding regenerated parquet bundles and notebook JSON). -- All 14 PRs target the `v1.1.0 — Curated dataset v1 release` GitHub milestone. +- All 14 PRs target the `dataset: leadforge-lead-scoring-v1` GitHub milestone. - Calendar duration is not committed; depends on iteration cadence and review feedback. --- @@ -199,7 +199,7 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` - Dataset release name is committed. **PR labels:** `type: docs`. -**Milestone:** create new `v1.1.0 — Curated dataset v1 release` (or similar). +**Milestone:** create new `dataset: leadforge-lead-scoring-v1` (or similar). ---