diff --git a/.agent-plan.md b/.agent-plan.md index 3c29ba0..89a3dfe 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -46,10 +46,12 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0. ### Phase 5 — Platform packaging -- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json` -- [ ] `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags -- [ ] `release/dataset-cover-image.png` (≥560×280) -- [ ] Local `load_dataset()` smoke test; Kaggle dry-run package validation +- [x] PR 5.1: `scripts/package_kaggle_release.py` (new) — Kaggle release packager. Reads each public tier's `manifest.json` + `feature_dictionary.csv` + flat CSV header under `release/`, emits `release/kaggle/dataset-metadata.json` validated against G11.1 (title 6-50 chars, subtitle 20-80 chars, slug 3-50 chars, single MIT license, `expectedUpdateFrequency=never`, image filename, `resources[].schema.fields` in column order for every tabular resource). Schema fields cover both flat CSVs (driven by `feature_dictionary.csv`) and parquet files (driven by `pyarrow.parquet.read_schema`). The metadata's `description` field inlines `release/README.md` with three Kaggle-specific rewrites: source-repo tree diagram → upload-tree diagram, `](../foo)` → GitHub blob URL via regex, `](validation/validation_report.md)` → GitHub blob URL. Default `id` follows Kaggle's actual `/` schema (`leadforge/leadforge-lead-scoring-v1`), so PR 7.2's publish script does not have to splice in a username at upload time. CLI: `--release-dir`, `--kaggle-dir`, `--tier`, `--user-slug`, `--dataset-slug`, `--cover-image`, `--dry-run`, `--print`. Exit codes: 0 pass / 1 validation failure / 2 pre-flight error. +- [x] PR 5.1: `scripts/generate_cover_image.py` (new) — deterministic Pillow + DejaVu Sans (bundled with matplotlib) renderer producing `release/dataset-cover-image.png` at 1280×640 (well above the 560×280 minimum, 2:1 aspect for Kaggle's header crop). Three-tier card design surfacing the cross-seed median conversion rate + LR AUC for each tier, pinned from `release/validation/validation_report.md`. Byte-identical re-runs guarded by `tests/scripts/test_generate_cover_image.py`. +- [x] PR 5.1: Upload-dir assembly under `release/kaggle/` uses relative symlinks for the heavy bundle directories + cover image + LICENSE, plus a real file copy for `README.md` (rewritten on the way in so its `../` links and tree diagram render correctly on the Kaggle dataset page). `_validate_kaggle_dir_safe` refuses to assemble into `cwd` / `release_dir` / its parent / the filesystem anchor. `release/kaggle/*` is gitignored except for `dataset-metadata.json` itself — only the metadata is committed; the upload tree is regenerated on demand. +- [x] PR 5.1: 19 new tests (`tests/scripts/test_package_kaggle_release.py` × 15, `tests/scripts/test_generate_cover_image.py` × 4): every Kaggle field constraint, schema field order parity for CSV + parquet, README rewriting (tree + `../` + validation report links), unsafe-kaggle-dir rejection, CLI rc=2 on missing release dir, byte-determinism (audit-artifact-sync), and committed-metadata-matches-fresh-regeneration sync check. 1194/1194 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape). +- [ ] PR 5.2: `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags +- [ ] PR 5.2: Local `load_dataset()` smoke test; Kaggle dry-run package validation ### Phase 6 — Notebook sequence + adversarial framing - [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` diff --git a/.gitignore b/.gitignore index cd71ed6..35abd8e 100644 --- a/.gitignore +++ b/.gitignore @@ -218,3 +218,9 @@ release/intermediate_instructor/ release/LICENSE release/_determinism/ release/_release_quality/ + +# Generated Kaggle upload tree (PR 5.1) — only dataset-metadata.json is +# committed; the rest is reassembled on demand via +# scripts/package_kaggle_release.py from release/{intro,intermediate,advanced}/. +release/kaggle/* +!release/kaggle/dataset-metadata.json diff --git a/release/dataset-cover-image.png b/release/dataset-cover-image.png new file mode 100644 index 0000000..912bb43 Binary files /dev/null and b/release/dataset-cover-image.png differ diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json new file mode 100644 index 0000000..2f4b9b2 --- /dev/null +++ b/release/kaggle/dataset-metadata.json @@ -0,0 +1,2572 @@ +{ + "collaborators": [], + "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n every tier. Difficulty is visible in AP, P@K, Brier, and value\n capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. Issue templates ship under\n`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as\n`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,\n`docs/release/v2_decision_log.md` will track every accepted finding\nand the design call that came from it. File issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate `; every file\nis hashed in `manifest.json`.\n", + "expectedUpdateFrequency": "never", + "id": "leadforge/leadforge-lead-scoring-v1", + "image": "dataset-cover-image.png", + "isPrivate": true, + "keywords": [ + "b2b", + "classification", + "crm", + "education", + "lead-scoring", + "saas", + "synthetic-data", + "tabular" + ], + "licenses": [ + { + "name": "MIT" + } + ], + "resources": [ + { + "description": "Intro tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.", + "path": "intro/lead_scoring.csv", + "schema": { + "fields": [ + { + "description": "Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.", + "name": "split", + "type": "string" + }, + { + "description": "Opaque account identifier.", + "name": "account_id", + "type": "string" + }, + { + "description": "Industry vertical of the buying organization.", + "name": "industry", + "type": "string" + }, + { + "description": "Geographic region of the account's headquarters.", + "name": "region", + "type": "string" + }, + { + "description": "Banded employee headcount of the account.", + "name": "employee_band", + "type": "string" + }, + { + "description": "Banded estimated annual revenue of the account.", + "name": "estimated_revenue_band", + "type": "string" + }, + { + "description": "Banded internal process maturity score (latent).", + "name": "process_maturity_band", + "type": "string" + }, + { + "description": "Opaque contact identifier.", + "name": "contact_id", + "type": "string" + }, + { + "description": "Functional area of the primary contact (e.g. finance, ops).", + "name": "role_function", + "type": "string" + }, + { + "description": "Seniority band of the primary contact.", + "name": "seniority", + "type": "string" + }, + { + "description": "Buyer role classification (economic_buyer, champion, etc.).", + "name": "buyer_role", + "type": "string" + }, + { + "description": "Opaque lead identifier.", + "name": "lead_id", + "type": "string" + }, + { + "description": "ISO-8601 timestamp when the lead was created.", + "name": "lead_created_at", + "type": "string" + }, + { + "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).", + "name": "lead_source", + "type": "string" + }, + { + "description": "Marketing channel responsible for the first recorded touch.", + "name": "first_touch_channel", + "type": "string" + }, + { + "description": "Total number of marketing/sales touches recorded before snapshot.", + "name": "touch_count", + "type": "integer" + }, + { + "description": "Number of inbound touches before snapshot.", + "name": "inbound_touch_count", + "type": "integer" + }, + { + "description": "Number of outbound touches before snapshot.", + "name": "outbound_touch_count", + "type": "integer" + }, + { + "description": "Number of web/trial sessions recorded before snapshot.", + "name": "session_count", + "type": "integer" + }, + { + "description": "Cumulative pricing page views across all sessions before snapshot.", + "name": "pricing_page_views", + "type": "integer" + }, + { + "description": "Cumulative demo page views across all sessions before snapshot.", + "name": "demo_page_views", + "type": "integer" + }, + { + "description": "Sum of session durations (seconds) before snapshot.", + "name": "total_session_duration_seconds", + "type": "integer" + }, + { + "description": "Number of touches in the first 7 days after lead creation.", + "name": "touches_week_1", + "type": "integer" + }, + { + "description": "Number of touches in the last 7 days before snapshot cutoff.", + "name": "touches_last_7_days", + "type": "integer" + }, + { + "description": "Days between first touch and snapshot cutoff (NaN if no touches).", + "name": "days_since_first_touch", + "type": "number" + }, + { + "description": "Number of sales activities logged before snapshot.", + "name": "activity_count", + "type": "integer" + }, + { + "description": "Days elapsed between most recent touch and snapshot cutoff.", + "name": "days_since_last_touch", + "type": "number" + }, + { + "description": "Whether any opportunity was created by snapshot date (open or closed).", + "name": "opportunity_created", + "type": "boolean" + }, + { + "description": "Whether an open opportunity existed at snapshot date.", + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "description": "Estimated ACV of the most recent open opportunity (NaN if none).", + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).", + "name": "expected_acv", + "type": "number" + }, + { + "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.", + "name": "total_touches_all", + "type": "integer" + }, + { + "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.", + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intro tier feature dictionary (canonical column spec).", + "path": "intro/feature_dictionary.csv" + }, + { + "description": "Intro tier train split for `converted_within_90_days` (3,500 rows).", + "path": "intro/tasks/converted_within_90_days/train.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intro tier valid split for `converted_within_90_days` (750 rows).", + "path": "intro/tasks/converted_within_90_days/valid.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intro tier test split for `converted_within_90_days` (750 rows).", + "path": "intro/tasks/converted_within_90_days/test.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intro tier `accounts` relational table (1,500 rows) — snapshot-safe.", + "path": "intro/tables/accounts.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "company_name", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Intro tier `contacts` relational table (4,200 rows) — snapshot-safe.", + "path": "intro/tables/contacts.parquet", + "schema": { + "fields": [ + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "job_title", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "email_domain_type", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Intro tier `leads` relational table (5,000 rows) — snapshot-safe.", + "path": "intro/tables/leads.parquet", + "schema": { + "fields": [ + { + "name": "lead_id", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "owner_rep_id", + "type": "string" + } + ] + } + }, + { + "description": "Intro tier `touches` relational table (38,561 rows) — snapshot-safe.", + "path": "intro/tables/touches.parquet", + "schema": { + "fields": [ + { + "name": "touch_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "touch_timestamp", + "type": "string" + }, + { + "name": "touch_type", + "type": "string" + }, + { + "name": "touch_channel", + "type": "string" + }, + { + "name": "touch_direction", + "type": "string" + }, + { + "name": "campaign_id", + "type": "string" + } + ] + } + }, + { + "description": "Intro tier `sessions` relational table (10,171 rows) — snapshot-safe.", + "path": "intro/tables/sessions.parquet", + "schema": { + "fields": [ + { + "name": "session_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "session_timestamp", + "type": "string" + }, + { + "name": "session_type", + "type": "string" + }, + { + "name": "page_views", + "type": "integer" + }, + { + "name": "pricing_page_views", + "type": "integer" + }, + { + "name": "demo_page_views", + "type": "integer" + }, + { + "name": "session_duration_seconds", + "type": "integer" + } + ] + } + }, + { + "description": "Intro tier `sales_activities` relational table (21,358 rows) — snapshot-safe.", + "path": "intro/tables/sales_activities.parquet", + "schema": { + "fields": [ + { + "name": "activity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "rep_id", + "type": "string" + }, + { + "name": "activity_timestamp", + "type": "string" + }, + { + "name": "activity_type", + "type": "string" + }, + { + "name": "activity_outcome", + "type": "string" + } + ] + } + }, + { + "description": "Intro tier `opportunities` relational table (4,426 rows) — snapshot-safe.", + "path": "intro/tables/opportunities.parquet", + "schema": { + "fields": [ + { + "name": "opportunity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + }, + { + "name": "stage", + "type": "string" + }, + { + "name": "estimated_acv", + "type": "integer" + } + ] + } + }, + { + "description": "Intro tier auto-rendered dataset card.", + "path": "intro/dataset_card.md" + }, + { + "description": "Intro tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).", + "path": "intro/manifest.json" + }, + { + "description": "Intermediate tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.", + "path": "intermediate/lead_scoring.csv", + "schema": { + "fields": [ + { + "description": "Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.", + "name": "split", + "type": "string" + }, + { + "description": "Opaque account identifier.", + "name": "account_id", + "type": "string" + }, + { + "description": "Industry vertical of the buying organization.", + "name": "industry", + "type": "string" + }, + { + "description": "Geographic region of the account's headquarters.", + "name": "region", + "type": "string" + }, + { + "description": "Banded employee headcount of the account.", + "name": "employee_band", + "type": "string" + }, + { + "description": "Banded estimated annual revenue of the account.", + "name": "estimated_revenue_band", + "type": "string" + }, + { + "description": "Banded internal process maturity score (latent).", + "name": "process_maturity_band", + "type": "string" + }, + { + "description": "Opaque contact identifier.", + "name": "contact_id", + "type": "string" + }, + { + "description": "Functional area of the primary contact (e.g. finance, ops).", + "name": "role_function", + "type": "string" + }, + { + "description": "Seniority band of the primary contact.", + "name": "seniority", + "type": "string" + }, + { + "description": "Buyer role classification (economic_buyer, champion, etc.).", + "name": "buyer_role", + "type": "string" + }, + { + "description": "Opaque lead identifier.", + "name": "lead_id", + "type": "string" + }, + { + "description": "ISO-8601 timestamp when the lead was created.", + "name": "lead_created_at", + "type": "string" + }, + { + "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).", + "name": "lead_source", + "type": "string" + }, + { + "description": "Marketing channel responsible for the first recorded touch.", + "name": "first_touch_channel", + "type": "string" + }, + { + "description": "Total number of marketing/sales touches recorded before snapshot.", + "name": "touch_count", + "type": "integer" + }, + { + "description": "Number of inbound touches before snapshot.", + "name": "inbound_touch_count", + "type": "integer" + }, + { + "description": "Number of outbound touches before snapshot.", + "name": "outbound_touch_count", + "type": "integer" + }, + { + "description": "Number of web/trial sessions recorded before snapshot.", + "name": "session_count", + "type": "integer" + }, + { + "description": "Cumulative pricing page views across all sessions before snapshot.", + "name": "pricing_page_views", + "type": "integer" + }, + { + "description": "Cumulative demo page views across all sessions before snapshot.", + "name": "demo_page_views", + "type": "integer" + }, + { + "description": "Sum of session durations (seconds) before snapshot.", + "name": "total_session_duration_seconds", + "type": "integer" + }, + { + "description": "Number of touches in the first 7 days after lead creation.", + "name": "touches_week_1", + "type": "integer" + }, + { + "description": "Number of touches in the last 7 days before snapshot cutoff.", + "name": "touches_last_7_days", + "type": "integer" + }, + { + "description": "Days between first touch and snapshot cutoff (NaN if no touches).", + "name": "days_since_first_touch", + "type": "number" + }, + { + "description": "Number of sales activities logged before snapshot.", + "name": "activity_count", + "type": "integer" + }, + { + "description": "Days elapsed between most recent touch and snapshot cutoff.", + "name": "days_since_last_touch", + "type": "number" + }, + { + "description": "Whether any opportunity was created by snapshot date (open or closed).", + "name": "opportunity_created", + "type": "boolean" + }, + { + "description": "Whether an open opportunity existed at snapshot date.", + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "description": "Estimated ACV of the most recent open opportunity (NaN if none).", + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).", + "name": "expected_acv", + "type": "number" + }, + { + "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.", + "name": "total_touches_all", + "type": "integer" + }, + { + "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.", + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intermediate tier feature dictionary (canonical column spec).", + "path": "intermediate/feature_dictionary.csv" + }, + { + "description": "Intermediate tier train split for `converted_within_90_days` (3,500 rows).", + "path": "intermediate/tasks/converted_within_90_days/train.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intermediate tier valid split for `converted_within_90_days` (750 rows).", + "path": "intermediate/tasks/converted_within_90_days/valid.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intermediate tier test split for `converted_within_90_days` (750 rows).", + "path": "intermediate/tasks/converted_within_90_days/test.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Intermediate tier `accounts` relational table (1,500 rows) — snapshot-safe.", + "path": "intermediate/tables/accounts.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "company_name", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Intermediate tier `contacts` relational table (4,200 rows) — snapshot-safe.", + "path": "intermediate/tables/contacts.parquet", + "schema": { + "fields": [ + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "job_title", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "email_domain_type", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Intermediate tier `leads` relational table (5,000 rows) — snapshot-safe.", + "path": "intermediate/tables/leads.parquet", + "schema": { + "fields": [ + { + "name": "lead_id", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "owner_rep_id", + "type": "string" + } + ] + } + }, + { + "description": "Intermediate tier `touches` relational table (38,724 rows) — snapshot-safe.", + "path": "intermediate/tables/touches.parquet", + "schema": { + "fields": [ + { + "name": "touch_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "touch_timestamp", + "type": "string" + }, + { + "name": "touch_type", + "type": "string" + }, + { + "name": "touch_channel", + "type": "string" + }, + { + "name": "touch_direction", + "type": "string" + }, + { + "name": "campaign_id", + "type": "string" + } + ] + } + }, + { + "description": "Intermediate tier `sessions` relational table (10,012 rows) — snapshot-safe.", + "path": "intermediate/tables/sessions.parquet", + "schema": { + "fields": [ + { + "name": "session_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "session_timestamp", + "type": "string" + }, + { + "name": "session_type", + "type": "string" + }, + { + "name": "page_views", + "type": "integer" + }, + { + "name": "pricing_page_views", + "type": "integer" + }, + { + "name": "demo_page_views", + "type": "integer" + }, + { + "name": "session_duration_seconds", + "type": "integer" + } + ] + } + }, + { + "description": "Intermediate tier `sales_activities` relational table (20,679 rows) — snapshot-safe.", + "path": "intermediate/tables/sales_activities.parquet", + "schema": { + "fields": [ + { + "name": "activity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "rep_id", + "type": "string" + }, + { + "name": "activity_timestamp", + "type": "string" + }, + { + "name": "activity_type", + "type": "string" + }, + { + "name": "activity_outcome", + "type": "string" + } + ] + } + }, + { + "description": "Intermediate tier `opportunities` relational table (4,255 rows) — snapshot-safe.", + "path": "intermediate/tables/opportunities.parquet", + "schema": { + "fields": [ + { + "name": "opportunity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + }, + { + "name": "stage", + "type": "string" + }, + { + "name": "estimated_acv", + "type": "integer" + } + ] + } + }, + { + "description": "Intermediate tier auto-rendered dataset card.", + "path": "intermediate/dataset_card.md" + }, + { + "description": "Intermediate tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).", + "path": "intermediate/manifest.json" + }, + { + "description": "Advanced tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.", + "path": "advanced/lead_scoring.csv", + "schema": { + "fields": [ + { + "description": "Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.", + "name": "split", + "type": "string" + }, + { + "description": "Opaque account identifier.", + "name": "account_id", + "type": "string" + }, + { + "description": "Industry vertical of the buying organization.", + "name": "industry", + "type": "string" + }, + { + "description": "Geographic region of the account's headquarters.", + "name": "region", + "type": "string" + }, + { + "description": "Banded employee headcount of the account.", + "name": "employee_band", + "type": "string" + }, + { + "description": "Banded estimated annual revenue of the account.", + "name": "estimated_revenue_band", + "type": "string" + }, + { + "description": "Banded internal process maturity score (latent).", + "name": "process_maturity_band", + "type": "string" + }, + { + "description": "Opaque contact identifier.", + "name": "contact_id", + "type": "string" + }, + { + "description": "Functional area of the primary contact (e.g. finance, ops).", + "name": "role_function", + "type": "string" + }, + { + "description": "Seniority band of the primary contact.", + "name": "seniority", + "type": "string" + }, + { + "description": "Buyer role classification (economic_buyer, champion, etc.).", + "name": "buyer_role", + "type": "string" + }, + { + "description": "Opaque lead identifier.", + "name": "lead_id", + "type": "string" + }, + { + "description": "ISO-8601 timestamp when the lead was created.", + "name": "lead_created_at", + "type": "string" + }, + { + "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).", + "name": "lead_source", + "type": "string" + }, + { + "description": "Marketing channel responsible for the first recorded touch.", + "name": "first_touch_channel", + "type": "string" + }, + { + "description": "Total number of marketing/sales touches recorded before snapshot.", + "name": "touch_count", + "type": "integer" + }, + { + "description": "Number of inbound touches before snapshot.", + "name": "inbound_touch_count", + "type": "integer" + }, + { + "description": "Number of outbound touches before snapshot.", + "name": "outbound_touch_count", + "type": "integer" + }, + { + "description": "Number of web/trial sessions recorded before snapshot.", + "name": "session_count", + "type": "integer" + }, + { + "description": "Cumulative pricing page views across all sessions before snapshot.", + "name": "pricing_page_views", + "type": "integer" + }, + { + "description": "Cumulative demo page views across all sessions before snapshot.", + "name": "demo_page_views", + "type": "integer" + }, + { + "description": "Sum of session durations (seconds) before snapshot.", + "name": "total_session_duration_seconds", + "type": "integer" + }, + { + "description": "Number of touches in the first 7 days after lead creation.", + "name": "touches_week_1", + "type": "integer" + }, + { + "description": "Number of touches in the last 7 days before snapshot cutoff.", + "name": "touches_last_7_days", + "type": "integer" + }, + { + "description": "Days between first touch and snapshot cutoff (NaN if no touches).", + "name": "days_since_first_touch", + "type": "number" + }, + { + "description": "Number of sales activities logged before snapshot.", + "name": "activity_count", + "type": "integer" + }, + { + "description": "Days elapsed between most recent touch and snapshot cutoff.", + "name": "days_since_last_touch", + "type": "number" + }, + { + "description": "Whether any opportunity was created by snapshot date (open or closed).", + "name": "opportunity_created", + "type": "boolean" + }, + { + "description": "Whether an open opportunity existed at snapshot date.", + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "description": "Estimated ACV of the most recent open opportunity (NaN if none).", + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).", + "name": "expected_acv", + "type": "number" + }, + { + "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.", + "name": "total_touches_all", + "type": "integer" + }, + { + "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.", + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Advanced tier feature dictionary (canonical column spec).", + "path": "advanced/feature_dictionary.csv" + }, + { + "description": "Advanced tier train split for `converted_within_90_days` (3,500 rows).", + "path": "advanced/tasks/converted_within_90_days/train.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Advanced tier valid split for `converted_within_90_days` (750 rows).", + "path": "advanced/tasks/converted_within_90_days/valid.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Advanced tier test split for `converted_within_90_days` (750 rows).", + "path": "advanced/tasks/converted_within_90_days/test.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "touch_count", + "type": "number" + }, + { + "name": "inbound_touch_count", + "type": "number" + }, + { + "name": "outbound_touch_count", + "type": "number" + }, + { + "name": "session_count", + "type": "number" + }, + { + "name": "pricing_page_views", + "type": "number" + }, + { + "name": "demo_page_views", + "type": "number" + }, + { + "name": "total_session_duration_seconds", + "type": "number" + }, + { + "name": "touches_week_1", + "type": "number" + }, + { + "name": "touches_last_7_days", + "type": "number" + }, + { + "name": "days_since_first_touch", + "type": "number" + }, + { + "name": "activity_count", + "type": "number" + }, + { + "name": "days_since_last_touch", + "type": "number" + }, + { + "name": "opportunity_created", + "type": "boolean" + }, + { + "name": "has_open_opportunity", + "type": "boolean" + }, + { + "name": "opportunity_estimated_acv", + "type": "number" + }, + { + "name": "expected_acv", + "type": "number" + }, + { + "name": "total_touches_all", + "type": "number" + }, + { + "name": "converted_within_90_days", + "type": "boolean" + } + ] + } + }, + { + "description": "Advanced tier `accounts` relational table (1,500 rows) — snapshot-safe.", + "path": "advanced/tables/accounts.parquet", + "schema": { + "fields": [ + { + "name": "account_id", + "type": "string" + }, + { + "name": "company_name", + "type": "string" + }, + { + "name": "industry", + "type": "string" + }, + { + "name": "region", + "type": "string" + }, + { + "name": "employee_band", + "type": "string" + }, + { + "name": "estimated_revenue_band", + "type": "string" + }, + { + "name": "process_maturity_band", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Advanced tier `contacts` relational table (4,200 rows) — snapshot-safe.", + "path": "advanced/tables/contacts.parquet", + "schema": { + "fields": [ + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "job_title", + "type": "string" + }, + { + "name": "role_function", + "type": "string" + }, + { + "name": "seniority", + "type": "string" + }, + { + "name": "buyer_role", + "type": "string" + }, + { + "name": "email_domain_type", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + } + ] + } + }, + { + "description": "Advanced tier `leads` relational table (5,000 rows) — snapshot-safe.", + "path": "advanced/tables/leads.parquet", + "schema": { + "fields": [ + { + "name": "lead_id", + "type": "string" + }, + { + "name": "contact_id", + "type": "string" + }, + { + "name": "account_id", + "type": "string" + }, + { + "name": "lead_created_at", + "type": "string" + }, + { + "name": "lead_source", + "type": "string" + }, + { + "name": "first_touch_channel", + "type": "string" + }, + { + "name": "owner_rep_id", + "type": "string" + } + ] + } + }, + { + "description": "Advanced tier `touches` relational table (38,208 rows) — snapshot-safe.", + "path": "advanced/tables/touches.parquet", + "schema": { + "fields": [ + { + "name": "touch_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "touch_timestamp", + "type": "string" + }, + { + "name": "touch_type", + "type": "string" + }, + { + "name": "touch_channel", + "type": "string" + }, + { + "name": "touch_direction", + "type": "string" + }, + { + "name": "campaign_id", + "type": "string" + } + ] + } + }, + { + "description": "Advanced tier `sessions` relational table (9,942 rows) — snapshot-safe.", + "path": "advanced/tables/sessions.parquet", + "schema": { + "fields": [ + { + "name": "session_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "session_timestamp", + "type": "string" + }, + { + "name": "session_type", + "type": "string" + }, + { + "name": "page_views", + "type": "integer" + }, + { + "name": "pricing_page_views", + "type": "integer" + }, + { + "name": "demo_page_views", + "type": "integer" + }, + { + "name": "session_duration_seconds", + "type": "integer" + } + ] + } + }, + { + "description": "Advanced tier `sales_activities` relational table (19,995 rows) — snapshot-safe.", + "path": "advanced/tables/sales_activities.parquet", + "schema": { + "fields": [ + { + "name": "activity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "rep_id", + "type": "string" + }, + { + "name": "activity_timestamp", + "type": "string" + }, + { + "name": "activity_type", + "type": "string" + }, + { + "name": "activity_outcome", + "type": "string" + } + ] + } + }, + { + "description": "Advanced tier `opportunities` relational table (4,004 rows) — snapshot-safe.", + "path": "advanced/tables/opportunities.parquet", + "schema": { + "fields": [ + { + "name": "opportunity_id", + "type": "string" + }, + { + "name": "lead_id", + "type": "string" + }, + { + "name": "created_at", + "type": "string" + }, + { + "name": "stage", + "type": "string" + }, + { + "name": "estimated_acv", + "type": "integer" + } + ] + } + }, + { + "description": "Advanced tier auto-rendered dataset card.", + "path": "advanced/dataset_card.md" + }, + { + "description": "Advanced tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).", + "path": "advanced/manifest.json" + } + ], + "subtitle": "Three-tier synthetic CRM funnel for leakage-aware lead scoring", + "title": "LeadForge: Synthetic B2B Lead Scoring (v1)", + "userSpecifiedSources": [ + { + "title": "leadforge source repository", + "url": "https://github.com/leadforge-dev/leadforge" + }, + { + "title": "v1 release validation report", + "url": "https://github.com/leadforge-dev/leadforge/tree/main/release/validation" + } + ] +} diff --git a/scripts/generate_cover_image.py b/scripts/generate_cover_image.py new file mode 100644 index 0000000..5cff167 --- /dev/null +++ b/scripts/generate_cover_image.py @@ -0,0 +1,245 @@ +#!/usr/bin/env python3 +"""Generate the Kaggle cover image for ``leadforge-lead-scoring-v1``. + +The cover image is rendered programmatically rather than hand-designed +or licensed so that: + +* re-running this script on the same machine produces byte-identical + output, guarded by ``test_render_cover_is_byte_deterministic`` — + enough for local regression detection; +* the source-of-truth for what the image *says* sits in version + control, not in a designer's file or a stock-photo licence; +* there is no licensing question. + +**Cross-platform byte equality is NOT guaranteed.** The committed +``release/dataset-cover-image.png`` was rendered on whichever machine +last ran this script; Pillow + FreeType produce slightly different +glyph rasterisation between macOS and Linux (different FreeType +versions, different font-hinting tables). The committed PNG is +therefore one valid render — checked into git so a fresh clone has a +usable cover image without first running this script — not a +hash-locked artefact. Tests assert dimensions and per-machine +determinism, not committed-vs-fresh byte equality. + +Output: ``release/dataset-cover-image.png`` at 1280 × 640 px (2:1 +aspect, well above Kaggle's 560 × 280 minimum, with a 1:1 thumbnail +crop centred on the headline). Pillow ships with matplotlib (already a +dev / scripts extra), so this script does not require any new +dependency. + +Headline metrics — conversion rates and LR AUC values — are pinned +literals sourced from the cross-seed medians (seeds 42–46) reported in +``release/validation/validation_report.md``. They are not recomputed +at render time: the cover image is intentionally a documentation-grade +artefact that lags by one validation cycle, not a live metric panel. +""" + +from __future__ import annotations + +import argparse +import sys +from collections.abc import Sequence +from dataclasses import dataclass +from pathlib import Path +from typing import Final + +import matplotlib.font_manager as fm +from PIL import Image, ImageDraw, ImageFont + +# --------------------------------------------------------------------------- +# Layout constants (pixels) +# --------------------------------------------------------------------------- + +CANVAS_WIDTH: Final[int] = 1280 +CANVAS_HEIGHT: Final[int] = 640 +LEFT_MARGIN: Final[int] = 80 + +#: Background — deep navy. +BACKGROUND: Final[tuple[int, int, int]] = (13, 27, 42) +#: Card background — slightly lighter navy. +CARD_BACKGROUND: Final[tuple[int, int, int]] = (27, 38, 59) +#: Primary text colour — pure white. +TEXT_PRIMARY: Final[tuple[int, int, int]] = (255, 255, 255) +#: Secondary text colour — pale steel. +TEXT_SECONDARY: Final[tuple[int, int, int]] = (200, 220, 240) + +DEFAULT_OUT_PATH: Final[Path] = Path("release/dataset-cover-image.png") + + +@dataclass(frozen=True) +class TierBadge: + """Per-tier headline shown on the cover.""" + + name: str + conversion_rate_pct: str + lr_auc: float + accent: tuple[int, int, int] + + +#: Cross-seed medians (seeds 42-46) from +#: ``release/validation/validation_report.md`` — pinned literals so the +#: cover image is reproducible without reading the report at render +#: time. +TIER_BADGES: Final[tuple[TierBadge, ...]] = ( + TierBadge("Intro", "42.7%", 0.879, (76, 175, 80)), + TierBadge("Intermediate", "21.6%", 0.886, (255, 152, 0)), + TierBadge("Advanced", "8.4%", 0.886, (244, 67, 54)), +) + + +# --------------------------------------------------------------------------- +# Font loading +# --------------------------------------------------------------------------- + + +def _find_font(family: str, *, weight: str = "normal") -> Path: + """Locate a font file via matplotlib's font manager. + + matplotlib bundles DejaVu Sans, so this resolves to a stable file + path in any environment where matplotlib is installed (the + ``[scripts]`` and ``[dev]`` extras both pull it in). The same + byte content of the font file → identical glyph rasters → + byte-identical PNG output. + """ + + prop = fm.FontProperties(family=family, weight=weight) + return Path(fm.findfont(prop, fallback_to_default=False)) + + +# --------------------------------------------------------------------------- +# Drawing +# --------------------------------------------------------------------------- + + +def _draw_title_block(draw: ImageDraw.ImageDraw, font_paths: dict[str, Path]) -> None: + """Render the title, tagline, and subtitle text block.""" + + title_font = ImageFont.truetype(str(font_paths["bold"]), 96) + draw.text((LEFT_MARGIN, 88), "LeadForge", font=title_font, fill=TEXT_PRIMARY) + + tagline_font = ImageFont.truetype(str(font_paths["regular"]), 40) + draw.text( + (LEFT_MARGIN, 208), + "Synthetic B2B Lead Scoring · v1", + font=tagline_font, + fill=TEXT_SECONDARY, + ) + + subtitle_font = ImageFont.truetype(str(font_paths["regular"]), 24) + draw.text( + (LEFT_MARGIN, 280), + "5,000 leads · 3 difficulty tiers · 90-day conversion · MIT", + font=subtitle_font, + fill=TEXT_SECONDARY, + ) + + +def _draw_tier_card( + draw: ImageDraw.ImageDraw, + *, + badge: TierBadge, + box: tuple[int, int, int, int], + font_paths: dict[str, Path], +) -> None: + """Render one tier card inside ``box`` (left, top, right, bottom).""" + + left, top, right, bottom = box + draw.rectangle((left, top, right, bottom), fill=CARD_BACKGROUND) + # Coloured accent stripe down the left edge. + draw.rectangle((left, top, left + 8, bottom), fill=badge.accent) + + name_font = ImageFont.truetype(str(font_paths["bold"]), 36) + draw.text((left + 32, top + 24), badge.name, font=name_font, fill=TEXT_PRIMARY) + + body_font = ImageFont.truetype(str(font_paths["regular"]), 22) + draw.text( + (left + 32, top + 80), + f"Conversion: {badge.conversion_rate_pct}", + font=body_font, + fill=TEXT_SECONDARY, + ) + draw.text( + (left + 32, top + 116), + f"LR AUC: {badge.lr_auc:.3f}", + font=body_font, + fill=TEXT_SECONDARY, + ) + + +def render_cover(badges: Sequence[TierBadge] = TIER_BADGES) -> Image.Image: + """Render the cover image as a fresh ``PIL.Image`` instance.""" + + image = Image.new("RGB", (CANVAS_WIDTH, CANVAS_HEIGHT), BACKGROUND) + draw = ImageDraw.Draw(image) + + font_paths = { + "regular": _find_font("DejaVu Sans", weight="normal"), + "bold": _find_font("DejaVu Sans", weight="bold"), + } + + _draw_title_block(draw, font_paths) + + # Three equal-width cards spanning the bottom half of the canvas. + card_top = 400 + card_bottom = 580 + card_count = len(badges) + gap = 40 + available = CANVAS_WIDTH - 2 * LEFT_MARGIN + card_width = (available - gap * (card_count - 1)) // card_count + for i, badge in enumerate(badges): + left = LEFT_MARGIN + i * (card_width + gap) + right = left + card_width + _draw_tier_card( + draw, + badge=badge, + box=(left, card_top, right, card_bottom), + font_paths=font_paths, + ) + + return image + + +def write_cover(path: Path, image: Image.Image | None = None) -> Path: + """Render and write the cover image to ``path`` deterministically. + + Pillow's PNG writer is byte-deterministic given the same input + image and the same encoder settings — pinning ``optimize=False`` + and a fixed ``compress_level`` removes the only sources of + run-to-run variance. + """ + + if image is None: + image = render_cover() + path.parent.mkdir(parents=True, exist_ok=True) + image.save(path, format="PNG", optimize=False, compress_level=6) + return path + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def _parse_args(argv: Sequence[str] | None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Generate the deterministic Kaggle cover image for leadforge-lead-scoring-v1.", + ) + parser.add_argument( + "--out", + type=Path, + default=DEFAULT_OUT_PATH, + help="output PNG path (default: %(default)s)", + ) + return parser.parse_args(argv) + + +def main(argv: Sequence[str] | None = None) -> int: + args = _parse_args(argv) + out_path: Path = args.out + write_cover(out_path) + print(f"wrote {out_path}", file=sys.stderr) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/package_kaggle_release.py b/scripts/package_kaggle_release.py new file mode 100644 index 0000000..96065a5 --- /dev/null +++ b/scripts/package_kaggle_release.py @@ -0,0 +1,1111 @@ +#!/usr/bin/env python3 +"""Package the ``leadforge-lead-scoring-v1`` family for Kaggle. + +PR 5.1 — first of two PRs in Phase 5 (Platform packaging) of the v1 +release roadmap. This script: + +1. Reads each public tier's ``manifest.json`` + ``feature_dictionary.csv`` + + flat CSV header under ``release/`` and assembles a Kaggle + ``dataset-metadata.json`` that satisfies G11.1 of + ``docs/release/v1_acceptance_gates.md`` — title length, subtitle + length, slug length, single licence, ``expectedUpdateFrequency`` + from the approved set, image filename, and + ``resources[].schema.fields`` listed **in column order** for every + tabular resource (CSV via the feature dictionary; parquet via the + Arrow schema). +2. Validates the cover image at ``release/dataset-cover-image.png`` + (≥ 560 × 280 per G11.2; generated by + ``scripts/generate_cover_image.py``). +3. Writes ``release/kaggle/dataset-metadata.json`` deterministically: + the same release input produces a byte-identical metadata file + (audit-artifact-sync pattern; guarded by + ``tests/scripts/test_package_kaggle_release.py``). +4. Optionally assembles a Kaggle-CLI-shaped upload directory under + ``release/kaggle/`` as real-file copies of the per-tier bundles + plus a rewritten copy of ``release/README.md`` whose directory + diagram and ``../`` links resolve correctly when read on the + Kaggle dataset page. An earlier draft used symlinks; we copy + instead because Kaggle's CLI walks the upload directory with + ``followlinks=False`` in some versions. + +The actual ``kaggle datasets create`` upload lives in PR 7.2; this +script is intentionally publish-free. ``--dry-run`` validates and +writes the metadata without touching the upload-dir layout, useful +for shape iteration; the default mode also assembles the upload tree. + +Failed validation exits with rc=1; pre-flight errors (missing release +dir, missing tier, missing cover image, unsafe ``--kaggle-dir``) +exit with rc=2. +""" + +from __future__ import annotations + +import argparse +import json +import re +import shutil +import sys +from collections.abc import Sequence +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any, Final + +import pandas as pd +import pyarrow as pa +import pyarrow.parquet as pq +from PIL import Image + +# --------------------------------------------------------------------------- +# Kaggle field constraints (chatgpt v2 §19, verified from official docs) +# --------------------------------------------------------------------------- + +TITLE_LEN_RANGE: Final[tuple[int, int]] = (6, 50) +SUBTITLE_LEN_RANGE: Final[tuple[int, int]] = (20, 80) +SLUG_LEN_RANGE: Final[tuple[int, int]] = (3, 50) + +#: Allowed values for ``expectedUpdateFrequency`` (Kaggle CLI rejects +#: anything else). +APPROVED_UPDATE_FREQUENCIES: Final[tuple[str, ...]] = ( + "never", + "annually", + "quarterly", + "monthly", + "weekly", + "daily", + "hourly", +) + +#: Cover-image minimum dimensions per G11.2: 560 × 280 minimum, with +#: 2:1 header / 1:1 thumbnail crops in mind. +COVER_IMAGE_MIN_WIDTH: Final[int] = 560 +COVER_IMAGE_MIN_HEIGHT: Final[int] = 280 + +#: Allowed cover-image extensions per Kaggle docs. +ALLOWED_COVER_IMAGE_SUFFIXES: Final[tuple[str, ...]] = ( + ".png", + ".jpg", + ".jpeg", + ".webp", +) + +#: Slug pattern — Kaggle dataset slugs are lowercase alphanumeric with +#: hyphens. Boundary chars must be alphanumeric so the slug never +#: starts or ends with a hyphen. +SLUG_PATTERN: Final[re.Pattern[str]] = re.compile(r"^[a-z0-9][a-z0-9-]*[a-z0-9]$") + +# --------------------------------------------------------------------------- +# Release-specific defaults (G1.2 dataset slug + G11 metadata content) +# --------------------------------------------------------------------------- + +DEFAULT_USER_SLUG: Final[str] = "leadforge" +DEFAULT_DATASET_SLUG: Final[str] = "leadforge-lead-scoring-v1" + +DEFAULT_TITLE: Final[str] = "LeadForge: Synthetic B2B Lead Scoring (v1)" +DEFAULT_SUBTITLE: Final[str] = "Three-tier synthetic CRM funnel for leakage-aware lead scoring" + +DEFAULT_KEYWORDS: Final[tuple[str, ...]] = ( + "b2b", + "classification", + "crm", + "education", + "lead-scoring", + "saas", + "synthetic-data", + "tabular", +) + + +@dataclass(frozen=True) +class UserSource: + """One entry under ``userSpecifiedSources``. + + Defined alongside the constants so ``DEFAULT_USER_SOURCES`` below + can reference it without a forward declaration; the rest of the + metadata dataclasses live further down in their own section. + """ + + title: str + url: str + + +DEFAULT_USER_SOURCES: Final[tuple[UserSource, ...]] = ( + UserSource( + title="leadforge source repository", + url="https://github.com/leadforge-dev/leadforge", + ), + UserSource( + title="v1 release validation report", + url="https://github.com/leadforge-dev/leadforge/tree/main/release/validation", + ), +) + +DEFAULT_LICENSE_NAME: Final[str] = "MIT" +DEFAULT_UPDATE_FREQUENCY: Final[str] = "never" + +DEFAULT_TIERS: Final[tuple[str, ...]] = ("intro", "intermediate", "advanced") +DEFAULT_TASK: Final[str] = "converted_within_90_days" + +DEFAULT_RELEASE_DIR: Final[Path] = Path("release") +DEFAULT_KAGGLE_DIR: Final[Path] = Path("release/kaggle") +DEFAULT_COVER_IMAGE: Final[Path] = Path("release/dataset-cover-image.png") + +#: Top-level files at ``release/`` that ship to Kaggle alongside the +#: bundles. README.md is rewritten on the way in (see +#: :func:`_kaggle_readme_text`); LICENSE is taken verbatim. +TOP_LEVEL_DOCS: Final[tuple[str, ...]] = ("README.md", "LICENSE") + +#: Tables that may appear in a public bundle, in canonical render +#: order. ``customers`` and ``subscriptions`` are intentionally +#: absent — their presence in a public bundle would itself be leakage +#: (PR 2.2). +BUNDLE_TABLES: Final[tuple[str, ...]] = ( + "accounts", + "contacts", + "leads", + "touches", + "sessions", + "sales_activities", + "opportunities", +) + +#: Mapping from feature_dictionary.csv ``dtype`` (see +#: ``leadforge/schema/dictionaries.py``) to a Frictionless Data +#: Package ``schema.fields[].type`` token, which Kaggle's resource +#: schema uses. +DTYPE_TO_FRICTIONLESS: Final[dict[str, str]] = { + "string": "string", + "Int64": "integer", + "Float64": "number", + "boolean": "boolean", +} + +#: Description for the ``split`` column, which is present in the flat +#: CSV but not in the feature dictionary (it tracks task-split +#: membership rather than describing a feature). +SPLIT_COLUMN_DESCRIPTION: Final[str] = ( + "Task-split membership: one of `train`, `valid`, `test`. " + "Matches the per-row split assignment in `tasks/converted_within_90_days/`." +) + +# --------------------------------------------------------------------------- +# Description / README rewriting +# --------------------------------------------------------------------------- + +GITHUB_BLOB_BASE: Final[str] = "https://github.com/leadforge-dev/leadforge/blob/main" + +#: The "What's inside" tree diagram in ``release/README.md``. The +#: published README on Kaggle should describe the *upload* layout +#: (which has dataset-metadata.json + cover image at the top, no +#: instructor companion, no notebooks/validation siblings), not the +#: source-repo layout — we substitute the block on the way out. +KAGGLE_TREE_BLOCK: Final[str] = """``` +release/ +├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier +│ ├── manifest.json # provenance + file hashes +│ ├── dataset_card.md # auto-rendered per-bundle card +│ ├── feature_dictionary.csv # authoritative column spec +│ ├── lead_scoring.csv # flat convenience CSV (all splits) +│ ├── tables/*.parquet # 7 snapshot-safe relational tables +│ └── tasks/converted_within_90_days/{train,valid,test}.parquet +├── intermediate_instructor/ # research companion: full-horizon tables + metadata/ +├── notebooks/01_baseline_lead_scoring.ipynb +└── validation/ # validation_report.{json,md} + figures +```""" + +KAGGLE_UPLOAD_TREE_BLOCK: Final[str] = """``` +. +├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier +│ ├── manifest.json # provenance + file hashes +│ ├── dataset_card.md # auto-rendered per-bundle card +│ ├── feature_dictionary.csv # authoritative column spec +│ ├── lead_scoring.csv # flat convenience CSV (all splits) +│ ├── tables/*.parquet # 7 snapshot-safe relational tables +│ └── tasks/converted_within_90_days/{train,valid,test}.parquet +├── dataset-metadata.json # Kaggle dataset metadata +├── dataset-cover-image.png # Kaggle cover image +├── README.md # Kaggle package README +└── LICENSE +```""" + +#: Inline relative link ``](../foo)`` → ``](GITHUB_BLOB_BASE/foo)`` +#: for any markdown link that escapes the bundle root. +_PARENT_RELATIVE_LINK: Final[re.Pattern[str]] = re.compile(r"\]\(\.\./([^)]+)\)") + +#: The README points at ``validation/validation_report.md`` (a path +#: that lives under ``release/`` but not under the Kaggle upload +#: directory). Rewrite to a GitHub blob URL so the link works on +#: Kaggle. +_VALIDATION_REPORT_LINK: Final[str] = "](validation/validation_report.md)" +_VALIDATION_REPORT_URL: Final[str] = ( + f"]({GITHUB_BLOB_BASE}/release/validation/validation_report.md)" +) + + +def _kaggle_readme_text(readme: str) -> str: + """Apply the Kaggle-specific rewrites to a copy of the release README. + + Rewrites: + + 1. Source-repo tree diagram → upload-tree diagram (the published + README should describe what the *user* sees on Kaggle, not the + source repo layout). + 2. ``](../foo)`` → ``]({GITHUB_BLOB_BASE}/foo)`` (markdown links + that escape the bundle root resolve to the source repo on + GitHub). + 3. ``](validation/validation_report.md)`` → blob URL (the + validation report does not ship to Kaggle; readers click + through to GitHub). + """ + + text = readme.replace(KAGGLE_TREE_BLOCK, KAGGLE_UPLOAD_TREE_BLOCK) + text = _PARENT_RELATIVE_LINK.sub(rf"]({GITHUB_BLOB_BASE}/\1)", text) + text = text.replace(_VALIDATION_REPORT_LINK, _VALIDATION_REPORT_URL) + return text + + +# --------------------------------------------------------------------------- +# Dataclasses — one per top-level metadata block +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class FieldDescriptor: + """One column entry inside ``resources[].schema.fields``.""" + + name: str + type: str + description: str | None = None + + +@dataclass(frozen=True) +class ResourceSchema: + """Frictionless-style schema declaration for a tabular resource.""" + + fields: tuple[FieldDescriptor, ...] + + +@dataclass(frozen=True) +class Resource: + """One entry under ``resources``. + + ``schema`` is set to ``None`` for non-tabular resources (markdown, + JSON manifests). The renderer drops ``None`` values and + ``description=None`` field-level entries so the JSON stays clean. + """ + + path: str + description: str + schema: ResourceSchema | None = None + + +@dataclass(frozen=True) +class LicenseSpec: + """One entry under ``licenses``. Kaggle requires exactly one.""" + + name: str + + +# ``UserSource`` is defined above (next to ``DEFAULT_USER_SOURCES``) so +# the constant can reference it without forward-declaration tricks. + + +@dataclass(frozen=True) +class DatasetMetadata: + """Top-level Kaggle metadata payload. + + These dataclasses are typed records, not invariants — construction + is unchecked. Callers MUST run :func:`validate_metadata` before + relying on the metadata being well-formed; that function is the + authoritative gate for every Kaggle field constraint. Doing the + validation eagerly in ``__post_init__`` would prevent tests from + constructing deliberately bad payloads to exercise the validator, + which is why the discipline lives in the validator instead. + """ + + title: str + id: str + subtitle: str + description: str + isPrivate: bool # noqa: N815 — Kaggle field name is camelCase + licenses: tuple[LicenseSpec, ...] + keywords: tuple[str, ...] + collaborators: tuple[str, ...] + expectedUpdateFrequency: str # noqa: N815 — Kaggle field name + userSpecifiedSources: tuple[UserSource, ...] # noqa: N815 — Kaggle field name + image: str + resources: tuple[Resource, ...] = field(default_factory=tuple) + + +# --------------------------------------------------------------------------- +# Validation +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class ValidationError: + """One field-level validation failure.""" + + field: str + message: str + + +def _validate_length(name: str, value: str, lo: int, hi: int) -> ValidationError | None: + n = len(value) + if n < lo or n > hi: + return ValidationError( + field=name, + message=f"length {n} outside Kaggle range [{lo}, {hi}]", + ) + return None + + +def _validate_slug(slug: str, *, field_name: str) -> ValidationError | None: + err = _validate_length(field_name, slug, *SLUG_LEN_RANGE) + if err is not None: + return err + if not SLUG_PATTERN.fullmatch(slug): + return ValidationError( + field=field_name, + message=f"slug {slug!r} must be lowercase alphanumeric with hyphens", + ) + return None + + +def _validate_id(value: str) -> list[ValidationError]: + """Validate the ``user/slug`` id field. + + Kaggle's actual ``dataset-metadata.json`` schema uses + ``/``; the slug-only short form some tooling accepts + is rejected here so the artefact is upload-ready without + publish-time fixup. + """ + + errors: list[ValidationError] = [] + if "/" not in value: + errors.append(ValidationError(field="id", message=f"id {value!r} missing 'user/' prefix")) + return errors + user, slug = value.split("/", 1) + if not user: + errors.append(ValidationError(field="id", message="user prefix is empty")) + if not slug: + errors.append(ValidationError(field="id", message="slug is empty")) + return errors + slug_err = _validate_slug(slug, field_name="id (slug)") + if slug_err is not None: + errors.append(slug_err) + return errors + + +def validate_metadata(metadata: DatasetMetadata) -> list[ValidationError]: + """Run every Kaggle-side check against a built ``DatasetMetadata``.""" + + errors: list[ValidationError] = [] + + title_err = _validate_length("title", metadata.title, *TITLE_LEN_RANGE) + if title_err is not None: + errors.append(title_err) + + subtitle_err = _validate_length("subtitle", metadata.subtitle, *SUBTITLE_LEN_RANGE) + if subtitle_err is not None: + errors.append(subtitle_err) + + errors.extend(_validate_id(metadata.id)) + + if len(metadata.licenses) != 1: + errors.append( + ValidationError( + field="licenses", + message=f"expected exactly one entry, got {len(metadata.licenses)}", + ) + ) + + if metadata.expectedUpdateFrequency not in APPROVED_UPDATE_FREQUENCIES: + errors.append( + ValidationError( + field="expectedUpdateFrequency", + message=( + f"{metadata.expectedUpdateFrequency!r} not in approved values " + f"{APPROVED_UPDATE_FREQUENCIES}" + ), + ) + ) + + image_suffix = Path(metadata.image).suffix.lower() + if image_suffix not in ALLOWED_COVER_IMAGE_SUFFIXES: + errors.append( + ValidationError( + field="image", + message=( + f"image extension {image_suffix!r} not in allowed Kaggle suffixes " + f"{ALLOWED_COVER_IMAGE_SUFFIXES}" + ), + ) + ) + + if not metadata.resources: + errors.append( + ValidationError(field="resources", message="must contain at least one resource") + ) + + for i, res in enumerate(metadata.resources): + if res.schema is None: + continue # non-tabular resource; schema is optional + if not res.schema.fields: + errors.append( + ValidationError( + field=f"resources[{i}].schema.fields", + message="must contain at least one field when schema is declared", + ) + ) + continue + for j, fd in enumerate(res.schema.fields): + if not fd.name or not fd.type: + errors.append( + ValidationError( + field=f"resources[{i}].schema.fields[{j}]", + message="each field must declare both name and type", + ) + ) + + return errors + + +def validate_cover_image(path: Path) -> list[ValidationError]: + """Validate that ``path`` exists and meets Kaggle's dimension floor.""" + + errors: list[ValidationError] = [] + if not path.exists(): + errors.append( + ValidationError( + field="cover_image", + message=f"cover image not found at {path}", + ) + ) + return errors + with Image.open(path) as img: + width, height = img.size + if width < COVER_IMAGE_MIN_WIDTH or height < COVER_IMAGE_MIN_HEIGHT: + errors.append( + ValidationError( + field="cover_image", + message=( + f"cover image {width}x{height} below Kaggle minimum " + f"{COVER_IMAGE_MIN_WIDTH}x{COVER_IMAGE_MIN_HEIGHT}" + ), + ) + ) + return errors + + +# --------------------------------------------------------------------------- +# Bundle reading + resource building +# --------------------------------------------------------------------------- + + +def _load_feature_dictionary(path: Path) -> dict[str, FieldDescriptor]: + """Load ``feature_dictionary.csv`` keyed by column name.""" + + df = pd.read_csv(path) + descriptors: dict[str, FieldDescriptor] = {} + for _, row in df.iterrows(): + dtype = str(row["dtype"]) + frictionless_type = DTYPE_TO_FRICTIONLESS.get(dtype) + if frictionless_type is None: + raise ValueError( + f"feature_dictionary.csv at {path}: dtype {dtype!r} not mapped to a " + f"Frictionless Data Package type ({sorted(DTYPE_TO_FRICTIONLESS)!r})" + ) + name = str(row["name"]) + descriptors[name] = FieldDescriptor( + name=name, + type=frictionless_type, + description=str(row["description"]).strip(), + ) + return descriptors + + +def _flat_csv_fields( + flat_csv_path: Path, feature_dict: dict[str, FieldDescriptor] +) -> tuple[FieldDescriptor, ...]: + """Build ``schema.fields`` for a flat CSV in CSV column order.""" + + columns = list(pd.read_csv(flat_csv_path, nrows=0).columns) + fields: list[FieldDescriptor] = [] + for col in columns: + name = str(col) + if name == "split": + fields.append( + FieldDescriptor(name=name, type="string", description=SPLIT_COLUMN_DESCRIPTION) + ) + continue + descriptor = feature_dict.get(name) + if descriptor is None: + raise ValueError( + f"flat CSV at {flat_csv_path}: column {name!r} is not present in " + f"feature_dictionary.csv — feature dictionary is the source of truth" + ) + fields.append(descriptor) + return tuple(fields) + + +def _kaggle_type_from_arrow(dtype: pa.DataType) -> str: + """Map a pyarrow type to the Frictionless field-type token.""" + + if pa.types.is_boolean(dtype): + return "boolean" + if pa.types.is_integer(dtype): + return "integer" + if pa.types.is_floating(dtype) or pa.types.is_decimal(dtype): + return "number" + if pa.types.is_date(dtype) or pa.types.is_timestamp(dtype) or pa.types.is_time(dtype): + return "datetime" + return "string" + + +def fields_from_parquet(path: Path) -> tuple[FieldDescriptor, ...]: + """Read parquet schema from ``path`` and return ``FieldDescriptor`` rows. + + Kaggle accepts Frictionless schemas on parquet resources too; the + parquet file's own Arrow metadata is the ground truth for column + order and types, so we read directly rather than mirroring a CSV + header. ``description`` is omitted for parquet fields — relational + tables don't have per-column docs in the bundle. + """ + + schema = pq.read_schema(path) + return tuple(FieldDescriptor(name=f.name, type=_kaggle_type_from_arrow(f.type)) for f in schema) + + +def _load_manifest(path: Path) -> dict[str, Any]: + payload = json.loads(path.read_text(encoding="utf-8")) + if not isinstance(payload, dict): + raise ValueError(f"manifest.json at {path} is not a JSON object") + return payload + + +def build_tier_resources( + release_dir: Path, + tier: str, + *, + task: str = DEFAULT_TASK, +) -> tuple[Resource, ...]: + """Build the ``Resource`` list for one tier in canonical order. + + Order: flat CSV (with full ``schema.fields``) → feature dictionary + → task splits (parquet, schema from Arrow) → relational tables + (parquet, schema from Arrow) → dataset card → manifest. Kaggle + renders this list in declared order on the dataset page. + """ + + tier_dir = release_dir / tier + if not tier_dir.is_dir(): + raise FileNotFoundError(f"tier directory missing: {tier_dir}") + + feature_dict_path = tier_dir / "feature_dictionary.csv" + feature_dict = _load_feature_dictionary(feature_dict_path) + flat_csv_path = tier_dir / "lead_scoring.csv" + fields = _flat_csv_fields(flat_csv_path, feature_dict) + + manifest = _load_manifest(tier_dir / "manifest.json") + table_inventory = manifest.get("tables", {}) + snapshot_day = manifest.get("snapshot_day") + + resources: list[Resource] = [] + + resources.append( + Resource( + path=f"{tier}/lead_scoring.csv", + description=( + f"{tier.capitalize()} tier flat CSV (all splits concatenated, label retained, " + f"snapshot_day={snapshot_day}). The `split` column distinguishes " + f"train/valid/test rows." + ), + schema=ResourceSchema(fields=fields), + ) + ) + + resources.append( + Resource( + path=f"{tier}/feature_dictionary.csv", + description=f"{tier.capitalize()} tier feature dictionary (canonical column spec).", + ) + ) + + for split in ("train", "valid", "test"): + split_path = tier_dir / "tasks" / task / f"{split}.parquet" + rows = manifest.get("tasks", {}).get(task, {}).get(f"{split}_rows") + rows_str = f"{rows:,} rows" if isinstance(rows, int) else "row count in manifest" + resources.append( + Resource( + path=f"{tier}/tasks/{task}/{split}.parquet", + description=(f"{tier.capitalize()} tier {split} split for `{task}` ({rows_str})."), + schema=ResourceSchema(fields=fields_from_parquet(split_path)), + ) + ) + + for table in BUNDLE_TABLES: + if table not in table_inventory: + continue + table_path = tier_dir / "tables" / f"{table}.parquet" + row_count = table_inventory[table].get("row_count") + rows_str = f"{row_count:,} rows" if isinstance(row_count, int) else "" + suffix = f" ({rows_str})" if rows_str else "" + resources.append( + Resource( + path=f"{tier}/tables/{table}.parquet", + description=( + f"{tier.capitalize()} tier `{table}` relational table{suffix} — snapshot-safe." + ), + schema=ResourceSchema(fields=fields_from_parquet(table_path)), + ) + ) + + resources.append( + Resource( + path=f"{tier}/dataset_card.md", + description=f"{tier.capitalize()} tier auto-rendered dataset card.", + ) + ) + resources.append( + Resource( + path=f"{tier}/manifest.json", + description=( + f"{tier.capitalize()} tier provenance manifest (recipe, seed, package " + f"version, file hashes, snapshot_day, redaction contract)." + ), + ) + ) + return tuple(resources) + + +def build_metadata( + release_dir: Path, + *, + tiers: Sequence[str] = DEFAULT_TIERS, + task: str = DEFAULT_TASK, + owner: str = DEFAULT_USER_SLUG, + dataset_slug: str = DEFAULT_DATASET_SLUG, + title: str = DEFAULT_TITLE, + subtitle: str = DEFAULT_SUBTITLE, + description: str | None = None, + keywords: Sequence[str] = DEFAULT_KEYWORDS, + license_name: str = DEFAULT_LICENSE_NAME, + update_frequency: str = DEFAULT_UPDATE_FREQUENCY, + user_sources: Sequence[UserSource] = DEFAULT_USER_SOURCES, + cover_image: Path = DEFAULT_COVER_IMAGE, +) -> DatasetMetadata: + """Assemble a ``DatasetMetadata`` from the release tree. + + When ``description`` is ``None`` (the default) we lift the + contents of ``release/README.md`` and apply the Kaggle-specific + rewrites — Kaggle renders the description above the file list, so + a full dataset card there is more useful than a curated blurb. + """ + + if description is None: + readme_path = release_dir / "README.md" + description = _kaggle_readme_text(readme_path.read_text(encoding="utf-8")) + + resources: list[Resource] = [] + for tier in tiers: + resources.extend(build_tier_resources(release_dir, tier, task=task)) + + return DatasetMetadata( + title=title, + id=f"{owner}/{dataset_slug}", + subtitle=subtitle, + description=description, + isPrivate=True, + licenses=(LicenseSpec(name=license_name),), + keywords=tuple(keywords), + collaborators=(), + expectedUpdateFrequency=update_frequency, + userSpecifiedSources=tuple(user_sources), + image=cover_image.name, + resources=tuple(resources), + ) + + +# --------------------------------------------------------------------------- +# Rendering +# --------------------------------------------------------------------------- + + +def _field_to_dict(fd: FieldDescriptor) -> dict[str, Any]: + payload: dict[str, Any] = {"name": fd.name, "type": fd.type} + if fd.description is not None: + payload["description"] = fd.description + return payload + + +def _resource_to_dict(resource: Resource) -> dict[str, Any]: + """Serialise a ``Resource`` to a JSON-primitive dict. + + Drops ``schema`` when ``None``; drops ``description`` from + individual fields when ``None`` (parquet schemas don't carry + per-column documentation). + """ + + payload: dict[str, Any] = { + "path": resource.path, + "description": resource.description, + } + if resource.schema is not None: + payload["schema"] = {"fields": [_field_to_dict(fd) for fd in resource.schema.fields]} + return payload + + +def metadata_to_dict(metadata: DatasetMetadata) -> dict[str, Any]: + """Convert ``DatasetMetadata`` to a JSON-primitive dict. + + Built field-by-field rather than via ``asdict()`` so resource + serialisation goes through one path (``_resource_to_dict``) and + the keywords array is sorted at render time — making the + determinism contract explicit rather than relying on the + ``DEFAULT_KEYWORDS`` constant happening to be alphabetised. + """ + + return { + "title": metadata.title, + "id": metadata.id, + "subtitle": metadata.subtitle, + "description": metadata.description, + "isPrivate": metadata.isPrivate, + "licenses": [{"name": lic.name} for lic in metadata.licenses], + "keywords": sorted(metadata.keywords), + "collaborators": list(metadata.collaborators), + "expectedUpdateFrequency": metadata.expectedUpdateFrequency, + "userSpecifiedSources": [ + {"title": s.title, "url": s.url} for s in metadata.userSpecifiedSources + ], + "image": metadata.image, + "resources": [_resource_to_dict(r) for r in metadata.resources], + } + + +def render_metadata_json(metadata: DatasetMetadata) -> str: + """Render the metadata as a deterministic JSON string. + + ``ensure_ascii=False`` keeps non-ASCII content (em-dashes, the × + multiplication sign, smart quotes from the inlined README) + rendered literally rather than escaped to ``\\u2013`` etc., which + is essential for ``git diff`` readability when the README evolves. + """ + + return ( + json.dumps( + metadata_to_dict(metadata), + indent=2, + sort_keys=True, + ensure_ascii=False, + ) + + "\n" + ) + + +# --------------------------------------------------------------------------- +# Upload-directory assembly +# --------------------------------------------------------------------------- + + +def _validate_kaggle_dir_safe(kaggle_dir: Path, release_dir: Path) -> None: + """Refuse to assemble into a path that aliases something dangerous. + + The packager replaces children of ``kaggle_dir`` (rmtree + recopy) + so pointing it at ``cwd`` / ``release_dir`` / their parents / the + filesystem anchor would clobber unrelated content. This guard + fires before any disk write. + """ + + resolved = kaggle_dir.resolve() + blocked = { + Path(resolved.anchor), + Path.cwd().resolve(), + release_dir.resolve(), + release_dir.resolve().parent, + } + if resolved in blocked: + raise ValueError(f"refusing to assemble into unsafe --kaggle-dir: {kaggle_dir}") + + +def _replace_file(src: Path, dst: Path) -> None: + """Copy ``src`` → ``dst``, replacing any existing entry at ``dst``.""" + + if dst.is_symlink() or dst.is_file(): + dst.unlink() + elif dst.exists() and dst.is_dir(): + shutil.rmtree(dst) + dst.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(src, dst) + + +def _replace_dir(src: Path, dst: Path) -> None: + """Copy directory ``src`` → ``dst``, replacing any existing entry.""" + + if dst.is_symlink() or dst.is_file(): + dst.unlink() + elif dst.exists() and dst.is_dir(): + shutil.rmtree(dst) + dst.parent.mkdir(parents=True, exist_ok=True) + shutil.copytree(src, dst) + + +def assemble_upload_dir( + release_dir: Path, + kaggle_dir: Path, + *, + tiers: Sequence[str] = DEFAULT_TIERS, + cover_image: Path = DEFAULT_COVER_IMAGE, +) -> None: + """Assemble ``kaggle_dir`` for ``kaggle datasets create`` to consume. + + The output tree is a self-contained directory of real files: + cover image, LICENSE, the rewritten README, and full copies of + each tier bundle. Symlinks were considered (and tried in an + earlier draft) but Kaggle's CLI walks the upload directory with + ``followlinks=False`` in some versions, silently skipping symlinked + children — switching to copies removes that fragility at the cost + of ~15 MB of disk per assembly run, which is gitignored anyway. + + Re-running the assembly is idempotent: ``_replace_file`` and + ``_replace_dir`` rmtree-then-copy any existing entry. The README + is the one file rewritten on the way in (tree diagram + ``../`` + links). ``--dry-run`` skips this whole function. + """ + + _validate_kaggle_dir_safe(kaggle_dir, release_dir) + kaggle_dir.mkdir(parents=True, exist_ok=True) + + # Cover image. + cover_src = release_dir / cover_image.name + if not cover_src.exists(): + cover_src = cover_image + _replace_file(cover_src, kaggle_dir / cover_image.name) + + # LICENSE — straight copy, no rewriting. + license_src = release_dir / "LICENSE" + if license_src.exists(): + _replace_file(license_src, kaggle_dir / "LICENSE") + + # README.md — real copy with link rewriting so ``../`` links and + # the directory diagram resolve correctly on the Kaggle dataset + # page. + kaggle_readme = kaggle_dir / "README.md" + if kaggle_readme.is_symlink() or kaggle_readme.is_file(): + kaggle_readme.unlink() + readme_src = release_dir / "README.md" + if readme_src.exists(): + kaggle_readme.parent.mkdir(parents=True, exist_ok=True) + kaggle_readme.write_text( + _kaggle_readme_text(readme_src.read_text(encoding="utf-8")), + encoding="utf-8", + ) + + # Per-tier bundles — full directory copies. + for tier in tiers: + tier_src = release_dir / tier + _replace_dir(tier_src, kaggle_dir / tier) + + +# --------------------------------------------------------------------------- +# Driver +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class PackagerOutcome: + """Return value from :func:`run_packager` — used by tests + CLI.""" + + metadata: DatasetMetadata + metadata_path: Path + errors: tuple[ValidationError, ...] + assembled: bool + + +def run_packager( + release_dir: Path, + *, + kaggle_dir: Path = DEFAULT_KAGGLE_DIR, + tiers: Sequence[str] = DEFAULT_TIERS, + task: str = DEFAULT_TASK, + owner: str = DEFAULT_USER_SLUG, + dataset_slug: str = DEFAULT_DATASET_SLUG, + title: str = DEFAULT_TITLE, + subtitle: str = DEFAULT_SUBTITLE, + description: str | None = None, + keywords: Sequence[str] = DEFAULT_KEYWORDS, + license_name: str = DEFAULT_LICENSE_NAME, + update_frequency: str = DEFAULT_UPDATE_FREQUENCY, + user_sources: Sequence[UserSource] = DEFAULT_USER_SOURCES, + cover_image: Path = DEFAULT_COVER_IMAGE, + dry_run: bool = False, +) -> PackagerOutcome: + """Build, validate, and write the Kaggle metadata. + + With ``dry_run=False`` (the default) the packager additionally + assembles the Kaggle-CLI-shaped upload directory under + ``kaggle_dir`` (real-file copies of the per-tier bundles + cover + image + LICENSE + the rewritten README). ``dry_run=True`` skips + the assembly step entirely — useful for fast shape iteration when + only the metadata content matters. + """ + + if not release_dir.exists(): + raise FileNotFoundError(f"release directory not found: {release_dir}") + + metadata = build_metadata( + release_dir, + tiers=tiers, + task=task, + owner=owner, + dataset_slug=dataset_slug, + title=title, + subtitle=subtitle, + description=description, + keywords=keywords, + license_name=license_name, + update_frequency=update_frequency, + user_sources=user_sources, + cover_image=cover_image, + ) + + errors: list[ValidationError] = [] + errors.extend(validate_metadata(metadata)) + errors.extend(validate_cover_image(cover_image)) + errors.extend(_validate_readme_substitution(release_dir)) + + metadata_path = kaggle_dir / "dataset-metadata.json" + metadata_path.parent.mkdir(parents=True, exist_ok=True) + metadata_path.write_text(render_metadata_json(metadata), encoding="utf-8") + + if not dry_run: + assemble_upload_dir(release_dir, kaggle_dir, tiers=tiers, cover_image=cover_image) + + return PackagerOutcome( + metadata=metadata, + metadata_path=metadata_path, + errors=tuple(errors), + assembled=not dry_run, + ) + + +def _validate_readme_substitution(release_dir: Path) -> list[ValidationError]: + """Guard against silent drift between the README's tree diagram + and ``KAGGLE_TREE_BLOCK``. + + ``_kaggle_readme_text`` substitutes the source-repo tree diagram + for the upload-tree diagram via plain string replace. If the + README's tree changes by even one whitespace character, the + substitution silently no-ops and the published Kaggle dataset + card shows the source-repo tree (with ``intermediate_instructor/``, + ``notebooks/``, ``validation/``). We catch that case here. + """ + + readme = release_dir / "README.md" + if not readme.exists(): + return [] # No README is itself a release-day issue, but not this validator's concern. + if KAGGLE_TREE_BLOCK not in readme.read_text(encoding="utf-8"): + return [ + ValidationError( + field="release/README.md", + message=( + "KAGGLE_TREE_BLOCK not found verbatim in release/README.md; " + "the source-repo tree diagram in the README has drifted from " + "the constant in scripts/package_kaggle_release.py — the " + "Kaggle description rewrite will silently no-op until the " + "README and KAGGLE_TREE_BLOCK are reconciled." + ), + ) + ] + return [] + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def _parse_args(argv: Sequence[str] | None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Generate and validate Kaggle dataset-metadata.json for " + "leadforge-lead-scoring-v1.", + ) + parser.add_argument( + "--release-dir", + type=Path, + default=DEFAULT_RELEASE_DIR, + help="release bundle root containing one subdirectory per tier (default: %(default)s)", + ) + parser.add_argument( + "--kaggle-dir", + type=Path, + default=DEFAULT_KAGGLE_DIR, + help="output directory for dataset-metadata.json (default: %(default)s)", + ) + parser.add_argument( + "--tier", + action="append", + dest="tiers", + default=None, + help="limit packaging to one tier (repeatable; default: intro/intermediate/advanced)", + ) + parser.add_argument( + "--owner", + default=DEFAULT_USER_SLUG, + help="Kaggle owner (user or organisation) prefix on the dataset id (default: %(default)s)", + ) + parser.add_argument( + "--dataset-slug", + default=DEFAULT_DATASET_SLUG, + help="dataset slug (must satisfy G1.2; default: %(default)s)", + ) + parser.add_argument( + "--cover-image", + type=Path, + default=DEFAULT_COVER_IMAGE, + help="path to the dataset cover image (default: %(default)s)", + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="validate + write metadata only; skip assembling the upload directory", + ) + return parser.parse_args(argv) + + +def main(argv: Sequence[str] | None = None) -> int: + args = _parse_args(argv) + kaggle_dir: Path = args.kaggle_dir + tiers: tuple[str, ...] = tuple(args.tiers) if args.tiers else DEFAULT_TIERS + + try: + outcome = run_packager( + args.release_dir, + kaggle_dir=kaggle_dir, + tiers=tiers, + owner=args.owner, + dataset_slug=args.dataset_slug, + cover_image=args.cover_image, + dry_run=args.dry_run, + ) + except FileNotFoundError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + except ValueError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + + if outcome.errors: + print("validation failed:", file=sys.stderr) + for err in outcome.errors: + print(f" - {err.field}: {err.message}", file=sys.stderr) + return 1 + + print(f"wrote {outcome.metadata_path}", file=sys.stderr) + if outcome.assembled: + print(f"assembled upload tree under {kaggle_dir}", file=sys.stderr) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/scripts/test_generate_cover_image.py b/tests/scripts/test_generate_cover_image.py new file mode 100644 index 0000000..46f3af8 --- /dev/null +++ b/tests/scripts/test_generate_cover_image.py @@ -0,0 +1,109 @@ +"""Tests for ``scripts/generate_cover_image.py``. + +Locks the two acceptance properties for the Kaggle cover image: + +1. it satisfies G11.2 — at least 560 × 280 pixels in the right modes; +2. the output is byte-deterministic across runs and matches the + committed PNG (audit-artifact-sync pattern from PR 4.1). + +If the simulator's headline metrics drift the cover image's pinned +literals out of date, both the determinism check here and the metrics +in ``release/validation/validation_report.md`` will need a coordinated +update. +""" + +from __future__ import annotations + +import importlib.util +import sys +from pathlib import Path + +import pytest +from PIL import Image + +_SCRIPT_PATH = Path(__file__).resolve().parents[2] / "scripts" / "generate_cover_image.py" +_REPO_ROOT = Path(__file__).resolve().parents[2] +_spec = importlib.util.spec_from_file_location("generate_cover_image", _SCRIPT_PATH) +assert _spec is not None +assert _spec.loader is not None +generator = importlib.util.module_from_spec(_spec) +sys.modules["generate_cover_image"] = generator +_spec.loader.exec_module(generator) + + +_COMMITTED_COVER = _REPO_ROOT / "release" / "dataset-cover-image.png" +_COMMITTED_PRESENT = _COMMITTED_COVER.exists() + + +# --------------------------------------------------------------------------- +# Dimension floor + mode (G11.2) +# --------------------------------------------------------------------------- + + +def test_render_cover_dimensions_above_kaggle_minimum() -> None: + """G11.2: cover image must be at least 560 × 280; we ship 1280 × 640.""" + + image = generator.render_cover() + assert image.size == (generator.CANVAS_WIDTH, generator.CANVAS_HEIGHT) + assert image.size[0] >= 560 + assert image.size[1] >= 280 + # Ratio check — we deliberately render at 2:1 so the Kaggle header + # crop matches the source aspect ratio. + assert image.size[0] == 2 * image.size[1] + assert image.mode == "RGB" + + +def test_write_cover_writes_png_at_target_size(tmp_path: Path) -> None: + """``write_cover`` round-trips through Pillow at the declared dimensions.""" + + out = tmp_path / "cover.png" + generator.write_cover(out) + + with Image.open(out) as img: + assert img.format == "PNG" + assert img.size == (generator.CANVAS_WIDTH, generator.CANVAS_HEIGHT) + + +# --------------------------------------------------------------------------- +# Determinism + sync with committed asset +# --------------------------------------------------------------------------- + + +def test_render_cover_is_byte_deterministic(tmp_path: Path) -> None: + """Two back-to-back ``write_cover`` calls on the same machine + produce byte-identical PNGs. + + Pillow's PNG writer is deterministic given the same encoder + settings + the same FreeType-rasterised glyph bitmaps. This + guard catches regressions in the rasterisation pipeline locally; + cross-platform byte equality is *not* guaranteed (FreeType + versions and font-hinting tables differ between macOS and Linux, + so the committed PNG may not match a fresh render produced on a + different OS — we deliberately do not assert that here). + """ + + a = tmp_path / "cover_a.png" + b = tmp_path / "cover_b.png" + generator.write_cover(a) + generator.write_cover(b) + assert a.read_bytes() == b.read_bytes() + + +@pytest.mark.skipif(not _COMMITTED_PRESENT, reason="committed cover image not present") +def test_committed_cover_meets_kaggle_dimensions(tmp_path: Path) -> None: + """The committed ``release/dataset-cover-image.png`` opens cleanly + and meets Kaggle's dimension floor (G11.2). + + The committed PNG is a *valid render*, not a hash-locked artefact — + it ships so a fresh clone has a usable cover image without first + running ``scripts/generate_cover_image.py``. Cross-OS byte + equality is not asserted (see + :func:`test_render_cover_is_byte_deterministic`). + """ + + with Image.open(_COMMITTED_COVER) as img: + assert img.format == "PNG" + assert img.size[0] >= 560 + assert img.size[1] >= 280 + # Same shape as ``render_cover`` produces. + assert img.size == (generator.CANVAS_WIDTH, generator.CANVAS_HEIGHT) diff --git a/tests/scripts/test_package_kaggle_release.py b/tests/scripts/test_package_kaggle_release.py new file mode 100644 index 0000000..276d01e --- /dev/null +++ b/tests/scripts/test_package_kaggle_release.py @@ -0,0 +1,536 @@ +"""Tests for ``scripts/package_kaggle_release.py``. + +Locks the Phase 5 Kaggle packaging contract: + +* every Kaggle field constraint surfaced in chatgpt v2 §19 (G11.1) +* the cover-image dimension floor (G11.2) +* the README link-rewriting that lets the published dataset card on + Kaggle keep working ``../`` links (rewritten to GitHub blob URLs) + and a directory diagram that reflects the upload layout, plus a + guard that the source ``KAGGLE_TREE_BLOCK`` is still present + verbatim in the README (silent-failure trap) +* the assembled upload tree resolves every declared resource path + (so ``kaggle datasets create`` can find each file) +* the safety net that refuses to assemble into ``cwd`` / + ``release_dir`` / its parent +* byte-equality + content-shape between the committed + ``release/kaggle/dataset-metadata.json`` and a fresh regeneration + (audit-artifact-sync pattern from PR 4.1) +""" + +from __future__ import annotations + +import importlib.util +import json +import sys +from pathlib import Path + +import pytest +from PIL import Image + +_SCRIPT_PATH = Path(__file__).resolve().parents[2] / "scripts" / "package_kaggle_release.py" +_REPO_ROOT = Path(__file__).resolve().parents[2] +_spec = importlib.util.spec_from_file_location("package_kaggle_release", _SCRIPT_PATH) +assert _spec is not None +assert _spec.loader is not None +packager = importlib.util.module_from_spec(_spec) +sys.modules["package_kaggle_release"] = packager +_spec.loader.exec_module(packager) + + +_RELEASE_DIR = _REPO_ROOT / "release" +_RELEASE_BUNDLES_PRESENT = (_RELEASE_DIR / "intro" / "manifest.json").exists() +_COMMITTED_METADATA = _REPO_ROOT / "release" / "kaggle" / "dataset-metadata.json" +_COMMITTED_COVER = _REPO_ROOT / "release" / "dataset-cover-image.png" + + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +def _minimal_metadata() -> packager.DatasetMetadata: + """A minimum-viable ``DatasetMetadata`` that should validate cleanly.""" + + return packager.DatasetMetadata( + title=packager.DEFAULT_TITLE, + id=f"{packager.DEFAULT_USER_SLUG}/{packager.DEFAULT_DATASET_SLUG}", + subtitle=packager.DEFAULT_SUBTITLE, + description="Synthetic CRM lead-scoring dataset.", + isPrivate=True, + licenses=(packager.LicenseSpec(name=packager.DEFAULT_LICENSE_NAME),), + keywords=packager.DEFAULT_KEYWORDS, + collaborators=(), + expectedUpdateFrequency=packager.DEFAULT_UPDATE_FREQUENCY, + userSpecifiedSources=packager.DEFAULT_USER_SOURCES, + image="dataset-cover-image.png", + resources=( + packager.Resource( + path="intro/lead_scoring.csv", + description="Intro flat CSV.", + schema=packager.ResourceSchema( + fields=( + packager.FieldDescriptor(name="lead_id", type="string", description="ID."), + ) + ), + ), + ), + ) + + +def _make_valid_cover(path: Path) -> None: + """Write a minimum-Kaggle-acceptable cover image at ``path``.""" + + Image.new("RGB", (1280, 640), (0, 0, 0)).save(path) + + +# --------------------------------------------------------------------------- +# Field-constraint validation (G11.1) +# --------------------------------------------------------------------------- + + +def test_validate_metadata_accepts_canonical_v1_metadata() -> None: + assert packager.validate_metadata(_minimal_metadata()) == [] + + +def test_validate_metadata_reports_every_constraint_violation() -> None: + """One bad metadata payload triggers every field check at once.""" + + bad = packager.DatasetMetadata( + title="Tiny", # < 6 chars + id="LeadForge Bad Slug!", # missing '/' + invalid chars + subtitle="short", # < 20 chars + description="x", + isPrivate=True, + licenses=( # two entries, must be exactly one + packager.LicenseSpec(name="MIT"), + packager.LicenseSpec(name="Apache-2.0"), + ), + keywords=("synthetic-data",), + collaborators=(), + expectedUpdateFrequency="sometimes", # not approved + userSpecifiedSources=(), + image="cover.bmp", # disallowed extension + resources=(), # empty resource list + ) + + errors = packager.validate_metadata(bad) + fields = {e.field for e in errors} + assert "title" in fields + assert "subtitle" in fields + assert "id" in fields + assert "licenses" in fields + assert "expectedUpdateFrequency" in fields + assert "image" in fields + assert "resources" in fields + + +def test_validate_id_requires_user_slash_slug_format() -> None: + """Slug-only ids are rejected — Kaggle's schema is ``user/slug``. + + Mirrors the design call recorded in the PR write-up: PR 7.2's + publish script should not have to splice in a username at upload + time. + """ + + slug_only = packager._validate_id("leadforge-lead-scoring-v1") + assert any(e.field == "id" and "missing 'user/'" in e.message for e in slug_only) + + well_formed = packager._validate_id("leadforge/leadforge-lead-scoring-v1") + assert well_formed == [] + + invalid_slug = packager._validate_id("leadforge/Bad Slug!") + assert any(e.field == "id (slug)" for e in invalid_slug) + + +def test_validate_metadata_flags_schema_fields_without_name_or_type() -> None: + """Schema fields must declare both name and type to satisfy G11.1.""" + + bad = _minimal_metadata() + broken = packager.Resource( + path="x.csv", + description="x", + schema=packager.ResourceSchema( + fields=(packager.FieldDescriptor(name="", type="string"),), + ), + ) + bad = packager.DatasetMetadata(**{**bad.__dict__, "resources": (broken,)}) + errors = packager.validate_metadata(bad) + assert any("name and type" in e.message for e in errors) + + +# --------------------------------------------------------------------------- +# Cover image (G11.2) +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _COMMITTED_COVER.exists(), reason="committed cover image not present") +def test_validate_cover_image_passes_for_committed_asset() -> None: + assert packager.validate_cover_image(_COMMITTED_COVER) == [] + + +def test_validate_cover_image_rejects_too_small_image(tmp_path: Path) -> None: + tiny = tmp_path / "tiny.png" + Image.new("RGB", (100, 50), (0, 0, 0)).save(tiny) + errors = packager.validate_cover_image(tiny) + assert errors + assert errors[0].field == "cover_image" + assert "below Kaggle minimum" in errors[0].message + + +def test_validate_cover_image_reports_missing_file(tmp_path: Path) -> None: + errors = packager.validate_cover_image(tmp_path / "no-such.png") + assert errors + assert errors[0].field == "cover_image" + + +# --------------------------------------------------------------------------- +# Schema fields — derive-from-source contract +# +# The flat-CSV schema is built by iterating the CSV header, so column- +# order parity with the CSV is a construction-time invariant. The +# parquet schema comes straight from ``pq.read_schema``, same story. +# Re-checking either via a separate validator is tautological — the +# real coverage is the audit-artifact-sync test below +# (``test_committed_kaggle_metadata_matches_fresh_regeneration``), +# which fails the moment any tier's CSV header or parquet schema +# drifts without a matching metadata regeneration. +# --------------------------------------------------------------------------- + + +# --------------------------------------------------------------------------- +# README rewriting + description content +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_kaggle_readme_text_rewrites_links_and_tree_diagram() -> None: + readme = (_RELEASE_DIR / "README.md").read_text(encoding="utf-8") + rewritten = packager._kaggle_readme_text(readme) + + # Source-repo tree → upload tree. + assert "intermediate_instructor/" not in rewritten + assert "notebooks/01_baseline_lead_scoring.ipynb" not in rewritten + assert "dataset-metadata.json # Kaggle" in rewritten + + # Relative ../ links rewritten to GitHub blob URLs. + assert "](../" not in rewritten + assert packager.GITHUB_BLOB_BASE in rewritten + + # The validation-report link (which lives under release/, not under + # the upload dir) must point at GitHub. + assert "](validation/validation_report.md)" not in rewritten + assert f"]({packager.GITHUB_BLOB_BASE}/release/validation/validation_report.md)" in rewritten + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_kaggle_tree_block_is_present_in_release_readme() -> None: + """Silent-failure guard. + + ``_kaggle_readme_text`` substitutes ``KAGGLE_TREE_BLOCK`` → + ``KAGGLE_UPLOAD_TREE_BLOCK`` via plain string replace. If anyone + tweaks the README's tree diagram by even one whitespace + character, the substitution silently no-ops and the published + Kaggle dataset card carries the source-repo tree. This guard + fires loudly the moment the constants drift apart. + """ + + readme = (_RELEASE_DIR / "README.md").read_text(encoding="utf-8") + assert packager.KAGGLE_TREE_BLOCK in readme, ( + "scripts/package_kaggle_release.py KAGGLE_TREE_BLOCK no longer matches " + "the tree diagram in release/README.md — reconcile the two before " + "the next release-metadata regeneration." + ) + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_validate_readme_substitution_flags_drift(tmp_path: Path) -> None: + """``_validate_readme_substitution`` is wired into the run-time + validator, not just the static guard above.""" + + fake_release = tmp_path / "release" + fake_release.mkdir() + (fake_release / "README.md").write_text("# Some unrelated README\n", encoding="utf-8") + errors = packager._validate_readme_substitution(fake_release) + assert errors + assert errors[0].field == "release/README.md" + assert "KAGGLE_TREE_BLOCK" in errors[0].message + + # Sanity: the real release README does NOT trigger the validator. + assert packager._validate_readme_substitution(_RELEASE_DIR) == [] + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_assembled_upload_dir_writes_rewritten_readme_copy(tmp_path: Path) -> None: + """The README inside the upload tree is a real file with the + rewrites — Kaggle reads this verbatim on the dataset page.""" + + kaggle_dir = tmp_path / "kaggle" + cover_image = tmp_path / "cover.png" + _make_valid_cover(cover_image) + packager.run_packager(_RELEASE_DIR, kaggle_dir=kaggle_dir, cover_image=cover_image) + + kaggle_readme = kaggle_dir / "README.md" + assert kaggle_readme.is_file() + assert not kaggle_readme.is_symlink() + contents = kaggle_readme.read_text(encoding="utf-8") + assert "](../" not in contents + assert packager.GITHUB_BLOB_BASE in contents + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_assembled_upload_dir_resolves_every_declared_resource(tmp_path: Path) -> None: + """Every ``resources[].path`` declared in the metadata must resolve + to a real file (not a symlink, not a missing path) under the + assembled upload directory. Kaggle's CLI walks the directory at + upload time; a declared resource that doesn't materialise is a + silent upload-time failure. + """ + + kaggle_dir = tmp_path / "kaggle" + cover_image = tmp_path / "cover.png" + _make_valid_cover(cover_image) + outcome = packager.run_packager(_RELEASE_DIR, kaggle_dir=kaggle_dir, cover_image=cover_image) + + # Every resource path resolves to a real file. + for resource in outcome.metadata.resources: + target = kaggle_dir / resource.path + assert target.is_file(), f"declared resource missing from upload tree: {resource.path}" + assert not target.is_symlink(), ( + f"declared resource is a symlink, not a real file: {resource.path} — " + f"Kaggle's CLI may skip symlinked entries on upload" + ) + + # Top-level required artefacts. + assert (kaggle_dir / "dataset-metadata.json").is_file() + assert (kaggle_dir / "README.md").is_file() + assert (kaggle_dir / cover_image.name).is_file() + assert not (kaggle_dir / cover_image.name).is_symlink() + + +# --------------------------------------------------------------------------- +# Upload-dir assembly safety +# --------------------------------------------------------------------------- + + +def test_assemble_upload_dir_rejects_unsafe_kaggle_dir(tmp_path: Path) -> None: + """Refuse to assemble into the release dir or its parent.""" + + fake_release = tmp_path / "release" + fake_release.mkdir() + with pytest.raises(ValueError, match="unsafe"): + packager.assemble_upload_dir(fake_release, fake_release) + with pytest.raises(ValueError, match="unsafe"): + packager.assemble_upload_dir(fake_release, fake_release.parent) + + +def test_assemble_upload_dir_rejects_kaggle_dir_equal_to_cwd( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """Refuse to assemble into the current working directory. + + A user passing ``--kaggle-dir .`` (or running from inside the + intended ``kaggle_dir``) would otherwise rmtree-then-recopy + arbitrary cwd contents. This is the most-likely-to-trigger + safety case and was missing test coverage in the initial PR. + """ + + fake_release = tmp_path / "release" + fake_release.mkdir() + cwd = tmp_path / "workdir" + cwd.mkdir() + monkeypatch.chdir(cwd) + with pytest.raises(ValueError, match="unsafe"): + packager.assemble_upload_dir(fake_release, cwd) + + +def test_assemble_upload_dir_idempotent_against_existing_tree(tmp_path: Path) -> None: + """Re-running the assembly over an already-populated upload tree + succeeds — the previous PR's symlink-vs-file confusion is no + longer possible because both passes call the same copy helpers.""" + + if not _RELEASE_BUNDLES_PRESENT: + pytest.skip("release bundles not present") + + kaggle_dir = tmp_path / "kaggle" + cover_image = tmp_path / "cover.png" + _make_valid_cover(cover_image) + packager.run_packager(_RELEASE_DIR, kaggle_dir=kaggle_dir, cover_image=cover_image) + # Second pass against the same kaggle_dir. + outcome = packager.run_packager(_RELEASE_DIR, kaggle_dir=kaggle_dir, cover_image=cover_image) + assert outcome.errors == () + for resource in outcome.metadata.resources: + assert (kaggle_dir / resource.path).is_file() + + +# --------------------------------------------------------------------------- +# CLI driver — error paths +# --------------------------------------------------------------------------- + + +def test_main_reports_missing_release_dir( + tmp_path: Path, capsys: pytest.CaptureFixture[str] +) -> None: + rc = packager.main( + [ + "--release-dir", + str(tmp_path / "missing"), + "--kaggle-dir", + str(tmp_path / "kaggle"), + "--cover-image", + str(tmp_path / "cover.png"), + "--dry-run", + ] + ) + captured = capsys.readouterr() + assert rc == 2 + assert "release directory not found" in captured.err + + +# --------------------------------------------------------------------------- +# Determinism + sync with committed artefact +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_run_packager_metadata_is_byte_deterministic(tmp_path: Path) -> None: + """Two back-to-back runs against the committed bundles must + produce byte-identical metadata files.""" + + cover = tmp_path / "cover.png" + _make_valid_cover(cover) + + out_a = tmp_path / "a" + out_b = tmp_path / "b" + packager.run_packager(_RELEASE_DIR, kaggle_dir=out_a, cover_image=cover, dry_run=True) + packager.run_packager(_RELEASE_DIR, kaggle_dir=out_b, cover_image=cover, dry_run=True) + assert (out_a / "dataset-metadata.json").read_bytes() == ( + out_b / "dataset-metadata.json" + ).read_bytes() + + +def test_render_metadata_emits_literal_unicode_not_escapes() -> None: + """``ensure_ascii=False`` keeps em-dashes, ``×``, smart quotes etc. + rendered literally so the committed JSON stays diffable.""" + + metadata = _minimal_metadata() + rendered = packager.render_metadata_json( + packager.DatasetMetadata(**{**metadata.__dict__, "description": "a — b × c"}) + ) + assert "a — b × c" in rendered + assert "\\u2014" not in rendered + assert "\\u00d7" not in rendered + + +def test_render_metadata_keywords_are_sorted_at_render_time() -> None: + """Keywords are sorted in the rendered JSON regardless of the + order they were declared on the metadata object — locks the + determinism contract independent of the ``DEFAULT_KEYWORDS`` + constant ordering.""" + + base = _minimal_metadata() + shuffled = packager.DatasetMetadata( + **{**base.__dict__, "keywords": ("zebra", "alpha", "mango")}, + ) + parsed = json.loads(packager.render_metadata_json(shuffled)) + assert parsed["keywords"] == ["alpha", "mango", "zebra"] + + +# --------------------------------------------------------------------------- +# Kaggle CLI shape validation (G11.3) — gated, opt-in +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +def test_kaggle_cli_accepts_assembled_metadata(tmp_path: Path) -> None: + """G11.3 — feed the assembled tree to the actual Kaggle metadata + validator and assert it accepts the shape. + + Skipped unless the optional ``kaggle`` package is installed + (``pip install -e '.[publish]'``); we deliberately don't make + that a hard dependency because the kaggle SDK pulls in a long + transitive tail. The Kaggle SDK exposes a metadata validator + via ``kaggle.api.validate_dataset_metadata`` (path varies by + version); we look it up dynamically and skip if absent rather + than hard-couple to one CLI version. + """ + + kaggle = pytest.importorskip("kaggle", reason="kaggle SDK not installed") + kaggle_dir = tmp_path / "kaggle" + cover = tmp_path / "cover.png" + _make_valid_cover(cover) + packager.run_packager(_RELEASE_DIR, kaggle_dir=kaggle_dir, cover_image=cover) + + # Search for a metadata-validator entry point on the kaggle API. + api = kaggle.api + candidates = [ + getattr(api, name, None) + for name in ( + "validate_dataset_metadata", + "_validate_dataset_metadata", + "process_resources", + ) + ] + validator = next((c for c in candidates if callable(c)), None) + if validator is None: + pytest.skip("no Kaggle metadata-validator entry point found on the installed SDK") + + # Different Kaggle SDK versions expose different signatures; try + # the most common shapes. We're treating "no exception raised" + # as acceptance. + try: + validator(str(kaggle_dir)) + except TypeError: + validator(str(kaggle_dir / "dataset-metadata.json")) + + +@pytest.mark.skipif( + not (_RELEASE_BUNDLES_PRESENT and _COMMITTED_METADATA.exists()), + reason="release bundles or committed metadata missing", +) +def test_committed_kaggle_metadata_matches_fresh_regeneration(tmp_path: Path) -> None: + """A fresh metadata regeneration must match the committed + ``release/kaggle/dataset-metadata.json`` byte-for-byte AND have + a non-degenerate description / id / image. + + If this fails, ``release/`` drifted without re-running + ``scripts/package_kaggle_release.py``. Regenerate via that + script from the repo root and commit the new metadata alongside + the bundle change. + """ + + cover = _COMMITTED_COVER if _COMMITTED_COVER.exists() else tmp_path / "cover.png" + if not _COMMITTED_COVER.exists(): + _make_valid_cover(cover) + + fresh_dir = tmp_path / "kaggle" + packager.run_packager(_RELEASE_DIR, kaggle_dir=fresh_dir, cover_image=cover, dry_run=True) + fresh_bytes = (fresh_dir / "dataset-metadata.json").read_bytes() + committed_bytes = _COMMITTED_METADATA.read_bytes() + assert fresh_bytes == committed_bytes + + # Positive content assertions — guard against the failure mode + # where a code change accidentally produces empty / minimal + # content that we then re-commit, leaving the byte-equality + # check passing on broken output. + parsed = json.loads(fresh_bytes) + assert parsed["id"] == f"{packager.DEFAULT_USER_SLUG}/{packager.DEFAULT_DATASET_SLUG}" + assert parsed["image"] == "dataset-cover-image.png" + description = parsed["description"] + # The description should carry the rewritten dataset card, not be + # empty or stub content. + assert "What's inside" in description + assert "Why lead scoring matters" in description + assert "Known limitations" in description + # Rewrites fired (no source-tree leaks, no broken relative links). + assert "intermediate_instructor/" not in description + assert "](../" not in description + assert "github.com/leadforge-dev/leadforge/blob/main" in description + # Resources are non-trivial. + assert len(parsed["resources"]) >= 30 + # Every flat CSV has a schema with the canonical 33-column shape. + flat_csvs = [r for r in parsed["resources"] if r["path"].endswith("/lead_scoring.csv")] + assert len(flat_csvs) == len(packager.DEFAULT_TIERS) + for r in flat_csvs: + assert r["schema"]["fields"][0]["name"] == "split" + assert r["schema"]["fields"][-1]["name"] == "converted_within_90_days"