Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,12 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0.

### Phase 5 — Platform packaging
- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json`
- [ ] `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags
- [ ] `release/dataset-cover-image.png` (≥560×280)
- [ ] Local `load_dataset()` smoke test; Kaggle dry-run package validation
- [x] PR 5.1: `scripts/package_kaggle_release.py` (new) — Kaggle release packager. Reads each public tier's `manifest.json` + `feature_dictionary.csv` + flat CSV header under `release/`, emits `release/kaggle/dataset-metadata.json` validated against G11.1 (title 6-50 chars, subtitle 20-80 chars, slug 3-50 chars, single MIT license, `expectedUpdateFrequency=never`, image filename, `resources[].schema.fields` in column order for every tabular resource). Schema fields cover both flat CSVs (driven by `feature_dictionary.csv`) and parquet files (driven by `pyarrow.parquet.read_schema`). The metadata's `description` field inlines `release/README.md` with three Kaggle-specific rewrites: source-repo tree diagram → upload-tree diagram, `](../foo)` → GitHub blob URL via regex, `](validation/validation_report.md)` → GitHub blob URL. Default `id` follows Kaggle's actual `<owner>/<slug>` schema (`leadforge/leadforge-lead-scoring-v1`), so PR 7.2's publish script does not have to splice in a username at upload time. CLI: `--release-dir`, `--kaggle-dir`, `--tier`, `--user-slug`, `--dataset-slug`, `--cover-image`, `--dry-run`, `--print`. Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.
- [x] PR 5.1: `scripts/generate_cover_image.py` (new) — deterministic Pillow + DejaVu Sans (bundled with matplotlib) renderer producing `release/dataset-cover-image.png` at 1280×640 (well above the 560×280 minimum, 2:1 aspect for Kaggle's header crop). Three-tier card design surfacing the cross-seed median conversion rate + LR AUC for each tier, pinned from `release/validation/validation_report.md`. Byte-identical re-runs guarded by `tests/scripts/test_generate_cover_image.py`.
- [x] PR 5.1: Upload-dir assembly under `release/kaggle/` uses relative symlinks for the heavy bundle directories + cover image + LICENSE, plus a real file copy for `README.md` (rewritten on the way in so its `../` links and tree diagram render correctly on the Kaggle dataset page). `_validate_kaggle_dir_safe` refuses to assemble into `cwd` / `release_dir` / its parent / the filesystem anchor. `release/kaggle/*` is gitignored except for `dataset-metadata.json` itself — only the metadata is committed; the upload tree is regenerated on demand.
- [x] PR 5.1: 19 new tests (`tests/scripts/test_package_kaggle_release.py` × 15, `tests/scripts/test_generate_cover_image.py` × 4): every Kaggle field constraint, schema field order parity for CSV + parquet, README rewriting (tree + `../` + validation report links), unsafe-kaggle-dir rejection, CLI rc=2 on missing release dir, byte-determinism (audit-artifact-sync), and committed-metadata-matches-fresh-regeneration sync check. 1194/1194 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).
- [ ] PR 5.2: `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags
- [ ] PR 5.2: Local `load_dataset()` smoke test; Kaggle dry-run package validation

### Phase 6 — Notebook sequence + adversarial framing
- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,9 @@ release/intermediate_instructor/
release/LICENSE
release/_determinism/
release/_release_quality/

# Generated Kaggle upload tree (PR 5.1) — only dataset-metadata.json is
# committed; the rest is reassembled on demand via
# scripts/package_kaggle_release.py from release/{intro,intermediate,advanced}/.
release/kaggle/*
!release/kaggle/dataset-metadata.json
Binary file added release/dataset-cover-image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading