Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,12 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] PR 5.1: `scripts/generate_cover_image.py` (new) — deterministic Pillow + DejaVu Sans (bundled with matplotlib) renderer producing `release/dataset-cover-image.png` at 1280×640 (well above the 560×280 minimum, 2:1 aspect for Kaggle's header crop). Three-tier card design surfacing the cross-seed median conversion rate + LR AUC for each tier, pinned from `release/validation/validation_report.md`. Byte-identical re-runs guarded by `tests/scripts/test_generate_cover_image.py`.
- [x] PR 5.1: Upload-dir assembly under `release/kaggle/` uses relative symlinks for the heavy bundle directories + cover image + LICENSE, plus a real file copy for `README.md` (rewritten on the way in so its `../` links and tree diagram render correctly on the Kaggle dataset page). `_validate_kaggle_dir_safe` refuses to assemble into `cwd` / `release_dir` / its parent / the filesystem anchor. `release/kaggle/*` is gitignored except for `dataset-metadata.json` itself — only the metadata is committed; the upload tree is regenerated on demand.
- [x] PR 5.1: 19 new tests (`tests/scripts/test_package_kaggle_release.py` × 15, `tests/scripts/test_generate_cover_image.py` × 4): every Kaggle field constraint, schema field order parity for CSV + parquet, README rewriting (tree + `../` + validation report links), unsafe-kaggle-dir rejection, CLI rc=2 on missing release dir, byte-determinism (audit-artifact-sync), and committed-metadata-matches-fresh-regeneration sync check. 1194/1194 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).
- [ ] PR 5.2: `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags
- [ ] PR 5.2: Local `load_dataset()` smoke test; Kaggle dry-run package validation
- [x] PR 5.2: `scripts/package_hf_release.py` (new) — Hugging Face release packager. Reads each public tier's `manifest.json` under `release/`, emits `release/huggingface/README.md` with YAML frontmatter satisfying G12.1 (`pretty_name`, `license: mit`, `language: [en]`, `task_categories: [tabular-classification]`, `size_categories: [1K<n<10K]` per tier, sorted `tags: [b2b, crm, datasets, lead-scoring, pandas, synthetic-data, tabular]`, three `configs` blocks pointing at the parquet task splits in the assembled upload tree). G12.2 wired in: `intermediate` is the only config with `default: true`, locked by `validate_card`. Body inlines the rewritten `release/README.md` (PR 4.1) with HF-specific link rewrites: source-repo tree diagram → upload-tree diagram, `](../foo)` → GitHub blob URL, `](validation/validation_report.md)` → GitHub blob URL. Hand-rolled YAML renderer (PyYAML's default flow style collapses `configs[]` onto one line, which the HF viewer rejects). `--variant=instructor` packages the companion repo (G12.4): `release/intermediate_instructor/` flattens to `intermediate/` in the upload tree, single-config card, separate `release/huggingface-instructor/` output dir. CLI: `--release-dir`, `--huggingface-dir`, `--variant`, `--default-config`, `--owner`, `--dataset-slug`, `--cover-image`, `--dry-run`. Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.
- [x] PR 5.2: `scripts/_release_common.py` (new) — extracted shared release-packaging primitives so the Kaggle + HF packagers share one source of truth: `GITHUB_BLOB_BASE`, the `](../foo)` regex, the validation-report blob-URL substitution, `validate_cover_image` (dimension floor; HF reuses Kaggle's 560×280), and `validate_upload_dir_safe` (refuses to assemble into `cwd` / `release_dir` / its parent / the filesystem anchor). `FieldDescriptor` / `ResourceSchema` / dtype maps are intentionally NOT extracted — HF infers schema from parquet via `load_dataset` and does not need a Frictionless declaration. `scripts/package_kaggle_release.py` refactored to import these primitives; PR 5.1's external behaviour and committed `release/kaggle/dataset-metadata.json` are byte-stable across the refactor.
- [x] PR 5.2: Upload-dir assembly under `release/huggingface/` and `release/huggingface-instructor/` uses real-file copies (not symlinks; mirrors PR 5.1 — the `datasets` library walks the upload tree and silently skips broken symlinks in some versions). Cover image at `release/dataset-cover-image.png` is reused for the HF thumbnail. `release/huggingface/*` and `release/huggingface-instructor/*` are gitignored except for `README.md` — only the dataset card is committed; the upload tree is reassembled on demand.
- [x] PR 5.2: `release/HF_DATASET_CARD.md` (legacy single-file stub) deleted — superseded by the generated `release/huggingface/README.md`.
- [x] PR 5.2 (Copilot review pass): folded Copilot's two real findings on the self-review revision back in before requesting human review. (COPILOT-1) `validate_upload_dir_safe` was only called inside `assemble_upload_dir`, which `--dry-run` skips — a user passing `--huggingface-dir release` (or `.`, etc.) in dry-run mode would write a README into the unsafe path before the safety net fired. Hoisted the check into `run_packager` (both packagers) so it runs BEFORE any mkdir or write; the inner `assemble_upload_dir` call stays as defence-in-depth for callers that bypass `run_packager`. (COPILOT-2) Cover-image path resolution was inconsistent: `validate_cover_image` used `cover_image` as passed, while `assemble_upload_dir` did its own ``release_dir / cover_image.name`` fallback — diverged for bare-basename inputs (false validation failures) and for two-paths-sharing-a-basename inputs (assembler could shadow the explicit path). Added `resolve_cover_image_path` to `scripts/_release_common.py` (explicit-wins, with release-dir fallback for bare basenames); `run_packager` calls it once and threads the resolved path through validation, metadata's `image` field, and assembly so every consumer agrees. (COPILOT-3) outdated docstring: `assemble_upload_dir` no longer claims to write the README it doesn't write — already addressed by self-review fix #8 in commit f2fc4a2; resolved as already treated. Net: 1232/1232 tests pass + 5 gated skips (4 + new resolver coverage in `tests/scripts/test_release_common.py`); ruff + mypy clean; hash determinism PASS 67/67; leakage probes 0/3 reconstruct on every tier; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
- [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review: (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides. (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block. (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children. (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install. (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports. (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences. (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed). (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree. (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight. (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring. (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim. (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename). (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`. Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).

### Phase 6 — Notebook sequence + adversarial framing
- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
Expand Down
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -224,3 +224,12 @@ release/_release_quality/
# scripts/package_kaggle_release.py from release/{intro,intermediate,advanced}/.
release/kaggle/*
!release/kaggle/dataset-metadata.json

# Generated HuggingFace upload tree (PR 5.2) — only README.md is
# committed; the rest is reassembled on demand via
# scripts/package_hf_release.py. Same pattern for the instructor
# companion under release/huggingface-instructor/.
release/huggingface/*
!release/huggingface/README.md
release/huggingface-instructor/*
!release/huggingface-instructor/README.md
9 changes: 9 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ scripts = [
"scikit-learn>=1.3",
"matplotlib>=3.7",
]
# Optional dependencies for the platform release packagers. Installing
# this extra (``pip install -e ".[publish]"``) enables the gated
# ``load_dataset()`` / Kaggle-CLI smoke tests that verify G11.3 (Kaggle
# package) and G12.3 / G12.4 (HF load_dataset round-trip) without
# pulling the heavy SDKs into the default dev install.
publish = [
"datasets>=2.14",
"kaggle>=1.6",
]

[project.scripts]
leadforge = "leadforge.cli.main:app"
Expand Down
110 changes: 0 additions & 110 deletions release/HF_DATASET_CARD.md

This file was deleted.

Loading
Loading