PR 5.2: HuggingFace release packager + load_dataset smoke test#72
Merged
Conversation
Add `scripts/package_hf_release.py` to generate `release/huggingface/README.md`
with G12.1-compliant YAML frontmatter (pretty_name, license, language,
task_categories, size_categories, tags, three configs with `default: true`
on intermediate per G12.2), inlining the rewritten `release/README.md`
body with HF-specific link rewrites. `--variant=instructor` packages the
companion repo (G12.4) from `release/intermediate_instructor/` into a
separate `release/huggingface-instructor/` upload tree. G12.3 covered
by a parametrised `load_dataset()` smoke test gated on the optional
`datasets` SDK.
Extract shared release-packaging primitives (link rewriter, dir-safety
guard, cover-image validator) into `scripts/_release_common.py`; refactor
the Kaggle packager to import them. `release/kaggle/dataset-metadata.json`
is byte-stable across the refactor.
Delete the legacy `release/HF_DATASET_CARD.md` stub — superseded by the
generated card. Gitignore `release/huggingface{,-instructor}/*` except
the committed README.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds Hugging Face release packaging support (dataset-card generation + upload-tree assembly) alongside shared helpers refactored out of the Kaggle packager, and locks the contract with a comprehensive new test suite (including optional datasets.load_dataset smoke tests).
Changes:
- Introduces
scripts/package_hf_release.pyto generate HF dataset-cardREADME.md(YAML frontmatter + rewritten body) and optionally assemble an HF-loadable upload directory, including an instructor variant. - Extracts shared release-packaging utilities into
scripts/_release_common.pyand refactors the Kaggle packager to consume them. - Adds extensive tests for HF packaging behavior and adjusts Kaggle tests for the refactored shared cover-image validator messaging; commits generated HF README artifacts and updates
.gitignoreaccordingly.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/scripts/test_package_kaggle_release.py | Updates cover-image validation assertion to match shared validator messaging. |
| tests/scripts/test_package_hf_release.py | New test suite locking HF packaging contract (YAML, rewriting, safety guards, determinism, optional load_dataset). |
| scripts/package_kaggle_release.py | Refactors to import shared primitives from _release_common. |
| scripts/package_hf_release.py | New HF packager: builds/validates dataset card, rewrites README, assembles upload tree, supports instructor variant. |
| scripts/_release_common.py | New shared module for link rewriting, cover-image validation, and upload-dir safety validation. |
| release/huggingface/README.md | Committed generated HF dataset card for public tiers. |
| release/huggingface-instructor/README.md | Committed generated HF dataset card for instructor companion variant. |
| release/HF_DATASET_CARD.md | Removes legacy stub dataset card superseded by generated READMEs. |
| .gitignore | Ignores generated HF upload trees while keeping committed README artifacts. |
| .agent-plan.md | Marks Phase 5.2 tasks complete and documents shipped artifacts/tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fold the brutal self-review's findings back into the PR before review. Bugs: - (#1) run_packager validate→write order — both packagers wrote README/metadata on validation failure, leaving corrupt artifacts on disk that would silently get committed. Gated on `errors == ()`; added no-write tests for both packagers. - (#2) Instructor README inlined the public 3-tier README into a 1-tier dataset card. Replaced with a dedicated `INSTRUCTOR_BODY` constant that links to the public dataset and describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary). - (#3) validate_upload_dir_safe also blocks strict descendants of release_dir; `--huggingface-dir release/intro` would otherwise rmtree the intro bundle. Architecture: - (#5) Finished shared-primitives extraction: SOURCE_TREE_BLOCK, validate_readme_substitution, replace_file, replace_dir, load_manifest now live in scripts/_release_common.py. Both packagers reduced to imports. - (#6) Replaced 60-line hand-rolled YAML renderer with yaml.safe_dump + a 4-line _IndentedDumper subclass. - (#7) Removed dead --owner / --dataset-slug CLI flags. - (#8) assemble_upload_dir now takes rendered_readme and writes it. - (#9) build_config_for_tier made pure (no I/O); cheap manifest-stat preflight via _assert_tier_dir_exists. - (#10) --default-config with --variant=instructor errors loudly. CI: - (#4) Added [publish] extra (datasets>=2.14, kaggle>=1.6) so the gated G12.3 / G12.4 / G11.3 tests install in one line. Cleanups: visual cruft (#13–#16), test cruft (#17 — unused tmp_path, dead tag_lines), em-dash YAML round-trip parametrised for the instructor pretty_name. Verification: 1223 tests pass + 5 gated skips; ruff + mypy clean; hash determinism PASS 67/67; leakage probes 0/3 reconstruct on every tier; validate_release_candidate --no-rebuild exits 0. release/{kaggle,huggingface,huggingface-instructor}/dataset-metadata .json|README.md regenerated; audit-artifact-sync tests guard them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
Self-review pass (commit f2fc4a2)Brutal senior-dev review of the first revision caught several real issues; folded the fixes back in before requesting human review. Bugs fixed
Architecture cleanups
CI gap acknowledged + addressed
Smaller stuff
Verification
Files changed in this commit
🤖 Generated with Claude Code |
This comment has been minimized.
This comment has been minimized.
Fold Copilot's two real findings on the self-review revision back in. COPILOT-1 — validate_upload_dir_safe was only invoked inside assemble_upload_dir, which --dry-run skips. A dry-run with --huggingface-dir release (or .) would write the README into the unsafe path BEFORE the safety net fired. Hoist the check into run_packager (both packagers) so it runs before any mkdir or write; the inner assemble_upload_dir call stays as defence-in-depth for direct callers. New tests: dry-run with unsafe upload-dir raises without writing; the same path through main() returns rc=2. COPILOT-2 — Cover-image path resolution was inconsistent: validate_cover_image used cover_image as passed, while assemble_upload_dir did a separate ``release_dir / cover_image.name`` fallback. Diverged for bare-basename inputs (false validation failures) and two-paths-sharing-a-basename (assembler shadowing the explicit path). Added resolve_cover_image_path() to _release_common.py (explicit-wins, release-dir fallback); run_packager calls it once and threads the resolved path through validation, the metadata's image field, and assembly. New tests/scripts/test_release_common.py covers the four resolution branches; new packager-side tests confirm bare-basename success + metadata field plumbing. COPILOT-3 — outdated; already addressed by self-review fix #8 in commit f2fc4a2. Resolved as already treated; no code change. Verification: 1232/1232 tests pass + 5 gated skips; ruff + mypy clean; hash determinism PASS 67/67; leakage probes rc=0 on every tier; validate_release_candidate --no-rebuild exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5. release/{kaggle,huggingface,huggingface-instructor}/* artifacts regenerated byte-identically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #72 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: scripts/package_hf_release.py:814
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919377
Root author: copilot-pull-request-reviewer
Comment:
`run_packager()` writes `README.md` to `huggingface_dir` before any unsafe-dir validation runs. If a user passes an unsafe `--huggingface-dir` (e.g. `.`) the script can overwrite/create `README.md` in that location even though `assemble_upload_dir()` later raises. Consider calling `validate_upload_dir_safe(huggingface_dir, release_dir, kind=...)` before *any* writes (and optionally also in `--dry-run` mode) so the safety net truly prevents clobbering.
Replies:
- shaypal5
Addressed in commit 42260a8. `validate_upload_dir_safe` is now hoisted into `run_packager` (both
Kaggle + HF packagers), called immediately after the release-dir pre-flight, BEFORE any mkdir or
write — including in dry-run mode. The inner call inside `assemble_upload_dir` stays as
defence-in-depth for direct callers that bypass `run_packager`. New tests on both sides: a dry-run
with `--{huggingface,kaggle}-dir release` (i.e. release_dir itself) raises `ValueError` without
writing; the same path through `main()` returns rc=2 with "unsafe" in stderr.
## COPILOT-2
Location: scripts/package_hf_release.py
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919426
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Cover image source resolution is inconsistent with validation: `run_packager()` validates `cover_image` exactly as passed, but `assemble_upload_dir()` may instead copy `release_dir / cover_image.name` if it exists (or skip copying if neither exists). This can produce false validation failures (e.g. passing just `dataset-cover-image.png`) and can also copy the wrong file if two paths share the same basename. Resolve the effective `cover_src` once (prefer the explicit path; optionally treat relative paths as relative to `release_dir`) and use that same path for both validation and copying.
## COPILOT-3
Location: scripts/package_hf_release.py
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919453
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`assemble_upload_dir()` docstring says it assembles the rewritten README, but the function currently only copies the cover image, LICENSE, and bundle directories; `README.md` is written earlier by `run_packager()`. Either have `assemble_upload_dir()` also write/copy the README (so it’s self-contained as a helper) or update the docstring to reflect the actual responsibility split.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 5 PR 5.2 — second and final PR of the platform-packaging phase. Adds
scripts/package_hf_release.py, the HuggingFace counterpart to PR 5.1's Kaggle packager. Generatesrelease/huggingface/README.mdwith G12.1-compliant YAML frontmatter and the rewritten dataset-card body, and supports a--variant=instructorflag for the companion repo (G12.4).What ships
scripts/package_hf_release.py— generatesrelease/huggingface/README.mdwith YAML frontmatter satisfying G12.1:pretty_name,license: mit,language: [en],task_categories: [tabular-classification],size_categories: [1K<n<10K]per tier, sortedtags: [b2b, crm, datasets, lead-scoring, pandas, synthetic-data, tabular], threeconfigsblocks pointing at the parquet task splits in the assembled upload tree. G12.2 wired in:intermediateis the only config withdefault: true, locked byvalidate_card. Body inlines the rewrittenrelease/README.md(PR 4.1) with HF-specific link rewrites (source-repo tree → upload-tree,](../foo)→ GitHub blob URL,](validation/...)→ GitHub blob URL). Hand-rolled YAML renderer (PyYAML's flow style collapsesconfigs[]onto one line, which the HF viewer rejects). CLI:--release-dir,--huggingface-dir,--variant,--default-config,--owner,--dataset-slug,--cover-image,--dry-run. Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.scripts/_release_common.py— extracted shared release-packaging primitives so the Kaggle + HF packagers share one source of truth:GITHUB_BLOB_BASE, the](../foo)regex, the validation-report blob-URL substitution,validate_cover_image(HF reuses the 560×280 floor), andvalidate_upload_dir_safe(refuses to assemble intocwd/release_dir/ its parent / the filesystem anchor).FieldDescriptor/ResourceSchema/ dtype maps are intentionally NOT extracted — HF infers schema from parquet viaload_datasetand does not need a Frictionless declaration.scripts/package_kaggle_release.pyrefactored to import these primitives;release/kaggle/dataset-metadata.jsonis byte-stable across the refactor.--variant=instructor— packages theleadforge-lead-scoring-v1-instructorcompanion repo (G12.4) fromrelease/intermediate_instructor/. The source dir flattens tointermediate/in the upload tree soload_dataset(..., name=\"intermediate\")works without naming friction. Single-config card; separaterelease/huggingface-instructor/output dir.datasetswalks the upload tree and silently skips broken symlinks in some versions).release/dataset-cover-image.pngreused as the HF thumbnail.release/huggingface{,-instructor}/*gitignored except forREADME.md; only the dataset card is committed.release/HF_DATASET_CARD.mdlegacy stub deleted — superseded by the generated card.Tests (21 new)
default: true(zero defaults and two defaults both rejected).yaml.safe_load; tags-sorted-at-render-time.../+ validation-report links) and theHF_TREE_BLOCK_SOURCEsilent-failure guard (mirrors PR 5.1's Kaggle-side).cwd).data_files[*].pathmaterialises as a real file, not a symlink).load_dataset(local_path, name=<config>)round-trip per public config + instructor companion, with row-count assertions (3500 / 750 / 750 train / valid / test). Gated on the optionaldatasetsSDK (pytest.importorskip); skipped in this CI install.Local verification
datasets-SDK round-trip + 1kaggle-SDK from PR 5.1).ruff check+ruff format --check+mypyclean across the new modules and the refactored Kaggle packager.python scripts/package_hf_release.py --dry-runandpython scripts/package_hf_release.py --variant=instructor --dry-runboth regenerate their committed README byte-identically.python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65exits 0 on every public tier.python scripts/verify_hash_determinism.pyPASS 67/67.python scripts/validate_release_candidate.py --no-rebuildexits 0.BUNDLE_SCHEMA_VERSIONunchanged at 5.Test plan
](../, nointermediate_instructor/in the tree diagram, no brokenvalidation/...links.pip install datasetsin their env runs the gated round-trip tests and confirms 3500 / 750 / 750 row counts.🤖 Generated with Claude Code