Skip to content

PR 5.2: HuggingFace release packager + load_dataset smoke test#72

Merged
shaypal5 merged 3 commits into
mainfrom
feat/hf-release-packager
May 6, 2026
Merged

PR 5.2: HuggingFace release packager + load_dataset smoke test#72
shaypal5 merged 3 commits into
mainfrom
feat/hf-release-packager

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 5 PR 5.2 — second and final PR of the platform-packaging phase. Adds scripts/package_hf_release.py, the HuggingFace counterpart to PR 5.1's Kaggle packager. Generates release/huggingface/README.md with G12.1-compliant YAML frontmatter and the rewritten dataset-card body, and supports a --variant=instructor flag for the companion repo (G12.4).

What ships

  • scripts/package_hf_release.py — generates release/huggingface/README.md with YAML frontmatter satisfying G12.1: pretty_name, license: mit, language: [en], task_categories: [tabular-classification], size_categories: [1K<n<10K] per tier, sorted tags: [b2b, crm, datasets, lead-scoring, pandas, synthetic-data, tabular], three configs blocks pointing at the parquet task splits in the assembled upload tree. G12.2 wired in: intermediate is the only config with default: true, locked by validate_card. Body inlines the rewritten release/README.md (PR 4.1) with HF-specific link rewrites (source-repo tree → upload-tree, ](../foo) → GitHub blob URL, ](validation/...) → GitHub blob URL). Hand-rolled YAML renderer (PyYAML's flow style collapses configs[] onto one line, which the HF viewer rejects). CLI: --release-dir, --huggingface-dir, --variant, --default-config, --owner, --dataset-slug, --cover-image, --dry-run. Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.
  • scripts/_release_common.py — extracted shared release-packaging primitives so the Kaggle + HF packagers share one source of truth: GITHUB_BLOB_BASE, the ](../foo) regex, the validation-report blob-URL substitution, validate_cover_image (HF reuses the 560×280 floor), and validate_upload_dir_safe (refuses to assemble into cwd / release_dir / its parent / the filesystem anchor). FieldDescriptor / ResourceSchema / dtype maps are intentionally NOT extracted — HF infers schema from parquet via load_dataset and does not need a Frictionless declaration. scripts/package_kaggle_release.py refactored to import these primitives; release/kaggle/dataset-metadata.json is byte-stable across the refactor.
  • --variant=instructor — packages the leadforge-lead-scoring-v1-instructor companion repo (G12.4) from release/intermediate_instructor/. The source dir flattens to intermediate/ in the upload tree so load_dataset(..., name=\"intermediate\") works without naming friction. Single-config card; separate release/huggingface-instructor/ output dir.
  • Upload-tree assembly — real-file copies (mirrors PR 5.1; datasets walks the upload tree and silently skips broken symlinks in some versions). release/dataset-cover-image.png reused as the HF thumbnail. release/huggingface{,-instructor}/* gitignored except for README.md; only the dataset card is committed.
  • release/HF_DATASET_CARD.md legacy stub deleted — superseded by the generated card.

Tests (21 new)

  • Every G12.1 YAML field, including the failure modes for empty / mistyped values.
  • G12.2 — exactly one config has default: true (zero defaults and two defaults both rejected).
  • YAML round-trip through yaml.safe_load; tags-sorted-at-render-time.
  • README rewriting (tree + ../ + validation-report links) and the HF_TREE_BLOCK_SOURCE silent-failure guard (mirrors PR 5.1's Kaggle-side).
  • Unsafe-dir rejection (release dir / parent / cwd).
  • Assembled-tree resource resolution (every declared data_files[*].path materialises as a real file, not a symlink).
  • Idempotent re-assembly.
  • Byte-determinism (audit-artifact-sync) and committed-README-matches-fresh-regeneration sync check.
  • Instructor variant — single config, flattened source dir.
  • G12.3 + G12.4 — parametrised load_dataset(local_path, name=<config>) round-trip per public config + instructor companion, with row-count assertions (3500 / 750 / 750 train / valid / test). Gated on the optional datasets SDK (pytest.importorskip); skipped in this CI install.

Local verification

  • 1216/1216 tests pass (+ 5 gated skips: 4 datasets-SDK round-trip + 1 kaggle-SDK from PR 5.1).
  • ruff check + ruff format --check + mypy clean across the new modules and the refactored Kaggle packager.
  • python scripts/package_hf_release.py --dry-run and python scripts/package_hf_release.py --variant=instructor --dry-run both regenerate their committed README byte-identically.
  • python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 exits 0 on every public tier.
  • python scripts/verify_hash_determinism.py PASS 67/67.
  • python scripts/validate_release_candidate.py --no-rebuild exits 0.
  • BUNDLE_SCHEMA_VERSION unchanged at 5.

Test plan

  • CI passes on this branch.
  • Reviewer confirms YAML frontmatter shape matches the HF dataset-card schema.
  • Reviewer spot-checks the rewritten body — no ](../, no intermediate_instructor/ in the tree diagram, no broken validation/... links.
  • (Optional) Reviewer with pip install datasets in their env runs the gated round-trip tests and confirms 3500 / 750 / 750 row counts.

🤖 Generated with Claude Code

Add `scripts/package_hf_release.py` to generate `release/huggingface/README.md`
with G12.1-compliant YAML frontmatter (pretty_name, license, language,
task_categories, size_categories, tags, three configs with `default: true`
on intermediate per G12.2), inlining the rewritten `release/README.md`
body with HF-specific link rewrites.  `--variant=instructor` packages the
companion repo (G12.4) from `release/intermediate_instructor/` into a
separate `release/huggingface-instructor/` upload tree.  G12.3 covered
by a parametrised `load_dataset()` smoke test gated on the optional
`datasets` SDK.

Extract shared release-packaging primitives (link rewriter, dir-safety
guard, cover-image validator) into `scripts/_release_common.py`; refactor
the Kaggle packager to import them.  `release/kaggle/dataset-metadata.json`
is byte-stable across the refactor.

Delete the legacy `release/HF_DATASET_CARD.md` stub — superseded by the
generated card.  Gitignore `release/huggingface{,-instructor}/*` except
the committed README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 19:21
@shaypal5 shaypal5 added type: feature New capability layer: cli cli/ command-line interface labels May 6, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Hugging Face release packaging support (dataset-card generation + upload-tree assembly) alongside shared helpers refactored out of the Kaggle packager, and locks the contract with a comprehensive new test suite (including optional datasets.load_dataset smoke tests).

Changes:

  • Introduces scripts/package_hf_release.py to generate HF dataset-card README.md (YAML frontmatter + rewritten body) and optionally assemble an HF-loadable upload directory, including an instructor variant.
  • Extracts shared release-packaging utilities into scripts/_release_common.py and refactors the Kaggle packager to consume them.
  • Adds extensive tests for HF packaging behavior and adjusts Kaggle tests for the refactored shared cover-image validator messaging; commits generated HF README artifacts and updates .gitignore accordingly.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/scripts/test_package_kaggle_release.py Updates cover-image validation assertion to match shared validator messaging.
tests/scripts/test_package_hf_release.py New test suite locking HF packaging contract (YAML, rewriting, safety guards, determinism, optional load_dataset).
scripts/package_kaggle_release.py Refactors to import shared primitives from _release_common.
scripts/package_hf_release.py New HF packager: builds/validates dataset card, rewrites README, assembles upload tree, supports instructor variant.
scripts/_release_common.py New shared module for link rewriting, cover-image validation, and upload-dir safety validation.
release/huggingface/README.md Committed generated HF dataset card for public tiers.
release/huggingface-instructor/README.md Committed generated HF dataset card for instructor companion variant.
release/HF_DATASET_CARD.md Removes legacy stub dataset card superseded by generated READMEs.
.gitignore Ignores generated HF upload trees while keeping committed README artifacts.
.agent-plan.md Marks Phase 5.2 tasks complete and documents shipped artifacts/tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/package_hf_release.py
Comment thread scripts/package_hf_release.py Outdated
Comment thread scripts/package_hf_release.py Outdated
Fold the brutal self-review's findings back into the PR before review.

Bugs:
- (#1) run_packager validate→write order — both packagers wrote
  README/metadata on validation failure, leaving corrupt artifacts on
  disk that would silently get committed.  Gated on `errors == ()`;
  added no-write tests for both packagers.
- (#2) Instructor README inlined the public 3-tier README into a
  1-tier dataset card.  Replaced with a dedicated `INSTRUCTOR_BODY`
  constant that links to the public dataset and describes only the
  instructor-specific additions (full-horizon tables, hidden DAG,
  latent registry, mechanism summary).
- (#3) validate_upload_dir_safe also blocks strict descendants of
  release_dir; `--huggingface-dir release/intro` would otherwise
  rmtree the intro bundle.

Architecture:
- (#5) Finished shared-primitives extraction: SOURCE_TREE_BLOCK,
  validate_readme_substitution, replace_file, replace_dir,
  load_manifest now live in scripts/_release_common.py.  Both
  packagers reduced to imports.
- (#6) Replaced 60-line hand-rolled YAML renderer with yaml.safe_dump
  + a 4-line _IndentedDumper subclass.
- (#7) Removed dead --owner / --dataset-slug CLI flags.
- (#8) assemble_upload_dir now takes rendered_readme and writes it.
- (#9) build_config_for_tier made pure (no I/O); cheap manifest-stat
  preflight via _assert_tier_dir_exists.
- (#10) --default-config with --variant=instructor errors loudly.

CI:
- (#4) Added [publish] extra (datasets>=2.14, kaggle>=1.6) so the
  gated G12.3 / G12.4 / G11.3 tests install in one line.

Cleanups: visual cruft (#13#16), test cruft (#17 — unused tmp_path,
dead tag_lines), em-dash YAML round-trip parametrised for the
instructor pretty_name.

Verification: 1223 tests pass + 5 gated skips; ruff + mypy clean;
hash determinism PASS 67/67; leakage probes 0/3 reconstruct on every
tier; validate_release_candidate --no-rebuild exits 0.
release/{kaggle,huggingface,huggingface-instructor}/dataset-metadata
.json|README.md regenerated; audit-artifact-sync tests guard them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5

shaypal5 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Self-review pass (commit f2fc4a2)

Brutal senior-dev review of the first revision caught several real issues; folded the fixes back in before requesting human review.

Bugs fixed

  • docs: add mandatory branch/PR workflow rules #1run_packager (both Kaggle + HF) was writing the README/metadata even when validation failed. Now early-returns with errors populated and zero artifacts on disk; new no-write tests on both sides.
  • docs: add label taxonomy and milestone map to PR workflow #2 — Instructor README was inlining the public 3-tier README for a 1-tier dataset (the published HF card would have said "three-tier" 6 times for a single-config dataset). Replaced with a dedicated INSTRUCTOR_BODY constant that opens by linking to the public dataset and describes only the instructor-specific additions.
  • feat: Milestone 0 — project foundation and package skeleton (v0.1.0) #3validate_upload_dir_safe was too narrow; --huggingface-dir release/intro would have rmtree'd the intro bundle. Now blocks strict descendants of release_dir with an allow-list for the canonical release/{kaggle,huggingface,huggingface-instructor} direct-children.

Architecture cleanups

CI gap acknowledged + addressed

  • feat: Milestone 1 — canonical config, recipe, and model objects #4 — Added [publish] extra in pyproject.toml (datasets>=2.14, kaggle>=1.6) so the gated G12.3/G12.4/G11.3 tests install in one line. Wiring a [publish] job into CI is a follow-up; this PR makes it trivial. Honest update to the PR description: G12.3/G12.4 are exercised only when datasets is installed; CI does not currently install it.

Smaller stuff

Verification

  • 1223/1223 tests pass + 5 gated skips (4 datasets-SDK round-trip + 1 kaggle-SDK from PR 5.1).
  • ruff check + ruff format --check + mypy clean.
  • Hash determinism PASS 67/67.
  • Leakage probes 0/3 reconstruct on every tier.
  • validate_release_candidate --no-rebuild exits 0.
  • BUNDLE_SCHEMA_VERSION unchanged at 5.

Files changed in this commit

  • scripts/_release_common.py — absorbed the shared helpers
  • scripts/package_kaggle_release.py — simplified imports, write-on-error fix
  • scripts/package_hf_release.pyINSTRUCTOR_BODY, PyYAML rendering, dropped dead flags, validate→write gate
  • tests/scripts/test_package_*.py — new tests for the fixes above
  • pyproject.toml[publish] extra
  • release/huggingface{,-instructor}/README.md — regenerated (audit-artifact-sync)
  • .agent-plan.md — captured the self-review fixes

🤖 Generated with Claude Code

@github-actions

This comment has been minimized.

Fold Copilot's two real findings on the self-review revision back in.

COPILOT-1 — validate_upload_dir_safe was only invoked inside
assemble_upload_dir, which --dry-run skips.  A dry-run with
--huggingface-dir release (or .) would write the README into the
unsafe path BEFORE the safety net fired.  Hoist the check into
run_packager (both packagers) so it runs before any mkdir or write;
the inner assemble_upload_dir call stays as defence-in-depth for
direct callers.  New tests: dry-run with unsafe upload-dir raises
without writing; the same path through main() returns rc=2.

COPILOT-2 — Cover-image path resolution was inconsistent:
validate_cover_image used cover_image as passed, while
assemble_upload_dir did a separate ``release_dir / cover_image.name``
fallback.  Diverged for bare-basename inputs (false validation
failures) and two-paths-sharing-a-basename (assembler shadowing the
explicit path).  Added resolve_cover_image_path() to
_release_common.py (explicit-wins, release-dir fallback);
run_packager calls it once and threads the resolved path through
validation, the metadata's image field, and assembly.  New
tests/scripts/test_release_common.py covers the four resolution
branches; new packager-side tests confirm bare-basename success +
metadata field plumbing.

COPILOT-3 — outdated; already addressed by self-review fix #8 in
commit f2fc4a2.  Resolved as already treated; no code change.

Verification: 1232/1232 tests pass + 5 gated skips; ruff + mypy
clean; hash determinism PASS 67/67; leakage probes rc=0 on every
tier; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5.
release/{kaggle,huggingface,huggingface-instructor}/* artifacts
regenerated byte-identically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #72 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/package_hf_release.py:814
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919377
Root author: copilot-pull-request-reviewer

Comment:
    `run_packager()` writes `README.md` to `huggingface_dir` before any unsafe-dir validation runs. If a user passes an unsafe `--huggingface-dir` (e.g. `.`) the script can overwrite/create `README.md` in that location even though `assemble_upload_dir()` later raises. Consider calling `validate_upload_dir_safe(huggingface_dir, release_dir, kind=...)` before *any* writes (and optionally also in `--dry-run` mode) so the safety net truly prevents clobbering.

Replies:
- shaypal5
  Addressed in commit 42260a8. `validate_upload_dir_safe` is now hoisted into `run_packager` (both
Kaggle + HF packagers), called immediately after the release-dir pre-flight, BEFORE any mkdir or
write — including in dry-run mode. The inner call inside `assemble_upload_dir` stays as
defence-in-depth for direct callers that bypass `run_packager`. New tests on both sides: a dry-run
with `--{huggingface,kaggle}-dir release` (i.e. release_dir itself) raises `ValueError` without
writing; the same path through `main()` returns rc=2 with "unsafe" in stderr.

## COPILOT-2
Location: scripts/package_hf_release.py
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919426
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Cover image source resolution is inconsistent with validation: `run_packager()` validates `cover_image` exactly as passed, but `assemble_upload_dir()` may instead copy `release_dir / cover_image.name` if it exists (or skip copying if neither exists). This can produce false validation failures (e.g. passing just `dataset-cover-image.png`) and can also copy the wrong file if two paths share the same basename. Resolve the effective `cover_src` once (prefer the explicit path; optionally treat relative paths as relative to `release_dir`) and use that same path for both validation and copying.

## COPILOT-3
Location: scripts/package_hf_release.py
URL: https://github.com/leadforge-dev/leadforge/pull/72#discussion_r3196919453
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `assemble_upload_dir()` docstring says it assembles the rewritten README, but the function currently only copies the cover image, LICENSE, and bundle directories; `README.md` is written earlier by `run_packager()`. Either have `assemble_upload_dir()` also write/copy the README (so it’s self-contained as a helper) or update the docstring to reflect the actual responsibility split.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25461015698 attempt 1
Comment timestamp: 2026-05-06T21:04:02.523880+00:00
PR head commit: 42260a80e9dbea4d12ae725e7912741a89355c6a

@shaypal5 shaypal5 merged commit d62faf8 into main May 6, 2026
9 of 11 checks passed
@shaypal5 shaypal5 deleted the feat/hf-release-packager branch May 6, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: cli cli/ command-line interface type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants