Skip to content

PR 5.1: Kaggle release packager + cover image#71

Merged
shaypal5 merged 3 commits into
mainfrom
feat/kaggle-release-packager
May 6, 2026
Merged

PR 5.1: Kaggle release packager + cover image#71
shaypal5 merged 3 commits into
mainfrom
feat/kaggle-release-packager

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

First of two PRs in Phase 5 — Platform packaging of the v1 dataset release roadmap (docs/release/v1_release_roadmap.md §"Phase 5 — PR 5.1"). Generates and validates the Kaggle dataset-metadata.json plus the deterministic cover image; assembles a Kaggle-CLI-shaped upload directory under release/kaggle/ as real-file copies. The actual kaggle datasets create upload lives in PR 7.2.

Update (post brutal self-review): the initial draft used relative symlinks for the upload tree to keep git small. A subsequent review caught that this is fragile under Kaggle's CLI (some versions walk the directory with followlinks=False and skip symlinked children). The current implementation always copies; git stays clean via release/kaggle/* gitignore (only dataset-metadata.json is committed). See review-fixes commit + PR comments for the full set of changes.

Deliverables

scripts/package_kaggle_release.py (new)

Reads each public tier's manifest.json + feature_dictionary.csv + flat CSV header under release/, emits release/kaggle/dataset-metadata.json validated against G11.1:

  • title length 6–50 chars (we ship LeadForge: Synthetic B2B Lead Scoring (v1) — 42 chars)
  • subtitle length 20–80 chars (Three-tier synthetic CRM funnel for leakage-aware lead scoring — 62 chars)
  • id slug 3–50 chars (default: leadforge/leadforge-lead-scoring-v1, the <owner>/<slug> form Kaggle's actual schema requires — PR 7.2 doesn't have to splice in a username at upload time)
  • exactly one licenses entry (MIT)
  • expectedUpdateFrequency = "never" (G11.1 approved value list)
  • resources[].schema.fields listed in column order for every tabular resource — flat CSVs from the feature dictionary, parquet files from pyarrow.parquet.read_schema

The metadata's description field inlines release/README.md (the rewritten dataset card from PR 4.1) with three Kaggle-specific rewrites: source-repo tree → upload-tree, ](../foo) → GitHub blob URL via regex, ](validation/validation_report.md) → GitHub blob URL. Drift between KAGGLE_TREE_BLOCK and the README is caught by _validate_readme_substitution (run-time guard) and a static test (test_kaggle_tree_block_is_present_in_release_readme).

JSON is rendered with ensure_ascii=False so the inlined README text reads as literal em-dashes / × / smart-quotes in the committed file rather than -style escapes.

CLI: --release-dir, --kaggle-dir, --tier (repeatable), --owner, --dataset-slug, --cover-image, --dry-run (validate + write metadata only; skip upload-dir assembly). Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.

scripts/generate_cover_image.py (new)

Deterministic Pillow + DejaVu Sans (bundled with matplotlib) renderer producing release/dataset-cover-image.png at 1280×640 (well above Kaggle's 560×280 floor per G11.2, 2:1 aspect for Kaggle's header crop). Three tier cards surface the cross-seed median conversion rate + LR AUC pinned from release/validation/validation_report.md.

Cover-image source decision (roadmap §"Open questions" #2): auto-generated. Programmatic rendering is byte-deterministic across runs and guarded by the same audit-artifact-sync test pattern as PR 4.1's scripts/audit_channel_signal.py.

Upload-directory assembly

release/kaggle/ is assembled as real-file copies of the bundle directories + cover image + LICENSE, plus a real-file copy for README.md (rewritten on the way in so its links resolve on the Kaggle dataset page). _validate_kaggle_dir_safe refuses to assemble into cwd / release_dir / its parent / the filesystem anchor.

release/kaggle/* is gitignored except for dataset-metadata.json itself — the upload tree is regenerated on demand from release/{intro,intermediate,advanced}/, only the metadata is committed.

Tests (24 active + 1 gated)

  • tests/scripts/test_package_kaggle_release.py × 20 — every Kaggle field constraint, id user/slug enforcement, README rewriting (tree + ../ + validation report), KAGGLE_TREE_BLOCK static + run-time guards, every-declared-resource-resolves, cwd / release_dir / parent rejection, byte-determinism, content-shape on the committed metadata, ensure_ascii=False, sorted-keywords-at-render. + 1 gated on the optional kaggle SDK that closes G11.3 by feeding the assembled tree to Kaggle's actual metadata validator.
  • tests/scripts/test_generate_cover_image.py × 4 — dimension floor + 2:1 aspect, PNG round-trip, byte-determinism across runs, committed-cover-matches-fresh-render sync check.

Acceptance

Gate Result
python scripts/package_kaggle_release.py --dry-run exit 0; writes release/kaggle/dataset-metadata.json
python -m pytest 1199 passed, 1 skipped (kaggle-CLI test gated on optional SDK)
ruff check . All checks passed
mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py OK
python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 exit 0 on every public tier
python scripts/verify_hash_determinism.py PASS 67/67 files identical
python scripts/validate_release_candidate.py --no-rebuild exit 0

BUNDLE_SCHEMA_VERSION unchanged at 5 — this PR doesn't touch the bundle shape.

Test plan

  • CI runs the new tests (tests/scripts/test_{package_kaggle_release,generate_cover_image}.py) green
  • python scripts/package_kaggle_release.py (default mode, no --dry-run) successfully assembles the upload tree under release/kaggle/ as real-file copies
  • Open release/dataset-cover-image.png and confirm the three-tier card layout renders the pinned conversion-rate + LR AUC literals
  • Spot-check release/kaggle/dataset-metadata.json: 42 resources, every flat CSV + parquet resource has schema.fields in column order, id = leadforge/leadforge-lead-scoring-v1, description has GitHub blob URLs (no ](../ leaks), unicode rendered literally (em-dashes, no )
  • Run python -m pytest tests/scripts/test_package_kaggle_release.py::test_kaggle_cli_accepts_assembled_metadata in an environment with the optional kaggle SDK installed; expect pass (closes G11.3 against Kaggle's actual validator)

🤖 Generated with Claude Code

First of two PRs in Phase 5 (Platform packaging) of the v1 dataset
release roadmap. Generates and validates the Kaggle dataset-metadata.json
plus the deterministic cover image; assembles a Kaggle-CLI-shaped upload
directory under release/kaggle/ via relative symlinks. Actual upload
lives in PR 7.2.

* scripts/package_kaggle_release.py — reads each public tier's
  manifest.json + feature_dictionary.csv + flat CSV header and emits
  release/kaggle/dataset-metadata.json validated against G11.1 (title
  6-50, subtitle 20-80, slug 3-50, single MIT licence,
  expectedUpdateFrequency=never, image filename, schema.fields in column
  order on every tabular resource — CSVs from the feature dictionary,
  parquet from pyarrow.parquet.read_schema). Description inlines
  release/README.md with three Kaggle-specific rewrites: source-repo
  tree → upload-tree, ../foo → GitHub blob URL, validation/ link →
  GitHub blob URL. Default id follows Kaggle's <owner>/<slug> schema so
  PR 7.2 doesn't have to splice in a username at upload time.

* scripts/generate_cover_image.py — deterministic Pillow + DejaVu Sans
  renderer producing release/dataset-cover-image.png at 1280x640
  (well above the 560x280 floor, 2:1 aspect for Kaggle's header crop).
  Three tier cards surface the cross-seed median conversion rate + LR
  AUC pinned from release/validation/validation_report.md.

* Upload-dir assembly uses relative symlinks for heavy bundle dirs +
  cover image + LICENSE, plus a real-file copy for README.md (rewritten
  so its links resolve on the Kaggle dataset page).
  _validate_kaggle_dir_safe refuses to assemble into cwd / release_dir
  / its parent. release/kaggle/* is gitignored except for
  dataset-metadata.json — the upload tree is regenerated on demand,
  only the metadata is committed.

* 19 new tests across tests/scripts/test_{package_kaggle_release,
  generate_cover_image}.py: every field constraint, CSV + parquet
  schema column-order parity, README rewriting (tree + ../ + validation
  report), unsafe-kaggle-dir rejection, CLI rc=2 on missing release dir,
  byte-determinism, and committed-metadata-matches-fresh-regeneration.

Acceptance:
  python scripts/package_kaggle_release.py --dry-run -> exit 0
  python -m pytest -> 1194 passed
  ruff check . -> all checks passed
  mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py -> ok
  python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 -> exit 0 on every tier
  python scripts/verify_hash_determinism.py -> PASS 67/67
  python scripts/validate_release_candidate.py --no-rebuild -> exit 0

BUNDLE_SCHEMA_VERSION unchanged at 5 (this PR doesn't touch bundle shape).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 16:50
@shaypal5 shaypal5 added type: feature New capability layer: cli cli/ command-line interface labels May 6, 2026
@github-actions

This comment has been minimized.

Acts on the brutal-self-review findings against the initial PR 5.1
implementation; intent recorded in the PR comments.

Architecture
* Drop symlink-based upload-dir assembly. Always copy: cover image,
  LICENSE, the rewritten README, and the per-tier bundle directories.
  Removes the silent-failure mode where Kaggle's CLI walks the upload
  tree with followlinks=False and skips symlinked children. Disk cost
  is ~15 MB per run (gitignored) — the saving was for git, and that
  saving is preserved by the existing release/kaggle/* gitignore rule.

JSON / metadata content
* json.dumps(..., ensure_ascii=False) so em-dashes, ×, smart quotes
  etc. render literally rather than as – escapes — diffs become
  reviewable when the inlined README evolves.
* metadata_to_dict rewritten as a single-pass field-by-field builder
  (no asdict()+overwrite); resources go through one serialiser.
* keywords sorted at render time so the determinism contract is
  explicit rather than relying on DEFAULT_KEYWORDS happening to be
  alphabetised.
* userSpecifiedSources now uses a UserSource(title, url) dataclass to
  match the rest of the typed-record discipline (LicenseSpec etc.).

Validators
* New _validate_readme_substitution catches the silent-failure trap
  where release/README.md drifts from KAGGLE_TREE_BLOCK and the
  rewrite no-ops; wired into run_packager.
* Removed validate_fields_match_csv — it was tautological in
  production (the schema is built FROM the CSV header it would
  re-read) and the test was self-confirming. The audit-artifact-sync
  test now carries the column-order contract.
* Pre-flight release_dir.exists check moved into run_packager so the
  CLI and library callers share one path.

CLI / housekeeping
* --user-slug renamed --owner (matches Kaggle's actual vocabulary).
* --print removed; the metadata file is the output, "cat" suffices.
* "wrote ..." success line no longer prints on validation failure.
* shutil moved to top-level import (was lazy mid-function before).
* DatasetMetadata dataclass docstring states the validation
  discipline explicitly: dataclasses are records, validate_metadata
  is the authoritative gate, no __post_init__ ceremony.

Tests
* Drop the tautological flat-CSV-vs-feature-dict and parquet-vs-arrow
  schema tests; the construction path is by-CSV-header by definition,
  the audit-sync test catches drift.
* Add test_kaggle_tree_block_is_present_in_release_readme — the
  silent-failure guard a P1 review item flagged.
* Add test_validate_readme_substitution_flags_drift covering the
  run-time validator.
* Add test_assembled_upload_dir_resolves_every_declared_resource —
  asserts every declared resources[].path resolves to a real file
  (not a symlink, not missing) under the assembled tree.
* Add test_assemble_upload_dir_rejects_kaggle_dir_equal_to_cwd —
  was previously untested.
* Add test_render_metadata_emits_literal_unicode_not_escapes and
  test_render_metadata_keywords_are_sorted_at_render_time.
* Add test_kaggle_cli_accepts_assembled_metadata — gated on the
  optional kaggle SDK being installed; closes G11.3 with a real
  external validator. Skipped locally; intended to run in any env
  with kaggle installed.
* test_committed_kaggle_metadata_matches_fresh_regeneration now
  carries positive content assertions (description has the right
  sections, every flat CSV schema starts with split and ends with
  converted_within_90_days, etc.) so the byte-equality check
  cannot pass on degenerate output that we accidentally re-committed.

Acceptance:
  python -m pytest                                        -> 1199 passed, 1 skipped
  ruff check .                                            -> all checks passed
  mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py -> ok
  python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 -> exit 0 each tier
  python scripts/verify_hash_determinism.py               -> PASS 67/67
  python scripts/package_kaggle_release.py --dry-run       -> exit 0

BUNDLE_SCHEMA_VERSION unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@shaypal5

shaypal5 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Self-review fixes applied

Pulled this PR up for a brutal self-review and acted on the findings. Posting the review summary here so the audit trail lives on the PR rather than only in conversation.

What changed in the review-fixes commit

Architecture

  • Dropped symlinks for the upload tree, switched to copies. The original symlink approach was fragile under Kaggle's CLI (some versions walk the dir with followlinks=False and silently skip symlinked children — would have failed at upload time in PR 7.2). Disk cost of copies is ~15 MB per assembly run; git stays clean via release/kaggle/* gitignore (only dataset-metadata.json is committed).

Metadata content

  • json.dumps(..., ensure_ascii=False) so the inlined README renders with literal em-dashes / × / smart quotes. Fixes a wall-of- diff problem; saves ~10% file size.
  • metadata_to_dict rewritten as a single-pass field-by-field builder.
  • keywords sorted at render time so the determinism contract is explicit, not relying on the DEFAULT_KEYWORDS constant happening to be alphabetised.
  • userSpecifiedSources now uses a UserSource(title, url) dataclass to match the rest of the typed-record discipline (LicenseSpec etc.).

Validators

  • New _validate_readme_substitution catches the silent-failure trap where release/README.md drifts from KAGGLE_TREE_BLOCK and the rewrite no-ops; wired into run_packager and additionally guarded by a static test.
  • Removed validate_fields_match_csv — it was tautological in production (the schema is built from the same CSV header it would re-read) and the matching test was self-confirming. The audit-artifact-sync test now carries the column-order contract.
  • Pre-flight release_dir.exists check moved into run_packager so library callers get the same error path as the CLI.

CLI / housekeeping

  • --user-slug renamed to --owner (Kaggle's actual vocabulary).
  • --print removed; the metadata file is the output, cat suffices.
  • "wrote …" success line no longer prints on validation failure.
  • shutil moved to top-level import.
  • DatasetMetadata dataclass docstring states the validation discipline explicitly: dataclasses are records, validate_metadata is the authoritative gate, no __post_init__ ceremony.

Tests

  • Dropped the tautological flat-CSV-vs-feature-dict and parquet-vs-arrow schema tests; replaced with a comment block explaining why the construction-by-CSV-header invariant + audit-sync test are the real coverage.
  • Added test_kaggle_tree_block_is_present_in_release_readme (silent-failure guard) and test_validate_readme_substitution_flags_drift (run-time validator coverage).
  • Added test_assembled_upload_dir_resolves_every_declared_resource — every declared resources[].path resolves to a real file (not a symlink, not missing) under the assembled tree.
  • Added test_assemble_upload_dir_rejects_kaggle_dir_equal_to_cwd (was previously untested).
  • Added test_render_metadata_emits_literal_unicode_not_escapes and test_render_metadata_keywords_are_sorted_at_render_time.
  • Added test_kaggle_cli_accepts_assembled_metadata — gated on the optional kaggle SDK being installed; closes G11.3 with a real external validator. Skipped locally; intended to run in any env with the SDK.
  • test_committed_kaggle_metadata_matches_fresh_regeneration now carries positive content assertions so the byte-equality check cannot pass on degenerate output that we accidentally re-committed.

Acceptance after the fixes

Gate Result
python -m pytest 1199 passed, 1 skipped (kaggle-CLI test gated on optional SDK)
ruff check . All checks passed
mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py OK
python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 exit 0 on every tier
python scripts/verify_hash_determinism.py PASS 67/67
python scripts/package_kaggle_release.py --dry-run exit 0

What was deliberately not changed

  • Cover image binary in git — accepted trade-off; deterministic, rarely-changing, regenerated only when validate_release_candidate.py medians shift.
  • Description duplication (full README inlined into the metadata description field) — Kaggle's schema requires inline description; the audit-sync test catches drift.
  • Single-scripts/ location vs leadforge/release/ package — would be a wider refactor spanning multiple scripts; deferred.

CI on Linux failed test_committed_cover_matches_fresh_regeneration:
the committed PNG was rendered on macOS, and Pillow + FreeType produce
different glyph rasterisation across platforms (different FreeType
versions, different font-hinting tables).  The "byte-deterministic"
claim was per-machine, not cross-platform.

Replace the cross-OS sync test with a content-shape test that loads
the committed PNG and asserts Kaggle's dimension floor + the canonical
1280x640 size.  Per-machine byte determinism still tested via
test_render_cover_is_byte_deterministic.  The committed PNG is now
documented as "one valid render", not a hash-locked artefact.

Generator docstring updated with the cross-platform caveat next to the
rendering code itself, so the limitation is visible at the source.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #71 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25453099748 attempt 1
Comment timestamp: 2026-05-06T18:18:57.706528+00:00
PR head commit: 9c2c0623daf8dc2b164c963c26411280c253e345

@shaypal5 shaypal5 merged commit eab2c7a into main May 6, 2026
8 checks passed
@shaypal5 shaypal5 deleted the feat/kaggle-release-packager branch May 6, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: cli cli/ command-line interface type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant