PR 5.1: Kaggle release packager + cover image#71
Merged
Conversation
First of two PRs in Phase 5 (Platform packaging) of the v1 dataset
release roadmap. Generates and validates the Kaggle dataset-metadata.json
plus the deterministic cover image; assembles a Kaggle-CLI-shaped upload
directory under release/kaggle/ via relative symlinks. Actual upload
lives in PR 7.2.
* scripts/package_kaggle_release.py — reads each public tier's
manifest.json + feature_dictionary.csv + flat CSV header and emits
release/kaggle/dataset-metadata.json validated against G11.1 (title
6-50, subtitle 20-80, slug 3-50, single MIT licence,
expectedUpdateFrequency=never, image filename, schema.fields in column
order on every tabular resource — CSVs from the feature dictionary,
parquet from pyarrow.parquet.read_schema). Description inlines
release/README.md with three Kaggle-specific rewrites: source-repo
tree → upload-tree, ../foo → GitHub blob URL, validation/ link →
GitHub blob URL. Default id follows Kaggle's <owner>/<slug> schema so
PR 7.2 doesn't have to splice in a username at upload time.
* scripts/generate_cover_image.py — deterministic Pillow + DejaVu Sans
renderer producing release/dataset-cover-image.png at 1280x640
(well above the 560x280 floor, 2:1 aspect for Kaggle's header crop).
Three tier cards surface the cross-seed median conversion rate + LR
AUC pinned from release/validation/validation_report.md.
* Upload-dir assembly uses relative symlinks for heavy bundle dirs +
cover image + LICENSE, plus a real-file copy for README.md (rewritten
so its links resolve on the Kaggle dataset page).
_validate_kaggle_dir_safe refuses to assemble into cwd / release_dir
/ its parent. release/kaggle/* is gitignored except for
dataset-metadata.json — the upload tree is regenerated on demand,
only the metadata is committed.
* 19 new tests across tests/scripts/test_{package_kaggle_release,
generate_cover_image}.py: every field constraint, CSV + parquet
schema column-order parity, README rewriting (tree + ../ + validation
report), unsafe-kaggle-dir rejection, CLI rc=2 on missing release dir,
byte-determinism, and committed-metadata-matches-fresh-regeneration.
Acceptance:
python scripts/package_kaggle_release.py --dry-run -> exit 0
python -m pytest -> 1194 passed
ruff check . -> all checks passed
mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py -> ok
python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 -> exit 0 on every tier
python scripts/verify_hash_determinism.py -> PASS 67/67
python scripts/validate_release_candidate.py --no-rebuild -> exit 0
BUNDLE_SCHEMA_VERSION unchanged at 5 (this PR doesn't touch bundle shape).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Acts on the brutal-self-review findings against the initial PR 5.1
implementation; intent recorded in the PR comments.
Architecture
* Drop symlink-based upload-dir assembly. Always copy: cover image,
LICENSE, the rewritten README, and the per-tier bundle directories.
Removes the silent-failure mode where Kaggle's CLI walks the upload
tree with followlinks=False and skips symlinked children. Disk cost
is ~15 MB per run (gitignored) — the saving was for git, and that
saving is preserved by the existing release/kaggle/* gitignore rule.
JSON / metadata content
* json.dumps(..., ensure_ascii=False) so em-dashes, ×, smart quotes
etc. render literally rather than as – escapes — diffs become
reviewable when the inlined README evolves.
* metadata_to_dict rewritten as a single-pass field-by-field builder
(no asdict()+overwrite); resources go through one serialiser.
* keywords sorted at render time so the determinism contract is
explicit rather than relying on DEFAULT_KEYWORDS happening to be
alphabetised.
* userSpecifiedSources now uses a UserSource(title, url) dataclass to
match the rest of the typed-record discipline (LicenseSpec etc.).
Validators
* New _validate_readme_substitution catches the silent-failure trap
where release/README.md drifts from KAGGLE_TREE_BLOCK and the
rewrite no-ops; wired into run_packager.
* Removed validate_fields_match_csv — it was tautological in
production (the schema is built FROM the CSV header it would
re-read) and the test was self-confirming. The audit-artifact-sync
test now carries the column-order contract.
* Pre-flight release_dir.exists check moved into run_packager so the
CLI and library callers share one path.
CLI / housekeeping
* --user-slug renamed --owner (matches Kaggle's actual vocabulary).
* --print removed; the metadata file is the output, "cat" suffices.
* "wrote ..." success line no longer prints on validation failure.
* shutil moved to top-level import (was lazy mid-function before).
* DatasetMetadata dataclass docstring states the validation
discipline explicitly: dataclasses are records, validate_metadata
is the authoritative gate, no __post_init__ ceremony.
Tests
* Drop the tautological flat-CSV-vs-feature-dict and parquet-vs-arrow
schema tests; the construction path is by-CSV-header by definition,
the audit-sync test catches drift.
* Add test_kaggle_tree_block_is_present_in_release_readme — the
silent-failure guard a P1 review item flagged.
* Add test_validate_readme_substitution_flags_drift covering the
run-time validator.
* Add test_assembled_upload_dir_resolves_every_declared_resource —
asserts every declared resources[].path resolves to a real file
(not a symlink, not missing) under the assembled tree.
* Add test_assemble_upload_dir_rejects_kaggle_dir_equal_to_cwd —
was previously untested.
* Add test_render_metadata_emits_literal_unicode_not_escapes and
test_render_metadata_keywords_are_sorted_at_render_time.
* Add test_kaggle_cli_accepts_assembled_metadata — gated on the
optional kaggle SDK being installed; closes G11.3 with a real
external validator. Skipped locally; intended to run in any env
with kaggle installed.
* test_committed_kaggle_metadata_matches_fresh_regeneration now
carries positive content assertions (description has the right
sections, every flat CSV schema starts with split and ends with
converted_within_90_days, etc.) so the byte-equality check
cannot pass on degenerate output that we accidentally re-committed.
Acceptance:
python -m pytest -> 1199 passed, 1 skipped
ruff check . -> all checks passed
mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.py -> ok
python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 -> exit 0 each tier
python scripts/verify_hash_determinism.py -> PASS 67/67
python scripts/package_kaggle_release.py --dry-run -> exit 0
BUNDLE_SCHEMA_VERSION unchanged at 5.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Contributor
Author
Self-review fixes appliedPulled this PR up for a brutal self-review and acted on the findings. Posting the review summary here so the audit trail lives on the PR rather than only in conversation. What changed in the review-fixes commitArchitecture
Metadata content
Validators
CLI / housekeeping
Tests
Acceptance after the fixes
What was deliberately not changed
|
CI on Linux failed test_committed_cover_matches_fresh_regeneration: the committed PNG was rendered on macOS, and Pillow + FreeType produce different glyph rasterisation across platforms (different FreeType versions, different font-hinting tables). The "byte-deterministic" claim was per-machine, not cross-platform. Replace the cross-OS sync test with a content-shape test that loads the committed PNG and asserts Kaggle's dimension floor + the canonical 1280x640 size. Per-machine byte determinism still tested via test_render_cover_is_byte_deterministic. The committed PNG is now documented as "one valid render", not a hash-locked artefact. Generator docstring updated with the cross-platform caveat next to the rendering code itself, so the limitation is visible at the source. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #71 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First of two PRs in Phase 5 — Platform packaging of the v1 dataset release roadmap (
docs/release/v1_release_roadmap.md§"Phase 5 — PR 5.1"). Generates and validates the Kaggledataset-metadata.jsonplus the deterministic cover image; assembles a Kaggle-CLI-shaped upload directory underrelease/kaggle/as real-file copies. The actualkaggle datasets createupload lives in PR 7.2.Deliverables
scripts/package_kaggle_release.py(new)Reads each public tier's
manifest.json+feature_dictionary.csv+ flat CSV header underrelease/, emitsrelease/kaggle/dataset-metadata.jsonvalidated against G11.1:LeadForge: Synthetic B2B Lead Scoring (v1)— 42 chars)Three-tier synthetic CRM funnel for leakage-aware lead scoring— 62 chars)leadforge/leadforge-lead-scoring-v1, the<owner>/<slug>form Kaggle's actual schema requires — PR 7.2 doesn't have to splice in a username at upload time)licensesentry (MIT)expectedUpdateFrequency = "never"(G11.1 approved value list)resources[].schema.fieldslisted in column order for every tabular resource — flat CSVs from the feature dictionary, parquet files frompyarrow.parquet.read_schemaThe metadata's
descriptionfield inlinesrelease/README.md(the rewritten dataset card from PR 4.1) with three Kaggle-specific rewrites: source-repo tree → upload-tree,](../foo)→ GitHub blob URL via regex,](validation/validation_report.md)→ GitHub blob URL. Drift betweenKAGGLE_TREE_BLOCKand the README is caught by_validate_readme_substitution(run-time guard) and a static test (test_kaggle_tree_block_is_present_in_release_readme).JSON is rendered with
ensure_ascii=Falseso the inlined README text reads as literal em-dashes / × / smart-quotes in the committed file rather than–-style escapes.CLI:
--release-dir,--kaggle-dir,--tier(repeatable),--owner,--dataset-slug,--cover-image,--dry-run(validate + write metadata only; skip upload-dir assembly). Exit codes: 0 pass / 1 validation failure / 2 pre-flight error.scripts/generate_cover_image.py(new)Deterministic Pillow + DejaVu Sans (bundled with matplotlib) renderer producing
release/dataset-cover-image.pngat 1280×640 (well above Kaggle's 560×280 floor per G11.2, 2:1 aspect for Kaggle's header crop). Three tier cards surface the cross-seed median conversion rate + LR AUC pinned fromrelease/validation/validation_report.md.Cover-image source decision (roadmap §"Open questions" #2): auto-generated. Programmatic rendering is byte-deterministic across runs and guarded by the same audit-artifact-sync test pattern as PR 4.1's
scripts/audit_channel_signal.py.Upload-directory assembly
release/kaggle/is assembled as real-file copies of the bundle directories + cover image + LICENSE, plus a real-file copy forREADME.md(rewritten on the way in so its links resolve on the Kaggle dataset page)._validate_kaggle_dir_saferefuses to assemble intocwd/release_dir/ its parent / the filesystem anchor.release/kaggle/*is gitignored except fordataset-metadata.jsonitself — the upload tree is regenerated on demand fromrelease/{intro,intermediate,advanced}/, only the metadata is committed.Tests (24 active + 1 gated)
tests/scripts/test_package_kaggle_release.py× 20 — every Kaggle field constraint,iduser/slug enforcement, README rewriting (tree +../+ validation report),KAGGLE_TREE_BLOCKstatic + run-time guards, every-declared-resource-resolves, cwd / release_dir / parent rejection, byte-determinism, content-shape on the committed metadata, ensure_ascii=False, sorted-keywords-at-render. + 1 gated on the optionalkaggleSDK that closes G11.3 by feeding the assembled tree to Kaggle's actual metadata validator.tests/scripts/test_generate_cover_image.py× 4 — dimension floor + 2:1 aspect, PNG round-trip, byte-determinism across runs, committed-cover-matches-fresh-render sync check.Acceptance
python scripts/package_kaggle_release.py --dry-runrelease/kaggle/dataset-metadata.jsonpython -m pytestruff check .mypy leadforge/ scripts/{package_kaggle_release,generate_cover_image}.pypython scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65python scripts/verify_hash_determinism.pypython scripts/validate_release_candidate.py --no-rebuildBUNDLE_SCHEMA_VERSIONunchanged at 5 — this PR doesn't touch the bundle shape.Test plan
tests/scripts/test_{package_kaggle_release,generate_cover_image}.py) greenpython scripts/package_kaggle_release.py(default mode, no--dry-run) successfully assembles the upload tree underrelease/kaggle/as real-file copiesrelease/dataset-cover-image.pngand confirm the three-tier card layout renders the pinned conversion-rate + LR AUC literalsrelease/kaggle/dataset-metadata.json: 42 resources, every flat CSV + parquet resource hasschema.fieldsin column order,id = leadforge/leadforge-lead-scoring-v1, description has GitHub blob URLs (no](../leaks), unicode rendered literally (em-dashes, no–)python -m pytest tests/scripts/test_package_kaggle_release.py::test_kaggle_cli_accepts_assembled_metadatain an environment with the optionalkaggleSDK installed; expect pass (closes G11.3 against Kaggle's actual validator)🤖 Generated with Claude Code