chore(licensing): add per-module NOTICE-binary generation script and CI checks for detecting NOTICE-binary drifting#5417
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5417 +/- ##
============================================
+ Coverage 52.42% 52.52% +0.09%
- Complexity 2481 2516 +35
============================================
Files 1070 1070
Lines 41359 42021 +662
Branches 4441 4613 +172
============================================
+ Hits 21682 22071 +389
- Misses 18406 18668 +262
- Partials 1271 1282 +11
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
5d36e76 to
7df03ad
Compare
|
do we need this in v1.2? if so please add backport. cc @xuang7 |
|
How do we review this PR? or another way to say, how do we know if the change is correct? |
…ETA-INF Adds bin/licensing/generate_notice_binary.py: walks each module's bundled jars, extracts every META-INF/NOTICE (and root-level NOTICE) file, skips first-party org.apache.texera.* jars, dedupes by content hash so jars from the same upstream collapse into one block, prepends the project's own root NOTICE, and emits one block per unique blob. Each block carries a heading derived from the longest common dotted prefix of its contributing jars' coordinates, the list of those jars, and the upstream NOTICE verbatim. Output is deterministic: CRLF->LF normalized and sorted by jar-count with a hash tiebreaker. Optional --extras appends non-jar attribution blocks (amber/NOTICE-binary-python carries the aiohttp + Matplotlib python wheels, which ship no NOTICE inside any jar). Replaces the 6 hand-curated per-module NOTICE-binary files (introduced in apache#4668) with the generator's output, so every distinct upstream attribution actually shipped is preserved verbatim and stays in sync with the deps. CI (build.yml): after each dist is built and unzipped, a new step regenerates that module's NOTICE-binary against the dist lib/ dir and diffs it against the committed file, failing with a one-line fix-up command on drift. The amber check runs in the scala job; the five platform services are each checked in the per-service platform matrix job, alongside the existing LICENSE-binary check. .licenserc.yaml: exclude **/NOTICE-binary-* from the license-header check (the extras manifest is plain text with no comment style), mirroring the existing **/LICENSE-binary-* exclusion.
7df03ad to
5fd5b8d
Compare
I think this should be backported to release/v1.2, and it has been linked under the release preparation tracking issue. Second this question. I'm also curious about the expected review process for this PR. |
There was a problem hiding this comment.
Pull request overview
This PR automates generation of per-module NOTICE-binary files directly from the bundled jars’ embedded NOTICE metadata, and adds CI checks to prevent those committed NOTICE-binary files from drifting as dependencies change.
Changes:
- Added
bin/licensing/generate_notice_binary.pyto scan distlib/jars, extract NOTICE content, de-duplicate, and emit deterministic per-moduleNOTICE-binaryoutput (with optional--extras). - Regenerated service
NOTICE-binaryfiles to reflect the actual bundled jars’ NOTICE content. - Added GitHub Actions steps to regenerate +
diffNOTICE-binaryduring CI, and updated.licenserc.yamlto excludeNOTICE-binary-*from header enforcement.
Reviewed changes
Copilot reviewed 5 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
bin/licensing/generate_notice_binary.py |
New generator that extracts/dedupes jar NOTICE content and writes deterministic NOTICE-binary. |
.github/workflows/build.yml |
Adds CI drift-check steps that regenerate and diff NOTICE-binary outputs. |
.licenserc.yaml |
Excludes NOTICE-binary-* from license-header checks (needed for generated extras). |
amber/NOTICE-binary-extras |
Adds non-jar attribution blocks to append to amber’s generated NOTICE-binary. |
access-control-service/NOTICE-binary |
Regenerated from bundled jars’ embedded NOTICE content. |
config-service/NOTICE-binary |
Regenerated from bundled jars’ embedded NOTICE content. |
file-service/NOTICE-binary |
Regenerated from bundled jars’ embedded NOTICE content (large expansion reflecting actual jar notices). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Yicong-Huang
left a comment
There was a problem hiding this comment.
I reviewed mainly on the python script and CI change. the notice file changes are mainly generated changes, hard to verify by human.
Drops each block's "Bundled jars: ..." line and derives the heading from the contributing jars' version-less Maven coordinates (groupId.artifactId, read from each jar's META-INF/.../pom.properties, with a filename-stripping fallback) instead of the versioned filename. The verbatim upstream NOTICE content is unchanged. This keeps the NOTICE-binary stable across routine dependency version bumps: a bump that leaves the upstream NOTICE text unchanged now produces no diff, so the file only changes when a dependency is added or removed or its NOTICE content actually changes.
xuang7
left a comment
There was a problem hiding this comment.
Overall, this PR looks good to me.
- build.yml: fold each module's NOTICE-binary drift-check into the adjacent LICENSE-binary check step (a single "check binary LICENSE + NOTICE" step for amber and for each platform service) rather than a separate step. - generate_notice_binary.py: parse zip-internal entry paths with PurePosixPath instead of ad-hoc string splitting (zip entry names are always POSIX, so this is the correct and portable way), addressing the Windows-portability note. - Add bin/licensing/test_generate_notice_binary.py: stdlib unittest covering version stripping, artifact-label resolution (pom.properties + filename fallback), NOTICE extraction / CRLF normalization, clustering and heading, and an end-to-end CLI run. Auto-discovered by the existing licensing-test CI step, mirroring test_check_binary_deps.py. - amber/NOTICE-binary-python: replace the vague aiohttp note with the exact MIT license + copyright (Copyright (c) 2018 Fedor Indutny) of the vendored llhttp parser, taken verbatim from the wheel's licenses/vendor/llhttp/LICENSE, and regenerate amber/NOTICE-binary.
…CI checks for detecting NOTICE-binary drifting (#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in #4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes #4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- (backported from commit 27c1df4) Co-authored-by: Bob Bai <bobbai0509@gmail.com>
|
Backport to |
…CI checks for detecting NOTICE-binary drifting (apache#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in apache#4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes apache#4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- Co-authored-by: Bob Bai <bobbai0509@gmail.com>
…CI checks for detecting NOTICE-binary drifting (apache#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in apache#4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes apache#4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- Co-authored-by: Bob Bai <bobbai0509@gmail.com>
What changes were proposed in this PR?
Auto-generates each module's
NOTICE-binaryfrom the third-partyMETA-INF/NOTICEfiles in its bundled jars — replacing the hand-curated subsets introduced in #4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change.bin/licensing/generate_notice_binary.py: walks a module's distlib/dir, extracts everyMETA-INF/NOTICE(and root-levelNOTICE) from each bundled jar, skips first-partyorg.apache.texera.*jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own rootNOTICE, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional--extras <file>appends non-jar attributions.amber/NOTICE-binary-extras(new): the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from thelib/dir.NOTICE-binaryfiles regenerated from the actual bundled jars:amber,access-control-service,config-service,file-service,computing-unit-managing-service,workflow-compiling-service.build.yml): after each dist is built and unzipped, a new step regenerates that module'sNOTICE-binaryand diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-serviceplatformmatrix job, alongside the existingLICENSE-binarycheck.LICENSE-binarystays hand-maintained (it needs human judgment on each license); onlyNOTICE-binary— a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting.Any related issues, documentation, discussions?
Closes #4674
ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)).
How was this PR tested?
sbt <project>/Universal/stage) and ran the generator against each freshly-builtlib/; the committedNOTICE-binaryfiles are byte-identical to the generator output, so the new CI drift-check passes for every module.LICENSE-binarychecks (check_binary_deps.py, PR mode) still pass against the same libs for all six modules.build.ymlvalidated as well-formed YAML.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8 (1M context)