Skip to content

chore(licensing): add per-module NOTICE-binary generation script and CI checks for detecting NOTICE-binary drifting#5417

Merged
Yicong-Huang merged 6 commits into
apache:mainfrom
bobbai00:feat/auto-generate-notice-binary
Jun 10, 2026
Merged

chore(licensing): add per-module NOTICE-binary generation script and CI checks for detecting NOTICE-binary drifting#5417
Yicong-Huang merged 6 commits into
apache:mainfrom
bobbai00:feat/auto-generate-notice-binary

Conversation

@bobbai00

@bobbai00 bobbai00 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Auto-generates each module's NOTICE-binary from the third-party META-INF/NOTICE files in its bundled jars — replacing the hand-curated subsets introduced in #4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change.

  • New generator — bin/licensing/generate_notice_binary.py: walks a module's dist lib/ dir, extracts every META-INF/NOTICE (and root-level NOTICE) from each bundled jar, skips first-party org.apache.texera.* jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root NOTICE, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional --extras <file> appends non-jar attributions.
  • amber/NOTICE-binary-extras (new): the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the lib/ dir.
  • 6 per-module NOTICE-binary files regenerated from the actual bundled jars: amber, access-control-service, config-service, file-service, computing-unit-managing-service, workflow-compiling-service.
  • CI drift-check (build.yml): after each dist is built and unzipped, a new step regenerates that module's NOTICE-binary and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service platform matrix job, alongside the existing LICENSE-binary check.

LICENSE-binary stays hand-maintained (it needs human judgment on each license); only NOTICE-binary — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting.

Any related issues, documentation, discussions?

Closes #4674

ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)).

How was this PR tested?

  • Built all six module dists locally (sbt <project>/Universal/stage) and ran the generator against each freshly-built lib/; the committed NOTICE-binary files are byte-identical to the generator output, so the new CI drift-check passes for every module.
  • Verified the existing LICENSE-binary checks (check_binary_deps.py, PR mode) still pass against the same libs for all six modules.
  • build.yml validated as well-formed YAML.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

@github-actions github-actions Bot added feature pyamber ci changes related to CI infra platform Non-amber Scala service paths labels Jun 6, 2026
@codecov-commenter

codecov-commenter commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.52%. Comparing base (cd60535) to head (7319ca0).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5417      +/-   ##
============================================
+ Coverage     52.42%   52.52%   +0.09%     
- Complexity     2481     2516      +35     
============================================
  Files          1070     1070              
  Lines         41359    42021     +662     
  Branches       4441     4613     +172     
============================================
+ Hits          21682    22071     +389     
- Misses        18406    18668     +262     
- Partials       1271     1282      +11     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (ø)
agent-service 33.76% <ø> (ø)
amber 53.45% <ø> (+0.17%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 57.97% <ø> (+1.91%) ⬆️
file-service 38.21% <ø> (ø)
frontend 47.09% <ø> (+0.06%) ⬆️
pyamber 90.75% <ø> (+0.03%) ⬆️
python 90.75% <ø> (ø) Carriedforward from 75b4619
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bobbai00 bobbai00 force-pushed the feat/auto-generate-notice-binary branch from 5d36e76 to 7df03ad Compare June 6, 2026 23:09
@Yicong-Huang

Copy link
Copy Markdown
Contributor

do we need this in v1.2? if so please add backport.

cc @xuang7

@bobbai00 bobbai00 requested review from Yicong-Huang and xuang7 June 6, 2026 23:47
@bobbai00 bobbai00 added release/v1.2 back porting to release/v1.2 and removed feature pyamber platform Non-amber Scala service paths labels Jun 6, 2026
@Yicong-Huang

Copy link
Copy Markdown
Contributor

How do we review this PR? or another way to say, how do we know if the change is correct?

…ETA-INF

Adds bin/licensing/generate_notice_binary.py: walks each module's bundled
jars, extracts every META-INF/NOTICE (and root-level NOTICE) file, skips
first-party org.apache.texera.* jars, dedupes by content hash so jars from
the same upstream collapse into one block, prepends the project's own root
NOTICE, and emits one block per unique blob. Each block carries a heading
derived from the longest common dotted prefix of its contributing jars'
coordinates, the list of those jars, and the upstream NOTICE verbatim.
Output is deterministic: CRLF->LF normalized and sorted by jar-count with a
hash tiebreaker. Optional --extras appends non-jar attribution blocks
(amber/NOTICE-binary-python carries the aiohttp + Matplotlib python wheels,
which ship no NOTICE inside any jar).

Replaces the 6 hand-curated per-module NOTICE-binary files (introduced in
apache#4668) with the generator's output, so every distinct upstream attribution
actually shipped is preserved verbatim and stays in sync with the deps.

CI (build.yml): after each dist is built and unzipped, a new step
regenerates that module's NOTICE-binary against the dist lib/ dir and diffs
it against the committed file, failing with a one-line fix-up command on
drift. The amber check runs in the scala job; the five platform services
are each checked in the per-service platform matrix job, alongside the
existing LICENSE-binary check.

.licenserc.yaml: exclude **/NOTICE-binary-* from the license-header check
(the extras manifest is plain text with no comment style), mirroring the
existing **/LICENSE-binary-* exclusion.
@bobbai00 bobbai00 force-pushed the feat/auto-generate-notice-binary branch from 7df03ad to 5fd5b8d Compare June 6, 2026 23:56
@github-actions github-actions Bot added feature pyamber platform Non-amber Scala service paths labels Jun 6, 2026
@xuang7

xuang7 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

How do we review this PR? or another way to say, how do we know if the change is correct?

I think this should be backported to release/v1.2, and it has been linked under the release preparation tracking issue. Second this question. I'm also curious about the expected review process for this PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR automates generation of per-module NOTICE-binary files directly from the bundled jars’ embedded NOTICE metadata, and adds CI checks to prevent those committed NOTICE-binary files from drifting as dependencies change.

Changes:

  • Added bin/licensing/generate_notice_binary.py to scan dist lib/ jars, extract NOTICE content, de-duplicate, and emit deterministic per-module NOTICE-binary output (with optional --extras).
  • Regenerated service NOTICE-binary files to reflect the actual bundled jars’ NOTICE content.
  • Added GitHub Actions steps to regenerate + diff NOTICE-binary during CI, and updated .licenserc.yaml to exclude NOTICE-binary-* from header enforcement.

Reviewed changes

Copilot reviewed 5 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
bin/licensing/generate_notice_binary.py New generator that extracts/dedupes jar NOTICE content and writes deterministic NOTICE-binary.
.github/workflows/build.yml Adds CI drift-check steps that regenerate and diff NOTICE-binary outputs.
.licenserc.yaml Excludes NOTICE-binary-* from license-header checks (needed for generated extras).
amber/NOTICE-binary-extras Adds non-jar attribution blocks to append to amber’s generated NOTICE-binary.
access-control-service/NOTICE-binary Regenerated from bundled jars’ embedded NOTICE content.
config-service/NOTICE-binary Regenerated from bundled jars’ embedded NOTICE content.
file-service/NOTICE-binary Regenerated from bundled jars’ embedded NOTICE content (large expansion reflecting actual jar notices).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bobbai00 bobbai00 changed the title chore(licensing): auto-generate per-module NOTICE-binary from jars' META-INF/NOTICE chore(licensing): add per-module NOTICE-binary generation script and CI checks for detecting NOTICE-binary drifting Jun 7, 2026

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed mainly on the python script and CI change. the notice file changes are mainly generated changes, hard to verify by human.

Comment thread .github/workflows/build.yml Outdated
Comment thread .github/workflows/build.yml Outdated
Comment thread bin/licensing/generate_notice_binary.py
Comment thread bin/licensing/generate_notice_binary.py
Drops each block's "Bundled jars: ..." line and derives the heading from
the contributing jars' version-less Maven coordinates (groupId.artifactId,
read from each jar's META-INF/.../pom.properties, with a filename-stripping
fallback) instead of the versioned filename. The verbatim upstream NOTICE
content is unchanged.

This keeps the NOTICE-binary stable across routine dependency version bumps:
a bump that leaves the upstream NOTICE text unchanged now produces no diff,
so the file only changes when a dependency is added or removed or its NOTICE
content actually changes.

@xuang7 xuang7 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this PR looks good to me.

Comment thread amber/NOTICE-binary-python Outdated
Bob Bai and others added 2 commits June 7, 2026 01:08
- build.yml: fold each module's NOTICE-binary drift-check into the adjacent
  LICENSE-binary check step (a single "check binary LICENSE + NOTICE" step for
  amber and for each platform service) rather than a separate step.
- generate_notice_binary.py: parse zip-internal entry paths with PurePosixPath
  instead of ad-hoc string splitting (zip entry names are always POSIX, so this
  is the correct and portable way), addressing the Windows-portability note.
- Add bin/licensing/test_generate_notice_binary.py: stdlib unittest covering
  version stripping, artifact-label resolution (pom.properties + filename
  fallback), NOTICE extraction / CRLF normalization, clustering and heading,
  and an end-to-end CLI run. Auto-discovered by the existing licensing-test CI
  step, mirroring test_check_binary_deps.py.
- amber/NOTICE-binary-python: replace the vague aiohttp note with the exact MIT
  license + copyright (Copyright (c) 2018 Fedor Indutny) of the vendored llhttp
  parser, taken verbatim from the wheel's licenses/vendor/llhttp/LICENSE, and
  regenerate amber/NOTICE-binary.
@bobbai00 bobbai00 added this pull request to the merge queue Jun 7, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 7, 2026
@bobbai00 bobbai00 added this pull request to the merge queue Jun 7, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 7, 2026
@bobbai00 bobbai00 enabled auto-merge June 10, 2026 09:30
@bobbai00 bobbai00 added this pull request to the merge queue Jun 10, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jun 10, 2026
@xuang7 xuang7 added this pull request to the merge queue Jun 10, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jun 10, 2026
@Yicong-Huang Yicong-Huang added this pull request to the merge queue Jun 10, 2026
Merged via the queue into apache:main with commit 27c1df4 Jun 10, 2026
40 checks passed
Yicong-Huang pushed a commit that referenced this pull request Jun 10, 2026
…CI checks for detecting NOTICE-binary drifting (#5417)

### What changes were proposed in this PR?

Auto-generates each module's `NOTICE-binary` from the third-party
`META-INF/NOTICE` files in its bundled jars — replacing the hand-curated
subsets introduced in #4668 — and adds a CI drift-check so the committed
files can never silently rot when dependencies change.

- **New generator — `bin/licensing/generate_notice_binary.py`:** walks a
module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and
root-level `NOTICE`) from each bundled jar, skips first-party
`org.apache.texera.*` jars, dedupes by content hash so jars sharing an
upstream notice collapse into one block, prepends the project's own root
`NOTICE`, and emits one block per unique notice with a synthesized
heading + the contributing-jar list. Output is deterministic (CRLF→LF
normalized, stably sorted by jar-count). An optional `--extras <file>`
appends non-jar attributions.
- **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib
notices, which ship as Python wheels (not jars) and so can't be
extracted from the `lib/` dir.
- **6 per-module `NOTICE-binary` files regenerated** from the actual
bundled jars: `amber`, `access-control-service`, `config-service`,
`file-service`, `computing-unit-managing-service`,
`workflow-compiling-service`.
- **CI drift-check (`build.yml`):** after each dist is built and
unzipped, a new step regenerates that module's `NOTICE-binary` and diffs
it against the committed file, failing the build with a one-line fix-up
command on any drift. The amber check runs in the scala job; the five
platform services are each checked in the per-service `platform` matrix
job, alongside the existing `LICENSE-binary` check.

`LICENSE-binary` stays hand-maintained (it needs human judgment on each
license); only `NOTICE-binary` — a mechanical carry-forward of upstream
notices — is generated. So future dep bumps fail CI with the exact
command to regenerate, instead of silently drifting.

### Any related issues, documentation, discussions?

Closes #4674

ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0
§4(d)).

### How was this PR tested?

- Built all six module dists locally (`sbt <project>/Universal/stage`)
and ran the generator against each freshly-built `lib/`; the committed
`NOTICE-binary` files are byte-identical to the generator output, so the
new CI drift-check passes for every module.
- Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`,
PR mode) still pass against the same libs for all six modules.
- `build.yml` validated as well-formed YAML.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

---------

(backported from commit 27c1df4)

Co-authored-by: Bob Bai <bobbai0509@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Backport to release/v1.2 succeeded as 1f84101. Run

ELin2025 pushed a commit to ELin2025/texera that referenced this pull request Jun 16, 2026
…CI checks for detecting NOTICE-binary drifting (apache#5417)

### What changes were proposed in this PR?

Auto-generates each module's `NOTICE-binary` from the third-party
`META-INF/NOTICE` files in its bundled jars — replacing the hand-curated
subsets introduced in apache#4668 — and adds a CI drift-check so the committed
files can never silently rot when dependencies change.

- **New generator — `bin/licensing/generate_notice_binary.py`:** walks a
module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and
root-level `NOTICE`) from each bundled jar, skips first-party
`org.apache.texera.*` jars, dedupes by content hash so jars sharing an
upstream notice collapse into one block, prepends the project's own root
`NOTICE`, and emits one block per unique notice with a synthesized
heading + the contributing-jar list. Output is deterministic (CRLF→LF
normalized, stably sorted by jar-count). An optional `--extras <file>`
appends non-jar attributions.
- **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib
notices, which ship as Python wheels (not jars) and so can't be
extracted from the `lib/` dir.
- **6 per-module `NOTICE-binary` files regenerated** from the actual
bundled jars: `amber`, `access-control-service`, `config-service`,
`file-service`, `computing-unit-managing-service`,
`workflow-compiling-service`.
- **CI drift-check (`build.yml`):** after each dist is built and
unzipped, a new step regenerates that module's `NOTICE-binary` and diffs
it against the committed file, failing the build with a one-line fix-up
command on any drift. The amber check runs in the scala job; the five
platform services are each checked in the per-service `platform` matrix
job, alongside the existing `LICENSE-binary` check.

`LICENSE-binary` stays hand-maintained (it needs human judgment on each
license); only `NOTICE-binary` — a mechanical carry-forward of upstream
notices — is generated. So future dep bumps fail CI with the exact
command to regenerate, instead of silently drifting.

### Any related issues, documentation, discussions?

Closes apache#4674

ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0
§4(d)).

### How was this PR tested?

- Built all six module dists locally (`sbt <project>/Universal/stage`)
and ran the generator against each freshly-built `lib/`; the committed
`NOTICE-binary` files are byte-identical to the generator output, so the
new CI drift-check passes for every module.
- Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`,
PR mode) still pass against the same libs for all six modules.
- `build.yml` validated as well-formed YAML.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

---------

Co-authored-by: Bob Bai <bobbai0509@gmail.com>
yangzhang75 pushed a commit to yangzhang75/texera that referenced this pull request Jun 22, 2026
…CI checks for detecting NOTICE-binary drifting (apache#5417)

### What changes were proposed in this PR?

Auto-generates each module's `NOTICE-binary` from the third-party
`META-INF/NOTICE` files in its bundled jars — replacing the hand-curated
subsets introduced in apache#4668 — and adds a CI drift-check so the committed
files can never silently rot when dependencies change.

- **New generator — `bin/licensing/generate_notice_binary.py`:** walks a
module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and
root-level `NOTICE`) from each bundled jar, skips first-party
`org.apache.texera.*` jars, dedupes by content hash so jars sharing an
upstream notice collapse into one block, prepends the project's own root
`NOTICE`, and emits one block per unique notice with a synthesized
heading + the contributing-jar list. Output is deterministic (CRLF→LF
normalized, stably sorted by jar-count). An optional `--extras <file>`
appends non-jar attributions.
- **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib
notices, which ship as Python wheels (not jars) and so can't be
extracted from the `lib/` dir.
- **6 per-module `NOTICE-binary` files regenerated** from the actual
bundled jars: `amber`, `access-control-service`, `config-service`,
`file-service`, `computing-unit-managing-service`,
`workflow-compiling-service`.
- **CI drift-check (`build.yml`):** after each dist is built and
unzipped, a new step regenerates that module's `NOTICE-binary` and diffs
it against the committed file, failing the build with a one-line fix-up
command on any drift. The amber check runs in the scala job; the five
platform services are each checked in the per-service `platform` matrix
job, alongside the existing `LICENSE-binary` check.

`LICENSE-binary` stays hand-maintained (it needs human judgment on each
license); only `NOTICE-binary` — a mechanical carry-forward of upstream
notices — is generated. So future dep bumps fail CI with the exact
command to regenerate, instead of silently drifting.

### Any related issues, documentation, discussions?

Closes apache#4674

ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0
§4(d)).

### How was this PR tested?

- Built all six module dists locally (`sbt <project>/Universal/stage`)
and ran the generator against each freshly-built `lib/`; the committed
`NOTICE-binary` files are byte-identical to the generator output, so the
new CI drift-check passes for every module.
- Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`,
PR mode) still pass against the same libs for all six modules.
- `build.yml` validated as well-formed YAML.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

---------

Co-authored-by: Bob Bai <bobbai0509@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci changes related to CI feature infra platform Non-amber Scala service paths pyamber release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-generate per-module NOTICE-binary from jars' META-INF/NOTICE

5 participants