chore(licensing): per-module LICENSE/NOTICE binaries with per-image concat#4668
Merged
Merged
Conversation
…oncat Splits the monolithic root LICENSE-binary / NOTICE-binary into per-module ground-truth files, one set per buildable module: each standalone Scala service, amber (java + python split), frontend, and agent-service. The root files are kept as-is for the source distribution. For each Docker image, the dockerfile now copies only the per-module file(s) relevant to what the image actually bundles. Multi-aspect images (texera-web-application, computing-unit-master, computing-unit-worker) merge their inputs into one /texera/LICENSE at build time via a new bin/licensing/concat_license_binary.py — joining at the license-group level so e.g. Apache-2.0 contains both Scala/Java jars and Python packages inline rather than the inputs being stacked end-to-end. CI: the four existing check_binary_deps.py points (frontend npm, scala jar, python, agent-npm) now build the same combined LICENSE-binary from all per-module files and pass it via --license-binary, so the per-module files become the authoritative claim source for dep validation. Per-module entry counts were derived by enumerating each container's bundled jars / pip-listed Python packages / node_modules and filtering the root LICENSE-binary down to entries that match. No new entries were invented; combined ⊆ root strictly. Closes apache#4667 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4668 +/- ##
============================================
+ Coverage 46.16% 46.75% +0.58%
- Complexity 1996 2067 +71
============================================
Files 1013 1013
Lines 38165 39571 +1406
Branches 3712 3852 +140
============================================
+ Hits 17618 18500 +882
- Misses 19774 20285 +511
- Partials 773 786 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The CDDL group has two sub-license sections (CDDL 1.0 and CDDL 1.1), each with its own "Scala/Java jars:" subsection. The previous merge keyed subsections by header alone, so the second "Scala/Java jars:" (CDDL 1.1) overwrote the first (CDDL 1.0), losing all 22 CDDL-1.0 jars (javax.*, jersey-2.25.1, hk2-2.5.0-b32 family). Key subsections by (sub_license, header) tuple instead, and on emit print each sub-license heading once whenever the marker changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at in checker
The per-module LICENSE-binary and NOTICE-binary files now fully
describe each Docker image's bundled third-party content, so the root
LICENSE-binary and NOTICE-binary are dead code:
- All dockerfiles ship the per-module file (or merged combination)
as /texera/LICENSE; none reference root.
- check_binary_deps.py now auto-builds a combined LICENSE-binary
from the per-module files via concat_license_binary.py when
--license-binary is omitted.
- Source tarball still ships LICENSE and NOTICE (the source-
distribution variants), which is what ASF requires; the -binary
variants describe binary content and aren't required for source.
Updates AddMetaInfLicenseFiles.distMappings to take per-module
LICENSE-binary and NOTICE-binary paths (each service's build.sbt
passes its own); amber passes LICENSE-binary-java since the
Universal dist zip is jar-only.
Simplifies build.yml: drops the explicit concat steps before each
check_binary_deps.py invocation since the tool auto-handles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icense-header check The skywalking-eyes license-header check fails on amber/LICENSE-binary-java and amber/LICENSE-binary-python because they're plain-text manifests with no comment-style and no Apache header (just like the existing root LICENSE-binary entry already handles). Replace the now-deleted root LICENSE-binary/NOTICE-binary entries with glob patterns covering the per-module files: **/LICENSE-binary, **/LICENSE-binary-*, **/NOTICE-binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce2903f to
b016f69
Compare
This was referenced May 2, 2026
Yicong-Huang
approved these changes
May 2, 2026
Yicong-Huang
left a comment
Contributor
There was a problem hiding this comment.
LGTM! Thanks for the effort!
github-actions Bot
pushed a commit
that referenced
this pull request
May 2, 2026
…oncat (#4668) ### What changes were proposed in this PR? Splits the monolithic root `LICENSE-binary` and `NOTICE-binary` into per-module ground-truth files so each Docker image's `/texera/LICENSE` describes only the third-party components actually bundled in that image, per ASF licensing guidance. **Per-module files added** (root files kept unchanged for the source tarball): | Path | Contents | |---|---| | `access-control-service/LICENSE-binary` + `NOTICE-binary` | 113 jars / 18 NOTICE blocks | | `config-service/LICENSE-binary` + `NOTICE-binary` | 115 jars / 18 NOTICE blocks | | `file-service/LICENSE-binary` + `NOTICE-binary` | 310 jars / 25 NOTICE blocks | | `workflow-compiling-service/LICENSE-binary` + `NOTICE-binary` | 319 jars / 26 NOTICE blocks | | `computing-unit-managing-service/LICENSE-binary` + `NOTICE-binary` | 349 jars / 26 NOTICE blocks (only image bundling Bouncy Castle) | | `amber/LICENSE-binary-java` + `NOTICE-binary` | 404 jars / 27 NOTICE blocks (`WorkflowExecutionService`, shared by web/master/runner) | | `amber/LICENSE-binary-python` | 113 packages (master/runner only) | | `frontend/LICENSE-binary` | 114 npm packages (Angular bundle, dashboard image) | | `agent-service/LICENSE-binary` | 57 npm packages | Counts were derived by enumerating each container's actual bundled jars (`ls /texera/lib/`), pip-listed Python packages, and `node_modules` (recursively, including `@scope/name` packages), then filtering the root `LICENSE-binary` down. No new entries were invented; `combined ⊆ root` strictly. **New script** — `bin/licensing/concat_license_binary.py`: - Style-matched to the existing `audit_jar_licenses.py` / `check_binary_deps.py`. - Merges multiple per-module LICENSE-binary files at the **license-group level**: each Apache-2.0 / MIT / BSD / ... section in the output contains all the ecosystem subsections (`Scala/Java jars:`, `Python packages:`, `Angular / npm packages:`, `Agent service npm packages:`, `Source files derived from ...`) inline, rather than stacking the inputs end-to-end. - Reuses the Apache-2.0 license header verbatim, deduplicates entries by id, emits a single trailer. **Dockerfile updates** — 9 dockerfiles: - 5 standalone Scala services + agent-service: copy only their own per-module `LICENSE-binary` (and `NOTICE-binary` for the Scala ones) into `/texera/LICENSE` and `/texera/NOTICE`. - 3 multi-aspect images run `concat_license_binary.py` at build time: - `computing-unit-master` and `computing-unit-worker`: union `amber/LICENSE-binary-java` + `amber/LICENSE-binary-python`. - `texera-web-application`: union `amber/LICENSE-binary-java` + `frontend/LICENSE-binary` (cross-stage `COPY --from=build-frontend`). - `python3-minimal` added to the Scala build stage of the 3 multi-aspect dockerfiles to run the concat script. **CI** — `.github/workflows/build.yml` (the workflow `required-checks.yml` orchestrates): - The four existing `check_binary_deps.py` invocations (frontend npm, scala jar, python, agent-npm) now build a fresh combined LICENSE-binary from all 9 per-module files via `concat_license_binary.py /tmp/combined-LICENSE-binary …`, then pass `--license-binary /tmp/combined-LICENSE-binary` to the existing tooling. The per-module files become the authoritative claim source for dep validation. ### Any related issues, documentation, discussions? Closes #4667 ### How was this PR tested? My personal fork: https://github.com/bobbai00/texera/actions/runs/25248183326 ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (backported from commit cfe472e)
Yicong-Huang
pushed a commit
to Yicong-Huang/texera
that referenced
this pull request
May 2, 2026
- build.yml:
- Replace the `scala` job with `amber`. It runs the cross-cutting Scala
lints (scalafmtCheckAll, scalafixAll --check) once on behalf of every
Scala module, builds WorkflowExecutionService/dist, license-checks
the amber dist, and runs amber tests via WorkflowExecutionService/jacoco.
- Add `platform` job: a matrix over the five non-amber Scala services
(config-service, access-control-service, file-service, computing-unit-
managing-service, workflow-compiling-service). Each entry runs a
single sbt invocation `<Service>/dist <Service>/test` and license-
checks its own dist lib only. This is now possible because per-module
LICENSE/NOTICE binaries (apache#4668) are embedded in each service's dist.
- Replace the `run_scala` workflow_call input with `run_amber` and
`run_platform`.
- labeler.yml: introduce a `platform` label scoped to the five platform
service dirs. Trim the `service` label to just `pyright-language-service`
plus root-level Scala build/lint config (`build.sbt`, `project/**`,
`.scalafix.conf`, `.scalafmt.conf`) — those affect amber + every
platform service.
- required-checks.yml: precheck now emits `run_amber` + `run_platform`
outputs. LABEL_STACKS maps the new `platform` label to the platform
stack and routes existing labels accordingly: python/engine -> amber +
python; service/common/ddl-change -> amber + platform; ci -> all
stacks. Build and backport callers pass the new inputs through.
Closes apache#4631
Yicong-Huang
pushed a commit
to Yicong-Huang/texera
that referenced
this pull request
May 2, 2026
Each service's dist still embeds the root union LICENSE-binary via AddMetaInfLicenseFiles, so a per-service `check_binary_deps.py jar <service>/lib` resolved against the embedded file fails with STALE entries — the union lists jars not in any single service. Pass the per-module file (e.g. amber/LICENSE-binary-java, config-service/LICENSE-binary) introduced in apache#4668 via the script's `--license-binary` flag so each service is validated against its own ground-truth list.
Yicong-Huang
pushed a commit
to Yicong-Huang/texera
that referenced
this pull request
May 2, 2026
- build.yml:
- Replace the `scala` job with `amber`. It runs the cross-cutting Scala
lints (scalafmtCheckAll, scalafixAll --check) once on behalf of every
Scala module, builds WorkflowExecutionService/dist, license-checks
the amber dist, and runs amber tests via WorkflowExecutionService/jacoco.
- Add `platform` job: a matrix over the five non-amber Scala services
(config-service, access-control-service, file-service, computing-unit-
managing-service, workflow-compiling-service). Each entry runs a
single sbt invocation `<Service>/dist <Service>/test` and license-
checks its own dist lib only. This is now possible because per-module
LICENSE/NOTICE binaries (apache#4668) are embedded in each service's dist.
- Replace the `run_scala` workflow_call input with `run_amber` and
`run_platform`.
- labeler.yml: introduce a `platform` label scoped to the five platform
service dirs. Trim the `service` label to just `pyright-language-service`
plus root-level Scala build/lint config (`build.sbt`, `project/**`,
`.scalafix.conf`, `.scalafmt.conf`) — those affect amber + every
platform service.
- required-checks.yml: precheck now emits `run_amber` + `run_platform`
outputs. LABEL_STACKS maps the new `platform` label to the platform
stack and routes existing labels accordingly: python/engine -> amber +
python; service/common/ddl-change -> amber + platform; ci -> all
stacks. Build and backport callers pass the new inputs through.
Closes apache#4631
Yicong-Huang
pushed a commit
to Yicong-Huang/texera
that referenced
this pull request
May 2, 2026
Each service's dist still embeds the root union LICENSE-binary via AddMetaInfLicenseFiles, so a per-service `check_binary_deps.py jar <service>/lib` resolved against the embedded file fails with STALE entries — the union lists jars not in any single service. Pass the per-module file (e.g. amber/LICENSE-binary-java, config-service/LICENSE-binary) introduced in apache#4668 via the script's `--license-binary` flag so each service is validated against its own ground-truth list.
Yicong-Huang
added a commit
that referenced
this pull request
May 2, 2026
## What changes were proposed in this PR? - `.github/workflows/build.yml`: - Replace the `scala` job with `amber`. It runs the cross-cutting Scala lints (`scalafmtCheckAll`, `scalafixAll --check`) once on behalf of every Scala module, builds `WorkflowExecutionService/dist`, license-checks the amber dist against `amber/LICENSE-binary-java`, and runs amber tests via `WorkflowExecutionService/jacoco`. - New `platform` job: a `strategy.matrix.include` over the five non-amber Scala services (config-service, access-control-service, file-service, computing-unit-managing-service, workflow-compiling-service). Each entry runs `sbt "<Service>/dist" "<Service>/test"` and license-checks its own dist `lib/` against `<service>/LICENSE-binary` in isolation. This is now possible because per-module LICENSE-binary files were introduced in #4668. - `run_scala` input replaced by `run_amber` + `run_platform`. - `.github/labeler.yml`: - New `platform` label for the five platform service dirs. - `service` label removed. The two things it carried go elsewhere: `pyright-language-service/**` is left uncategorized (no test stack today), and the root-level Scala build/lint config (`build.sbt`, `project/**`, `.scalafix.conf`, `.scalafmt.conf`) joins the `common` glob — `common` already maps to amber + platform, which is correct for changes that affect every Scala module. - `.github/workflows/required-checks.yml`: - Precheck now emits `run_amber` + `run_platform` instead of `run_scala`. - LABEL_STACKS routes the new label set. Build and backport callers pass the new inputs through. ### Label → stack matrix | Label | frontend | amber | platform | python | agent-service | |---|:-:|:-:|:-:|:-:|:-:| | `frontend` | ✓ | | | | | | `engine` | | ✓ | | ✓ | | | `python` | | ✓ | | ✓ | | | `platform` | | | ✓ | | | | `common` | | ✓ | ✓ | | | | `ddl-change` | | ✓ | ✓ | | | | `agent-service` | | | | | ✓ | | `ci` | ✓ | ✓ | ✓ | ✓ | ✓ | | `docs`, `dev`, `dependencies`, `feature`, `fix`, `refactor`, `release/*` | | | | | | The selected stacks are the union across all PR labels. PRs that pick up only no-stack labels (e.g. docs-only, dev-only) skip every build stack. Push and `workflow_dispatch` events run every stack unconditionally. ### Why per-service license check is now possible Before #4668 there was a single repo-wide `LICENSE-binary` covering the union of all service jars. Splitting the license check per service would have made every per-service check fail — each lib is a strict subset of the union, so the script would report STALE jars (claimed in the union, not in this service). #4668 ships per-module `LICENSE-binary` files at the repo root (`config-service/LICENSE-binary`, `amber/LICENSE-binary-java`, etc.), so each service's dist `lib/` is now validated against its own ground-truth file via `check_binary_deps.py --license-binary <module>/LICENSE-binary`. ## Any related issues, documentation, discussions? Closes #4631. Builds on #4668 (per-module LICENSE-binary files) and #4640 (LABEL_STACKS gating). ## How was this PR tested? YAML parses locally for all three modified files. Currently exercising on this PR's CI run: amber job runs unconditionally; platform matrix runs because the `platform` and `ci` labels are present. ## Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 (Claude Code)
Yicong-Huang
added a commit
that referenced
this pull request
May 2, 2026
## What changes were proposed in this PR? - `.github/workflows/build.yml`: - Replace the `scala` job with `amber`. It runs the cross-cutting Scala lints (`scalafmtCheckAll`, `scalafixAll --check`) once on behalf of every Scala module, builds `WorkflowExecutionService/dist`, license-checks the amber dist against `amber/LICENSE-binary-java`, and runs amber tests via `WorkflowExecutionService/jacoco`. - New `platform` job: a `strategy.matrix.include` over the five non-amber Scala services (config-service, access-control-service, file-service, computing-unit-managing-service, workflow-compiling-service). Each entry runs `sbt "<Service>/dist" "<Service>/test"` and license-checks its own dist `lib/` against `<service>/LICENSE-binary` in isolation. This is now possible because per-module LICENSE-binary files were introduced in #4668. - `run_scala` input replaced by `run_amber` + `run_platform`. - `.github/labeler.yml`: - New `platform` label for the five platform service dirs. - `service` label removed. The two things it carried go elsewhere: `pyright-language-service/**` is left uncategorized (no test stack today), and the root-level Scala build/lint config (`build.sbt`, `project/**`, `.scalafix.conf`, `.scalafmt.conf`) joins the `common` glob — `common` already maps to amber + platform, which is correct for changes that affect every Scala module. - `.github/workflows/required-checks.yml`: - Precheck now emits `run_amber` + `run_platform` instead of `run_scala`. - LABEL_STACKS routes the new label set. Build and backport callers pass the new inputs through. ### Label → stack matrix | Label | frontend | amber | platform | python | agent-service | |---|:-:|:-:|:-:|:-:|:-:| | `frontend` | ✓ | | | | | | `engine` | | ✓ | | ✓ | | | `python` | | ✓ | | ✓ | | | `platform` | | | ✓ | | | | `common` | | ✓ | ✓ | | | | `ddl-change` | | ✓ | ✓ | | | | `agent-service` | | | | | ✓ | | `ci` | ✓ | ✓ | ✓ | ✓ | ✓ | | `docs`, `dev`, `dependencies`, `feature`, `fix`, `refactor`, `release/*` | | | | | | The selected stacks are the union across all PR labels. PRs that pick up only no-stack labels (e.g. docs-only, dev-only) skip every build stack. Push and `workflow_dispatch` events run every stack unconditionally. ### Why per-service license check is now possible Before #4668 there was a single repo-wide `LICENSE-binary` covering the union of all service jars. Splitting the license check per service would have made every per-service check fail — each lib is a strict subset of the union, so the script would report STALE jars (claimed in the union, not in this service). #4668 ships per-module `LICENSE-binary` files at the repo root (`config-service/LICENSE-binary`, `amber/LICENSE-binary-java`, etc.), so each service's dist `lib/` is now validated against its own ground-truth file via `check_binary_deps.py --license-binary <module>/LICENSE-binary`. ## Any related issues, documentation, discussions? Closes #4631. Builds on #4668 (per-module LICENSE-binary files) and #4640 (LABEL_STACKS gating). ## How was this PR tested? YAML parses locally for all three modified files. Currently exercising on this PR's CI run: amber job runs unconditionally; platform matrix runs because the `platform` and `ci` labels are present. ## Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 (Claude Code) (backported from commit 6cf4322)
bobbai00
pushed a commit
to bobbai00/texera
that referenced
this pull request
Jun 6, 2026
…ETA-INF Adds bin/licensing/generate_notice_binary.py: walks each module's bundled jars, extracts every META-INF/NOTICE (and root-level NOTICE) file, skips first-party org.apache.texera.* jars, dedupes by content hash so jars from the same upstream collapse into one block, prepends the project's own root NOTICE, and emits one block per unique blob. Each block carries a heading derived from the longest common dotted prefix of its contributing jars' coordinates, the list of those jars, and the upstream NOTICE verbatim. Output is deterministic: CRLF->LF normalized and sorted by jar-count with a hash tiebreaker. Optional --extras appends non-jar attribution blocks (amber/NOTICE-binary-extras carries the aiohttp + Matplotlib python wheels, which ship no NOTICE inside any jar). Replaces the 6 hand-curated per-module NOTICE-binary files (introduced in apache#4668) with the generator's output, so every distinct upstream attribution actually shipped is preserved verbatim and stays in sync with the deps. CI (build.yml): after each dist is built and unzipped, a new step regenerates that module's NOTICE-binary against the dist lib/ dir and diffs it against the committed file, failing with a one-line fix-up command on drift. The amber check runs in the scala job; the five platform services are each checked in the per-service platform matrix job, alongside the existing LICENSE-binary check. .licenserc.yaml: exclude **/NOTICE-binary-* from the license-header check (the extras manifest is plain text with no comment style), mirroring the existing **/LICENSE-binary-* exclusion.
bobbai00
pushed a commit
to bobbai00/texera
that referenced
this pull request
Jun 6, 2026
…ETA-INF Adds bin/licensing/generate_notice_binary.py: walks each module's bundled jars, extracts every META-INF/NOTICE (and root-level NOTICE) file, skips first-party org.apache.texera.* jars, dedupes by content hash so jars from the same upstream collapse into one block, prepends the project's own root NOTICE, and emits one block per unique blob. Each block carries a heading derived from the longest common dotted prefix of its contributing jars' coordinates, the list of those jars, and the upstream NOTICE verbatim. Output is deterministic: CRLF->LF normalized and sorted by jar-count with a hash tiebreaker. Optional --extras appends non-jar attribution blocks (amber/NOTICE-binary-python carries the aiohttp + Matplotlib python wheels, which ship no NOTICE inside any jar). Replaces the 6 hand-curated per-module NOTICE-binary files (introduced in apache#4668) with the generator's output, so every distinct upstream attribution actually shipped is preserved verbatim and stays in sync with the deps. CI (build.yml): after each dist is built and unzipped, a new step regenerates that module's NOTICE-binary against the dist lib/ dir and diffs it against the committed file, failing with a one-line fix-up command on drift. The amber check runs in the scala job; the five platform services are each checked in the per-service platform matrix job, alongside the existing LICENSE-binary check. .licenserc.yaml: exclude **/NOTICE-binary-* from the license-header check (the extras manifest is plain text with no comment style), mirroring the existing **/LICENSE-binary-* exclusion.
Yicong-Huang
pushed a commit
that referenced
this pull request
Jun 10, 2026
…CI checks for detecting NOTICE-binary drifting (#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in #4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes #4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- (backported from commit 27c1df4) Co-authored-by: Bob Bai <bobbai0509@gmail.com>
SarahAsad23
pushed a commit
to SarahAsad23/texera
that referenced
this pull request
Jun 10, 2026
…CI checks for detecting NOTICE-binary drifting (apache#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in apache#4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes apache#4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- Co-authored-by: Bob Bai <bobbai0509@gmail.com>
ELin2025
pushed a commit
to ELin2025/texera
that referenced
this pull request
Jun 16, 2026
…CI checks for detecting NOTICE-binary drifting (apache#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in apache#4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes apache#4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- Co-authored-by: Bob Bai <bobbai0509@gmail.com>
yangzhang75
pushed a commit
to yangzhang75/texera
that referenced
this pull request
Jun 22, 2026
…oncat (apache#4668) ### What changes were proposed in this PR? Splits the monolithic root `LICENSE-binary` and `NOTICE-binary` into per-module ground-truth files so each Docker image's `/texera/LICENSE` describes only the third-party components actually bundled in that image, per ASF licensing guidance. **Per-module files added** (root files kept unchanged for the source tarball): | Path | Contents | |---|---| | `access-control-service/LICENSE-binary` + `NOTICE-binary` | 113 jars / 18 NOTICE blocks | | `config-service/LICENSE-binary` + `NOTICE-binary` | 115 jars / 18 NOTICE blocks | | `file-service/LICENSE-binary` + `NOTICE-binary` | 310 jars / 25 NOTICE blocks | | `workflow-compiling-service/LICENSE-binary` + `NOTICE-binary` | 319 jars / 26 NOTICE blocks | | `computing-unit-managing-service/LICENSE-binary` + `NOTICE-binary` | 349 jars / 26 NOTICE blocks (only image bundling Bouncy Castle) | | `amber/LICENSE-binary-java` + `NOTICE-binary` | 404 jars / 27 NOTICE blocks (`WorkflowExecutionService`, shared by web/master/runner) | | `amber/LICENSE-binary-python` | 113 packages (master/runner only) | | `frontend/LICENSE-binary` | 114 npm packages (Angular bundle, dashboard image) | | `agent-service/LICENSE-binary` | 57 npm packages | Counts were derived by enumerating each container's actual bundled jars (`ls /texera/lib/`), pip-listed Python packages, and `node_modules` (recursively, including `@scope/name` packages), then filtering the root `LICENSE-binary` down. No new entries were invented; `combined ⊆ root` strictly. **New script** — `bin/licensing/concat_license_binary.py`: - Style-matched to the existing `audit_jar_licenses.py` / `check_binary_deps.py`. - Merges multiple per-module LICENSE-binary files at the **license-group level**: each Apache-2.0 / MIT / BSD / ... section in the output contains all the ecosystem subsections (`Scala/Java jars:`, `Python packages:`, `Angular / npm packages:`, `Agent service npm packages:`, `Source files derived from ...`) inline, rather than stacking the inputs end-to-end. - Reuses the Apache-2.0 license header verbatim, deduplicates entries by id, emits a single trailer. **Dockerfile updates** — 9 dockerfiles: - 5 standalone Scala services + agent-service: copy only their own per-module `LICENSE-binary` (and `NOTICE-binary` for the Scala ones) into `/texera/LICENSE` and `/texera/NOTICE`. - 3 multi-aspect images run `concat_license_binary.py` at build time: - `computing-unit-master` and `computing-unit-worker`: union `amber/LICENSE-binary-java` + `amber/LICENSE-binary-python`. - `texera-web-application`: union `amber/LICENSE-binary-java` + `frontend/LICENSE-binary` (cross-stage `COPY --from=build-frontend`). - `python3-minimal` added to the Scala build stage of the 3 multi-aspect dockerfiles to run the concat script. **CI** — `.github/workflows/build.yml` (the workflow `required-checks.yml` orchestrates): - The four existing `check_binary_deps.py` invocations (frontend npm, scala jar, python, agent-npm) now build a fresh combined LICENSE-binary from all 9 per-module files via `concat_license_binary.py /tmp/combined-LICENSE-binary …`, then pass `--license-binary /tmp/combined-LICENSE-binary` to the existing tooling. The per-module files become the authoritative claim source for dep validation. ### Any related issues, documentation, discussions? Closes apache#4667 ### How was this PR tested? My personal fork: https://github.com/bobbai00/texera/actions/runs/25248183326 ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yangzhang75
pushed a commit
to yangzhang75/texera
that referenced
this pull request
Jun 22, 2026
## What changes were proposed in this PR? - `.github/workflows/build.yml`: - Replace the `scala` job with `amber`. It runs the cross-cutting Scala lints (`scalafmtCheckAll`, `scalafixAll --check`) once on behalf of every Scala module, builds `WorkflowExecutionService/dist`, license-checks the amber dist against `amber/LICENSE-binary-java`, and runs amber tests via `WorkflowExecutionService/jacoco`. - New `platform` job: a `strategy.matrix.include` over the five non-amber Scala services (config-service, access-control-service, file-service, computing-unit-managing-service, workflow-compiling-service). Each entry runs `sbt "<Service>/dist" "<Service>/test"` and license-checks its own dist `lib/` against `<service>/LICENSE-binary` in isolation. This is now possible because per-module LICENSE-binary files were introduced in apache#4668. - `run_scala` input replaced by `run_amber` + `run_platform`. - `.github/labeler.yml`: - New `platform` label for the five platform service dirs. - `service` label removed. The two things it carried go elsewhere: `pyright-language-service/**` is left uncategorized (no test stack today), and the root-level Scala build/lint config (`build.sbt`, `project/**`, `.scalafix.conf`, `.scalafmt.conf`) joins the `common` glob — `common` already maps to amber + platform, which is correct for changes that affect every Scala module. - `.github/workflows/required-checks.yml`: - Precheck now emits `run_amber` + `run_platform` instead of `run_scala`. - LABEL_STACKS routes the new label set. Build and backport callers pass the new inputs through. ### Label → stack matrix | Label | frontend | amber | platform | python | agent-service | |---|:-:|:-:|:-:|:-:|:-:| | `frontend` | ✓ | | | | | | `engine` | | ✓ | | ✓ | | | `python` | | ✓ | | ✓ | | | `platform` | | | ✓ | | | | `common` | | ✓ | ✓ | | | | `ddl-change` | | ✓ | ✓ | | | | `agent-service` | | | | | ✓ | | `ci` | ✓ | ✓ | ✓ | ✓ | ✓ | | `docs`, `dev`, `dependencies`, `feature`, `fix`, `refactor`, `release/*` | | | | | | The selected stacks are the union across all PR labels. PRs that pick up only no-stack labels (e.g. docs-only, dev-only) skip every build stack. Push and `workflow_dispatch` events run every stack unconditionally. ### Why per-service license check is now possible Before apache#4668 there was a single repo-wide `LICENSE-binary` covering the union of all service jars. Splitting the license check per service would have made every per-service check fail — each lib is a strict subset of the union, so the script would report STALE jars (claimed in the union, not in this service). apache#4668 ships per-module `LICENSE-binary` files at the repo root (`config-service/LICENSE-binary`, `amber/LICENSE-binary-java`, etc.), so each service's dist `lib/` is now validated against its own ground-truth file via `check_binary_deps.py --license-binary <module>/LICENSE-binary`. ## Any related issues, documentation, discussions? Closes apache#4631. Builds on apache#4668 (per-module LICENSE-binary files) and apache#4640 (LABEL_STACKS gating). ## How was this PR tested? YAML parses locally for all three modified files. Currently exercising on this PR's CI run: amber job runs unconditionally; platform matrix runs because the `platform` and `ci` labels are present. ## Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 (Claude Code)
yangzhang75
pushed a commit
to yangzhang75/texera
that referenced
this pull request
Jun 22, 2026
…CI checks for detecting NOTICE-binary drifting (apache#5417) ### What changes were proposed in this PR? Auto-generates each module's `NOTICE-binary` from the third-party `META-INF/NOTICE` files in its bundled jars — replacing the hand-curated subsets introduced in apache#4668 — and adds a CI drift-check so the committed files can never silently rot when dependencies change. - **New generator — `bin/licensing/generate_notice_binary.py`:** walks a module's dist `lib/` dir, extracts every `META-INF/NOTICE` (and root-level `NOTICE`) from each bundled jar, skips first-party `org.apache.texera.*` jars, dedupes by content hash so jars sharing an upstream notice collapse into one block, prepends the project's own root `NOTICE`, and emits one block per unique notice with a synthesized heading + the contributing-jar list. Output is deterministic (CRLF→LF normalized, stably sorted by jar-count). An optional `--extras <file>` appends non-jar attributions. - **`amber/NOTICE-binary-extras` (new):** the aiohttp + Matplotlib notices, which ship as Python wheels (not jars) and so can't be extracted from the `lib/` dir. - **6 per-module `NOTICE-binary` files regenerated** from the actual bundled jars: `amber`, `access-control-service`, `config-service`, `file-service`, `computing-unit-managing-service`, `workflow-compiling-service`. - **CI drift-check (`build.yml`):** after each dist is built and unzipped, a new step regenerates that module's `NOTICE-binary` and diffs it against the committed file, failing the build with a one-line fix-up command on any drift. The amber check runs in the scala job; the five platform services are each checked in the per-service `platform` matrix job, alongside the existing `LICENSE-binary` check. `LICENSE-binary` stays hand-maintained (it needs human judgment on each license); only `NOTICE-binary` — a mechanical carry-forward of upstream notices — is generated. So future dep bumps fail CI with the exact command to regenerate, instead of silently drifting. ### Any related issues, documentation, discussions? Closes apache#4674 ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)). ### How was this PR tested? - Built all six module dists locally (`sbt <project>/Universal/stage`) and ran the generator against each freshly-built `lib/`; the committed `NOTICE-binary` files are byte-identical to the generator output, so the new CI drift-check passes for every module. - Verified the existing `LICENSE-binary` checks (`check_binary_deps.py`, PR mode) still pass against the same libs for all six modules. - `build.yml` validated as well-formed YAML. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 (1M context) --------- Co-authored-by: Bob Bai <bobbai0509@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Splits the monolithic root
LICENSE-binaryandNOTICE-binaryinto per-module ground-truth files so each Docker image's/texera/LICENSEdescribes only the third-party components actually bundled in that image, per ASF licensing guidance.Per-module files added (root files kept unchanged for the source tarball):
access-control-service/LICENSE-binary+NOTICE-binaryconfig-service/LICENSE-binary+NOTICE-binaryfile-service/LICENSE-binary+NOTICE-binaryworkflow-compiling-service/LICENSE-binary+NOTICE-binarycomputing-unit-managing-service/LICENSE-binary+NOTICE-binaryamber/LICENSE-binary-java+NOTICE-binaryWorkflowExecutionService, shared by web/master/runner)amber/LICENSE-binary-pythonfrontend/LICENSE-binaryagent-service/LICENSE-binaryCounts were derived by enumerating each container's actual bundled jars (
ls /texera/lib/), pip-listed Python packages, andnode_modules(recursively, including@scope/namepackages), then filtering the rootLICENSE-binarydown. No new entries were invented;combined ⊆ rootstrictly.New script —
bin/licensing/concat_license_binary.py:audit_jar_licenses.py/check_binary_deps.py.Scala/Java jars:,Python packages:,Angular / npm packages:,Agent service npm packages:,Source files derived from ...) inline, rather than stacking the inputs end-to-end.Dockerfile updates — 9 dockerfiles:
LICENSE-binary(andNOTICE-binaryfor the Scala ones) into/texera/LICENSEand/texera/NOTICE.concat_license_binary.pyat build time:computing-unit-masterandcomputing-unit-worker: unionamber/LICENSE-binary-java+amber/LICENSE-binary-python.texera-web-application: unionamber/LICENSE-binary-java+frontend/LICENSE-binary(cross-stageCOPY --from=build-frontend).python3-minimaladded to the Scala build stage of the 3 multi-aspect dockerfiles to run the concat script.CI —
.github/workflows/build.yml(the workflowrequired-checks.ymlorchestrates):check_binary_deps.pyinvocations (frontend npm, scala jar, python, agent-npm) now build a fresh combined LICENSE-binary from all 9 per-module files viaconcat_license_binary.py /tmp/combined-LICENSE-binary …, then pass--license-binary /tmp/combined-LICENSE-binaryto the existing tooling. The per-module files become the authoritative claim source for dep validation.Any related issues, documentation, discussions?
Closes #4667
How was this PR tested?
My personal fork: https://github.com/bobbai00/texera/actions/runs/25248183326
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)