Skip to content

chore(licensing): auto-generate per-module NOTICE-binary from jars' META-INF#4675

Open
bobbai00 wants to merge 6 commits intoapache:mainfrom
bobbai00:feat/auto-generate-notice-binary
Open

chore(licensing): auto-generate per-module NOTICE-binary from jars' META-INF#4675
bobbai00 wants to merge 6 commits intoapache:mainfrom
bobbai00:feat/auto-generate-notice-binary

Conversation

@bobbai00
Copy link
Copy Markdown
Contributor

@bobbai00 bobbai00 commented May 2, 2026

What changes were proposed in this PR?

Stacked on top of #4668. This PR's diff against `main` will reduce to a single commit (the auto-generation work) once #4668 is merged. Until then, this PR shows all of #4668's commits plus the auto-generation commit.

Replaces the hand-curated per-module `NOTICE-binary` files introduced in #4668 with output from a new generator that extracts attribution from each module's bundled jars.

New script — `bin/licensing/generate_notice_binary.py`:

  • Walks each module's `lib/` dir, opens every `.jar` (skips `org.apache.texera.`), extracts every `META-INF/NOTICE` (or root-level `NOTICE`) file.
  • Dedupes by SHA-1 of normalized content; jars sharing a NOTICE collapse into one block.
  • Each block: `--- 80-dash sep ---`, project heading derived from a hand-curated `PROJECT_NAMES` table (longest-prefix match → e.g. `org.apache.hadoop.` → `Apache Hadoop`), sep, "Bundled jars" listing, verbatim upstream NOTICE.
  • Sorted by jar-count desc; hash tiebreaker for stable order.
  • Normalizes CRLF→LF so committed and regenerated outputs match byte-for-byte through git.
  • Optional `--extras ` appends a verbatim block (used for non-jar attributions like aiohttp + Matplotlib).

`amber/NOTICE-binary-extras` (new): the aiohttp + Matplotlib blocks, since those are Python wheels not jars.

6 per-module `NOTICE-binary` files regenerated — replace the curated subsets. Block counts: 24 / 24 / 87 / 92 / 88 / 91 (was 18 / 18 / 25 / 26 / 26 / 27 in #4668). Higher counts because dedup is by exact content rather than by hand-grouped upstream project, so e.g. Hadoop sub-artifacts whose `META-INF/NOTICE` differ slightly across versions now show as separate blocks. Every distinct attribution actually shipped is preserved verbatim — strictly more ASF-compliant under Apache-2.0 §4(d).

CI verification — new step in `build.yml`'s scala job, after the existing dist-unzip + license check:

```
for each module: regenerate NOTICE-binary against /tmp/dists/-*/lib, diff against committed
fail with a one-line fix-up command if drift
```

So future dep bumps: bump in `build.sbt` → CI fails on NOTICE drift → run `./bin/licensing/generate_notice_binary.py /NOTICE-binary [--extras …]` → commit.

Any related issues, documentation, discussions?

Closes #4674
Depends on #4668 (this PR's base will retarget to a clean diff once #4668 lands)

ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d))

How was this PR tested?

  • Generator run locally against jars extracted from `ghcr.io/apache/texera-*:61ce334cb` images for all 6 modules; output verified line-by-line against current curated NOTICE blocks.
  • CRLF→LF normalization verified: regenerated files produce byte-identical output to committed files (no spurious git auto-conversion drift).
  • CI step's logic exercised locally: `generate_notice_binary.py /tmp/foo --extras …` then `diff /NOTICE-binary /tmp/foo` → empty (clean).
  • Generator skips `org.apache.texera.*` jars (own first-party content, not third-party).

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

bobbai00 and others added 5 commits May 2, 2026 01:07
…oncat

Splits the monolithic root LICENSE-binary / NOTICE-binary into per-module
ground-truth files, one set per buildable module: each standalone Scala
service, amber (java + python split), frontend, and agent-service. The
root files are kept as-is for the source distribution.

For each Docker image, the dockerfile now copies only the per-module
file(s) relevant to what the image actually bundles. Multi-aspect images
(texera-web-application, computing-unit-master, computing-unit-worker)
merge their inputs into one /texera/LICENSE at build time via a new
bin/licensing/concat_license_binary.py — joining at the license-group
level so e.g. Apache-2.0 contains both Scala/Java jars and Python
packages inline rather than the inputs being stacked end-to-end.

CI: the four existing check_binary_deps.py points (frontend npm, scala
jar, python, agent-npm) now build the same combined LICENSE-binary from
all per-module files and pass it via --license-binary, so the per-module
files become the authoritative claim source for dep validation.

Per-module entry counts were derived by enumerating each container's
bundled jars / pip-listed Python packages / node_modules and filtering
the root LICENSE-binary down to entries that match. No new entries were
invented; combined ⊆ root strictly.

Closes apache#4667

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CDDL group has two sub-license sections (CDDL 1.0 and CDDL 1.1),
each with its own "Scala/Java jars:" subsection. The previous merge
keyed subsections by header alone, so the second "Scala/Java jars:"
(CDDL 1.1) overwrote the first (CDDL 1.0), losing all 22 CDDL-1.0
jars (javax.*, jersey-2.25.1, hk2-2.5.0-b32 family).

Key subsections by (sub_license, header) tuple instead, and on emit
print each sub-license heading once whenever the marker changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at in checker

The per-module LICENSE-binary and NOTICE-binary files now fully
describe each Docker image's bundled third-party content, so the root
LICENSE-binary and NOTICE-binary are dead code:

  - All dockerfiles ship the per-module file (or merged combination)
    as /texera/LICENSE; none reference root.
  - check_binary_deps.py now auto-builds a combined LICENSE-binary
    from the per-module files via concat_license_binary.py when
    --license-binary is omitted.
  - Source tarball still ships LICENSE and NOTICE (the source-
    distribution variants), which is what ASF requires; the -binary
    variants describe binary content and aren't required for source.

Updates AddMetaInfLicenseFiles.distMappings to take per-module
LICENSE-binary and NOTICE-binary paths (each service's build.sbt
passes its own); amber passes LICENSE-binary-java since the
Universal dist zip is jar-only.

Simplifies build.yml: drops the explicit concat steps before each
check_binary_deps.py invocation since the tool auto-handles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icense-header check

The skywalking-eyes license-header check fails on amber/LICENSE-binary-java
and amber/LICENSE-binary-python because they're plain-text manifests with
no comment-style and no Apache header (just like the existing root
LICENSE-binary entry already handles).

Replace the now-deleted root LICENSE-binary/NOTICE-binary entries with
glob patterns covering the per-module files: **/LICENSE-binary,
**/LICENSE-binary-*, **/NOTICE-binary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ETA-INF

Adds bin/licensing/generate_notice_binary.py: walks each module's
bundled jars, extracts every META-INF/NOTICE (and root-level NOTICE)
file, dedupes by content hash so jars from the same upstream collapse
into one block, and emits one block per unique blob. Each block lists
contributing jars and reproduces the upstream NOTICE verbatim. Optional
--extras file appends non-jar blocks (used by amber/NOTICE-binary-extras
for the aiohttp + Matplotlib python-only attributions).

Replaces the 6 hand-curated per-module NOTICE-binary files with the
generator's output. Block count rises (from 18-27 to 24-92 per module)
because dedup is by content hash rather than upstream-project header,
so e.g. Apache Hadoop jars whose META-INF/NOTICE differ slightly across
sub-artifacts now appear as separate blocks. ASF compliance is improved:
every distinct upstream attribution actually present in jars is now
preserved verbatim.

CI: build.yml's scala job regenerates the per-module NOTICE-binary
files against the freshly-built dist lib/ dirs and diffs against the
committed files. Drift fails the build with a one-line fix-up command.

Generator normalizes line endings (CRLF -> LF) since some upstream
NOTICE files ship CRLF and would otherwise round-trip through git's
auto-normalization differently than the on-disk regenerated output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added feature engine dependencies Pull requests that update a dependency file python frontend Changes related to the frontend GUI ci changes related to CI dev service agent-service labels May 2, 2026
…erator

The 78-line PROJECT_NAMES table mapped Maven groupId prefixes to
human-readable project labels ("Apache Hadoop", "AWS SDK for Java
2.0", etc.) used as block headings. Since each block already lists
its contributing jars verbatim under "Bundled jars: ...", the heading
just needs to be a navigational summary — the longest common dotted
prefix of the cluster's jar names suffices and requires zero
maintenance when new deps land.

Headings now look like 'org.apache.hadoop' instead of 'Apache Hadoop',
'software.amazon.awssdk' instead of 'AWS SDK for Java 2.0'. ASF
compliance is unchanged: the upstream NOTICE content is still
preserved verbatim.

Single-jar clusters use the jar name minus '.jar'.

Regenerates the 6 per-module NOTICE-binary files with the simpler
headings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@69f3aea). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4675   +/-   ##
=======================================
  Coverage        ?   46.16%           
  Complexity      ?     1994           
=======================================
  Files           ?     1013           
  Lines           ?    38165           
  Branches        ?     3712           
=======================================
  Hits            ?    17618           
  Misses          ?    19775           
  Partials        ?      772           
Flag Coverage Δ
agent-service 28.73% <ø> (?)
frontend 35.28% <ø> (?)
python 85.05% <ø> (?)
scala 38.17% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service ci changes related to CI dependencies Pull requests that update a dependency file dev engine feature frontend Changes related to the frontend GUI python service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-generate per-module NOTICE-binary from jars' META-INF/NOTICE

2 participants