:sparkles: Update unsafeblock probe to detect use of Java's Unsafe classes by thomasleplus · Pull Request #4849 · ossf/scorecard

thomasleplus · 2025-11-12T19:14:34Z

What kind of change does this PR introduce?

This PR is adding Java support to the probe introduced by #4499. It looks for references to the classes sun.misc.Unsafe or jdk.internal.misc.Unsafe classes which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.).

Note that the PR includes a Java source code parser generated with Antlr4 that can be reused to add more Java probes and checks in the future.

PR title follows the guidelines defined in our pull request documentation

What is the current behavior?

The probe looks for unsafe patterns in go and c# code.

What is the new behavior (if this is a feature change)?

The probe also looks for unsafe patterns in Java code.

Tests for the changes have been added (for bug fixes/features)

Which issue(s) this PR fixes

Contributes to #3736.

Special notes for your reviewer

Does this PR introduce a user-facing change?

For user-facing changes, please add a concise, human-readable release note to
the release-note

(In particular, describe what changes users might need to make in their
application as a result of this pull request.)

Added Java support to probe for non-memory safe practices by detecting references to the sun.misc.Unsafe and jdk.internal.misc.Unsafe classes.

github-actions · 2025-11-23T02:13:59Z

This pull request has been marked stale because it has been open for 10 days with no activity

github-actions · 2025-12-11T02:13:35Z

This pull request has been marked stale because it has been open for 10 days with no activity

github-actions · 2025-12-28T02:14:32Z

This pull request has been marked stale because it has been open for 10 days with no activity

AdamKorcz · 2026-01-07T17:00:13Z

Hi @thomasleplus thanks a lot for the PR and sorry for the long wait here. I think we want coverage of unsafe code blocks in Java added to Scorecard since Scorecard already supports it for other languages, however, given the size of your PR, have you thought of ways to do this with less code? For example, have you explored simple pattern search? At the time of writing, the PR has almost 59k lines of code, and I am not sure the added feature justifies that amount of code.

thomasleplus · 2026-01-07T18:04:13Z

Hi @AdamKorcz,

First let me say that I understand your concern. I considered using a simple regex but the trade off is reliability. The regex might create false positives if it finds an import statement inside commented code for example. With some advanced regex magic, we may be able to exclude some false positives but there are too many to exclude them all (what if the import is inside a string literal, or in an annotation attribute, etc.). I know that it sounds unlike but given a large enough code base, I am sure someone will find an occurrence in their code and complain about it. Only a real Java parser can handle all the cases (except dead code maybe). That's why the equivalent Golang check uses a parser too, it just happens that one was available ready-made whereas I couldn't find a Golang parser for Java, hence my decision to generate one with Antlr.

Another option I considered is to use something like go-tree-sitter which supports Java but it would requires the tree-sitter binaries to be either either installed separately or shipped with scorecard. That doesn't seem ideal.

I think that my approach has the benefit of being a pure Golang solution. It's definitely a lot of code for just the feature described in this PR but it also means that any future Java check can benefit from the fact that the parser is there if it needs to find more complex code issues.

If your main concern is that it is impossible to review that many lines of code, a workaround could be to generate the parser code at build time from the grammar instead of committing it.

Ultimately I will respect what you guys as maintainers feel is best for the project. As I said it's a trade-off, there is no perfect answer. If you want me to rewrite this PR with a regex, I can do that.

Cheers,

Tom

github-actions · 2026-01-18T02:14:22Z

This pull request has been marked stale because it has been open for 10 days with no activity

AdamKorcz · 2026-02-02T13:48:11Z

@thomasleplus Would it be possible to hide the code generation behind a command in Scorecards Makefile and not include it in the PR? In addition, if a user runs the unsafe-code probes on a Java code base, Scorecard is able to check whether the Java parsers would be used (although they wouldn't be because they have not been generated) and log to inform the user that they should generate the java parsers with make generate-java-unsafe-code-parser to analyze the java code.

thomasleplus · 2026-02-08T08:17:29Z

I sure can. I was hoping to finish it this week-end but I have some Makefile target dependency issues.

thomasleplus · 2026-02-11T20:03:06Z

@AdamKorcz I've pushed what I have so far but to be honest I don't fully master the project's CI so I wasn't sure for example where to add the installation of antlr4 in the GitHub workflow files although it's probably just adding apt-get install -y antlr4 in the right place. If you could guide me in the right direction, I'd appreciate it.

spencerschrock · 2026-02-13T16:47:44Z

The last time I looked at this, all of the generated java parser increased binary size by 15MiB, or about 18%.

disclaimer for this next part: my grammar/parser knowledge is rusty.

How does antl4r handle partial grammars? Is there anyway the .g4 files could be minimized to look for imports of sun.misc.Unsafe or jdk.internal.misc.Unsafe, and cut out a lot of the unrelated Java grammars?

I don't fully master the project's CI so I wasn't sure for example where to add the installation of antlr4 in the GitHub workflow files

Is antlr4 not a pure Go dependency? Does it require a package to be installed? I fear this would break our build/release/docker image pipeline, as we've constructed it around avoiding things like cgo or extra dependencies to build the program.

Or is this purely for the go:generate directive? If this is only when updating the antl4r files, and verifying they're up-to-date in the CI, I don't think this is a hard blocker.

//go:generate antlr4 < ... truncated >

hide the code generation behind a command in Scorecards Makefile and not include it in the PR

I'm not sure I agree with this, given the above point.

thomasleplus · 2026-02-21T23:57:17Z

Hi @spencerschrock,

You're right, on my machine (darwin_arm64) the scorecard binary is only 9 MiB bigger but that's still significant. I tried to trim the grammar but it's not working because in Java, before the import statement, there is usually a package statement and it supports annotations. These annotations can have complex arguments with expressions so in order to parse all that and get to the import statement that we really care about, you end up needing almost half of the grammar.

@javax.ws.rs.ApplicationPath(value = "/foo/bar")
package foo.bar;

import sun.misc.Unsafe;
...

To answer your second question, there are two parts of antlr4 involved here. The antlr4 CLI tool that take the grammar files and generate the corresponding parser source code in the desired programming language. But that source code is not self-sufficient, it needs the antlr4 runtime to execute (think of it as the engine that runs the generated parser). That's a library provided for each language (github.com/antlr4-go/antlr/v4 in our case). In my original PR I committed the generated code to avoid depending on the CLI tool (only the runtime Go library). But that resulted in a very large commit that is almost impossible to review unless you just ignore the generated code in internal/java/java20. Adam suggested that I generate the parser code at build time instead but as you correctly pointed out, it implies that you'd need the antlr4 CLI in your CI (and same goes for anyone trying to compile scorecard). Personally I prefer the original approach.

By the way the Go source code generated from the full Java grammar is 2.5 MiB and I think it's safe to assume that the resulting compiled binary is smaller. I'd bet that most of the bloat comes from the added antlr4 runtime dependency instead, but I am not sure how I can confirm that. If I am right then trimming the grammar won't help much regardless of the issue mentioned above.

Trimming the grammar also means that the Java parser introduced by this PR could not be used in the future to create more checks on Java source code. It also becomes a challenge in terms of maintenance: as the Java language continues to evolve, and the Java grammar in the antlr4 repo gets updated accordingly, it would be easier to be able to just copy over the whole new grammar files each time without having to redo the trimming which requires grammar syntax expertise.

In short, I think that trimming the grammar is not the way to go. I think that either we accept that the size increase is a price worth paying to have a full-fledged Java parser in scorecard that can be reused in future checks. Or we decide that it's better to have a simple regex-based check even though it might, in some cases, cause false positives (e.g. find the undesired import statements in commented code etc.). Or of course we can decide that neither option is acceptable and drop the PR.

What do you think?

spencerschrock · 2026-02-25T16:18:58Z

Or we decide that it's better to have a simple regex-based check even though it might, in some cases, cause false positives (e.g. find the undesired import statements in commented code etc.).

Lets go with the regex-based heuristic for now, and if accuracy or future analysis is needed, we can revisit the full gramar you had in earlier commits (e.g. 4f90387). I want to have this land, and that's the path of least resistance.

spencerschrock · 2026-02-25T16:20:57Z

Thank you for iterating on this multiple times!

thomasleplus · 2026-03-08T00:01:28Z

Hi @spencerschrock,

I think that I am finished with the regex approach and I have to say that it works better than I expected. I was able to pass all the same tests as with the parser. Not to say that it is foolproof, it remains a heuristic, but false-positives should be rather limited.

Let me know if you see anything that I may have missed.

Cheers,

Tom

Looks for references to the classes sun.misc.Unsafe or jdk.internal.misc.Unsafe classes which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.). Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>

Remove generated code and add code generation target to Makefile. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>

Remove parser and use regexp instead. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>

thomasleplus requested a review from a team as a code owner November 12, 2025 19:14

thomasleplus requested review from AdamKorcz and jeffmendoza and removed request for a team November 12, 2025 19:14

github-project-automation bot added this to OpenSSF Scorecard Nov 12, 2025

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Nov 12, 2025

thomasleplus force-pushed the main branch from 4d3236e to 3dc68b4 Compare November 12, 2025 19:18

github-actions bot added the Stale label Nov 23, 2025

thomasleplus force-pushed the main branch from 1e51e73 to 4f90387 Compare November 30, 2025 00:49

github-actions bot removed the Stale label Nov 30, 2025

github-actions bot added the Stale label Dec 11, 2025

spencerschrock removed the Stale label Dec 17, 2025

spencerschrock self-requested a review December 17, 2025 21:45

github-actions bot added Stale and removed Stale labels Dec 28, 2025

github-actions bot added the Stale label Jan 18, 2026

github-actions bot removed the Stale label Jan 25, 2026

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Mar 7, 2026

thomasleplus force-pushed the main branch from cad4d38 to 4a64ecd Compare March 8, 2026 09:40

thomasleplus added 2 commits March 8, 2026 10:49

Update probe to detect use of Java's Unsafe classes

3be3f5a

Remove generated code and add code generation target to Makefile. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>

thomasleplus force-pushed the main branch from 5ca6910 to 49907cd Compare March 8, 2026 09:52

Update probe to detect use of Java's Unsafe classes

275d135

Remove parser and use regexp instead. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>

thomasleplus force-pushed the main branch from 49907cd to 275d135 Compare March 8, 2026 09:54

Conversation

thomasleplus commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What kind of change does this PR introduce?

What is the current behavior?

What is the new behavior (if this is a feature change)?

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

github-actions bot commented Nov 23, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 28, 2025

Uh oh!

AdamKorcz commented Jan 7, 2026

Uh oh!

thomasleplus commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

AdamKorcz commented Feb 2, 2026

Uh oh!

thomasleplus commented Feb 8, 2026

Uh oh!

thomasleplus commented Feb 11, 2026

Uh oh!

spencerschrock commented Feb 13, 2026

Uh oh!

thomasleplus commented Feb 21, 2026

Uh oh!

spencerschrock commented Feb 25, 2026

Uh oh!

spencerschrock commented Feb 25, 2026

Uh oh!

thomasleplus commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasleplus commented Nov 12, 2025 •

edited

Loading