Skip to content

✨ Update unsafeblock probe to detect use of Java's Unsafe classes#4849

Open
thomasleplus wants to merge 3 commits intoossf:mainfrom
thomasleplus:main
Open

✨ Update unsafeblock probe to detect use of Java's Unsafe classes#4849
thomasleplus wants to merge 3 commits intoossf:mainfrom
thomasleplus:main

Conversation

@thomasleplus
Copy link

@thomasleplus thomasleplus commented Nov 12, 2025

What kind of change does this PR introduce?

This PR is adding Java support to the probe introduced by #4499. It looks for references to the classes sun.misc.Unsafe or jdk.internal.misc.Unsafe classes which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.).

Note that the PR includes a Java source code parser generated with Antlr4 that can be reused to add more Java probes and checks in the future.

What is the current behavior?

The probe looks for unsafe patterns in go and c# code.

What is the new behavior (if this is a feature change)?

The probe also looks for unsafe patterns in Java code.

  • Tests for the changes have been added (for bug fixes/features)

Which issue(s) this PR fixes

Contributes to #3736.

Special notes for your reviewer

Does this PR introduce a user-facing change?

For user-facing changes, please add a concise, human-readable release note to
the release-note

(In particular, describe what changes users might need to make in their
application as a result of this pull request.)

Added Java support to probe for non-memory safe practices by detecting references to the sun.misc.Unsafe and jdk.internal.misc.Unsafe classes.

@thomasleplus thomasleplus requested a review from a team as a code owner November 12, 2025 19:14
@thomasleplus thomasleplus requested review from AdamKorcz and jeffmendoza and removed request for a team November 12, 2025 19:14
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Nov 12, 2025
@github-actions
Copy link

This pull request has been marked stale because it has been open for 10 days with no activity

@github-actions
Copy link

This pull request has been marked stale because it has been open for 10 days with no activity

@github-actions
Copy link

This pull request has been marked stale because it has been open for 10 days with no activity

@github-actions github-actions bot added Stale and removed Stale labels Dec 28, 2025
@AdamKorcz
Copy link
Contributor

Hi @thomasleplus thanks a lot for the PR and sorry for the long wait here. I think we want coverage of unsafe code blocks in Java added to Scorecard since Scorecard already supports it for other languages, however, given the size of your PR, have you thought of ways to do this with less code? For example, have you explored simple pattern search? At the time of writing, the PR has almost 59k lines of code, and I am not sure the added feature justifies that amount of code.

@thomasleplus
Copy link
Author

Hi @AdamKorcz,

First let me say that I understand your concern. I considered using a simple regex but the trade off is reliability. The regex might create false positives if it finds an import statement inside commented code for example. With some advanced regex magic, we may be able to exclude some false positives but there are too many to exclude them all (what if the import is inside a string literal, or in an annotation attribute, etc.). I know that it sounds unlike but given a large enough code base, I am sure someone will find an occurrence in their code and complain about it. Only a real Java parser can handle all the cases (except dead code maybe). That's why the equivalent Golang check uses a parser too, it just happens that one was available ready-made whereas I couldn't find a Golang parser for Java, hence my decision to generate one with Antlr.

Another option I considered is to use something like go-tree-sitter which supports Java but it would requires the tree-sitter binaries to be either either installed separately or shipped with scorecard. That doesn't seem ideal.

I think that my approach has the benefit of being a pure Golang solution. It's definitely a lot of code for just the feature described in this PR but it also means that any future Java check can benefit from the fact that the parser is there if it needs to find more complex code issues.

If your main concern is that it is impossible to review that many lines of code, a workaround could be to generate the parser code at build time from the grammar instead of committing it.

Ultimately I will respect what you guys as maintainers feel is best for the project. As I said it's a trade-off, there is no perfect answer. If you want me to rewrite this PR with a regex, I can do that.

Cheers,

Tom

@github-actions
Copy link

This pull request has been marked stale because it has been open for 10 days with no activity

@github-actions github-actions bot added the Stale label Jan 18, 2026
@github-actions github-actions bot removed the Stale label Jan 25, 2026
@AdamKorcz
Copy link
Contributor

@thomasleplus Would it be possible to hide the code generation behind a command in Scorecards Makefile and not include it in the PR? In addition, if a user runs the unsafe-code probes on a Java code base, Scorecard is able to check whether the Java parsers would be used (although they wouldn't be because they have not been generated) and log to inform the user that they should generate the java parsers with make generate-java-unsafe-code-parser to analyze the java code.

@thomasleplus
Copy link
Author

I sure can. I was hoping to finish it this week-end but I have some Makefile target dependency issues.

@thomasleplus
Copy link
Author

@AdamKorcz I've pushed what I have so far but to be honest I don't fully master the project's CI so I wasn't sure for example where to add the installation of antlr4 in the GitHub workflow files although it's probably just adding apt-get install -y antlr4 in the right place. If you could guide me in the right direction, I'd appreciate it.

@spencerschrock
Copy link
Member

The last time I looked at this, all of the generated java parser increased binary size by 15MiB, or about 18%.

disclaimer for this next part: my grammar/parser knowledge is rusty.

How does antl4r handle partial grammars? Is there anyway the .g4 files could be minimized to look for imports of sun.misc.Unsafe or jdk.internal.misc.Unsafe, and cut out a lot of the unrelated Java grammars?

I don't fully master the project's CI so I wasn't sure for example where to add the installation of antlr4 in the GitHub workflow files

Is antlr4 not a pure Go dependency? Does it require a package to be installed? I fear this would break our build/release/docker image pipeline, as we've constructed it around avoiding things like cgo or extra dependencies to build the program.

Or is this purely for the go:generate directive? If this is only when updating the antl4r files, and verifying they're up-to-date in the CI, I don't think this is a hard blocker.

//go:generate antlr4 < ... truncated >

hide the code generation behind a command in Scorecards Makefile and not include it in the PR

I'm not sure I agree with this, given the above point.

@thomasleplus
Copy link
Author

Hi @spencerschrock,

You're right, on my machine (darwin_arm64) the scorecard binary is only 9 MiB bigger but that's still significant. I tried to trim the grammar but it's not working because in Java, before the import statement, there is usually a package statement and it supports annotations. These annotations can have complex arguments with expressions so in order to parse all that and get to the import statement that we really care about, you end up needing almost half of the grammar.

@javax.ws.rs.ApplicationPath(value = "/foo/bar")
package foo.bar;

import sun.misc.Unsafe;
...

To answer your second question, there are two parts of antlr4 involved here. The antlr4 CLI tool that take the grammar files and generate the corresponding parser source code in the desired programming language. But that source code is not self-sufficient, it needs the antlr4 runtime to execute (think of it as the engine that runs the generated parser). That's a library provided for each language (github.com/antlr4-go/antlr/v4 in our case). In my original PR I committed the generated code to avoid depending on the CLI tool (only the runtime Go library). But that resulted in a very large commit that is almost impossible to review unless you just ignore the generated code in internal/java/java20. Adam suggested that I generate the parser code at build time instead but as you correctly pointed out, it implies that you'd need the antlr4 CLI in your CI (and same goes for anyone trying to compile scorecard). Personally I prefer the original approach.

By the way the Go source code generated from the full Java grammar is 2.5 MiB and I think it's safe to assume that the resulting compiled binary is smaller. I'd bet that most of the bloat comes from the added antlr4 runtime dependency instead, but I am not sure how I can confirm that. If I am right then trimming the grammar won't help much regardless of the issue mentioned above.

Trimming the grammar also means that the Java parser introduced by this PR could not be used in the future to create more checks on Java source code. It also becomes a challenge in terms of maintenance: as the Java language continues to evolve, and the Java grammar in the antlr4 repo gets updated accordingly, it would be easier to be able to just copy over the whole new grammar files each time without having to redo the trimming which requires grammar syntax expertise.

In short, I think that trimming the grammar is not the way to go. I think that either we accept that the size increase is a price worth paying to have a full-fledged Java parser in scorecard that can be reused in future checks. Or we decide that it's better to have a simple regex-based check even though it might, in some cases, cause false positives (e.g. find the undesired import statements in commented code etc.). Or of course we can decide that neither option is acceptable and drop the PR.

What do you think?

@spencerschrock
Copy link
Member

Or we decide that it's better to have a simple regex-based check even though it might, in some cases, cause false positives (e.g. find the undesired import statements in commented code etc.).

Lets go with the regex-based heuristic for now, and if accuracy or future analysis is needed, we can revisit the full gramar you had in earlier commits (e.g. 4f90387). I want to have this land, and that's the path of least resistance.

@spencerschrock
Copy link
Member

Thank you for iterating on this multiple times!

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Mar 7, 2026
@thomasleplus
Copy link
Author

Hi @spencerschrock,

I think that I am finished with the regex approach and I have to say that it works better than I expected. I was able to pass all the same tests as with the parser. Not to say that it is foolproof, it remains a heuristic, but false-positives should be rather limited.

Let me know if you see anything that I may have missed.

Cheers,

Tom

Looks for references to the classes sun.misc.Unsafe or jdk.internal.misc.Unsafe classes which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.).

Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
Remove generated code and add code generation target to Makefile.

Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
Remove parser and use regexp instead.

Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants