✨ Update unsafeblock probe to detect use of Java's Unsafe classes#4849
✨ Update unsafeblock probe to detect use of Java's Unsafe classes#4849thomasleplus wants to merge 3 commits intoossf:mainfrom
Conversation
|
This pull request has been marked stale because it has been open for 10 days with no activity |
|
This pull request has been marked stale because it has been open for 10 days with no activity |
|
This pull request has been marked stale because it has been open for 10 days with no activity |
|
Hi @thomasleplus thanks a lot for the PR and sorry for the long wait here. I think we want coverage of unsafe code blocks in Java added to Scorecard since Scorecard already supports it for other languages, however, given the size of your PR, have you thought of ways to do this with less code? For example, have you explored simple pattern search? At the time of writing, the PR has almost 59k lines of code, and I am not sure the added feature justifies that amount of code. |
|
Hi @AdamKorcz, First let me say that I understand your concern. I considered using a simple regex but the trade off is reliability. The regex might create false positives if it finds an import statement inside commented code for example. With some advanced regex magic, we may be able to exclude some false positives but there are too many to exclude them all (what if the import is inside a string literal, or in an annotation attribute, etc.). I know that it sounds unlike but given a large enough code base, I am sure someone will find an occurrence in their code and complain about it. Only a real Java parser can handle all the cases (except dead code maybe). That's why the equivalent Golang check uses a parser too, it just happens that one was available ready-made whereas I couldn't find a Golang parser for Java, hence my decision to generate one with Antlr. Another option I considered is to use something like go-tree-sitter which supports Java but it would requires the tree-sitter binaries to be either either installed separately or shipped with scorecard. That doesn't seem ideal. I think that my approach has the benefit of being a pure Golang solution. It's definitely a lot of code for just the feature described in this PR but it also means that any future Java check can benefit from the fact that the parser is there if it needs to find more complex code issues. If your main concern is that it is impossible to review that many lines of code, a workaround could be to generate the parser code at build time from the grammar instead of committing it. Ultimately I will respect what you guys as maintainers feel is best for the project. As I said it's a trade-off, there is no perfect answer. If you want me to rewrite this PR with a regex, I can do that. Cheers, Tom |
|
This pull request has been marked stale because it has been open for 10 days with no activity |
|
@thomasleplus Would it be possible to hide the code generation behind a command in Scorecards |
|
I sure can. I was hoping to finish it this week-end but I have some |
|
@AdamKorcz I've pushed what I have so far but to be honest I don't fully master the project's CI so I wasn't sure for example where to add the installation of antlr4 in the GitHub workflow files although it's probably just adding |
|
The last time I looked at this, all of the generated java parser increased binary size by 15MiB, or about 18%. disclaimer for this next part: my grammar/parser knowledge is rusty. How does antl4r handle partial grammars? Is there anyway the
Is Or is this purely for the
I'm not sure I agree with this, given the above point. |
|
Hi @spencerschrock, You're right, on my machine (darwin_arm64) the scorecard binary is only 9 MiB bigger but that's still significant. I tried to trim the grammar but it's not working because in Java, before the @javax.ws.rs.ApplicationPath(value = "/foo/bar")
package foo.bar;
import sun.misc.Unsafe;
...To answer your second question, there are two parts of antlr4 involved here. The antlr4 CLI tool that take the grammar files and generate the corresponding parser source code in the desired programming language. But that source code is not self-sufficient, it needs the antlr4 runtime to execute (think of it as the engine that runs the generated parser). That's a library provided for each language ( By the way the Go source code generated from the full Java grammar is 2.5 MiB and I think it's safe to assume that the resulting compiled binary is smaller. I'd bet that most of the bloat comes from the added antlr4 runtime dependency instead, but I am not sure how I can confirm that. If I am right then trimming the grammar won't help much regardless of the issue mentioned above. Trimming the grammar also means that the Java parser introduced by this PR could not be used in the future to create more checks on Java source code. It also becomes a challenge in terms of maintenance: as the Java language continues to evolve, and the Java grammar in the antlr4 repo gets updated accordingly, it would be easier to be able to just copy over the whole new grammar files each time without having to redo the trimming which requires grammar syntax expertise. In short, I think that trimming the grammar is not the way to go. I think that either we accept that the size increase is a price worth paying to have a full-fledged Java parser in scorecard that can be reused in future checks. Or we decide that it's better to have a simple regex-based check even though it might, in some cases, cause false positives (e.g. find the undesired What do you think? |
Lets go with the regex-based heuristic for now, and if accuracy or future analysis is needed, we can revisit the full gramar you had in earlier commits (e.g. 4f90387). I want to have this land, and that's the path of least resistance. |
|
Thank you for iterating on this multiple times! |
|
Hi @spencerschrock, I think that I am finished with the regex approach and I have to say that it works better than I expected. I was able to pass all the same tests as with the parser. Not to say that it is foolproof, it remains a heuristic, but false-positives should be rather limited. Let me know if you see anything that I may have missed. Cheers, Tom |
Looks for references to the classes sun.misc.Unsafe or jdk.internal.misc.Unsafe classes which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.). Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
Remove generated code and add code generation target to Makefile. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
Remove parser and use regexp instead. Signed-off-by: Thomas Leplus <thomasleplus@users.noreply.github.com>
What kind of change does this PR introduce?
This PR is adding Java support to the probe introduced by #4499. It looks for references to the classes
sun.misc.Unsafeorjdk.internal.misc.Unsafeclasses which can bypass the JVM's memory safety features (garbage collection, checks against out-of-bound read and write, etc.).Note that the PR includes a Java source code parser generated with Antlr4 that can be reused to add more Java probes and checks in the future.
What is the current behavior?
The probe looks for unsafe patterns in go and c# code.
What is the new behavior (if this is a feature change)?
The probe also looks for unsafe patterns in Java code.
Which issue(s) this PR fixes
Contributes to #3736.
Special notes for your reviewer
Does this PR introduce a user-facing change?
For user-facing changes, please add a concise, human-readable release note to
the
release-note(In particular, describe what changes users might need to make in their
application as a result of this pull request.)