fix: correct overlap behavior for long line chunk splitting#223
fix: correct overlap behavior for long line chunk splitting#223BearAlliance wants to merge 1 commit into
Conversation
| cur := chunks[i] | ||
|
|
||
| if len(prev) >= overlapLength && len(cur) >= overlapLength { | ||
| bridge := prev[len(prev)-overlapLength:] + cur[:overlapLength] |
There was a problem hiding this comment.
Almost same overlapping logic implemented in a different way as before the only change done here is
cur[:overlapLength]
instead of previous on cur[:overlapLength-1]
There was a problem hiding this comment.
While reviewing the PR, found some issue in the overlapping, fixed it.
#227
There was a problem hiding this comment.
Almost same overlapping logic implemented in a different way as before the only change done here is
cur[:overlapLength]
instead of previous on cur[:overlapLength-1]
This is a material change to the behavior. The old code declared tmpString inside the loop each iteration, so the previous chunk tail was not actually preserved into the next iteration.
In practice the “bridge” was not equivalent to prevTail + curHead, it effectively dropped the previous tail.
The new code explicitly builds bridge := prev[len(prev)-overlapLength:] + cur[:overlapLength], so it really uses both adjacent chunks.
It does this for every adjacent pair, not every other chunk.
While reviewing the PR, found some issue in the overlapping, fixed it.
That is only a partial fix of what this PR does. The diff there still uses the old toggle scheme, so it only builds overlap bridges for alternating chunks. With chunks A, B, C, D, it produces bridges for A+B and C+D. It still misses B+C`. I've left a comment on that PR to the same effect.
I've added a new test to illustrate this exact situation:
Let's use a contrived example with overlapLength = 3. The real code uses 25 in pkg/scan/const.go, but the failure mode is the same.
Suppose splitSubN produces four chunks:
A = "aaaaa"
B = "xxSEC"
C = "RETyy"
D = "zzzzz"
Assume the thing you need to detect is SECRET, which spans the boundary between B and C:
B ends with "SEC"
C starts with "RET"
What this produces:
A- bridge
A+B=> tail(A) + head(B) BC- bridge
C+D=> tail(C) + head(D) D
So it checks:
AA+BBCC+DD
It never creates the B+C bridge, so SECRET is missed.
What this PR produces:
A- bridge
A+B B- bridge
B+C C- bridge
C+D D
That B+C bridge contains:
tail(B, 3) + head(C, 3) = "SEC" + "RET" = "SECRET"
There was a problem hiding this comment.
Yes I found the issues raised here, I have added a PR to fix this, the link to PR have been attached. Additionally you can review the PR and let me know if any improvements is needed there.
There was a problem hiding this comment.
The above comment illustrates how the PR you've raised does not entirely fix the problem. Additionally, I'm confused about why you'd open a new PR to fix what this PR already addresses?
There was a problem hiding this comment.
Let's connect for this internally once.
There was a problem hiding this comment.
There are few more improvements that can be done.
- Adding the rune in the bridge making also, else slicing a string with non ascii char can break characters like ñ or emoji into two pieces, eventually corrupting the chunk.
- condition
len(prev) >= overlapLengthis redundant. - Moving the last condition out of the loop for both the loops to make it run efficiently.
- Merging the two loop into just one, reducing the time-complexity from 2n to just simply n, would be useful in case of a big minified file.
The overlap builder incorrectly dropped the previous chunk's tail. That is corrected to use the same amount of overlap granted to the head.
|
I've reverted the full line behavior and squashed commits. Updated the description to match the new diff. If we want to address the memory vs correctness aspect of respecting the |
|
Closing this Pull request Changes went in #227, thanks @BearAlliance for pointing out this, the change went only because you pointed out. Thanks once again. Please ping any other issue that you find. |
Description
The overlap helper for long-line splitting was broken and did not actually preserve the previous chunk’s trailing context, so chunks were scanned with even less surrounding context than intended.
The scenario can be most accurately understood by this new test case
Old (outdated) description
Summary
Fixes false positives when scanning files with very long lines, especially minified JavaScript. This is relevant in contexts where Earlybird is used to scan every branch of a repository, including an orphaned
gh-pagesbranch intended to house build artifacts for serving with GitHub Pages.Whitespaces changes on the modified files are from my editor's interpretation of this project's formatting configuration.
Problem
The scanner splits lines longer than WorkLength into smaller synthetic chunks before applying content rules. That behavior introduced two issues:
UseFullLine: truewere evaluating the chunked fragment rather than the original source line.On minified files, that meant detections were often evaluated against incomplete, artificial substrings instead of the real line, which caused valid full-line suppressions to stop working and increased false positives.
Root Cause
splitJob/splitSubNrewrote oversized lines into synthetic scan jobs.scanLinepropagated the fragment into hit/postprocess evaluation as if it were the true line.findFalsePositiveand related postprocessing logic used that fragment for UseFullLine checks.splitSubNdropped the prior chunk tail due to incorrect overlap handling, so the “overlap” path did not preserve boundary context correctly.Fix
Tests
Added regression coverage for:
Verified with: