Skip to content

fix: correct overlap behavior for long line chunk splitting#223

Closed
BearAlliance wants to merge 1 commit into
americanexpress:mainfrom
BearAlliance:long-lines
Closed

fix: correct overlap behavior for long line chunk splitting#223
BearAlliance wants to merge 1 commit into
americanexpress:mainfrom
BearAlliance:long-lines

Conversation

@BearAlliance
Copy link
Copy Markdown

@BearAlliance BearAlliance commented May 6, 2026

Description

The overlap helper for long-line splitting was broken and did not actually preserve the previous chunk’s trailing context, so chunks were scanned with even less surrounding context than intended.

The scenario can be most accurately understood by this new test case

Old (outdated) description

Summary

Fixes false positives when scanning files with very long lines, especially minified JavaScript. This is relevant in contexts where Earlybird is used to scan every branch of a repository, including an orphaned gh-pages branch intended to house build artifacts for serving with GitHub Pages.

Whitespaces changes on the modified files are from my editor's interpretation of this project's formatting configuration.

Problem

The scanner splits lines longer than WorkLength into smaller synthetic chunks before applying content rules. That behavior introduced two issues:

  1. False-positive rules marked UseFullLine: true were evaluating the chunked fragment rather than the original source line.
  2. The overlap helper for long-line splitting was broken and did not actually preserve the previous chunk’s trailing context, so chunks were scanned with even less surrounding context than intended.

On minified files, that meant detections were often evaluated against incomplete, artificial substrings instead of the real line, which caused valid full-line suppressions to stop working and increased false positives.

Root Cause

  • splitJob / splitSubN rewrote oversized lines into synthetic scan jobs.
  • scanLine propagated the fragment into hit/postprocess evaluation as if it were the true line.
  • findFalsePositive and related postprocessing logic used that fragment for UseFullLine checks.
  • splitSubN dropped the prior chunk tail due to incorrect overlap handling, so the “overlap” path did not preserve boundary context correctly.

Fix

  • Preserve the original unsplit line as internal scan context while still scanning chunked segments.
  • Use that original full line for false-positive checks and line-based postprocessing/labeling logic that depends on full context.
  • Correct the overlap generation so adjacent long-line chunks produce a real bridge window across the chunk boundary.
  • Keep the external report format unchanged. The extra full-line context is internal only.

Tests

Added regression coverage for:

  • Long-line overlap generation.
  • False-positive evaluation using the original full line rather than a chunk fragment.

Verified with:

go test ./...

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 6, 2026

CLA assistant check
All committers have signed the CLA.

Comment thread pkg/scan/scan.go
cur := chunks[i]

if len(prev) >= overlapLength && len(cur) >= overlapLength {
bridge := prev[len(prev)-overlapLength:] + cur[:overlapLength]
Copy link
Copy Markdown
Contributor

@afham123 afham123 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost same overlapping logic implemented in a different way as before the only change done here is
cur[:overlapLength]
instead of previous on cur[:overlapLength-1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing the PR, found some issue in the overlapping, fixed it.
#227

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost same overlapping logic implemented in a different way as before the only change done here is
cur[:overlapLength]
instead of previous on cur[:overlapLength-1]

This is a material change to the behavior. The old code declared tmpString inside the loop each iteration, so the previous chunk tail was not actually preserved into the next iteration.
In practice the “bridge” was not equivalent to prevTail + curHead, it effectively dropped the previous tail.

The new code explicitly builds bridge := prev[len(prev)-overlapLength:] + cur[:overlapLength], so it really uses both adjacent chunks.
It does this for every adjacent pair, not every other chunk.

While reviewing the PR, found some issue in the overlapping, fixed it.

That is only a partial fix of what this PR does. The diff there still uses the old toggle scheme, so it only builds overlap bridges for alternating chunks. With chunks A, B, C, D, it produces bridges for A+B and C+D. It still misses B+C`. I've left a comment on that PR to the same effect.

I've added a new test to illustrate this exact situation:

Let's use a contrived example with overlapLength = 3. The real code uses 25 in pkg/scan/const.go, but the failure mode is the same.

Suppose splitSubN produces four chunks:

A = "aaaaa"
B = "xxSEC"
C = "RETyy"
D = "zzzzz"

Assume the thing you need to detect is SECRET, which spans the boundary between B and C:

B ends with  "SEC"
C starts with "RET"

What this produces:

  • A
  • bridge A+B => tail(A) + head(B)
  • B
  • C
  • bridge C+D => tail(C) + head(D)
  • D

So it checks:

  • A
  • A+B
  • B
  • C
  • C+D
  • D

It never creates the B+C bridge, so SECRET is missed.

What this PR produces:

  • A
  • bridge A+B
  • B
  • bridge B+C
  • C
  • bridge C+D
  • D

That B+C bridge contains:

tail(B, 3) + head(C, 3) = "SEC" + "RET" = "SECRET"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I found the issues raised here, I have added a PR to fix this, the link to PR have been attached. Additionally you can review the PR and let me know if any improvements is needed there.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment illustrates how the PR you've raised does not entirely fix the problem. Additionally, I'm confused about why you'd open a new PR to fix what this PR already addresses?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's connect for this internally once.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few more improvements that can be done.

  • Adding the rune in the bridge making also, else slicing a string with non ascii char can break characters like ñ or emoji into two pieces, eventually corrupting the chunk.
  • condition len(prev) >= overlapLength is redundant.
  • Moving the last condition out of the loop for both the loops to make it run efficiently.
  • Merging the two loop into just one, reducing the time-complexity from 2n to just simply n, would be useful in case of a big minified file.

Copy link
Copy Markdown
Contributor

@afham123 afham123 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the PR for these improvements.
#227

Comment thread pkg/scan/scanUtil.go Outdated
@BearAlliance BearAlliance mentioned this pull request May 11, 2026
The overlap builder incorrectly dropped the previous
chunk's tail. That is corrected to use the
same amount of overlap granted to the head.
@BearAlliance
Copy link
Copy Markdown
Author

I've reverted the full line behavior and squashed commits. Updated the description to match the new diff. If we want to address the memory vs correctness aspect of respecting the UseFullLine false-positive configuration we can do that in a separate PR.

@BearAlliance BearAlliance changed the title fix: behave the same for short and long lines fix: correct overlap behavior for long line chunk splitting May 11, 2026
@afham123
Copy link
Copy Markdown
Contributor

afham123 commented May 22, 2026

Closing this Pull request Changes went in #227, thanks @BearAlliance for pointing out this, the change went only because you pointed out. Thanks once again. Please ping any other issue that you find.

@afham123 afham123 closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants