Skip to content

fix: normalize CRLF/LF line endings in stats and checkpoint diffing#1075

Merged
svarlamov merged 4 commits intomainfrom
worktree-crlf-line-ending-fix
Apr 14, 2026
Merged

fix: normalize CRLF/LF line endings in stats and checkpoint diffing#1075
svarlamov merged 4 commits intomainfrom
worktree-crlf-line-ending-fix

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented Apr 14, 2026

Summary

  • Fixes inflated AI vs Human stats when files switch between CRLF and LF line endings (common on Windows with core.autocrlf or when AI tools write LF into a CRLF repo)
  • A 100-line CRLF file with 5 AI-added LF lines was showing +105 -100 instead of +5 -0
  • Root cause: compute_line_changes() passed raw content to imara-diff, which tokenizes by \n — lines like "hello\r\n" produce token "hello\r""hello" from "hello\n"

Changes

  • imara_diff_utils.rs: Add normalize_line_endings() (Cow-based, zero-copy when no \r present). Normalize both inputs in compute_line_changes() before diffing while still returning references to original strings
  • checkpoint.rs: Add content_eq_normalized() and update 3 equality checks to skip files where only line endings differ — prevents unnecessary checkpoint entries

Test coverage

  • 6 unit tests for compute_line_changes with CRLF/LF scenarios
  • 4 unit tests for compute_file_line_stats with CRLF/LF scenarios
  • 3 unit tests for attribution preservation through CRLF changes
  • 2 end-to-end tests exercising the real checkpoint flow with CRLF git blobs vs LF working tree

Test plan

  • All 1412 unit tests pass
  • All 2944 integration tests pass
  • All 54 daemon mode tests pass
  • All 33 notes sync regression tests pass
  • CI green on ubuntu checks

🤖 Generated with Claude Code


Open with Devin

svarlamov and others added 2 commits April 14, 2026 04:17
When previous_content (from git blob) and current_content (from working
tree) have different line endings — common on Windows with core.autocrlf
or when AI tools write LF into a CRLF repo — every line appeared as
changed, inflating AI vs Human stats (e.g. +105/-100 instead of +5/-0).

Root cause: compute_line_changes() passed raw content to imara-diff,
which tokenizes by \n. Lines like "hello\r\n" produce token "hello\r",
which doesn't match "hello" from "hello\n".

Fix:
- Add normalize_line_endings() that strips \r before \n (with zero-copy
  Cow fast path when no \r present)
- Normalize both inputs in compute_line_changes() before diffing, while
  still returning references to the original input strings
- Add content_eq_normalized() for CRLF-aware equality checks in the
  checkpoint pipeline, preventing unnecessary entries when only line
  endings differ

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…ment

Bare \r (pre-2001 Mac OS 9) converted to \n would increase the
normalized line count, causing hunk indices from the normalized diff to
exceed the bounds of original line arrays from split_lines_with_terminators.
Only handle \r\n → \n (Windows CRLF) which preserves line count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@svarlamov
Copy link
Copy Markdown
Member Author

CI Status — Ubuntu Checks

All Ubuntu-based checks are green except Test on ubuntu-latest (daemon), which failed on an unrelated flaky test:

squash_merge::test_prepare_working_log_squash_with_main_changes_standard_human_in_worktree
Line 4: Expected AI author but got 'Test User'

This is a pre-existing flaky test — the same Test on ubuntu-latest (daemon) job also fails on main (run 24362817360) with a different test (graphite::test_gt_create_then_squash_then_fold — daemon timeout). The failing test is in the squash merge integration path, which is unrelated to the CRLF normalization changes in this PR.

Will re-run the failed job once the current run completes.

When a previous checkpoint stores a CRLF blob and the working tree
converts to LF without content changes, the checkpoint now updates the
stored blob to LF (remapping attributions via line-number roundtrip)
instead of skipping entirely. This prevents the next AI checkpoint from
seeing all lines as changed due to CRLF vs LF byte differences in
capture_diff_slices, which with force_split=true would re-attribute
every line to AI.

Addresses Devin PR review feedback on #1075.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@svarlamov
Copy link
Copy Markdown
Member Author

Addressed Devin review feedback (stale CRLF blob re-attribution)

Fixed in f872ecd. When content differs only in line endings (CRLF ↔ LF), the checkpoint now updates the stored blob to LF instead of skipping entirely. Attributions are preserved via a line-attribution roundtrip (char attrs → line attrs using old content → char attrs using new content), correctly remapping byte offsets from CRLF to LF space.

Added regression test test_checkpoint_stale_crlf_blob_causes_ai_reattribution that demonstrates the exact scenario: stale CRLF blob + AI checkpoint with force_split=true re-attributes all 5 lines to AI when only 1 was actually added.

@svarlamov svarlamov merged commit 47de6ad into main Apr 14, 2026
44 of 46 checks passed
@svarlamov svarlamov deleted the worktree-crlf-line-ending-fix branch April 14, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant