fix: normalize CRLF/LF line endings in stats and checkpoint diffing#1075
fix: normalize CRLF/LF line endings in stats and checkpoint diffing#1075
Conversation
When previous_content (from git blob) and current_content (from working tree) have different line endings — common on Windows with core.autocrlf or when AI tools write LF into a CRLF repo — every line appeared as changed, inflating AI vs Human stats (e.g. +105/-100 instead of +5/-0). Root cause: compute_line_changes() passed raw content to imara-diff, which tokenizes by \n. Lines like "hello\r\n" produce token "hello\r", which doesn't match "hello" from "hello\n". Fix: - Add normalize_line_endings() that strips \r before \n (with zero-copy Cow fast path when no \r present) - Normalize both inputs in compute_line_changes() before diffing, while still returning references to the original input strings - Add content_eq_normalized() for CRLF-aware equality checks in the checkpoint pipeline, preventing unnecessary entries when only line endings differ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ment Bare \r (pre-2001 Mac OS 9) converted to \n would increase the normalized line count, causing hunk indices from the normalized diff to exceed the bounds of original line arrays from split_lines_with_terminators. Only handle \r\n → \n (Windows CRLF) which preserves line count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI Status — Ubuntu ChecksAll Ubuntu-based checks are green except This is a pre-existing flaky test — the same Will re-run the failed job once the current run completes. |
When a previous checkpoint stores a CRLF blob and the working tree converts to LF without content changes, the checkpoint now updates the stored blob to LF (remapping attributions via line-number roundtrip) instead of skipping entirely. This prevents the next AI checkpoint from seeing all lines as changed due to CRLF vs LF byte differences in capture_diff_slices, which with force_split=true would re-attribute every line to AI. Addresses Devin PR review feedback on #1075. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed Devin review feedback (stale CRLF blob re-attribution) Fixed in f872ecd. When content differs only in line endings (CRLF ↔ LF), the checkpoint now updates the stored blob to LF instead of skipping entirely. Attributions are preserved via a line-attribution roundtrip (char attrs → line attrs using old content → char attrs using new content), correctly remapping byte offsets from CRLF to LF space. Added regression test |
Summary
core.autocrlfor when AI tools write LF into a CRLF repo)+105 -100instead of+5 -0compute_line_changes()passed raw content to imara-diff, which tokenizes by\n— lines like"hello\r\n"produce token"hello\r"≠"hello"from"hello\n"Changes
imara_diff_utils.rs: Addnormalize_line_endings()(Cow-based, zero-copy when no\rpresent). Normalize both inputs incompute_line_changes()before diffing while still returning references to original stringscheckpoint.rs: Addcontent_eq_normalized()and update 3 equality checks to skip files where only line endings differ — prevents unnecessary checkpoint entriesTest coverage
compute_line_changeswith CRLF/LF scenarioscompute_file_line_statswith CRLF/LF scenariosTest plan
🤖 Generated with Claude Code