feat: byte-correct (non-UTF-8) repository paths — RepoPath newtype + DiffEntry migration#887
Open
schacon wants to merge 3 commits into
Open
feat: byte-correct (non-UTF-8) repository paths — RepoPath newtype + DiffEntry migration#887schacon wants to merge 3 commits into
schacon wants to merge 3 commits into
Conversation
Flip DiffEntry.old_path/new_path from Option<String> to Option<RepoPathBuf> so non-UTF-8 Git paths survive verbatim instead of being mangled by from_utf8_lossy (U+FFFD). Tree-to-tree diffs are now fully byte-true, including non-UTF-8 directory prefixes carried through recursion; many worktree FS reads now resolve via RepoPath::to_fs_path. Index/worktree path sources and pathspec/quote-path matching remain lossy for now (marked // TODO(byte-paths)) — deferred to Phase 2 (FS I/O) and Phase 3 (pathspec) per docs/non-utf8-paths-design.md. RepoPath gains a lossy Display (mirrors bstr::BStr) and byte-exact PartialEq<str/String>. Adds grit-lib/tests/non_utf8_paths.rs covering add/modify/nested-dir round-trips. Workspace builds clean; diff/status/log/stash t-tests pass (pre-existing --graph and ignore-glob failures unrelated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
grit-libforced OS paths into UTF-8 (String/&str), decoding Git tree/indexbyte paths with
String::from_utf8_lossy. Git pathnames are arbitrary bytestrings on Unix (only
NULis forbidden,/separates components), so anynon-UTF-8 name was silently corrupted: invalid bytes became
U+FFFD, breakingdiff/status/checkout for those files and changing the resulting tree OID on
re-commit.
This was flagged by Caleb Owens and Byron in Discord. Byron's key constraint:
the fix is not to swap
String→ bareBString(thegixpitfall, where abyte blob implicitly is a path depending on context) — it needs a dedicated
path type with an explicit filesystem boundary.
Design write-up:
docs/non-utf8-paths-design.md(included in this PR).What this PR delivers (Phases 0–1)
Phase 0 —
RepoPath/RepoPathBufnewtype (grit-lib/src/repo_path.rs)PathBuf/Path), backed bybstr— not a bareBString. Conversions to/fromstr/filesystem areexplicit (no
Deref<str>), so a path can't implicitly masquerade as any byteblob.
RepoPath::to_fs_path/RepoPathBuf::from_fs_relative(Unix maps bytes ↔
OsStrlosslessly; other platforms fall back to lossyUTF-8, as they can't store arbitrary bytes in filenames anyway). Generalizes
the
OsStrExt::as_bytespattern already used incrlf.rs.PartialEq<str/&str/String>(comparing to UTF-8 is lossless), plusa lossy
Display(see note below).Phase 1 — flag-day
DiffEntrymigrationDiffEntry.old_path/new_path:Option<String>→Option<RepoPathBuf>,with
path() -> &RepoPath,path_str() -> Cow<str>(lossy), anddisplay_path()unchanged.prefixes threaded through the recursion (the case that required reworking
format_pathto operate on bytes).to_fs_path(a free down payment onPhase 2).
grit-lib,grit-cli, andgrit-gitupdated.Explicitly deferred (marked
// TODO(byte-paths)in-tree)Consistent with the phased design doc, these remain lossy for now and are
greppable:
Stringpaths(
porcelain/checkout.rswrite API).quote_path/-zquoting(currently fed lossy
&str).flatten_tree/head_path_states/status long tail (Stringmap keys), and index/worktree path sources.
Behavior is byte-identical for valid UTF-8 paths — this is a type migration,
not a logic change.
Testing
grit-lib/tests/non_utf8_paths.rs: add / modify / nested-non-UTF-8-dirround-trips assert path bytes survive verbatim and
to_str()isNone.unwrap/expectintroduced — both are deny-level).
t3903 (stash) pass fully. The only failures are pre-existing and unrelated:
t4202
--graph(diff-tree --graphis an unimplemented CLI option) and twoignore.rsglob unit tests (untouched file).Note for reviewers — the
DisplaydecisionThe design doc proposed omitting
Display(mirroringstd::path::Path, whichforces explicit
.display()). During implementation I reversed this: a lossyDisplayonRepoPath/RepoPathBuf(matchingbstr::BStr) removed ~120 sitesof churn. It does not weaken the newtype's identity — there is still no implicit
coercion to a path, the bytes stay preserved in the model, and round-trip-safe
output still goes through
quote_path. Trivial to revert to explicit.display()if preferred. One known wrinkle: the bstr-backedDisplaydoes nothonor
{:<width$}fill/alignment, so--statcolumn code storesCow<str>instead (handled).
🤖 Generated with Claude Code