Skip to content

feat: byte-correct (non-UTF-8) repository paths — RepoPath newtype + DiffEntry migration#887

Open
schacon wants to merge 3 commits into
mainfrom
non-utf8-paths-design
Open

feat: byte-correct (non-UTF-8) repository paths — RepoPath newtype + DiffEntry migration#887
schacon wants to merge 3 commits into
mainfrom
non-utf8-paths-design

Conversation

@schacon

@schacon schacon commented Jun 22, 2026

Copy link
Copy Markdown
Member

Motivation

grit-lib forced OS paths into UTF-8 (String/&str), decoding Git tree/index
byte paths with String::from_utf8_lossy. Git pathnames are arbitrary byte
strings
on Unix (only NUL is forbidden, / separates components), so any
non-UTF-8 name was silently corrupted: invalid bytes became U+FFFD, breaking
diff/status/checkout for those files and changing the resulting tree OID on
re-commit.

This was flagged by Caleb Owens and Byron in Discord. Byron's key constraint:
the fix is not to swap String → bare BString (the gix pitfall, where a
byte blob implicitly is a path depending on context) — it needs a dedicated
path type with an explicit filesystem boundary.

Design write-up: docs/non-utf8-paths-design.md (included in this PR).

What this PR delivers (Phases 0–1)

Phase 0 — RepoPath / RepoPathBuf newtype (grit-lib/src/repo_path.rs)

  • An owned/borrowed byte-path pair (mirroring PathBuf/Path), backed by
    bstrnot a bare BString. Conversions to/from str/filesystem are
    explicit (no Deref<str>), so a path can't implicitly masquerade as any byte
    blob.
  • A single OS boundary: RepoPath::to_fs_path / RepoPathBuf::from_fs_relative
    (Unix maps bytes ↔ OsStr losslessly; other platforms fall back to lossy
    UTF-8, as they can't store arbitrary bytes in filenames anyway). Generalizes
    the OsStrExt::as_bytes pattern already used in crlf.rs.
  • Byte-exact PartialEq<str/&str/String> (comparing to UTF-8 is lossless), plus
    a lossy Display (see note below).

Phase 1 — flag-day DiffEntry migration

  • DiffEntry.old_path / new_path: Option<String>Option<RepoPathBuf>,
    with path() -> &RepoPath, path_str() -> Cow<str> (lossy), and
    display_path() unchanged.
  • Tree-to-tree diffs are now fully byte-true, including non-UTF-8 directory
    prefixes threaded through the recursion (the case that required reworking
    format_path to operate on bytes).
  • Many working-tree FS reads now resolve via to_fs_path (a free down payment on
    Phase 2).
  • ~590 consumer call sites across grit-lib, grit-cli, and grit-git updated.

Explicitly deferred (marked // TODO(byte-paths) in-tree)

Consistent with the phased design doc, these remain lossy for now and are
greppable:

  • Phase 2 — working-tree I/O that still routes through String paths
    (porcelain/checkout.rs write API).
  • Phase 3 — byte-correct pathspec matching and quote_path/-z quoting
    (currently fed lossy &str).
  • Phase 4 — the flatten_tree/head_path_states/status long tail (String
    map keys), and index/worktree path sources.

Behavior is byte-identical for valid UTF-8 paths — this is a type migration,
not a logic change.

Testing

  • New grit-lib/tests/non_utf8_paths.rs: add / modify / nested-non-UTF-8-dir
    round-trips assert path bytes survive verbatim and to_str() is None.
  • Workspace builds clean; production clippy: 0 errors (no unwrap/expect
    introduced — both are deny-level).
  • t-suite green where it matters: t4000/t4001/t4002/t4013 (diff), t7060 (status),
    t3903 (stash) pass fully. The only failures are pre-existing and unrelated:
    t4202 --graph (diff-tree --graph is an unimplemented CLI option) and two
    ignore.rs glob unit tests (untouched file).

Note for reviewers — the Display decision

The design doc proposed omitting Display (mirroring std::path::Path, which
forces explicit .display()). During implementation I reversed this: a lossy
Display on RepoPath/RepoPathBuf (matching bstr::BStr) removed ~120 sites
of churn. It does not weaken the newtype's identity — there is still no implicit
coercion to a path, the bytes stay preserved in the model, and round-trip-safe
output still goes through quote_path. Trivial to revert to explicit
.display() if preferred. One known wrinkle: the bstr-backed Display does not
honor {:<width$} fill/alignment, so --stat column code stores Cow<str>
instead (handled).

🤖 Generated with Claude Code

schacon and others added 3 commits June 22, 2026 12:10
Flip DiffEntry.old_path/new_path from Option<String> to Option<RepoPathBuf>
so non-UTF-8 Git paths survive verbatim instead of being mangled by
from_utf8_lossy (U+FFFD). Tree-to-tree diffs are now fully byte-true,
including non-UTF-8 directory prefixes carried through recursion; many
worktree FS reads now resolve via RepoPath::to_fs_path.

Index/worktree path sources and pathspec/quote-path matching remain lossy
for now (marked // TODO(byte-paths)) — deferred to Phase 2 (FS I/O) and
Phase 3 (pathspec) per docs/non-utf8-paths-design.md. RepoPath gains a
lossy Display (mirrors bstr::BStr) and byte-exact PartialEq<str/String>.

Adds grit-lib/tests/non_utf8_paths.rs covering add/modify/nested-dir
round-trips. Workspace builds clean; diff/status/log/stash t-tests pass
(pre-existing --graph and ignore-glob failures unrelated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant