Case-insensitive filesystem resolver for Linux. The target application (s&box game engine) runs on a case-sensitive Linux filesystem but was developed on Windows/macOS, so it frequently requests files with incorrect casing. This project benchmarks resolution algorithms and will produce a C-level LD_PRELOAD shim that intercepts file IO syscalls and transparently resolves wrong-cased paths to their real on-disk equivalents.
On Linux, open("/game/addons/base/Code/foo.cs") fails if the real path is /game/addons/base/code/foo.cs. The shim intercepts newfstatat, openat, and inotify_add_watch, resolves the path case-insensitively, and forwards the corrected path to the kernel.
10 strace logs captured from the game binary, cleaned and rebased to the local CWD:
- 644,575 total path lookups
- 65,215 unique paths
- 1,223 unique paths that exist on disk
- 64,110 unique paths that don't exist (63,547 confirmed unreachable after segment walk)
- 47 unique wrong→right case mismatch pairs (e.g.
assets/→Assets/,code/→Code/)
See dataset/analysis/metadata.log for full stats and dataset/analysis/matches/case_insensitive_matches_deduplicated.log for all 47 mismatch pairs.
# Run from project root
python3 dataset/pipeline.py # clean raw strace logs → dataset/clean/
python3 dataset/find_missing.py # existence checks + case mismatch analysis
python3 dataset/extract_syscalls.py # validate which syscalls make real on-disk callsOnly 7 of 20 syscalls observed in the strace logs make real on-disk calls. The three that matter:
| Syscall | Total Calls | CWD Calls |
|---|---|---|
newfstatat |
320,789 | 318,378 (99.2%) |
openat |
237,435 | 190,908 (80.4%) |
inotify_add_watch |
133,968 | 133,968 (100%) |
Given a wrong-cased path like {cwd}/addons/base/Code:
- Strip CWD → segments
["addons", "base", "Code"] - For each segment:
iterdir()the current directory, build{lowercase: real}map - Look up
segment.lower()— finds the real on-disk name regardless of casing - Advance to
current / real_name
The directory cache ({dir_path: {lowercase_child: real_child}}) ensures each directory is scanned at most once. A cold walk requires at least one fresh iterdir(); a cache walk is served entirely from prior scans.
Three algorithms implemented in both Python (benchmark/python/) and C (benchmark/c/), each with and without a negative cache for confirmed-missing paths.
| Algorithm | Description |
|---|---|
| Flat Hash | {lowercase_input → resolved_path} — O(1) for known paths, segment walk on miss |
| Trie | Nested dict keyed by lowercase segments — shares prefix resolution across paths |
| Segment Map | {dir_path → {lowercase_child → real_child}} — per-directory flat maps, direct exists() fast path |
| Algorithm | Neg Cache | Throughput | True Misses (unique) |
|---|---|---|---|
| Segment Map | No | 36,728/s | 63,547 |
| Segment Map | Yes | 72,795/s | 63,547 |
| Flat Hash | No | 15,242/s | 63,547 |
| Flat Hash | Yes | 58,664/s | 63,547 |
| Trie | No | 14,106/s | 63,549 |
| Trie | Yes | 26,297/s | 63,549 |
Segment Map wins in Python because repeated stat() calls on a warm kernel dentry cache are faster than Python dict lookups on long path strings. This dynamic is expected to invert in C.
# Python
python3 benchmark/python/bench_flat_hash.py
python3 benchmark/python/bench_trie.py
python3 benchmark/python/bench_segment_map.py
# C
cd benchmark/c && make
./bench_flat_hash
./bench_trie
./bench_segment_map
# Flags: --neg-cache / --no-neg-cache to run a single variantResults are written to benchmark/python/results/ and benchmark/c/results/.
dataset/
dirty/ raw strace logs (gitignored — large)
clean/ cleaned + rebased logs (gitignored — large)
analysis/
metadata.log
findings/ found/not_found deduplicated lists
matches/ case-insensitive mismatch pairs
syscalls/ syscall breakdown with examples
pipeline.py
find_missing.py
extract_syscalls.py
benchmark/
python/ Python benchmark implementations
c/ C benchmark implementations (Makefile included)
bin/ C probe library (LD_PRELOAD prototype)
CLAUDE.md Full project context for AI-assisted development
- Run C benchmarks and compare throughput against Python results
- Determine winning algorithm (flat hash expected to win in C)
- Port winning algorithm to production C shim intercepting
newfstatat,openat,inotify_add_watch - CWD anchored via
/proc/self/exeat shim init; paths not prefixed with CWD pass through unmodified - Thread safety via
pthread_rwlock_t(reads vastly outnumber writes once cache is warm)