io_algo

Case-insensitive filesystem resolver for Linux. The target application (s&box game engine) runs on a case-sensitive Linux filesystem but was developed on Windows/macOS, so it frequently requests files with incorrect casing. This project benchmarks resolution algorithms and will produce a C-level LD_PRELOAD shim that intercepts file IO syscalls and transparently resolves wrong-cased paths to their real on-disk equivalents.

Problem

On Linux, open("/game/addons/base/Code/foo.cs") fails if the real path is /game/addons/base/code/foo.cs. The shim intercepts newfstatat, openat, and inotify_add_watch, resolves the path case-insensitively, and forwards the corrected path to the kernel.

Dataset

10 strace logs captured from the game binary, cleaned and rebased to the local CWD:

644,575 total path lookups
65,215 unique paths
1,223 unique paths that exist on disk
64,110 unique paths that don't exist (63,547 confirmed unreachable after segment walk)
47 unique wrong→right case mismatch pairs (e.g. assets/ → Assets/, code/ → Code/)

See dataset/analysis/metadata.log for full stats and dataset/analysis/matches/case_insensitive_matches_deduplicated.log for all 47 mismatch pairs.

Pipeline

# Run from project root
python3 dataset/pipeline.py          # clean raw strace logs → dataset/clean/
python3 dataset/find_missing.py      # existence checks + case mismatch analysis
python3 dataset/extract_syscalls.py  # validate which syscalls make real on-disk calls

Validated Syscalls

Only 7 of 20 syscalls observed in the strace logs make real on-disk calls. The three that matter:

Syscall	Total Calls	CWD Calls
`newfstatat`	320,789	318,378 (99.2%)
`openat`	237,435	190,908 (80.4%)
`inotify_add_watch`	133,968	133,968 (100%)

Resolution Algorithm — Segment Walk

Given a wrong-cased path like {cwd}/addons/base/Code:

Strip CWD → segments ["addons", "base", "Code"]
For each segment: iterdir() the current directory, build {lowercase: real} map
Look up segment.lower() — finds the real on-disk name regardless of casing
Advance to current / real_name

The directory cache ({dir_path: {lowercase_child: real_child}}) ensures each directory is scanned at most once. A cold walk requires at least one fresh iterdir(); a cache walk is served entirely from prior scans.

Benchmarks

Three algorithms implemented in both Python (benchmark/python/) and C (benchmark/c/), each with and without a negative cache for confirmed-missing paths.

Algorithm	Description
Flat Hash	`{lowercase_input → resolved_path}` — O(1) for known paths, segment walk on miss
Trie	Nested dict keyed by lowercase segments — shares prefix resolution across paths
Segment Map	`{dir_path → {lowercase_child → real_child}}` — per-directory flat maps, direct `exists()` fast path

Python results (latest run)

Algorithm	Neg Cache	Throughput	True Misses (unique)
Segment Map	No	36,728/s	63,547
Segment Map	Yes	72,795/s	63,547
Flat Hash	No	15,242/s	63,547
Flat Hash	Yes	58,664/s	63,547
Trie	No	14,106/s	63,549
Trie	Yes	26,297/s	63,549

Segment Map wins in Python because repeated stat() calls on a warm kernel dentry cache are faster than Python dict lookups on long path strings. This dynamic is expected to invert in C.

Running benchmarks

# Python
python3 benchmark/python/bench_flat_hash.py
python3 benchmark/python/bench_trie.py
python3 benchmark/python/bench_segment_map.py

# C
cd benchmark/c && make
./bench_flat_hash
./bench_trie
./bench_segment_map

# Flags: --neg-cache / --no-neg-cache to run a single variant

Results are written to benchmark/python/results/ and benchmark/c/results/.

Structure

dataset/
  dirty/           raw strace logs (gitignored — large)
  clean/           cleaned + rebased logs (gitignored — large)
  analysis/
    metadata.log
    findings/      found/not_found deduplicated lists
    matches/       case-insensitive mismatch pairs
    syscalls/      syscall breakdown with examples
  pipeline.py
  find_missing.py
  extract_syscalls.py

benchmark/
  python/          Python benchmark implementations
  c/               C benchmark implementations (Makefile included)

bin/               C probe library (LD_PRELOAD prototype)
CLAUDE.md          Full project context for AI-assisted development

Next Steps

Run C benchmarks and compare throughput against Python results
Determine winning algorithm (flat hash expected to win in C)
Port winning algorithm to production C shim intercepting newfstatat, openat, inotify_add_watch
CWD anchored via /proc/self/exe at shim init; paths not prefixed with CWD pass through unmodified
Thread safety via pthread_rwlock_t (reads vastly outnumber writes once cache is warm)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
bin		bin
dataset		dataset
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

io_algo

Problem

Dataset

Pipeline

Validated Syscalls

Resolution Algorithm — Segment Walk

Benchmarks

Python results (latest run)

Running benchmarks

Structure

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

io_algo

Problem

Dataset

Pipeline

Validated Syscalls

Resolution Algorithm — Segment Walk

Benchmarks

Python results (latest run)

Running benchmarks

Structure

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages