Skip to content

joshuascript/io_algo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

io_algo

Case-insensitive filesystem resolver for Linux. The target application (s&box game engine) runs on a case-sensitive Linux filesystem but was developed on Windows/macOS, so it frequently requests files with incorrect casing. This project benchmarks resolution algorithms and will produce a C-level LD_PRELOAD shim that intercepts file IO syscalls and transparently resolves wrong-cased paths to their real on-disk equivalents.

Problem

On Linux, open("/game/addons/base/Code/foo.cs") fails if the real path is /game/addons/base/code/foo.cs. The shim intercepts newfstatat, openat, and inotify_add_watch, resolves the path case-insensitively, and forwards the corrected path to the kernel.

Dataset

10 strace logs captured from the game binary, cleaned and rebased to the local CWD:

  • 644,575 total path lookups
  • 65,215 unique paths
  • 1,223 unique paths that exist on disk
  • 64,110 unique paths that don't exist (63,547 confirmed unreachable after segment walk)
  • 47 unique wrong→right case mismatch pairs (e.g. assets/Assets/, code/Code/)

See dataset/analysis/metadata.log for full stats and dataset/analysis/matches/case_insensitive_matches_deduplicated.log for all 47 mismatch pairs.

Pipeline

# Run from project root
python3 dataset/pipeline.py          # clean raw strace logs → dataset/clean/
python3 dataset/find_missing.py      # existence checks + case mismatch analysis
python3 dataset/extract_syscalls.py  # validate which syscalls make real on-disk calls

Validated Syscalls

Only 7 of 20 syscalls observed in the strace logs make real on-disk calls. The three that matter:

Syscall Total Calls CWD Calls
newfstatat 320,789 318,378 (99.2%)
openat 237,435 190,908 (80.4%)
inotify_add_watch 133,968 133,968 (100%)

Resolution Algorithm — Segment Walk

Given a wrong-cased path like {cwd}/addons/base/Code:

  1. Strip CWD → segments ["addons", "base", "Code"]
  2. For each segment: iterdir() the current directory, build {lowercase: real} map
  3. Look up segment.lower() — finds the real on-disk name regardless of casing
  4. Advance to current / real_name

The directory cache ({dir_path: {lowercase_child: real_child}}) ensures each directory is scanned at most once. A cold walk requires at least one fresh iterdir(); a cache walk is served entirely from prior scans.

Benchmarks

Three algorithms implemented in both Python (benchmark/python/) and C (benchmark/c/), each with and without a negative cache for confirmed-missing paths.

Algorithm Description
Flat Hash {lowercase_input → resolved_path} — O(1) for known paths, segment walk on miss
Trie Nested dict keyed by lowercase segments — shares prefix resolution across paths
Segment Map {dir_path → {lowercase_child → real_child}} — per-directory flat maps, direct exists() fast path

Python results (latest run)

Algorithm Neg Cache Throughput True Misses (unique)
Segment Map No 36,728/s 63,547
Segment Map Yes 72,795/s 63,547
Flat Hash No 15,242/s 63,547
Flat Hash Yes 58,664/s 63,547
Trie No 14,106/s 63,549
Trie Yes 26,297/s 63,549

Segment Map wins in Python because repeated stat() calls on a warm kernel dentry cache are faster than Python dict lookups on long path strings. This dynamic is expected to invert in C.

Running benchmarks

# Python
python3 benchmark/python/bench_flat_hash.py
python3 benchmark/python/bench_trie.py
python3 benchmark/python/bench_segment_map.py

# C
cd benchmark/c && make
./bench_flat_hash
./bench_trie
./bench_segment_map

# Flags: --neg-cache / --no-neg-cache to run a single variant

Results are written to benchmark/python/results/ and benchmark/c/results/.

Structure

dataset/
  dirty/           raw strace logs (gitignored — large)
  clean/           cleaned + rebased logs (gitignored — large)
  analysis/
    metadata.log
    findings/      found/not_found deduplicated lists
    matches/       case-insensitive mismatch pairs
    syscalls/      syscall breakdown with examples
  pipeline.py
  find_missing.py
  extract_syscalls.py

benchmark/
  python/          Python benchmark implementations
  c/               C benchmark implementations (Makefile included)

bin/               C probe library (LD_PRELOAD prototype)
CLAUDE.md          Full project context for AI-assisted development

Next Steps

  • Run C benchmarks and compare throughput against Python results
  • Determine winning algorithm (flat hash expected to win in C)
  • Port winning algorithm to production C shim intercepting newfstatat, openat, inotify_add_watch
  • CWD anchored via /proc/self/exe at shim init; paths not prefixed with CWD pass through unmodified
  • Thread safety via pthread_rwlock_t (reads vastly outnumber writes once cache is warm)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors