-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers
This is related to issue #272
Summary
When kv_cache_benchmark is run with multiple processes (MPI or otherwise) targeting shared storage, every worker generates byte-for-byte identical data for the same logical request IDs. On a dedup-capable storage system this is invisible, but it means the benchmark measures write throughput of deduplicated data — not unique data — rendering multi-host storage stress results meaningless.
Affected Files
kv_cache_benchmark/kv_cache/cache.py—KVCacheGenerator,_seed_from_key()kv_cache_benchmark/kv_cache/benchmark.py—cache_keyconstruction
Root Cause
Key generation is deterministic with no per-process identity
Cache keys are constructed in benchmark.py using only the request counter and a modulo of num_users:
# benchmark.py line ~382
user_id = f"dataset_user_{req_id % self.num_users}"
cache_key = f"{user_id}_req_{req_id:06d}"These strings are identical on every worker, because req_id is reset to 0 on each process independently.
Data generation depends only on the key string and a fixed global seed
In cache.py, KVCacheGenerator uses a fixed global_seed (default 0) and derives a per-entry seed via SHA-256 of the key string:
# cache.py line ~39
rng = np.random.default_rng(self.global_seed) # same on every worker
self.precomputed_buffer = rng.uniform(...) # same 256 MB buffer on every worker
# cache.py line ~49
return (key_hash64 ^ self.global_seed) & 0xFFFF_FFFF_FFFF_FFFF # same XOR stamp for same keyBecause both the key strings and the global seed are identical across workers, every worker produces bitwise-identical 4 KB blocks for every cache entry.
Impact
| Workers (N) | Dedup ratio | Unique data written |
|---|---|---|
| 1 | 0% | 100% |
| 2 | 50% | 50% |
| 8 | 87.5% | 12.5% |
| 16 | 93.75% | 6.25% |
| 64 | 98.4% | 1.6% |
A storage system with inline deduplication (e.g. many all-flash arrays and object stores) will absorb N× the logical write I/O while storing only 1× the data, appearing N× faster than it actually is for unique workloads. This makes the benchmark unreliable as a measure of raw write capacity in any multi-host scenario.
Steps to Reproduce
- Run the benchmark on 2+ hosts targeting the same shared storage mount or object store endpoint.
- Compare effective storage capacity consumed vs. logical bytes written — consumption will not scale with host count.
- Alternatively, inspect the raw data: any two workers' output files for the same time window will be byte-for-byte identical.
Expected Behavior
Each worker (MPI rank, process, or host) should produce unique data so that N workers write N× unique bytes to storage, properly stressing storage capacity and ingestion throughput.
Proposed Fix
Embed a per-worker identity into either the global_seed or the cache key string. Two equivalent options:
Option A — Unique seed per worker (minimal change)
Pass the MPI rank (or os.getpid() / hostname hash as fallback) as global_seed when constructing KVCacheGenerator:
import os, socket, hashlib
def _worker_seed() -> int:
"""Return a seed unique to this process on this host."""
try:
from mpi4py import MPI
return MPI.COMM_WORLD.Get_rank()
except ImportError:
# Fallback: hash of hostname + PID
ident = f"{socket.gethostname()}:{os.getpid()}"
return int(hashlib.sha256(ident.encode()).hexdigest()[:16], 16)
# When constructing KVCacheGenerator:
generator = KVCacheGenerator(model_config, global_seed=_worker_seed())This changes the 256 MB precomputed buffer and the XOR stamp for every worker, making all 4 KB blocks unique across workers while keeping them reproducible within a single worker run.
Option B — Unique key prefix per worker (more explicit)
Prefix every cache key with the worker identity:
worker_prefix = f"rank{mpi_rank}_host{hostname_hash}"
cache_key = f"{worker_prefix}_{user_id}_req_{req_id:06d}"This keeps the same precomputed buffer but changes the XOR stamp per worker, which is sufficient to eliminate cross-worker dedup.
Recommendation
Option A (unique global_seed) is preferred because:
- It also diversifies the 256 MB precomputed noise buffer, giving true statistical independence between workers.
- It requires changes in only one place (benchmark initialization).
- It is transparent to all downstream key-derivation and stamping logic.
The --seed CLI argument (if added) should document that it sets the per-worker base seed, and that MPI rank is XOR'd in automatically so users can still get reproducible multi-worker runs by fixing --seed.
Notes
- Single-process runs are not affected — the current design is correct and the anti-dedup properties (no intra-entry or cross-entry block collisions) verified at 64 GB scale hold for single-process use.
- The fix does not change the on-disk format, the benchmark output schema, or any config file fields.
- This issue is distinct from the previously fixed 96.7% intra-entry dedup bug (commit
0aa9aee) — that was a single-process issue; this is a multi-process issue.