Need to resolve MPI deduplicatable data generation

# Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers

This is related to issue #272 

## Summary

When `kv_cache_benchmark` is run with multiple processes (MPI or otherwise) targeting shared storage, every worker generates **byte-for-byte identical data** for the same logical request IDs. On a dedup-capable storage system this is invisible, but it means the benchmark measures write throughput of deduplicated data — not unique data — rendering multi-host storage stress results meaningless.

## Affected Files

- `kv_cache_benchmark/kv_cache/cache.py` — `KVCacheGenerator`, `_seed_from_key()`
- `kv_cache_benchmark/kv_cache/benchmark.py` — `cache_key` construction

## Root Cause

### Key generation is deterministic with no per-process identity

Cache keys are constructed in `benchmark.py` using only the request counter and a modulo of `num_users`:

```python
# benchmark.py line ~382
user_id  = f"dataset_user_{req_id % self.num_users}"
cache_key = f"{user_id}_req_{req_id:06d}"
```

These strings are identical on every worker, because `req_id` is reset to 0 on each process independently.

### Data generation depends only on the key string and a fixed global seed

In `cache.py`, `KVCacheGenerator` uses a fixed `global_seed` (default `0`) and derives a per-entry seed via SHA-256 of the key string:

```python
# cache.py line ~39
rng = np.random.default_rng(self.global_seed)   # same on every worker
self.precomputed_buffer = rng.uniform(...)        # same 256 MB buffer on every worker

# cache.py line ~49
return (key_hash64 ^ self.global_seed) & 0xFFFF_FFFF_FFFF_FFFF  # same XOR stamp for same key
```

Because both the key strings and the global seed are identical across workers, every worker produces bitwise-identical 4 KB blocks for every cache entry.

## Impact

| Workers (N) | Dedup ratio | Unique data written |
|-------------|-------------|---------------------|
| 1           | 0%          | 100%                |
| 2           | 50%         | 50%                 |
| 8           | 87.5%       | 12.5%               |
| 16          | 93.75%      | 6.25%               |
| 64          | 98.4%       | 1.6%                |

A storage system with inline deduplication (e.g. many all-flash arrays and object stores) will absorb N× the logical write I/O while storing only 1× the data, appearing N× faster than it actually is for unique workloads. This makes the benchmark unreliable as a measure of raw write capacity in any multi-host scenario.

## Steps to Reproduce

1. Run the benchmark on 2+ hosts targeting the same shared storage mount or object store endpoint.
2. Compare effective storage capacity consumed vs. logical bytes written — consumption will not scale with host count.
3. Alternatively, inspect the raw data: any two workers' output files for the same time window will be byte-for-byte identical.

## Expected Behavior

Each worker (MPI rank, process, or host) should produce unique data so that N workers write N× unique bytes to storage, properly stressing storage capacity and ingestion throughput.

## Proposed Fix

Embed a per-worker identity into either the `global_seed` or the cache key string. Two equivalent options:

### Option A — Unique seed per worker (minimal change)

Pass the MPI rank (or `os.getpid()` / hostname hash as fallback) as `global_seed` when constructing `KVCacheGenerator`:

```python
import os, socket, hashlib

def _worker_seed() -> int:
    """Return a seed unique to this process on this host."""
    try:
        from mpi4py import MPI
        return MPI.COMM_WORLD.Get_rank()
    except ImportError:
        # Fallback: hash of hostname + PID
        ident = f"{socket.gethostname()}:{os.getpid()}"
        return int(hashlib.sha256(ident.encode()).hexdigest()[:16], 16)

# When constructing KVCacheGenerator:
generator = KVCacheGenerator(model_config, global_seed=_worker_seed())
```

This changes the 256 MB precomputed buffer and the XOR stamp for every worker, making all 4 KB blocks unique across workers while keeping them reproducible within a single worker run.

### Option B — Unique key prefix per worker (more explicit)

Prefix every cache key with the worker identity:

```python
worker_prefix = f"rank{mpi_rank}_host{hostname_hash}"
cache_key = f"{worker_prefix}_{user_id}_req_{req_id:06d}"
```

This keeps the same precomputed buffer but changes the XOR stamp per worker, which is sufficient to eliminate cross-worker dedup.

### Recommendation

**Option A** (unique `global_seed`) is preferred because:
- It also diversifies the 256 MB precomputed noise buffer, giving true statistical independence between workers.
- It requires changes in only one place (benchmark initialization).
- It is transparent to all downstream key-derivation and stamping logic.

The `--seed` CLI argument (if added) should document that it sets the *per-worker base seed*, and that MPI rank is XOR'd in automatically so users can still get reproducible multi-worker runs by fixing `--seed`.

## Notes

- Single-process runs are **not affected** — the current design is correct and the anti-dedup properties (no intra-entry or cross-entry block collisions) verified at 64 GB scale hold for single-process use.
- The fix does not change the on-disk format, the benchmark output schema, or any config file fields.
- This issue is distinct from the previously fixed 96.7% intra-entry dedup bug (commit `0aa9aee`) — that was a single-process issue; this is a multi-process issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to resolve MPI deduplicatable data generation #278

Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers

Summary

Affected Files

Root Cause

Key generation is deterministic with no per-process identity

Data generation depends only on the key string and a fixed global seed

Impact

Steps to Reproduce

Expected Behavior

Proposed Fix

Option A — Unique seed per worker (minimal change)

Option B — Unique key prefix per worker (more explicit)

Recommendation

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Need to resolve MPI deduplicatable data generation #278

Description

Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers

Summary

Affected Files

Root Cause

Key generation is deterministic with no per-process identity

Data generation depends only on the key string and a fixed global seed

Impact

Steps to Reproduce

Expected Behavior

Proposed Fix

Option A — Unique seed per worker (minimal change)

Option B — Unique key prefix per worker (more explicit)

Recommendation

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions