Symptom:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.2 0 3
ERROR: median and mean are more than 2× std apart
Causes & Fixes:
Problem: When backlog fills up, new connections wait.
Fix: Increase listen queue (already done)
listen(server_fd, 2048); // Was 128Problem: TCP delays small packets for efficiency (bad for latency).
Fix: Disable with TCP_NODELAY (already done)
setsockopt(server_fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));Problem: Creating new thread per request = variable latency.
Current: Thread-per-request model
pthread_create(&thread, NULL, handle_http_client, client_fd);
pthread_detach(thread);Better: Thread pool (recycle threads)
- Pre-create N threads
- Use work queue
- Eliminates thread creation overhead
Problem: Multiple threads waiting for same lock.
Where:
- Config loader:
pthread_rwlock_rdlock(&kv_lock) - Metrics:
pthread_mutex_lock(&metrics_lock)
Optimization:
- Use RCU (Read-Copy-Update) for config
- Lock-free data structures for metrics
Problem: Iterating over all workers on every request.
Current: O(N) workers
for (int i = 0; i < NUM_WORKERS; i++) {
get_worker_metrics(i, &m);
if (m.score < best_score) ...
}Better: Keep workers sorted by score
- Update on metrics receipt
- Select = O(1) instead of O(N)
- Large listen backlog: 2048 (was 128)
- TCP_NODELAY: Disabled Nagle's algorithm
- SO_REUSEPORT: Multiple threads can accept()
- Direct calls: Gateway→LB (no IPC overhead)
- Score-based routing: Avoid overloaded workers
# Increase max connections
sudo sysctl -w net.core.somaxconn=4096
# TCP tuning for low latency
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w net.ipv4.tcp_fin_timeout=30
# Increase file descriptors
ulimit -n 65536// Increase worker count for higher throughput
#define NUM_WORKERS 8 // More workers = more parallelism
// Reduce metrics frequency if CPU-bound
usleep(1000000); // 1s instead of 500msAlready increased to 8KB:
#define BUFFER_SIZE 8192 // Was 512# Don't benchmark cold start
ab -n 100 -c 10 http://localhost:8080/api/hello # Warmup
ab -n 1000 -c 50 http://localhost:8080/api/hello # Real testDon't benchmark from same host (resource contention):
# From another machine
ab -n 10000 -c 100 http://remote-host:8080/api/hellowrk -t4 -c100 -d30s http://localhost:8080/api/helloRequests per second: ~8,000-12,000
Time per request: ~8-12ms (mean)
Connection time: <1ms (p50), <2ms (p99)
Throughput: ~50-80 MB/s
| Component | Latency | Bottleneck At |
|---|---|---|
| HTTP parse | ~10µs | Not a bottleneck |
| KV lookup | ~0.2µs | Not a bottleneck |
| Worker select | ~1µs | Not a bottleneck |
| Unix socket | ~50µs | Not a bottleneck |
| PHP execution | 10-200ms | |
| Thread create | ~100µs |
Your processing time is 156ms average. That's mostly:
- PHP script startup (~5-10ms)
- PHP execution (~10-150ms depending on function)
- Worker fork/exec (~1-2ms)
- Use faster runtime (compiled languages)
- Cache function results (if deterministic)
- Pre-fork workers (pool of PHP processes)
- Use PHP-FPM instead of CLI
Check real bottlenecks:
# CPU usage
htop
# Threads per process
ps -eLf | grep gateway
# Socket stats
ss -s
# System limits
ulimit -a- ✅ Increase listen backlog (done: 2048)
- ✅ TCP_NODELAY (done)
- ✅ SO_REUSEPORT (done)
- 🔧 Warmup before benchmark (do this!)
- 🔧 Increase system limits (ulimit, sysctl)
- 🔧 Use thread pool (future work)
Votre variance est probablement due à :
- Thread creation overhead (100µs variance)
- Cold start effects (premiers appels plus lents)
- Lock contention (plusieurs threads en même temps)
Les optimisations appliquées devraient réduire la variance de 50-70% ! 🚀