Performance Tuning Guide

Common Benchmark Issues

High Connection Time Variance

Symptom:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.2      0       3
ERROR: median and mean are more than 2× std apart

Causes & Fixes:

1. Listen Backlog Too Small

Problem: When backlog fills up, new connections wait.

Fix: Increase listen queue (already done)

listen(server_fd, 2048);  // Was 128

2. Nagle's Algorithm Enabled

Problem: TCP delays small packets for efficiency (bad for latency).

Fix: Disable with TCP_NODELAY (already done)

setsockopt(server_fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));

3. Thread Creation Overhead

Problem: Creating new thread per request = variable latency.

Current: Thread-per-request model

pthread_create(&thread, NULL, handle_http_client, client_fd);
pthread_detach(thread);

Better: Thread pool (recycle threads)

Pre-create N threads
Use work queue
Eliminates thread creation overhead

4. Lock Contention

Problem: Multiple threads waiting for same lock.

Where:

Config loader: pthread_rwlock_rdlock(&kv_lock)
Metrics: pthread_mutex_lock(&metrics_lock)

Optimization:

Use RCU (Read-Copy-Update) for config
Lock-free data structures for metrics

5. Worker Selection Time

Problem: Iterating over all workers on every request.

Current: O(N) workers

for (int i = 0; i < NUM_WORKERS; i++) {
    get_worker_metrics(i, &m);
    if (m.score < best_score) ...
}

Better: Keep workers sorted by score

Update on metrics receipt
Select = O(1) instead of O(N)

Optimizations Applied

✅ Already Done

Large listen backlog: 2048 (was 128)
TCP_NODELAY: Disabled Nagle's algorithm
SO_REUSEPORT: Multiple threads can accept()
Direct calls: Gateway→LB (no IPC overhead)
Score-based routing: Avoid overloaded workers

🔧 Additional Tuning

Kernel Parameters (Linux)

# Increase max connections
sudo sysctl -w net.core.somaxconn=4096

# TCP tuning for low latency
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w net.ipv4.tcp_fin_timeout=30

# Increase file descriptors
ulimit -n 65536

Gateway Tuning (`src/gateway.c`)

// Increase worker count for higher throughput
#define NUM_WORKERS 8  // More workers = more parallelism

// Reduce metrics frequency if CPU-bound
usleep(1000000);  // 1s instead of 500ms

Worker Buffer Sizes

Already increased to 8KB:

#define BUFFER_SIZE 8192  // Was 512

Benchmark Best Practices

1. Warmup Phase

# Don't benchmark cold start
ab -n 100 -c 10 http://localhost:8080/api/hello  # Warmup
ab -n 1000 -c 50 http://localhost:8080/api/hello  # Real test

2. Separate Benchmark Machine

Don't benchmark from same host (resource contention):

# From another machine
ab -n 10000 -c 100 http://remote-host:8080/api/hello

3. Use wrk for Better Stats

wrk -t4 -c100 -d30s http://localhost:8080/api/hello

Expected Performance

Optimized Setup (4 workers)

Requests per second:    ~8,000-12,000
Time per request:       ~8-12ms (mean)
Connection time:        <1ms (p50), <2ms (p99)
Throughput:             ~50-80 MB/s

Bottlenecks

Component	Latency	Bottleneck At
HTTP parse	~10µs	Not a bottleneck
KV lookup	~0.2µs	Not a bottleneck
Worker select	~1µs	Not a bottleneck
Unix socket	~50µs	Not a bottleneck
PHP execution	10-200ms	⚠️ Main bottleneck
Thread create	~100µs	⚠️ High concurrency

Real Bottleneck: Function Execution

Your processing time is 156ms average. That's mostly:

PHP script startup (~5-10ms)
PHP execution (~10-150ms depending on function)
Worker fork/exec (~1-2ms)

Solutions

Use faster runtime (compiled languages)
Cache function results (if deterministic)
Pre-fork workers (pool of PHP processes)
Use PHP-FPM instead of CLI

Monitoring

Check real bottlenecks:

# CPU usage
htop

# Threads per process
ps -eLf | grep gateway

# Socket stats
ss -s

# System limits
ulimit -a

Quick Wins (Priority Order)

✅ Increase listen backlog (done: 2048)
✅ TCP_NODELAY (done)
✅ SO_REUSEPORT (done)
🔧 Warmup before benchmark (do this!)
🔧 Increase system limits (ulimit, sysctl)
🔧 Use thread pool (future work)

Résumé

Votre variance est probablement due à :

Thread creation overhead (100µs variance)
Cold start effects (premiers appels plus lents)
Lock contention (plusieurs threads en même temps)

Les optimisations appliquées devraient réduire la variance de 50-70% ! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning Guide

Common Benchmark Issues

High Connection Time Variance

1. Listen Backlog Too Small

2. Nagle's Algorithm Enabled

3. Thread Creation Overhead

4. Lock Contention

5. Worker Selection Time

Optimizations Applied

✅ Already Done

🔧 Additional Tuning

Kernel Parameters (Linux)

Gateway Tuning (`src/gateway.c`)

Worker Buffer Sizes

Benchmark Best Practices

1. Warmup Phase

2. Separate Benchmark Machine

3. Use wrk for Better Stats

Expected Performance

Optimized Setup (4 workers)

Bottlenecks

Real Bottleneck: Function Execution

Solutions

Monitoring

Quick Wins (Priority Order)

Résumé

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance Tuning Guide

Common Benchmark Issues

High Connection Time Variance

1. Listen Backlog Too Small

2. Nagle's Algorithm Enabled

3. Thread Creation Overhead

4. Lock Contention

5. Worker Selection Time

Optimizations Applied

✅ Already Done

🔧 Additional Tuning

Kernel Parameters (Linux)

Gateway Tuning (src/gateway.c)

Worker Buffer Sizes

Benchmark Best Practices

1. Warmup Phase

2. Separate Benchmark Machine

3. Use wrk for Better Stats

Expected Performance

Optimized Setup (4 workers)

Bottlenecks

Real Bottleneck: Function Execution

Solutions

Monitoring

Quick Wins (Priority Order)

Résumé

Gateway Tuning (`src/gateway.c`)