Skip to content

ruthwikkakumani/redirection-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ High-Performance URL Redirection Engine

A production-grade, 5-service distributed URL shortening and redirection system β€” solo-architected, horizontally scaled across 26 live replicas, load-tested to 17,843 RPS under 10,000 concurrent virtual users, and deployed at rdrt.dev.

Live endpoints:

  • Frontend: https://rdrt.dev
  • API: https://api.rdrt.dev

Table of Contents


πŸ— System Architecture (HLD & LLD)

High-Level Design (HLD)

The system relies on platform-native Layer 7 load balancing to route traffic to the API Gateway. The architecture intentionally isolates the write-heavy URL creation path from the read-heavy, latency-sensitive redirect path.

graph TD
    Client([Client]) -->|HTTPS| LB[Railway Native Ingress L7 LB]

    subgraph "Ingress & Gateway"
        UI[url-frontend]
        API[api-gateway x6]
    end

    LB -->|rdrt.dev| UI
    LB -->|api.rdrt.dev| API

    subgraph "Microservices"
        Auth[auth-service x2]
        URL[url-service x3]
        Redirect[redirect-service x12]
        Analytics[analytics-service x3]
    end

    API -->|Route| Auth
    API -->|Route| URL
    API -->|Route| Redirect

    URL -->|Write| PG_Primary[(PostgreSQL Primary)]
    Auth -->|Read/Write| PG_Primary

    Redirect -->|1. Cache Check O 1| Redis[(Redis Shared Cache)]
    Redirect -->|2. Fallback Read| PG_Replica[(PostgreSQL Read Replica)]
    Redirect -.->|3. Async Event| Kafka[[Apache Kafka Stream]]

    Kafka -.->|Consume| Analytics
    Analytics -->|Batch Write| PG_Primary
Loading

Low-Level Design (LLD): The Redirect Hot-Path

To achieve median latencies of 111ms under extreme load, the redirect sequence ensures that analytics and database writes never block the HTTP response.

sequenceDiagram
    participant C as Client
    participant G as API Gateway
    participant R as Redirect Service
    participant Cache as Redis
    participant DB as Postgres Replica
    participant K as Kafka

    C->>G: GET /r/{shortcode}
    G->>R: Forward Request
    
    rect rgb(30, 40, 50)
    Note over R,DB: Critical Latency Path
    R->>Cache: GET {shortcode}
    alt Cache Hit (99.9%)
        Cache-->>R: Return original_url
    else Cache Miss (0.1%)
        R->>DB: SELECT original_url FROM urls
        DB-->>R: Return original_url
        R->>Cache: SET {shortcode} (Background Goroutine)
    end
    end
    
    R->>K: Publish ClickEvent (Async / Fire & Forget)
    R-->>G: HTTP 302 Found (Location: original_url)
    G-->>C: Redirect to Destination
Loading

🧠 Why This Is Hard

URL redirection sounds trivial β€” receive a short code, look it up, return 302. At 10 RPS, it is. At 17,843 RPS with 10,000 concurrent connections, the engineering surface explodes:

  • The redirect hot-path is brutally latency-sensitive. Every millisecond of added overhead is multiplied across thousands of simultaneous connections. A naΓ―ve implementation β€” synchronous DB write per redirect plus in-band analytics β€” collapses under load because PostgreSQL cannot sustain 17K+ IOPS at sub-150ms while also accepting analytics writes.
  • Shared cache invalidation across replicas is non-trivial. The redirect-service runs across 12 replicas. In-process caching breaks horizontal scaling β€” if Replica 1 caches a URL mapping, Replica 2 has a cold cache and goes to the DB. Redis solves cross-replica cache coherence, but getting it wrong means a 12Γ— DB read amplification on every cache miss.
  • Analytics cannot block the critical path. If a click-analytics write takes 50ms and it's synchronous with the redirect response, your median latency triples. Decoupling requires an event bus (Kafka), introducing at-least-once delivery semantics.
  • Nginx hop elimination was counterintuitive. Conventionally, Nginx sits in front of Go services as a reverse proxy. Under pprof profiling, each intra-cluster Nginx hop added context-switch overhead per request. Removing it (Railway LB β†’ Go app instead of Railway LB β†’ Nginx β†’ Go app) cut per-request switching cost by 25%.

🚒 Production Deployment

The system is live on Railway, heavily weighted toward the redirect read-path:

Railway Production Environment (Screenshot of live Railway deployment showing active replicas and DB instances)

Component Replicas Responsibility
API Gateway 6 active api.rdrt.dev routing, rate limiting
Redirect Service 12 active The Hot Path. Redis lookup β†’ PG fallback β†’ Kafka publish β†’ 302
URL Service 3 active URL creation, ownership, PostgreSQL primary writes
Analytics Service 3 active Kafka consumer, click aggregation, stats
Auth Service 2 active JWT issuance, user management
Frontend 1 active rdrt.dev web UI

Data Layer Isolation:

  • PostgreSQL Primary: Handles all writes from URL creation, Auth, and batched Analytics.
  • PostgreSQL Read Replica (Postgres-7Ev1): Dedicated exclusively to the redirect-service fallback reads. Zero write pressure.
  • Redis: Shared global cache across all 12 redirect replicas.

πŸ“Š Benchmark Results

All tests run with k6 from local MacBook Pro against the live api.rdrt.dev deployment over the public internet.

Peak Run β€” 10,000 VUs Β· 4.28M Requests Β· 17,843 RPS

Click to view raw k6 output
k6 run redirect_load_test.js
scenarios: 10,000 max VUs, 4m0s, 4 stages

βœ“ is redirect (301/302/307/308)
βœ“ has location header

http_req_duration:  avg=290ms  med=111ms  p(90)=333ms  p(95)=1.19s  max=8.41s
http_req_failed:    0.00%   ← 2 i/o timeouts / 4,284,206 requests
http_reqs:          4,284,206 total  @  17,843 RPS

Success rate: 99.9999%

Cleanest High-Load Run β€” 3,000 VUs Β· 2.3M Requests Β· 11,019 RPS

Click to view raw k6 output
k6 run load-test.js
scenarios: 3,000 max VUs, 3m30s, 6 stages

βœ“ status is 302

http_req_duration:  avg=119ms  med=113ms  p(90)=138ms  p(95)=157ms  max=614ms
http_req_failed:    0.00%   ← 1 failure / 2,315,167 requests

p(95) = 157ms  βœ“  (threshold: 200ms β€” PASSED)

Full Progression

Concurrent VUs Total Requests RPS p(50) p(95) Failures
400 36,058 171 288ms 335ms 0
1,000 520,805 2,168 276ms 341ms 0
3,000 2,315,167 11,019 113ms 157ms 1
10,000 4,284,206 17,843 111ms 1.19s 2
5,000 (analytics) 2,011,389 8,378 105ms 886ms 2

On the p(95) rise at 10K VUs: p(95) rises to ~1.19s at 10,000 concurrent users. This is Railway's free-tier TCP connection ceiling β€” not Go application saturation. The application p(50) remains 111ms even at peak, confirming the Go runtime is not the bottleneck.

On the RPS jump from 2,168β†’11,019: This reflects Redis cache warming. Once the hot URL working-set is fully cached across the 12 redirect replicas, requests never reach PostgreSQL β€” pure O(1) Redis at wire speed.


βš™οΈ Tech Decision Rationale

Decision Chosen Rejected Why
Caching Redis In-memory map Shared state across 12 replicas. Prevents 12x DB read amplification on cache misses.
Analytics pipeline Kafka Direct DB write Eliminates I/O blocking on the redirect hot-path. Ensures 302 response is decoupled from click tracking.
Load balancing Platform L7 LB Nginx Removes an unnecessary network hop and context switch inside the container.
Database PG Primary + Replica MongoDB / Single PG Primary handles URL writes. Read Replica takes 100% of the redirect fallback reads, eliminating lock contention.

πŸ“ Repo Structure

redirection-engine/
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ analytics-service/  # Kafka consumer, click aggregation, stats
β”‚   β”œβ”€β”€ api-gateway/        # API Gateway β€” routing, rate-limiting
β”‚   β”œβ”€β”€ auth-service/       # Auth Service β€” JWT, user management
β”‚   β”œβ”€β”€ redirect-service/   # Redirect Service β€” hot path, Redis, Kafka producer
β”‚   └── url-service/        # URL Service β€” URL creation, DB writes
β”œβ”€β”€ pkg/
β”‚   β”œβ”€β”€ cache/              # Redis client abstraction
β”‚   β”œβ”€β”€ db/                 # PostgreSQL connection pool
β”‚   β”œβ”€β”€ kafka/              # Producer/consumer helpers
β”‚   └── middleware/         # Shared HTTP middleware
β”œβ”€β”€ docker-compose.yaml
β”œβ”€β”€ go.mod
└── go.sum

πŸš€ Local Setup

Prerequisites: Docker & Docker Compose, Go 1.22+

git clone [https://github.com/ruthwikkakumani/redirection-engine](https://github.com/ruthwikkakumani/redirection-engine)
cd redirection-engine
docker compose up --build
Service Local Port
API Gateway 8080
Auth Service 8081
Redirect Service 8082
Analytics Service 8083
URL Service 8084
# 1. Register and get a token
TOKEN=$(curl -s -X POST http://localhost:8080/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"you@example.com","password":"secret"}' | jq -r '.token')

# 2. Shorten a URL
curl -X POST http://localhost:8080/api/urls \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"original_url":"[https://google.com](https://google.com)"}'

# 3. Follow the redirect (check Location header)
curl -I http://localhost:8080/r/<shortcode>

πŸ”¬ Performance Profiling & Observability

# CPU profile β€” capture 30s under load
curl "http://localhost:8082/debug/pprof/profile?seconds=30" > cpu.prof
go tool pprof cpu.prof

# Goroutine snapshot β€” check for leaks
curl "http://localhost:8082/debug/pprof/goroutine?debug=2"
  • No Goroutine Leaks: Goroutine count stabilizes strictly under 10K VU load.
  • Structured logging: JSON logs with request_id, service, short_code, cache_hit, latency_ms on every request for distributed tracing.
  • Graceful shutdown: OS signal listener triggers server drain with 30s timeout; in-flight requests complete before the process exits.

πŸ”­ What I'd Do Next

  • OpenTelemetry traces β€” End-to-end spans from API Gateway through Kafka to Analytics, exported to Jaeger or Tempo.
  • Circuit breaker β€” sony/gobreaker on Redis and PostgreSQL clients; fail-fast on dependency degradation rather than piling up goroutines waiting on a dead connection.
  • GitHub Actions CI β€” go test -race ./... + golangci-lint + Docker build verification on every PR.
  • Canary deployments β€” Progressive rollout (5% β†’ 25% β†’ 100%) with automated rollback triggered by p99 spike detection.

πŸ‘€ Author

Ruthwik Kakumani β€” Backend & Distributed Systems
LinkedIn Β· GitHub Β· LeetCode Β· ruthwikkakumani@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors