Skip to content

hari7261/indus-tester

Repository files navigation

⚡ INDUS Tester

A production-grade, Go-native distributed load-testing platform

Real performance testing for CI/CD pipelines, staging environments, and load balancer validation

Go License gRPC Prometheus


System Architecture

Architecture

Key Design Principles:

  • Time-driven load modeling (not request loops)
  • Single controller coordination (no peer-to-peer)
  • Online metric aggregation with HDR histograms
  • Context-driven cancellation throughout
  • Lock-free hot paths for performance
  • Streaming metrics with backpressure

Features

Category Capability
Load Profiles constant, ramp_up, ramp_down, step, spike
Metrics p50, p90, p95, p99, RPS, error_ratio, min/max/mean
Distribution Multiple agents, VUs split automatically
Thresholds < > <= >= on any metric, global or per-scenario
Observability Prometheus export, structured JSON logs, Grafana-ready
Reporting HTML (human) + JSON (machine) reports, time-series data
CI/CD Exit code 0 pass / 1 fail, GitHub Actions friendly

Load Profiles

Load Profiles


Metrics Pipeline

Metrics Pipeline


Execution Sequence (UML)

Sequence Diagram


Quick Start

Prerequisites

  • Go 1.21+
  • make (optional, but recommended)

1. Build

# Linux / macOS
make deps && make build

# Windows (PowerShell)
go mod download
go mod tidy
go build -o bin/controller.exe ./cmd/controller
go build -o bin/agent.exe      ./cmd/agent
go build -o bin/indus-tester.exe ./cmd/indus-tester

Binaries are placed in bin/:

Binary Role
bin/controller Coordinates all agents, aggregates metrics, evaluates thresholds
bin/agent Executes Virtual Users and streams metrics back
bin/indus-tester CLI — submits plans, streams progress, generates reports

2. Start the Mock Target Server

# Build and start (one-time)
go build -o bin/mock-server ./scripts/mock-server.go
./bin/mock-server          # starts on http://localhost:8080
2026/02/28 17:18:37 🚀 Mock server starting on :8080
2026/02/28 17:18:37    Test with: curl http://localhost:8080/healthz

3. Start Agents

# Terminal 1
./bin/agent --id=agent-1 --grpc-addr=:50052 --metrics-addr=:9091

# Terminal 2
./bin/agent --id=agent-2 --grpc-addr=:50053 --metrics-addr=:9092

Agent log output:

{"time":"2026-02-28T17:18:57+05:30","level":"INFO","msg":"agent gRPC server starting","agent_id":"agent-1","addr":":50052"}
{"time":"2026-02-28T17:18:57+05:30","level":"INFO","msg":"agent metrics server starting","agent_id":"agent-1","addr":":9091"}

4. Start the Controller

# Terminal 3
./bin/controller \
  --grpc-addr=:50051 \
  --metrics-addr=:9090 \
  --agents=localhost:50052,localhost:50053

5. Run Your First Test

./bin/indus-tester --controller=localhost:50051 run examples/basic.yaml

Live output:

Run started: 059a8d29 (state: running)

STATE      ELAPSED         VUs        RPS    p95(ms)     ERRORS
────────── ────────── ──────── ────────── ────────── ──────────
running    8s               10        1.7     494.67      0.00%
running    16s              10        1.8     491.20      0.00%
running    24s              10        1.9     489.45      0.00%
running    30s              10        2.0     487.85      0.00%

Final result:

Run completed: 059a8d29

THRESHOLD RESULTS:
  ✓  p95 < 500ms          actual: 487.85ms
  ✓  error_ratio < 1%     actual: 0.00%

All thresholds passed. Exit code: 0

Example Test Flows & Results

Example 1 — Basic Constant Load (examples/basic.yaml)

Goal: Confirm the API handles 10 concurrent users for 30 seconds.

name: "basic-api-test"

scenarios:
  get-users:
    duration: "30s"
    think_time: "200ms"
    profile:
      type: constant
      vus: 10
    steps:
      - name: "list-users"
        method: GET
        url: "http://localhost:8080/api/users"
        timeout: "5s"

thresholds:
  - metric: p95
    condition: "<"
    value: 500      # p95 < 500ms
  - metric: error_ratio
    condition: "<"
    value: 0.01     # errors < 1%

Command:

./bin/indus-tester --controller=localhost:50051 run examples/basic.yaml

Result:

STATE      ELAPSED    VUs    RPS    p95(ms)   ERRORS
running    30s        10     2.0    487.85    0.00%

THRESHOLD RESULTS:
  ✓  p95 < 500         actual: 487.85
  ✓  error_ratio < 0.01  actual: 0.00

Exit: 0  ✅

Example 2 — Ramp-Up Stress Test (examples/ramp-stress.yaml)

Goal: Find the system's breaking point by ramping 5 → 200 VUs over 2 minutes.

name: "ramp-stress-test"

scenarios:
  api-stress:
    duration: "120s"
    think_time: "100ms"
    profile:
      type: ramp_up
      start_vus: 5
      end_vus: 200
    steps:
      - name: "create-order"
        method: POST
        url: "http://localhost:8080/api/orders"
        body: '{"item": "widget", "quantity": 1}'
        headers:
          Content-Type: "application/json"

thresholds:
  - metric: p95
    condition: "<"
    value: 1000     # p95 < 1s
  - metric: p99
    condition: "<"
    value: 3000     # p99 < 3s
  - metric: error_ratio
    condition: "<"
    value: 0.05     # errors < 5%
  - metric: rps
    condition: ">"
    value: 100      # must sustain > 100 RPS

Expected live output (showing latency increase as VUs ramp):

STATE      ELAPSED    VUs    RPS     p95(ms)   ERRORS
running    15s        21     18.2    142.30    0.00%
running    45s        67     55.4    389.10    0.10%
running    90s        133    88.7    742.60    2.30%
running    120s       200    101.2   994.45    4.80%

THRESHOLD RESULTS:
  ✓  p95 < 1000        actual: 994.45
  ✓  p99 < 3000        actual: 2104.30
  ✓  error_ratio < 0.05  actual: 0.048
  ✓  rps > 100         actual: 101.2

Exit: 0  ✅

Interpretation: The system starts struggling at ~133 VUs (p95 crosses 700ms). At 200 VU it's at the edge. Set your SLAs accordingly.


Example 3 — Traffic Spike (examples/multi-scenario-spike.yaml)

Goal: Simulate a flash sale. 5 base VUs → 100 spike VUs for 15s → back to 5.

scenarios:
  checkout-spike:
    duration: "90s"
    profile:
      type: spike
      base_vus: 5
      spike_vus: 100
      spike_at: "30s"
      spike_duration: "15s"
    steps:
      - name: "checkout"
        method: POST
        url: "http://localhost:8080/api/checkout"
        body: '{"payment_method": "card"}'

thresholds:
  - metric: p95
    scenario: "checkout-spike"
    condition: "<"
    value: 2000     # LB/system may be slow during spike, but keep p95 < 2s
  - metric: error_ratio
    scenario: "checkout-spike"
    condition: "<"
    value: 0.05

Expected live output:

STATE      ELAPSED    VUs    RPS    p95(ms)   ERRORS
running    25s        5      2.1    210.40    0.00%   ← base load
running    32s        100    24.3   1840.50   1.10%   ← spike hit!
running    47s        100    25.8   1920.30   1.80%   ← peak of spike
running    50s        5      2.2    240.10    0.00%   ← recovered

THRESHOLD RESULTS:
  ✓  p95 < 2000 [checkout-spike]   actual: 1920.30
  ✓  error_ratio < 0.05 [checkout-spike]  actual: 0.018

Exit: 0  ✅

Example 4 — Step Load (examples/step-load.yaml)

Goal: Step up VUs every 30s to identify the exact degradation threshold.

scenarios:
  step-api:
    duration: "120s"
    profile:
      type: step
      start_vus: 10
      step_vus: 20
      step_duration: "30s"   # 10 → 30 → 50 → 70 VUs
    steps:
      - name: "heavy-report"
        method: GET
        url: "http://localhost:8080/api/reports/summary"
        timeout: "15s"

Expected live output:

STATE      ELAPSED    VUs    RPS    p95(ms)   ERRORS
running    15s        10     3.1    185.20    0.00%   ← step 1
running    45s        30     8.9    420.80    0.00%   ← step 2
running    75s        50     14.2   880.10    0.40%   ← step 3 — degrading
running    105s       70     17.5   1840.60   3.20%   ← step 4 — near limit

Interpretation: Performance degrades significantly at 50 VUs. Set your max-capacity at 40 VUs for this endpoint.


Example 5 — Load Balancer Performance Suite (examples/lb-test.yaml)

Goal: Comprehensive LB validation — all 4 profile types against your load balancer.

./bin/indus-tester --controller=localhost:50051 run examples/lb-test.yaml

Expected output (60s test):

Run started: eb0ef8fb

STATE      ELAPSED    VUs    RPS    p95(ms)   ERRORS
running    10s        35     6.4    412.30    0.00%
running    30s        90     14.8   525.47    0.00%
running    50s        100    18.2   610.30    0.40%
running    60s        10     3.2    490.10    0.00%   ← spike recovered

THRESHOLD RESULTS:
  ✓  p95 < 1000                    actual: 610.30
  ✓  error_ratio < 0.02            actual: 0.004
  ✓  p95 < 300 [baseline-health]   actual: 241.10
  ✓  error_ratio < 0.005 [baseline] actual: 0.000
  ✓  p95 < 2000 [traffic-spike]    actual: 1840.20
  ✓  error_ratio < 0.05 [spike]    actual: 0.018
  ✓  rps > 50                      actual: 18.2

Exit: 0  ✅

Test Plan Schema Reference

name: "my-test"                    # Human-readable test name

scenarios:
  scenario-name:                   # Unique scenario identifier
    duration: "60s"                # How long this scenario runs
    think_time: "200ms"            # Pause between each VU iteration

    profile:
      type: constant               # constant | ramp_up | ramp_down | step | spike
      vus: 50                      # (constant) fixed VU count

      # ramp_up / ramp_down
      start_vus: 5
      end_vus: 100

      # step
      start_vus: 10
      step_vus: 20                 # VUs added per step
      step_duration: "30s"         # Duration of each step

      # spike
      base_vus: 10
      spike_vus: 200
      spike_at: "30s"              # When to begin the spike
      spike_duration: "15s"        # Duration of the spike

    steps:                         # HTTP steps executed per VU iteration
      - name: "step-name"
        method: GET                # GET | POST | PUT | DELETE | PATCH
        url: "http://host/path"
        timeout: "5s"
        headers:
          Accept: "application/json"
          Content-Type: "application/json"
        body: '{"key": "value"}'   # Request body (string)
        tags:
          endpoint: "users"        # Custom tags for metric breakdown

thresholds:                        # Optional pass/fail SLA gates
  - metric: p95                    # p50 | p90 | p95 | p99 | error_ratio | rps
    condition: "<"                 # < | > | <= | >=
    value: 500                     # Milliseconds for latency; ratio for error_ratio
    scenario: "scenario-name"      # Optional: scope to one scenario

CLI Reference

Usage:
  indus-tester [command]

Available Commands:
  run      Execute a load test plan
  status   Get run status by ID
  agents   List connected agents
  report   Generate HTML or JSON report

Flags:
  --controller string   Controller gRPC address (default "localhost:50051")

run

./bin/indus-tester run examples/basic.yaml
./bin/indus-tester --controller=host:50051 run plan.yaml
./bin/indus-tester run plan.yaml --no-stream      # submit without live output

status

./bin/indus-tester status 059a8d29
Run: 059a8d29
State: completed
Duration: 30s
Requests: 62  Errors: 0  RPS: 2.07
p50: 312ms  p95: 487ms  p99: 503ms

agents

./bin/indus-tester agents
ID        ADDRESS             STATE     VUs
agent-1   localhost:50052     active    5
agent-2   localhost:50053     active    5

report

./bin/indus-tester report 059a8d29 --format=html -o my-report.html
./bin/indus-tester report 059a8d29 --format=json -o my-report.json

Configuration Reference

Controller Flags

Flag Default Description
--grpc-addr :50051 gRPC listen address
--metrics-addr :9090 Prometheus metrics endpoint
--agents (required) Comma-separated host:port list

Agent Flags

Flag Default Description
--id agent-0 Unique agent identifier
--grpc-addr :50052 gRPC listen address
--metrics-addr :9091 Prometheus metrics endpoint

CLI Flags

Flag Default Description
--controller localhost:50051 Controller address

Observability

Prometheus Metrics

Endpoint Component
http://localhost:9090/metrics Controller
http://localhost:9091/metrics Agent 1
http://localhost:9092/metrics Agent 2

Key exported metrics:

indus_controller_requests_total
indus_controller_request_duration_seconds{quantile="0.95"}
indus_controller_active_vus
indus_controller_thresholds_passed
indus_controller_thresholds_failed
indus_controller_run_state

Prometheus + Grafana Setup

# prometheus.yml (already included in repo)
scrape_configs:
  - job_name: 'indus-controller'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'indus-agents'
    static_configs:
      - targets: ['localhost:9091', 'localhost:9092']
# Start Prometheus
docker run -p 9091:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Structured JSON Logs

All components emit structured logs to stderr:

{"time":"2026-02-28T17:19:54+05:30","level":"INFO","msg":"RunScenario stream opened","component":"agent","agent_id":"agent-1"}
{"time":"2026-02-28T17:19:54+05:30","level":"INFO","msg":"assigning scenario","component":"agent","agent_id":"agent-1","scenario":"get-users","run_id":"059a8d29","target_vus":5}
{"time":"2026-02-28T17:20:24+05:30","level":"INFO","msg":"scenario duration elapsed","component":"agent","agent_id":"agent-1","scenario":"get-users"}

Fields: time, level, msg, component, run_id, agent_id, scenario, addr


CI/CD Integration

Exit Codes

Code Meaning
0 All thresholds passed — deploy!
1 Threshold violation or error — block deploy

GitHub Actions

name: Load Test
on: [push, pull_request]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.21'

      - name: Build
        run: make build

      - name: Start services
        run: |
          ./bin/agent --id=agent-1 --grpc-addr=:50052 --metrics-addr=:9091 &
          ./bin/agent --id=agent-2 --grpc-addr=:50053 --metrics-addr=:9092 &
          sleep 1
          ./bin/controller --grpc-addr=:50051 --metrics-addr=:9090 \
            --agents=localhost:50052,localhost:50053 &
          sleep 2

      - name: Run load test
        run: ./bin/indus-tester --controller=localhost:50051 run examples/basic.yaml

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: load-test-report
          path: report.html

Docker Compose (included)

docker compose up

Services started:

  • controllerlocalhost:50051
  • agent-1localhost:50052
  • agent-2localhost:50053
  • prometheuslocalhost:9090

Architecture Deep Dive

Virtual User Execution Model

Each VU is a goroutine that:

  1. Runs HTTP steps in sequence
  2. Records timing + status for each step
  3. Sends a Sample to the per-agent aggregator channel
  4. Waits think_time before repeating
  5. Shuts down cleanly when context is cancelled

Scheduler

The scheduler pre-computes a 1-second resolution VU schedule:

t=0s  → 5 VUs/agent  (constant of 10 total)
t=1s  → 5 VUs/agent
...

For ramp profiles, it computes intermediate values using linear interpolation. For spike, it injects a high-VU window. For step, it increments at each step_duration boundary.

Adjust commands are streamed to agents whenever the target VU count changes.

Metrics Collection

VU goroutine
  └─ executes HTTP request
  └─ creates Sample{scenario, step, duration_ns, status_code, is_error}
  └─ pushes to buffered channel (capacity 10,000)

Aggregator goroutine  (per agent)
  └─ drains channel
  └─ updates HDR histogram (per scenario, per step)
  └─ updates atomic counters (total_reqs, total_errs, active_vus)
  └─ emits Snapshot every 1s → gRPC stream to controller

Controller
  └─ receives Snapshot from each agent
  └─ merges into global HDR histogram
  └─ evaluates thresholds
  └─ streams ProgressUpdate to CLI

Threshold Engine

thresholds:
  - metric: p95          # Evaluated against global or per-scenario snapshot
    condition: "<"       # Supports: < > <= >=
    value: 500           # Milliseconds for latency metrics; ratio (0.0–1.0) for error_ratio
    scenario: "my-sc"    # Optional — scopes check to one scenario only

Threshold evaluation happens:

  1. Continuously during the run (every 1s) — for early abort if desired
  2. Finally when all scenarios complete — determines the exit code

Project Structure

indus-tester/
├── cmd/
│   ├── agent/           # Agent binary entry point
│   ├── controller/      # Controller binary entry point
│   └── indus-tester/    # CLI entry point
├── internal/
│   ├── agent/           # Agent gRPC server + VU orchestration
│   ├── controller/      # Controller gRPC server + scheduler + aggregation
│   └── proto/           # gRPC codec + generated types
├── pkg/
│   ├── executor/        # HTTP execution engine
│   ├── metrics/         # HDR histogram aggregator + snapshot
│   ├── observability/   # Logger + Prometheus metrics
│   ├── plan/            # YAML parser + plan types
│   ├── report/          # HTML + JSON report generation
│   ├── scheduler/       # VU count schedule computation
│   ├── threshold/       # Pass/fail threshold engine
│   └── vu/              # Virtual user goroutine
├── proto/
│   └── indus.proto      # gRPC service definition
├── examples/
│   ├── basic.yaml
│   ├── ramp-stress.yaml
│   ├── step-load.yaml
│   ├── multi-scenario-spike.yaml
│   └── lb-test.yaml     # Load balancer performance suite
├── docs/
│   ├── architecture.svg
│   ├── load-profiles.svg
│   ├── metrics-pipeline.svg
│   └── sequence-diagram.svg
└── scripts/
    └── mock-server.go   # Mock HTTP target for local testing

Performance Characteristics

Characteristic Implementation
Lock-free metric ingestion Buffered channels + atomic counters
Memory-bounded histograms HDR histogram with fixed bucket count
Connection reuse HTTP client with keep-alive pool per VU
Zero goroutine leaks Context propagation from CLI → VU
Graceful shutdown SIGTERM → context cancel → drain → exit
Backpressure Aggregator channel drops oldest on overflow

Development

# Run all tests
make test

# Format code
make fmt

# Vet code
make vet

# Clean build artifacts
make clean

Non-Goals

  • Browser automation → use Playwright / Selenium
  • Custom scripting language → fork and extend in Go
  • Plugin system → fork and extend
  • Built-in web UI → use Grafana + Prometheus
  • Persistent run history → add a database layer

License

MIT — see LICENSE


Contributing

This is a reference implementation. For production hardening, consider:

  • Persistent state — database-backed run storage
  • Controller HA — leader election (etcd / Raft)
  • Dynamic agent discovery — service mesh / Consul
  • Protocol coverage — gRPC, WebSocket, TCP load testing
  • Advanced profiles — custom scripted profiles

About

Real performance testing for CI/CD pipelines, staging environments, and load balancer validation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors