Skip to content

[FEAT]Add /health endpoint with dependency status checks and structured JSON response #444

@samarthsugandhi

Description

@samarthsugandhi

Summary

SecuScan's backend exposes a /health route but it returns a minimal
response with no dependency status checks. This feature upgrades the
health endpoint to return a structured JSON response that includes
the status of all critical dependencies (database, cache, plugin
registry) with latency measurements, making it suitable for use as
a Docker/compose healthcheck, a load balancer probe, and a developer
diagnostics tool.

Problem

The current /health endpoint returns a static {"status": "ok"}
regardless of the actual state of the system. This means:

  • docker compose healthchecks pass even when the database is
    unreachable or the plugin registry failed to initialise
  • Developers have no fast way to confirm all subsystems are
    operational without tailing logs
  • There is no structured way for monitoring tools or CI pipelines
    to assert that a deployed SecuScan instance is fully functional
  • The endpoint does not surface partial degradation — a working
    API with a broken cache is indistinguishable from a fully healthy
    instance

Proposed Solution

Backend (backend/secuscan/routes.py) — modify:

Upgrade GET /api/v1/health to perform active dependency probes
and return a structured response:

{
  "status": "healthy" | "degraded" | "unhealthy",
  "version": "4.5.3-BETA",
  "uptime_seconds": 3412,
  "checks": {
    "database": {
      "status": "ok" | "error",
      "latency_ms": 2.4,
      "detail": null | "<error message>"
    },
    "cache": {
      "status": "ok" | "error",
      "latency_ms": 0.8,
      "detail": null | "<error message>"
    },
    "plugins": {
      "status": "ok" | "error",
      "total": 12,
      "runnable": 10,
      "detail": null | "<error message>"
    }
  }
}

Status rules:

  • healthy — all checks pass
  • degraded — cache or plugins check fails but database is ok
  • unhealthy — database check fails

HTTP status codes:

  • 200 for healthy and degraded
  • 503 for unhealthy

Implementation:

  • Database probe: SELECT 1 with a 2s timeout, measure latency
  • Cache probe: cache.ping() or equivalent with a 1s timeout
  • Plugin probe: call plugin_manager.list_plugins() and count
    runnable vs total — no timeout needed (in-memory)
  • All probes run concurrently via asyncio.gather(return_exceptions=True)
  • A startup timestamp stored in main.py is used to compute
    uptime_seconds

docker-compose.yml — modify:

Update the api service healthcheck to use the new endpoint:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8081/api/v1/health"]
  interval: 15s
  timeout: 5s
  retries: 3
  start_period: 10s

backend/secuscan/main.py — modify:

Store _start_time = time.time() at application startup and
expose it for the health route to compute uptime_seconds.

No frontend changes required.

Suggested Scope

  • Suggested files or directories:

    • backend/secuscan/routes.py — upgrade GET /health route
    • backend/secuscan/main.py — add startup timestamp
    • docker-compose.yml — update api healthcheck
    • backend/tests/test_health.py (new) — unit tests
  • Related route, page, component, API, or plugin:

    • GET /api/v1/health (existing, upgrade)
    • database.pyget_db() used for DB probe
    • cache.pyget_cache() used for cache probe
    • plugins.pyget_plugin_manager() for plugin probe

Acceptance Criteria

  • GET /api/v1/health returns structured JSON with status,
    version, uptime_seconds, and checks fields
  • Database probe executes SELECT 1 and records latency in ms
  • Cache probe pings the cache and records latency in ms
  • Plugin probe counts total and runnable plugins
  • All three probes run concurrently via asyncio.gather
  • Response is 200 when status is healthy or degraded
  • Response is 503 when status is unhealthy (DB probe fails)
  • uptime_seconds reflects actual time since application start
  • docker-compose.yml healthcheck uses the new endpoint
  • Unit tests cover: all-healthy, DB-down (unhealthy),
    cache-down (degraded), plugin-error (degraded)

Test Plan

  1. Start the full stack with docker compose up — confirm
    docker inspect shows the api container as healthy
  2. Call GET /api/v1/health — confirm structured JSON response
    with all checks passing and status: healthy
  3. Stop the database container — call the endpoint again and
    confirm status: unhealthy and HTTP 503
  4. Stop only the Redis container — confirm status: degraded
    and HTTP 200 with checks.cache.status: error
  5. Run test_health.py — all unit tests pass

Alternatives Considered

  • Using a third-party health-check library (e.g. fastapi-health):
    Adds a dependency for functionality that is straightforwardly
    implementable with asyncio.gather and the existing database/cache
    abstractions already in the codebase.
  • Separate /readiness and /liveness endpoints (Kubernetes style):
    Out of scope for a local-first tool. A single /health endpoint
    with a structured checks object is sufficient and simpler for
    compose and developer use.
  • Returning 503 for degraded state: Degraded means the API is
    still serving requests — returning 503 would cause load balancers
    to stop routing traffic unnecessarily. Only a DB failure warrants
    503 since the API cannot function without it.

Additional Context

  • This directly unblocks the Docker fix in [BUG] Backend Dockerfile uses inconsistent build context and app module #381 — a correct
    healthcheck requires a reliable health endpoint to probe
  • The structured response is designed to be forward-compatible:
    new dependency checks (e.g. vault, workflows scheduler) can
    be added to the checks object without breaking existing
    consumers
  • Suggested labels: level:advanced, type:feature,
    type:devops, area:backend

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions