Skip to content

feat: Scalability — SaaS multi-tenant architecture with Azure as first cloud #22

@FrodeHus

Description

@FrodeHus

Goal

Make PatchHound production-ready for SaaS multi-tenant deployments. Target scale: many tenants, each with up to ~10k devices and ~50k vulnerability records, with the architecture remaining correct and safe under horizontal scaling.

Azure is the first supported cloud. The self-hosted path (docker-compose / on-premises container runtime) remains a first-class supported deployment option alongside Azure, sharing as much application-level code as possible.


Current state

Area Current behaviour Problem
Ingestion pipeline GetAllPagesAsync accumulates all pages in a single in-memory List<T> before processing (up to MaxPages = 1000) Memory spike per large tenant; not safe at scale
Worker concurrency safety Only EnrichmentWorker has a DB-level compare-and-swap lease. IngestionWorker, SlaCheckWorker, WorkflowWorker, ApprovalExpiryWorker, ScanSchedulerWorker, RemediationAiWorker have no distributed lease Running two worker instances causes double-processing (duplicate ingestion runs, duplicate SLA notifications, double approval expiry)
Ingestion dispatch Ingestion runs are triggered by a 1-minute poll loop in IngestionWorker evaluating every TenantSourceConfiguration cron schedule All tenant ingestion jobs contend on the same single worker; no backpressure; no dead-letter; no retry on partial failure
Redis StackExchange.Redis is referenced and registered but has zero injection sites — it is wired up but never used Prepared infrastructure with no active functionality
SignalR backplane AddSignalR() with no backplane. Multi-instance API deployments would silently drop notifications for users connected to a different instance API cannot be safely scaled horizontally
Rate limiting PartitionedRateLimiter with FixedWindowRateLimiter — in-process only Rate limit is per-instance, not per-user across the cluster
DB migration race MigrateAsync runs at startup in every API replica with a retry loop Concurrent replicas race on migration; no distributed lock guards this
Observability Zero Application Insights, OpenTelemetry, or structured telemetry packages. Built-in ILogger<T> only No distributed tracing, no metrics, no alerting surface
IaC infra/ directory exists with the correct structure (modules/control-plane, modules/shared, modules/stamp, stacks/azure, stacks/selfhosted, params/dev, params/prod) but all directories are empty No deployable infrastructure definition

Scope

This issue covers three work streams that should be implemented together or in close sequence:


1. Ingestion pipeline — Azure Service Bus + streaming

Replace the in-process poll-and-accumulate ingestion model with a queue-backed streaming model.

Queue dispatch:

  • IngestionWorker cron evaluation remains but instead of running ingestion inline, it enqueues a message to Azure Service Bus for each due TenantSourceConfiguration
  • Message payload: { TenantId, SourceConfigId, TriggeredAt, ManualRequest: bool }
  • Separate worker (or competing consumers on the same worker) dequeues messages and calls IngestionService.RunIngestionAsync per message
  • Dead-letter queue for failed ingestion runs; retry count configurable per tenant
  • Self-hosted path: use an in-process System.Threading.Channels.Channel<T> as the queue abstraction with the same interface, so the application code is identical and only the DI registration differs

Streaming ingestion:

  • Replace GetAllPagesAsync (src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs:260) with IAsyncEnumerable<TItem> streaming — yield each page's items as they arrive
  • IVulnerabilityBatchSource / IAssetInventoryBatchSource already exist as cursor-based batch interfaces; the full-accumulation mode in FetchAssetBatchAsync (DefenderVulnerabilitySource.cs:252) should be replaced with the batch interface
  • Ingestion checkpoints (IngestionCheckpoint entity) should be persisted after each page/batch so a failed run can resume rather than restart from the beginning

Interface:

// Abstract over Service Bus vs in-process channel
public interface IIngestionJobQueue
{
    Task EnqueueAsync(IngestionJobMessage message, CancellationToken ct);
    IAsyncEnumerable<IngestionJobMessage> DequeueAsync(CancellationToken ct);
}

Affected files:

  • src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.csGetAllPagesAsync (line 260)
  • src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.csFetchAssetBatchAsync (line 252)
  • src/PatchHound.Worker/IngestionWorker.cs — replace inline execution with enqueue
  • src/PatchHound.Infrastructure/Services/IngestionService.cs — Phase 3 rewrite target (already a no-op stub)
  • New: src/PatchHound.Infrastructure/Queues/IIngestionJobQueue.cs
  • New: src/PatchHound.Infrastructure/Queues/ServiceBusIngestionJobQueue.cs (Azure)
  • New: src/PatchHound.Infrastructure/Queues/InProcessIngestionJobQueue.cs (self-hosted)

2. Worker concurrency safety — DB lease extension

Extend the DB-level compare-and-swap lease pattern from EnrichmentWorker to all workers that currently have no concurrency guard.

The EnrichmentWorker pattern (src/PatchHound.Worker/EnrichmentWorker.cs:415):

  • Conditional ExecuteUpdateAsync where LeaseId IS NULL OR LeaseExpiresAt < NOW()
  • Lease owner: $"{Environment.MachineName}:{Environment.ProcessId}"
  • Lease duration: configurable per worker type
  • Released in a finally block

Workers requiring lease protection:

Worker Poll interval Idempotency risk Lease duration
IngestionWorker 1 min Duplicate ingestion run per source 10 min
SlaCheckWorker 1 hr Duplicate SLA notifications 2 hr
WorkflowWorker 1 min Double-processing of workflow timeout transitions 5 min
ApprovalExpiryWorker 15 min Double auto-denial of expired approvals 30 min
ScanSchedulerWorker 60 s Duplicate scan scheduling 5 min
RemediationAiWorker 10 s RemediationAiJob picked by two workers simultaneously (no optimistic concurrency guard) 2 min

RemediationAiWorker additional fix: Add EF Core optimistic concurrency ([Timestamp] / rowversion) or a status CAS (UPDATE ... WHERE Status = 'Pending' RETURNING id) on RemediationAiJob so the job cannot be claimed by two workers atomically.

DB schema change: Add LeaseId, LeaseOwner, LeaseExpiresAt columns to a new WorkerLease table (keyed by worker type name), rather than adding lease columns to every domain table. Alternatively add a WorkerLeaseService that wraps the pattern.

Affected files:

  • src/PatchHound.Worker/IngestionWorker.cs
  • src/PatchHound.Worker/SlaCheckWorker.cs
  • src/PatchHound.Worker/WorkflowWorker.cs
  • src/PatchHound.Worker/ApprovalExpiryWorker.cs
  • src/PatchHound.Worker/ScanSchedulerWorker.cs
  • src/PatchHound.Worker/RemediationAiWorker.cs
  • New: src/PatchHound.Infrastructure/Services/WorkerLeaseService.cs
  • New migration: AddWorkerLeaseTable

3. Horizontal API scaling fixes

SignalR backplane

  • Azure path: replace AddSignalR() with AddSignalR().AddAzureSignalR(connectionString) or .AddStackExchangeRedis(redisConnectionString) in src/PatchHound.Api/Program.cs (line 427)
  • Self-hosted path: use AddStackExchangeRedis with the existing Redis connection string
  • Redis is already registered in DI (DependencyInjection.cs:28); promote it to a required dependency for both paths once the backplane is wired up

Distributed rate limiting

  • Replace System.Threading.RateLimiting.FixedWindowRateLimiter (in-process) with a Redis-backed implementation (e.g. using StackExchange.Redis with a sliding window Lua script, or the RedisRateLimiting library)
  • Self-hosted path: same Redis instance used for the backplane

DB migration race on startup

  • Guard MigrateAsync with a distributed lock before applying migrations
  • Azure path: use an Azure Storage blob lease or a simple SELECT pg_try_advisory_lock(constant) PostgreSQL advisory lock (no extra dependency) to ensure only one replica migrates at a time
  • Both paths can use the PostgreSQL advisory lock approach since Postgres is shared

Affected files:

  • src/PatchHound.Api/Program.cs — lines 427, 431–450, 470–480
  • src/PatchHound.Infrastructure/DependencyInjection.cs — line 27

4. Observability

Add structured telemetry as a first-class concern — required before any production cloud deployment.

  • OpenTelemetry SDK (OpenTelemetry.Extensions.Hosting, OpenTelemetry.Instrumentation.AspNetCore, OpenTelemetry.Instrumentation.EntityFrameworkCore, OpenTelemetry.Instrumentation.Http)
  • Azure path: Export to Azure Monitor (Azure.Monitor.OpenTelemetry.AspNetCore) — single package wires up traces, metrics, and logs to Application Insights
  • Self-hosted path: Export to OTLP endpoint (Jaeger / Grafana / etc.) via OpenTelemetry.Exporter.OpenTelemetryProtocol
  • Key traces to add: ingestion run per source, per-tenant vulnerability reconciliation, workflow stage transitions, enrichment job execution
  • Metrics: ingestion job queue depth, jobs processed/failed per minute, p99 DB query latency, active workflow count per tenant
  • Configuration key: Telemetry:Endpoint (OTLP) or ApplicationInsights:ConnectionString (Azure)

Affected files:

  • All .csproj files — add OpenTelemetry packages
  • src/PatchHound.Api/Program.csAddOpenTelemetry() registration
  • src/PatchHound.Worker/Program.cs — same
  • New: src/PatchHound.Infrastructure/Telemetry/PatchHoundActivitySource.cs

5. Azure IaC — Bicep

Fill the infra/ directory with a deployable Bicep structure. The existing directory layout maps naturally to Bicep modules:

infra/
  modules/
    control-plane/    # Azure Container Registry, Key Vault (if not using OpenBao), Log Analytics workspace
    shared/           # PostgreSQL Flexible Server, Azure Cache for Redis, Azure Service Bus namespace
    stamp/            # Container Apps environment, API container app, Worker container app, Managed Identity
  stacks/
    azure/            # main.bicep — composes control-plane + shared + stamp
    selfhosted/       # docker-compose.yml already exists; document it here
  params/
    dev/              # dev.bicepparam
    prod/             # prod.bicepparam

Target Azure services:

Component Azure service Notes
API Azure Container Apps Auto-scaling on HTTP concurrency; HTTPS ingress; min replicas = 1
Worker Azure Container Apps (internal) Scale on Service Bus queue depth (KEDA rule); min replicas = 1
Database Azure Database for PostgreSQL Flexible Server General-purpose tier, zone-redundant HA for prod
Cache / backplane Azure Cache for Redis Basic SKU for dev, Standard for prod
Ingestion queue Azure Service Bus Standard tier; one namespace, one queue per deployment
Secrets Azure Key Vault (production) or OpenBao (self-hosted / dev) MSAL managed identity auth for Key Vault
Container registry Azure Container Registry Geo-replicated for multi-region prod
Observability Azure Monitor + Application Insights workspace Connected to Log Analytics workspace
Identity User-assigned Managed Identity Assigned to both Container Apps; used for ACR pull, Key Vault, Service Bus access

Authentication between PatchHound and Azure services:

  • API and Worker use a user-assigned managed identity — no connection string secrets for Azure-managed services
  • Azure.Identity.DefaultAzureCredential already referenced (Azure.Identity 1.18.0 in PatchHound.Infrastructure.csproj:12) — use ManagedIdentityCredential in prod, DefaultAzureCredential in dev

Self-hosted stack:

  • infra/stacks/selfhosted/ documents the docker-compose deployment (existing docker-compose.yml)
  • Service Bus replaced by in-process Channel via InProcessIngestionJobQueue
  • Redis is already in docker-compose
  • OpenBao remains as the secrets backend

Acceptance criteria

Ingestion

  • Ingestion jobs are enqueued to Azure Service Bus (Azure path) or in-process Channel (self-hosted path) when their cron schedule is due
  • A failed ingestion job is dead-lettered after N retries and does not block subsequent jobs for the same or other tenants
  • GetAllPagesAsync is replaced with IAsyncEnumerable<T> streaming — at no point is the full page list held in memory
  • Ingestion checkpoints survive a worker restart; the next run resumes from the last saved cursor position
  • A tenant with 10k devices and 50k vulnerabilities can be ingested end-to-end without OOM

Worker safety

  • Running two worker instances simultaneously does not produce duplicate ingestion runs, duplicate SLA notifications, or double-processed workflow transitions
  • RemediationAiJob cannot be claimed by two workers in the same atomic operation
  • Lease expiry and release are visible in logs

API horizontal scaling

  • Two API instances behind a load balancer deliver SignalR notifications correctly regardless of which instance a user is connected to
  • Rate limiting is enforced per user across the cluster (not per-instance)
  • DB migrations are applied by exactly one replica on startup

Observability

  • Distributed traces are emitted for all ingestion runs, enrichment jobs, and workflow transitions
  • A /metrics endpoint (or Azure Monitor workspace) shows ingestion queue depth, jobs/min, p99 DB latency
  • Application Insights (Azure) or OTLP exporter (self-hosted) is wired up and receiving data

IaC

  • bicep build infra/stacks/azure/main.bicep succeeds with no errors
  • az deployment sub create with dev params deploys a working environment to Azure
  • Managed identity is used for all Azure service access (no secrets in environment variables for Azure-managed services)
  • docker-compose up (self-hosted) continues to work end-to-end after all application changes

Out of scope

  • Read replicas / DB horizontal scaling — not required at the SaaS multi-tenant target scale; re-evaluate if p99 query latency exceeds SLA at >50 active tenants
  • Multi-region / geo-distribution — not in scope for the first Azure stack
  • DB sharding / per-tenant schemas — current single-schema multi-tenant design is sufficient at target scale with proper indexing

Dependencies


Reference files

  • src/PatchHound.Worker/IngestionWorker.cs — current cron-poll dispatcher
  • src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.csGetAllPagesAsync (line 260)
  • src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.csFetchAssetBatchAsync (line 252)
  • src/PatchHound.Worker/EnrichmentWorker.cs — reference lease pattern (line 415)
  • src/PatchHound.Api/Program.cs — SignalR (line 427), rate limiter (line 431), migration (line 470)
  • src/PatchHound.Infrastructure/DependencyInjection.cs — Redis registration (line 28)
  • infra/ — empty directory tree to be filled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions