Goal
Make PatchHound production-ready for SaaS multi-tenant deployments. Target scale: many tenants, each with up to ~10k devices and ~50k vulnerability records, with the architecture remaining correct and safe under horizontal scaling.
Azure is the first supported cloud. The self-hosted path (docker-compose / on-premises container runtime) remains a first-class supported deployment option alongside Azure, sharing as much application-level code as possible.
Current state
| Area |
Current behaviour |
Problem |
| Ingestion pipeline |
GetAllPagesAsync accumulates all pages in a single in-memory List<T> before processing (up to MaxPages = 1000) |
Memory spike per large tenant; not safe at scale |
| Worker concurrency safety |
Only EnrichmentWorker has a DB-level compare-and-swap lease. IngestionWorker, SlaCheckWorker, WorkflowWorker, ApprovalExpiryWorker, ScanSchedulerWorker, RemediationAiWorker have no distributed lease |
Running two worker instances causes double-processing (duplicate ingestion runs, duplicate SLA notifications, double approval expiry) |
| Ingestion dispatch |
Ingestion runs are triggered by a 1-minute poll loop in IngestionWorker evaluating every TenantSourceConfiguration cron schedule |
All tenant ingestion jobs contend on the same single worker; no backpressure; no dead-letter; no retry on partial failure |
| Redis |
StackExchange.Redis is referenced and registered but has zero injection sites — it is wired up but never used |
Prepared infrastructure with no active functionality |
| SignalR backplane |
AddSignalR() with no backplane. Multi-instance API deployments would silently drop notifications for users connected to a different instance |
API cannot be safely scaled horizontally |
| Rate limiting |
PartitionedRateLimiter with FixedWindowRateLimiter — in-process only |
Rate limit is per-instance, not per-user across the cluster |
| DB migration race |
MigrateAsync runs at startup in every API replica with a retry loop |
Concurrent replicas race on migration; no distributed lock guards this |
| Observability |
Zero Application Insights, OpenTelemetry, or structured telemetry packages. Built-in ILogger<T> only |
No distributed tracing, no metrics, no alerting surface |
| IaC |
infra/ directory exists with the correct structure (modules/control-plane, modules/shared, modules/stamp, stacks/azure, stacks/selfhosted, params/dev, params/prod) but all directories are empty |
No deployable infrastructure definition |
Scope
This issue covers three work streams that should be implemented together or in close sequence:
1. Ingestion pipeline — Azure Service Bus + streaming
Replace the in-process poll-and-accumulate ingestion model with a queue-backed streaming model.
Queue dispatch:
IngestionWorker cron evaluation remains but instead of running ingestion inline, it enqueues a message to Azure Service Bus for each due TenantSourceConfiguration
- Message payload:
{ TenantId, SourceConfigId, TriggeredAt, ManualRequest: bool }
- Separate worker (or competing consumers on the same worker) dequeues messages and calls
IngestionService.RunIngestionAsync per message
- Dead-letter queue for failed ingestion runs; retry count configurable per tenant
- Self-hosted path: use an in-process
System.Threading.Channels.Channel<T> as the queue abstraction with the same interface, so the application code is identical and only the DI registration differs
Streaming ingestion:
- Replace
GetAllPagesAsync (src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs:260) with IAsyncEnumerable<TItem> streaming — yield each page's items as they arrive
IVulnerabilityBatchSource / IAssetInventoryBatchSource already exist as cursor-based batch interfaces; the full-accumulation mode in FetchAssetBatchAsync (DefenderVulnerabilitySource.cs:252) should be replaced with the batch interface
- Ingestion checkpoints (
IngestionCheckpoint entity) should be persisted after each page/batch so a failed run can resume rather than restart from the beginning
Interface:
// Abstract over Service Bus vs in-process channel
public interface IIngestionJobQueue
{
Task EnqueueAsync(IngestionJobMessage message, CancellationToken ct);
IAsyncEnumerable<IngestionJobMessage> DequeueAsync(CancellationToken ct);
}
Affected files:
src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs — GetAllPagesAsync (line 260)
src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs — FetchAssetBatchAsync (line 252)
src/PatchHound.Worker/IngestionWorker.cs — replace inline execution with enqueue
src/PatchHound.Infrastructure/Services/IngestionService.cs — Phase 3 rewrite target (already a no-op stub)
- New:
src/PatchHound.Infrastructure/Queues/IIngestionJobQueue.cs
- New:
src/PatchHound.Infrastructure/Queues/ServiceBusIngestionJobQueue.cs (Azure)
- New:
src/PatchHound.Infrastructure/Queues/InProcessIngestionJobQueue.cs (self-hosted)
2. Worker concurrency safety — DB lease extension
Extend the DB-level compare-and-swap lease pattern from EnrichmentWorker to all workers that currently have no concurrency guard.
The EnrichmentWorker pattern (src/PatchHound.Worker/EnrichmentWorker.cs:415):
- Conditional
ExecuteUpdateAsync where LeaseId IS NULL OR LeaseExpiresAt < NOW()
- Lease owner:
$"{Environment.MachineName}:{Environment.ProcessId}"
- Lease duration: configurable per worker type
- Released in a
finally block
Workers requiring lease protection:
| Worker |
Poll interval |
Idempotency risk |
Lease duration |
IngestionWorker |
1 min |
Duplicate ingestion run per source |
10 min |
SlaCheckWorker |
1 hr |
Duplicate SLA notifications |
2 hr |
WorkflowWorker |
1 min |
Double-processing of workflow timeout transitions |
5 min |
ApprovalExpiryWorker |
15 min |
Double auto-denial of expired approvals |
30 min |
ScanSchedulerWorker |
60 s |
Duplicate scan scheduling |
5 min |
RemediationAiWorker |
10 s |
RemediationAiJob picked by two workers simultaneously (no optimistic concurrency guard) |
2 min |
RemediationAiWorker additional fix: Add EF Core optimistic concurrency ([Timestamp] / rowversion) or a status CAS (UPDATE ... WHERE Status = 'Pending' RETURNING id) on RemediationAiJob so the job cannot be claimed by two workers atomically.
DB schema change: Add LeaseId, LeaseOwner, LeaseExpiresAt columns to a new WorkerLease table (keyed by worker type name), rather than adding lease columns to every domain table. Alternatively add a WorkerLeaseService that wraps the pattern.
Affected files:
src/PatchHound.Worker/IngestionWorker.cs
src/PatchHound.Worker/SlaCheckWorker.cs
src/PatchHound.Worker/WorkflowWorker.cs
src/PatchHound.Worker/ApprovalExpiryWorker.cs
src/PatchHound.Worker/ScanSchedulerWorker.cs
src/PatchHound.Worker/RemediationAiWorker.cs
- New:
src/PatchHound.Infrastructure/Services/WorkerLeaseService.cs
- New migration:
AddWorkerLeaseTable
3. Horizontal API scaling fixes
SignalR backplane
- Azure path: replace
AddSignalR() with AddSignalR().AddAzureSignalR(connectionString) or .AddStackExchangeRedis(redisConnectionString) in src/PatchHound.Api/Program.cs (line 427)
- Self-hosted path: use
AddStackExchangeRedis with the existing Redis connection string
- Redis is already registered in DI (
DependencyInjection.cs:28); promote it to a required dependency for both paths once the backplane is wired up
Distributed rate limiting
- Replace
System.Threading.RateLimiting.FixedWindowRateLimiter (in-process) with a Redis-backed implementation (e.g. using StackExchange.Redis with a sliding window Lua script, or the RedisRateLimiting library)
- Self-hosted path: same Redis instance used for the backplane
DB migration race on startup
- Guard
MigrateAsync with a distributed lock before applying migrations
- Azure path: use an Azure Storage blob lease or a simple
SELECT pg_try_advisory_lock(constant) PostgreSQL advisory lock (no extra dependency) to ensure only one replica migrates at a time
- Both paths can use the PostgreSQL advisory lock approach since Postgres is shared
Affected files:
src/PatchHound.Api/Program.cs — lines 427, 431–450, 470–480
src/PatchHound.Infrastructure/DependencyInjection.cs — line 27
4. Observability
Add structured telemetry as a first-class concern — required before any production cloud deployment.
- OpenTelemetry SDK (
OpenTelemetry.Extensions.Hosting, OpenTelemetry.Instrumentation.AspNetCore, OpenTelemetry.Instrumentation.EntityFrameworkCore, OpenTelemetry.Instrumentation.Http)
- Azure path: Export to Azure Monitor (
Azure.Monitor.OpenTelemetry.AspNetCore) — single package wires up traces, metrics, and logs to Application Insights
- Self-hosted path: Export to OTLP endpoint (Jaeger / Grafana / etc.) via
OpenTelemetry.Exporter.OpenTelemetryProtocol
- Key traces to add: ingestion run per source, per-tenant vulnerability reconciliation, workflow stage transitions, enrichment job execution
- Metrics: ingestion job queue depth, jobs processed/failed per minute, p99 DB query latency, active workflow count per tenant
- Configuration key:
Telemetry:Endpoint (OTLP) or ApplicationInsights:ConnectionString (Azure)
Affected files:
- All
.csproj files — add OpenTelemetry packages
src/PatchHound.Api/Program.cs — AddOpenTelemetry() registration
src/PatchHound.Worker/Program.cs — same
- New:
src/PatchHound.Infrastructure/Telemetry/PatchHoundActivitySource.cs
5. Azure IaC — Bicep
Fill the infra/ directory with a deployable Bicep structure. The existing directory layout maps naturally to Bicep modules:
infra/
modules/
control-plane/ # Azure Container Registry, Key Vault (if not using OpenBao), Log Analytics workspace
shared/ # PostgreSQL Flexible Server, Azure Cache for Redis, Azure Service Bus namespace
stamp/ # Container Apps environment, API container app, Worker container app, Managed Identity
stacks/
azure/ # main.bicep — composes control-plane + shared + stamp
selfhosted/ # docker-compose.yml already exists; document it here
params/
dev/ # dev.bicepparam
prod/ # prod.bicepparam
Target Azure services:
| Component |
Azure service |
Notes |
| API |
Azure Container Apps |
Auto-scaling on HTTP concurrency; HTTPS ingress; min replicas = 1 |
| Worker |
Azure Container Apps (internal) |
Scale on Service Bus queue depth (KEDA rule); min replicas = 1 |
| Database |
Azure Database for PostgreSQL Flexible Server |
General-purpose tier, zone-redundant HA for prod |
| Cache / backplane |
Azure Cache for Redis |
Basic SKU for dev, Standard for prod |
| Ingestion queue |
Azure Service Bus |
Standard tier; one namespace, one queue per deployment |
| Secrets |
Azure Key Vault (production) or OpenBao (self-hosted / dev) |
MSAL managed identity auth for Key Vault |
| Container registry |
Azure Container Registry |
Geo-replicated for multi-region prod |
| Observability |
Azure Monitor + Application Insights workspace |
Connected to Log Analytics workspace |
| Identity |
User-assigned Managed Identity |
Assigned to both Container Apps; used for ACR pull, Key Vault, Service Bus access |
Authentication between PatchHound and Azure services:
- API and Worker use a user-assigned managed identity — no connection string secrets for Azure-managed services
Azure.Identity.DefaultAzureCredential already referenced (Azure.Identity 1.18.0 in PatchHound.Infrastructure.csproj:12) — use ManagedIdentityCredential in prod, DefaultAzureCredential in dev
Self-hosted stack:
infra/stacks/selfhosted/ documents the docker-compose deployment (existing docker-compose.yml)
- Service Bus replaced by in-process Channel via
InProcessIngestionJobQueue
- Redis is already in docker-compose
- OpenBao remains as the secrets backend
Acceptance criteria
Ingestion
Worker safety
API horizontal scaling
Observability
IaC
Out of scope
- Read replicas / DB horizontal scaling — not required at the SaaS multi-tenant target scale; re-evaluate if p99 query latency exceeds SLA at >50 active tenants
- Multi-region / geo-distribution — not in scope for the first Azure stack
- DB sharding / per-tenant schemas — current single-schema multi-tenant design is sufficient at target scale with proper indexing
Dependencies
Reference files
src/PatchHound.Worker/IngestionWorker.cs — current cron-poll dispatcher
src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs — GetAllPagesAsync (line 260)
src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs — FetchAssetBatchAsync (line 252)
src/PatchHound.Worker/EnrichmentWorker.cs — reference lease pattern (line 415)
src/PatchHound.Api/Program.cs — SignalR (line 427), rate limiter (line 431), migration (line 470)
src/PatchHound.Infrastructure/DependencyInjection.cs — Redis registration (line 28)
infra/ — empty directory tree to be filled
Goal
Make PatchHound production-ready for SaaS multi-tenant deployments. Target scale: many tenants, each with up to ~10k devices and ~50k vulnerability records, with the architecture remaining correct and safe under horizontal scaling.
Azure is the first supported cloud. The self-hosted path (docker-compose / on-premises container runtime) remains a first-class supported deployment option alongside Azure, sharing as much application-level code as possible.
Current state
GetAllPagesAsyncaccumulates all pages in a single in-memoryList<T>before processing (up toMaxPages = 1000)EnrichmentWorkerhas a DB-level compare-and-swap lease.IngestionWorker,SlaCheckWorker,WorkflowWorker,ApprovalExpiryWorker,ScanSchedulerWorker,RemediationAiWorkerhave no distributed leaseIngestionWorkerevaluating everyTenantSourceConfigurationcron scheduleStackExchange.Redisis referenced and registered but has zero injection sites — it is wired up but never usedAddSignalR()with no backplane. Multi-instance API deployments would silently drop notifications for users connected to a different instancePartitionedRateLimiterwithFixedWindowRateLimiter— in-process onlyMigrateAsyncruns at startup in every API replica with a retry loopILogger<T>onlyinfra/directory exists with the correct structure (modules/control-plane,modules/shared,modules/stamp,stacks/azure,stacks/selfhosted,params/dev,params/prod) but all directories are emptyScope
This issue covers three work streams that should be implemented together or in close sequence:
1. Ingestion pipeline — Azure Service Bus + streaming
Replace the in-process poll-and-accumulate ingestion model with a queue-backed streaming model.
Queue dispatch:
IngestionWorkercron evaluation remains but instead of running ingestion inline, it enqueues a message to Azure Service Bus for each dueTenantSourceConfiguration{ TenantId, SourceConfigId, TriggeredAt, ManualRequest: bool }IngestionService.RunIngestionAsyncper messageSystem.Threading.Channels.Channel<T>as the queue abstraction with the same interface, so the application code is identical and only the DI registration differsStreaming ingestion:
GetAllPagesAsync(src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs:260) withIAsyncEnumerable<TItem>streaming — yield each page's items as they arriveIVulnerabilityBatchSource/IAssetInventoryBatchSourcealready exist as cursor-based batch interfaces; the full-accumulation mode inFetchAssetBatchAsync(DefenderVulnerabilitySource.cs:252) should be replaced with the batch interfaceIngestionCheckpointentity) should be persisted after each page/batch so a failed run can resume rather than restart from the beginningInterface:
Affected files:
src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs—GetAllPagesAsync(line 260)src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs—FetchAssetBatchAsync(line 252)src/PatchHound.Worker/IngestionWorker.cs— replace inline execution with enqueuesrc/PatchHound.Infrastructure/Services/IngestionService.cs— Phase 3 rewrite target (already a no-op stub)src/PatchHound.Infrastructure/Queues/IIngestionJobQueue.cssrc/PatchHound.Infrastructure/Queues/ServiceBusIngestionJobQueue.cs(Azure)src/PatchHound.Infrastructure/Queues/InProcessIngestionJobQueue.cs(self-hosted)2. Worker concurrency safety — DB lease extension
Extend the DB-level compare-and-swap lease pattern from
EnrichmentWorkerto all workers that currently have no concurrency guard.The
EnrichmentWorkerpattern (src/PatchHound.Worker/EnrichmentWorker.cs:415):ExecuteUpdateAsyncwhereLeaseId IS NULL OR LeaseExpiresAt < NOW()$"{Environment.MachineName}:{Environment.ProcessId}"finallyblockWorkers requiring lease protection:
IngestionWorkerSlaCheckWorkerWorkflowWorkerApprovalExpiryWorkerScanSchedulerWorkerRemediationAiWorkerRemediationAiJobpicked by two workers simultaneously (no optimistic concurrency guard)RemediationAiWorkeradditional fix: Add EF Core optimistic concurrency ([Timestamp]/rowversion) or a status CAS (UPDATE ... WHERE Status = 'Pending' RETURNING id) onRemediationAiJobso the job cannot be claimed by two workers atomically.DB schema change: Add
LeaseId,LeaseOwner,LeaseExpiresAtcolumns to a newWorkerLeasetable (keyed by worker type name), rather than adding lease columns to every domain table. Alternatively add aWorkerLeaseServicethat wraps the pattern.Affected files:
src/PatchHound.Worker/IngestionWorker.cssrc/PatchHound.Worker/SlaCheckWorker.cssrc/PatchHound.Worker/WorkflowWorker.cssrc/PatchHound.Worker/ApprovalExpiryWorker.cssrc/PatchHound.Worker/ScanSchedulerWorker.cssrc/PatchHound.Worker/RemediationAiWorker.cssrc/PatchHound.Infrastructure/Services/WorkerLeaseService.csAddWorkerLeaseTable3. Horizontal API scaling fixes
SignalR backplane
AddSignalR()withAddSignalR().AddAzureSignalR(connectionString)or.AddStackExchangeRedis(redisConnectionString)insrc/PatchHound.Api/Program.cs(line 427)AddStackExchangeRediswith the existing Redis connection stringDependencyInjection.cs:28); promote it to a required dependency for both paths once the backplane is wired upDistributed rate limiting
System.Threading.RateLimiting.FixedWindowRateLimiter(in-process) with a Redis-backed implementation (e.g. usingStackExchange.Rediswith a sliding window Lua script, or theRedisRateLimitinglibrary)DB migration race on startup
MigrateAsyncwith a distributed lock before applying migrationsSELECT pg_try_advisory_lock(constant)PostgreSQL advisory lock (no extra dependency) to ensure only one replica migrates at a timeAffected files:
src/PatchHound.Api/Program.cs— lines 427, 431–450, 470–480src/PatchHound.Infrastructure/DependencyInjection.cs— line 274. Observability
Add structured telemetry as a first-class concern — required before any production cloud deployment.
OpenTelemetry.Extensions.Hosting,OpenTelemetry.Instrumentation.AspNetCore,OpenTelemetry.Instrumentation.EntityFrameworkCore,OpenTelemetry.Instrumentation.Http)Azure.Monitor.OpenTelemetry.AspNetCore) — single package wires up traces, metrics, and logs to Application InsightsOpenTelemetry.Exporter.OpenTelemetryProtocolTelemetry:Endpoint(OTLP) orApplicationInsights:ConnectionString(Azure)Affected files:
.csprojfiles — add OpenTelemetry packagessrc/PatchHound.Api/Program.cs—AddOpenTelemetry()registrationsrc/PatchHound.Worker/Program.cs— samesrc/PatchHound.Infrastructure/Telemetry/PatchHoundActivitySource.cs5. Azure IaC — Bicep
Fill the
infra/directory with a deployable Bicep structure. The existing directory layout maps naturally to Bicep modules:Target Azure services:
Authentication between PatchHound and Azure services:
Azure.Identity.DefaultAzureCredentialalready referenced (Azure.Identity 1.18.0inPatchHound.Infrastructure.csproj:12) — useManagedIdentityCredentialin prod,DefaultAzureCredentialin devSelf-hosted stack:
infra/stacks/selfhosted/documents the docker-compose deployment (existingdocker-compose.yml)InProcessIngestionJobQueueAcceptance criteria
Ingestion
GetAllPagesAsyncis replaced withIAsyncEnumerable<T>streaming — at no point is the full page list held in memoryWorker safety
RemediationAiJobcannot be claimed by two workers in the same atomic operationAPI horizontal scaling
Observability
/metricsendpoint (or Azure Monitor workspace) shows ingestion queue depth, jobs/min, p99 DB latencyIaC
bicep build infra/stacks/azure/main.bicepsucceeds with no errorsaz deployment sub createwith dev params deploys a working environment to Azuredocker-compose up(self-hosted) continues to work end-to-end after all application changesOut of scope
Dependencies
IngestionService.RunIngestionAsyncis currently a no-op stubReference files
src/PatchHound.Worker/IngestionWorker.cs— current cron-poll dispatchersrc/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs—GetAllPagesAsync(line 260)src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs—FetchAssetBatchAsync(line 252)src/PatchHound.Worker/EnrichmentWorker.cs— reference lease pattern (line 415)src/PatchHound.Api/Program.cs— SignalR (line 427), rate limiter (line 431), migration (line 470)src/PatchHound.Infrastructure/DependencyInjection.cs— Redis registration (line 28)infra/— empty directory tree to be filled