feat: Scalability — SaaS multi-tenant architecture with Azure as first cloud

## Goal

Make PatchHound production-ready for SaaS multi-tenant deployments. Target scale: many tenants, each with up to ~10k devices and ~50k vulnerability records, with the architecture remaining correct and safe under horizontal scaling.

Azure is the first supported cloud. The self-hosted path (docker-compose / on-premises container runtime) remains a first-class supported deployment option alongside Azure, sharing as much application-level code as possible.

---

## Current state

| Area | Current behaviour | Problem |
|---|---|---|
| Ingestion pipeline | `GetAllPagesAsync` accumulates all pages in a single in-memory `List<T>` before processing (up to `MaxPages = 1000`) | Memory spike per large tenant; not safe at scale |
| Worker concurrency safety | Only `EnrichmentWorker` has a DB-level compare-and-swap lease. `IngestionWorker`, `SlaCheckWorker`, `WorkflowWorker`, `ApprovalExpiryWorker`, `ScanSchedulerWorker`, `RemediationAiWorker` have **no** distributed lease | Running two worker instances causes double-processing (duplicate ingestion runs, duplicate SLA notifications, double approval expiry) |
| Ingestion dispatch | Ingestion runs are triggered by a 1-minute poll loop in `IngestionWorker` evaluating every `TenantSourceConfiguration` cron schedule | All tenant ingestion jobs contend on the same single worker; no backpressure; no dead-letter; no retry on partial failure |
| Redis | `StackExchange.Redis` is referenced and registered but has **zero injection sites** — it is wired up but never used | Prepared infrastructure with no active functionality |
| SignalR backplane | `AddSignalR()` with no backplane. Multi-instance API deployments would silently drop notifications for users connected to a different instance | API cannot be safely scaled horizontally |
| Rate limiting | `PartitionedRateLimiter` with `FixedWindowRateLimiter` — in-process only | Rate limit is per-instance, not per-user across the cluster |
| DB migration race | `MigrateAsync` runs at startup in every API replica with a retry loop | Concurrent replicas race on migration; no distributed lock guards this |
| Observability | Zero Application Insights, OpenTelemetry, or structured telemetry packages. Built-in `ILogger<T>` only | No distributed tracing, no metrics, no alerting surface |
| IaC | `infra/` directory exists with the correct structure (`modules/control-plane`, `modules/shared`, `modules/stamp`, `stacks/azure`, `stacks/selfhosted`, `params/dev`, `params/prod`) but **all directories are empty** | No deployable infrastructure definition |

---

## Scope

This issue covers three work streams that should be implemented together or in close sequence:

---

### 1. Ingestion pipeline — Azure Service Bus + streaming

Replace the in-process poll-and-accumulate ingestion model with a queue-backed streaming model.

**Queue dispatch:**
- `IngestionWorker` cron evaluation remains but instead of running ingestion inline, it **enqueues a message** to Azure Service Bus for each due `TenantSourceConfiguration`
- Message payload: `{ TenantId, SourceConfigId, TriggeredAt, ManualRequest: bool }`
- Separate worker (or competing consumers on the same worker) dequeues messages and calls `IngestionService.RunIngestionAsync` per message
- Dead-letter queue for failed ingestion runs; retry count configurable per tenant
- Self-hosted path: use an in-process `System.Threading.Channels.Channel<T>` as the queue abstraction with the same interface, so the application code is identical and only the DI registration differs

**Streaming ingestion:**
- Replace `GetAllPagesAsync` (`src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs:260`) with `IAsyncEnumerable<TItem>` streaming — yield each page's items as they arrive
- `IVulnerabilityBatchSource` / `IAssetInventoryBatchSource` already exist as cursor-based batch interfaces; the full-accumulation mode in `FetchAssetBatchAsync` (`DefenderVulnerabilitySource.cs:252`) should be replaced with the batch interface
- Ingestion checkpoints (`IngestionCheckpoint` entity) should be persisted after each page/batch so a failed run can resume rather than restart from the beginning

**Interface:**
```csharp
// Abstract over Service Bus vs in-process channel
public interface IIngestionJobQueue
{
    Task EnqueueAsync(IngestionJobMessage message, CancellationToken ct);
    IAsyncEnumerable<IngestionJobMessage> DequeueAsync(CancellationToken ct);
}
```

**Affected files:**
- `src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs` — `GetAllPagesAsync` (line 260)
- `src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs` — `FetchAssetBatchAsync` (line 252)
- `src/PatchHound.Worker/IngestionWorker.cs` — replace inline execution with enqueue
- `src/PatchHound.Infrastructure/Services/IngestionService.cs` — Phase 3 rewrite target (already a no-op stub)
- New: `src/PatchHound.Infrastructure/Queues/IIngestionJobQueue.cs`
- New: `src/PatchHound.Infrastructure/Queues/ServiceBusIngestionJobQueue.cs` (Azure)
- New: `src/PatchHound.Infrastructure/Queues/InProcessIngestionJobQueue.cs` (self-hosted)

---

### 2. Worker concurrency safety — DB lease extension

Extend the DB-level compare-and-swap lease pattern from `EnrichmentWorker` to all workers that currently have no concurrency guard.

The `EnrichmentWorker` pattern (`src/PatchHound.Worker/EnrichmentWorker.cs:415`):
- Conditional `ExecuteUpdateAsync` where `LeaseId IS NULL OR LeaseExpiresAt < NOW()`
- Lease owner: `$"{Environment.MachineName}:{Environment.ProcessId}"`
- Lease duration: configurable per worker type
- Released in a `finally` block

**Workers requiring lease protection:**

| Worker | Poll interval | Idempotency risk | Lease duration |
|---|---|---|---|
| `IngestionWorker` | 1 min | Duplicate ingestion run per source | 10 min |
| `SlaCheckWorker` | 1 hr | Duplicate SLA notifications | 2 hr |
| `WorkflowWorker` | 1 min | Double-processing of workflow timeout transitions | 5 min |
| `ApprovalExpiryWorker` | 15 min | Double auto-denial of expired approvals | 30 min |
| `ScanSchedulerWorker` | 60 s | Duplicate scan scheduling | 5 min |
| `RemediationAiWorker` | 10 s | `RemediationAiJob` picked by two workers simultaneously (no optimistic concurrency guard) | 2 min |

**`RemediationAiWorker` additional fix:** Add EF Core optimistic concurrency (`[Timestamp]` / `rowversion`) or a status CAS (`UPDATE ... WHERE Status = 'Pending' RETURNING id`) on `RemediationAiJob` so the job cannot be claimed by two workers atomically.

**DB schema change:** Add `LeaseId`, `LeaseOwner`, `LeaseExpiresAt` columns to a new `WorkerLease` table (keyed by worker type name), rather than adding lease columns to every domain table. Alternatively add a `WorkerLeaseService` that wraps the pattern.

**Affected files:**
- `src/PatchHound.Worker/IngestionWorker.cs`
- `src/PatchHound.Worker/SlaCheckWorker.cs`
- `src/PatchHound.Worker/WorkflowWorker.cs`
- `src/PatchHound.Worker/ApprovalExpiryWorker.cs`
- `src/PatchHound.Worker/ScanSchedulerWorker.cs`
- `src/PatchHound.Worker/RemediationAiWorker.cs`
- New: `src/PatchHound.Infrastructure/Services/WorkerLeaseService.cs`
- New migration: `AddWorkerLeaseTable`

---

### 3. Horizontal API scaling fixes

#### SignalR backplane
- Azure path: replace `AddSignalR()` with `AddSignalR().AddAzureSignalR(connectionString)` or `.AddStackExchangeRedis(redisConnectionString)` in `src/PatchHound.Api/Program.cs` (line 427)
- Self-hosted path: use `AddStackExchangeRedis` with the existing Redis connection string
- Redis is already registered in DI (`DependencyInjection.cs:28`); promote it to a required dependency for both paths once the backplane is wired up

#### Distributed rate limiting
- Replace `System.Threading.RateLimiting.FixedWindowRateLimiter` (in-process) with a Redis-backed implementation (e.g. using `StackExchange.Redis` with a sliding window Lua script, or the `RedisRateLimiting` library)
- Self-hosted path: same Redis instance used for the backplane

#### DB migration race on startup
- Guard `MigrateAsync` with a distributed lock before applying migrations
- Azure path: use an Azure Storage blob lease or a simple `SELECT pg_try_advisory_lock(constant)` PostgreSQL advisory lock (no extra dependency) to ensure only one replica migrates at a time
- Both paths can use the PostgreSQL advisory lock approach since Postgres is shared

**Affected files:**
- `src/PatchHound.Api/Program.cs` — lines 427, 431–450, 470–480
- `src/PatchHound.Infrastructure/DependencyInjection.cs` — line 27

---

### 4. Observability

Add structured telemetry as a first-class concern — required before any production cloud deployment.

- **OpenTelemetry SDK** (`OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Instrumentation.AspNetCore`, `OpenTelemetry.Instrumentation.EntityFrameworkCore`, `OpenTelemetry.Instrumentation.Http`)
- **Azure path:** Export to Azure Monitor (`Azure.Monitor.OpenTelemetry.AspNetCore`) — single package wires up traces, metrics, and logs to Application Insights
- **Self-hosted path:** Export to OTLP endpoint (Jaeger / Grafana / etc.) via `OpenTelemetry.Exporter.OpenTelemetryProtocol`
- Key traces to add: ingestion run per source, per-tenant vulnerability reconciliation, workflow stage transitions, enrichment job execution
- Metrics: ingestion job queue depth, jobs processed/failed per minute, p99 DB query latency, active workflow count per tenant
- Configuration key: `Telemetry:Endpoint` (OTLP) or `ApplicationInsights:ConnectionString` (Azure)

**Affected files:**
- All `.csproj` files — add OpenTelemetry packages
- `src/PatchHound.Api/Program.cs` — `AddOpenTelemetry()` registration
- `src/PatchHound.Worker/Program.cs` — same
- New: `src/PatchHound.Infrastructure/Telemetry/PatchHoundActivitySource.cs`

---

### 5. Azure IaC — Bicep

Fill the `infra/` directory with a deployable Bicep structure. The existing directory layout maps naturally to Bicep modules:

```
infra/
  modules/
    control-plane/    # Azure Container Registry, Key Vault (if not using OpenBao), Log Analytics workspace
    shared/           # PostgreSQL Flexible Server, Azure Cache for Redis, Azure Service Bus namespace
    stamp/            # Container Apps environment, API container app, Worker container app, Managed Identity
  stacks/
    azure/            # main.bicep — composes control-plane + shared + stamp
    selfhosted/       # docker-compose.yml already exists; document it here
  params/
    dev/              # dev.bicepparam
    prod/             # prod.bicepparam
```

**Target Azure services:**

| Component | Azure service | Notes |
|---|---|---|
| API | Azure Container Apps | Auto-scaling on HTTP concurrency; HTTPS ingress; min replicas = 1 |
| Worker | Azure Container Apps (internal) | Scale on Service Bus queue depth (KEDA rule); min replicas = 1 |
| Database | Azure Database for PostgreSQL Flexible Server | General-purpose tier, zone-redundant HA for prod |
| Cache / backplane | Azure Cache for Redis | Basic SKU for dev, Standard for prod |
| Ingestion queue | Azure Service Bus | Standard tier; one namespace, one queue per deployment |
| Secrets | Azure Key Vault (production) or OpenBao (self-hosted / dev) | MSAL managed identity auth for Key Vault |
| Container registry | Azure Container Registry | Geo-replicated for multi-region prod |
| Observability | Azure Monitor + Application Insights workspace | Connected to Log Analytics workspace |
| Identity | User-assigned Managed Identity | Assigned to both Container Apps; used for ACR pull, Key Vault, Service Bus access |

**Authentication between PatchHound and Azure services:**
- API and Worker use a **user-assigned managed identity** — no connection string secrets for Azure-managed services
- `Azure.Identity.DefaultAzureCredential` already referenced (`Azure.Identity 1.18.0` in `PatchHound.Infrastructure.csproj:12`) — use `ManagedIdentityCredential` in prod, `DefaultAzureCredential` in dev

**Self-hosted stack:**
- `infra/stacks/selfhosted/` documents the docker-compose deployment (existing `docker-compose.yml`)
- Service Bus replaced by in-process Channel via `InProcessIngestionJobQueue`
- Redis is already in docker-compose
- OpenBao remains as the secrets backend

---

## Acceptance criteria

### Ingestion
- [ ] Ingestion jobs are enqueued to Azure Service Bus (Azure path) or in-process Channel (self-hosted path) when their cron schedule is due
- [ ] A failed ingestion job is dead-lettered after N retries and does not block subsequent jobs for the same or other tenants
- [ ] `GetAllPagesAsync` is replaced with `IAsyncEnumerable<T>` streaming — at no point is the full page list held in memory
- [ ] Ingestion checkpoints survive a worker restart; the next run resumes from the last saved cursor position
- [ ] A tenant with 10k devices and 50k vulnerabilities can be ingested end-to-end without OOM

### Worker safety
- [ ] Running two worker instances simultaneously does not produce duplicate ingestion runs, duplicate SLA notifications, or double-processed workflow transitions
- [ ] `RemediationAiJob` cannot be claimed by two workers in the same atomic operation
- [ ] Lease expiry and release are visible in logs

### API horizontal scaling
- [ ] Two API instances behind a load balancer deliver SignalR notifications correctly regardless of which instance a user is connected to
- [ ] Rate limiting is enforced per user across the cluster (not per-instance)
- [ ] DB migrations are applied by exactly one replica on startup

### Observability
- [ ] Distributed traces are emitted for all ingestion runs, enrichment jobs, and workflow transitions
- [ ] A `/metrics` endpoint (or Azure Monitor workspace) shows ingestion queue depth, jobs/min, p99 DB latency
- [ ] Application Insights (Azure) or OTLP exporter (self-hosted) is wired up and receiving data

### IaC
- [ ] `bicep build infra/stacks/azure/main.bicep` succeeds with no errors
- [ ] `az deployment sub create` with dev params deploys a working environment to Azure
- [ ] Managed identity is used for all Azure service access (no secrets in environment variables for Azure-managed services)
- [ ] `docker-compose up` (self-hosted) continues to work end-to-end after all application changes

---

## Out of scope

- Read replicas / DB horizontal scaling — not required at the SaaS multi-tenant target scale; re-evaluate if p99 query latency exceeds SLA at >50 active tenants
- Multi-region / geo-distribution — not in scope for the first Azure stack
- DB sharding / per-tenant schemas — current single-schema multi-tenant design is sufficient at target scale with proper indexing

---

## Dependencies

- Phase 3 of #17 (canonical ingestion pipeline rewrite) must land before the ingestion queue and streaming work is built — `IngestionService.RunIngestionAsync` is currently a no-op stub
- Issue #20 (Defender for Cloud DevOps ingestion) should be implemented on top of the streaming interface introduced here

---

## Reference files

- `src/PatchHound.Worker/IngestionWorker.cs` — current cron-poll dispatcher
- `src/PatchHound.Infrastructure/VulnerabilitySources/DefenderApiClient.cs` — `GetAllPagesAsync` (line 260)
- `src/PatchHound.Infrastructure/VulnerabilitySources/DefenderVulnerabilitySource.cs` — `FetchAssetBatchAsync` (line 252)
- `src/PatchHound.Worker/EnrichmentWorker.cs` — reference lease pattern (line 415)
- `src/PatchHound.Api/Program.cs` — SignalR (line 427), rate limiter (line 431), migration (line 470)
- `src/PatchHound.Infrastructure/DependencyInjection.cs` — Redis registration (line 28)
- `infra/` — empty directory tree to be filled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Scalability — SaaS multi-tenant architecture with Azure as first cloud #22

Goal

Current state

Scope

1. Ingestion pipeline — Azure Service Bus + streaming

2. Worker concurrency safety — DB lease extension

3. Horizontal API scaling fixes

SignalR backplane

Distributed rate limiting

DB migration race on startup

4. Observability

5. Azure IaC — Bicep

Acceptance criteria

Ingestion

Worker safety

API horizontal scaling

Observability

IaC

Out of scope

Dependencies

Reference files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Area	Current behaviour	Problem
Ingestion pipeline	`GetAllPagesAsync` accumulates all pages in a single in-memory `List<T>` before processing (up to `MaxPages = 1000`)	Memory spike per large tenant; not safe at scale
Worker concurrency safety	Only `EnrichmentWorker` has a DB-level compare-and-swap lease. `IngestionWorker`, `SlaCheckWorker`, `WorkflowWorker`, `ApprovalExpiryWorker`, `ScanSchedulerWorker`, `RemediationAiWorker` have no distributed lease	Running two worker instances causes double-processing (duplicate ingestion runs, duplicate SLA notifications, double approval expiry)
Ingestion dispatch	Ingestion runs are triggered by a 1-minute poll loop in `IngestionWorker` evaluating every `TenantSourceConfiguration` cron schedule	All tenant ingestion jobs contend on the same single worker; no backpressure; no dead-letter; no retry on partial failure
Redis	`StackExchange.Redis` is referenced and registered but has zero injection sites — it is wired up but never used	Prepared infrastructure with no active functionality
SignalR backplane	`AddSignalR()` with no backplane. Multi-instance API deployments would silently drop notifications for users connected to a different instance	API cannot be safely scaled horizontally
Rate limiting	`PartitionedRateLimiter` with `FixedWindowRateLimiter` — in-process only	Rate limit is per-instance, not per-user across the cluster
DB migration race	`MigrateAsync` runs at startup in every API replica with a retry loop	Concurrent replicas race on migration; no distributed lock guards this
Observability	Zero Application Insights, OpenTelemetry, or structured telemetry packages. Built-in `ILogger<T>` only	No distributed tracing, no metrics, no alerting surface
IaC	`infra/` directory exists with the correct structure (`modules/control-plane`, `modules/shared`, `modules/stamp`, `stacks/azure`, `stacks/selfhosted`, `params/dev`, `params/prod`) but all directories are empty	No deployable infrastructure definition

Worker	Poll interval	Idempotency risk	Lease duration
`IngestionWorker`	1 min	Duplicate ingestion run per source	10 min
`SlaCheckWorker`	1 hr	Duplicate SLA notifications	2 hr
`WorkflowWorker`	1 min	Double-processing of workflow timeout transitions	5 min
`ApprovalExpiryWorker`	15 min	Double auto-denial of expired approvals	30 min
`ScanSchedulerWorker`	60 s	Duplicate scan scheduling	5 min
`RemediationAiWorker`	10 s	`RemediationAiJob` picked by two workers simultaneously (no optimistic concurrency guard)	2 min

Component	Azure service	Notes
API	Azure Container Apps	Auto-scaling on HTTP concurrency; HTTPS ingress; min replicas = 1
Worker	Azure Container Apps (internal)	Scale on Service Bus queue depth (KEDA rule); min replicas = 1
Database	Azure Database for PostgreSQL Flexible Server	General-purpose tier, zone-redundant HA for prod
Cache / backplane	Azure Cache for Redis	Basic SKU for dev, Standard for prod
Ingestion queue	Azure Service Bus	Standard tier; one namespace, one queue per deployment
Secrets	Azure Key Vault (production) or OpenBao (self-hosted / dev)	MSAL managed identity auth for Key Vault
Container registry	Azure Container Registry	Geo-replicated for multi-region prod
Observability	Azure Monitor + Application Insights workspace	Connected to Log Analytics workspace
Identity	User-assigned Managed Identity	Assigned to both Container Apps; used for ACR pull, Key Vault, Service Bus access

feat: Scalability — SaaS multi-tenant architecture with Azure as first cloud #22

Description

Goal

Current state

Scope

1. Ingestion pipeline — Azure Service Bus + streaming

2. Worker concurrency safety — DB lease extension

3. Horizontal API scaling fixes

SignalR backplane

Distributed rate limiting

DB migration race on startup

4. Observability

5. Azure IaC — Bicep

Acceptance criteria

Ingestion

Worker safety

API horizontal scaling

Observability

IaC

Out of scope

Dependencies

Reference files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions