feat: Self-hosted deployment — production-grade setup aligned with #22 scalability model

## Goal

Close the gap between the current development-only docker-compose and a production-grade self-hosted deployment that shares the same application architecture as the Azure path defined in #22. This means:

1. Replace the in-process `Channel<T>` queue with **RabbitMQ** so self-hosted gets the same durability guarantees (persistence, dead-letter, retries) as Azure Service Bus
2. Add an **optional Grafana observability stack** to docker-compose so self-hosted operators get traces, metrics, and log aggregation out of the box
3. **Automate the OpenBao bootstrap** so `docker compose up` (or a single `make setup`) results in a fully operational PatchHound instance without manual unseal / token steps

Everything in this issue targets the `infra/stacks/selfhosted/` path. The Azure path is covered by #22. Application code changes (streaming ingestion, `IIngestionJobQueue` abstraction, OpenTelemetry wiring) are shared between both issues and tracked in #22 — this issue focuses on the self-hosted infrastructure layer on top of that shared code.

---

## Current state

| Area | Current behaviour | Gap |
|---|---|---|
| Message queue | In-process `System.Threading.Channels.Channel<T>` (`SentinelAuditQueue`). Ingestion dispatch has no external queue at all. | Messages lost on worker restart. No dead-letter. No retry on failure. Not equivalent to Azure Service Bus. |
| Observability | Zero telemetry. No OTLP export configured. | Self-hosted operators have no visibility into ingestion performance, errors, or SLA metrics. |
| OpenBao bootstrap | 6 manual steps: init → unseal (×3) → login → enable-kv → policy → token. Must be repeated after every volume reset. | First-run experience is brittle and undocumented at the compose level. A single missed step silently breaks all secret reads. |
| docker-compose profiles | No profiles. All services always start. | No way to opt in/out of observability services or run a minimal dev stack. |
| `infra/stacks/selfhosted/` | Empty directory. | No canonical self-hosted IaC or runbook. |

---

## 1. RabbitMQ as the self-hosted message broker

### Why RabbitMQ over in-process Channel

The `IIngestionJobQueue` abstraction introduced by #22 hides the queue implementation. The self-hosted path should bind it to RabbitMQ rather than an in-process Channel so that:

- Ingestion jobs survive a worker container restart
- Failed jobs land in a dead-letter queue rather than being silently dropped
- Multiple worker replicas can be run safely (competing consumers on the same queue)
- The operational model mirrors the Azure Service Bus path — operators get the same guarantees and the same retry/DLQ behaviour

### docker-compose addition

```yaml
rabbitmq:
  image: rabbitmq:4-management-alpine
  environment:
    RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER:-patchhound}
    RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
  volumes:
    - rabbitmq_data:/var/lib/rabbitmq
  ports:
    - "5672:5672"    # AMQP
    - "15672:15672"  # Management UI
  healthcheck:
    test: ["CMD", "rabbitmq-diagnostics", "ping"]
    interval: 10s
    timeout: 5s
    retries: 12
    start_period: 20s
```

Both `api` and `worker` services gain a dependency on `rabbitmq: condition: service_healthy` and a new environment variable:

```
ConnectionStrings__RabbitMq: amqp://${RABBITMQ_USER:-patchhound}:${RABBITMQ_PASSWORD}@rabbitmq:5672/
```

### Application wiring

A new `RabbitMqIngestionJobQueue` implements `IIngestionJobQueue` (defined in #22):

```csharp
// src/PatchHound.Infrastructure/Queues/RabbitMqIngestionJobQueue.cs
public class RabbitMqIngestionJobQueue : IIngestionJobQueue
{
    // Uses RabbitMQ.Client or MassTransit backing
    // Queue: "ingestion-jobs"
    // Dead-letter exchange: "ingestion-jobs-dlx" → queue "ingestion-jobs-dead"
    // Message TTL and max-delivery-count configurable via appsettings
}
```

DI registration in `DependencyInjection.cs` — select implementation based on configuration:

```csharp
if (!string.IsNullOrEmpty(config["ConnectionStrings:ServiceBus"]))
    services.AddSingleton<IIngestionJobQueue, ServiceBusIngestionJobQueue>();   // Azure
else if (!string.IsNullOrEmpty(config["ConnectionStrings:RabbitMq"]))
    services.AddSingleton<IIngestionJobQueue, RabbitMqIngestionJobQueue>();     // self-hosted
else
    services.AddSingleton<IIngestionJobQueue, InProcessIngestionJobQueue>();    // dev fallback
```

The `SentinelAuditQueue` (currently a `Channel<SentinelAuditEvent>`) should be evaluated separately — it is an internal in-process audit buffer and may remain as a Channel since it is not a job queue.

### Queue topology

| Queue | Purpose | DLX |
|---|---|---|
| `ingestion-jobs` | Pending ingestion job messages | `ingestion-jobs-dlx` |
| `ingestion-jobs-dead` | Failed jobs after max retries | — (monitor manually) |

Queue and exchange declarations should be idempotent on startup (declare-if-not-exists pattern).

**Affected files:**
- `docker-compose.yml` — add `rabbitmq` service, update `api` and `worker` dependencies
- `src/PatchHound.Infrastructure/Queues/RabbitMqIngestionJobQueue.cs` — new
- `src/PatchHound.Infrastructure/DependencyInjection.cs` — queue implementation selection
- `src/PatchHound.Infrastructure.csproj` — add `RabbitMQ.Client` or `MassTransit.RabbitMQ` package
- `.env.example` — add `RABBITMQ_USER`, `RABBITMQ_PASSWORD`

---

## 2. Grafana observability stack (optional docker-compose profile)

### Profile structure

Add an `observability` docker-compose profile containing:

| Service | Image | Purpose |
|---|---|---|
| `prometheus` | `prom/prometheus:v3` | Scrapes `/metrics` from API and Worker |
| `tempo` | `grafana/tempo:latest` | OTLP trace receiver + trace storage |
| `loki` | `grafana/loki:3` | Log aggregation (JSON log sink from API/Worker) |
| `grafana` | `grafana/grafana-oss:11` | Dashboards, explorer, alerting |

Services are only started when the profile is active:

```bash
docker compose --profile observability up
```

`api` and `worker` gain optional environment variables that are set only when the `observability` profile is used:

```
Telemetry__OtlpEndpoint: http://tempo:4317
```

The OpenTelemetry SDK (wired in #22) will check for this endpoint at startup. If absent, telemetry is a no-op.

### Provisioning

Grafana should start pre-configured with:
- Prometheus, Tempo, and Loki as provisioned data sources (`deploy/grafana/provisioning/datasources/`)
- A PatchHound dashboard provisioned from JSON (`deploy/grafana/provisioning/dashboards/patchhound.json`) covering:
  - Ingestion jobs enqueued / processed / dead-lettered per hour
  - Worker lease acquisition success/failure
  - p99 DB query latency
  - Active workflow and SLA breach counts

### Log shipping

API and Worker emit structured JSON logs to stdout (default .NET behaviour). Loki collects these via the Docker log driver or a Promtail sidecar:

```yaml
promtail:
  image: grafana/promtail:3
  profiles: [observability]
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - /var/run/docker.sock:/var/run/docker.sock
    - ./deploy/promtail/config.yml:/etc/promtail/config.yml:ro
```

**New files:**
- `deploy/grafana/provisioning/datasources/datasources.yml`
- `deploy/grafana/provisioning/dashboards/dashboards.yml`
- `deploy/grafana/provisioning/dashboards/patchhound.json`
- `deploy/promtail/config.yml`
- `deploy/prometheus/prometheus.yml`
- `deploy/tempo/tempo.yml`

**Affected files:**
- `docker-compose.yml` — add `observability` profile services
- `.env.example` — add `GF_SECURITY_ADMIN_PASSWORD`

---

## 3. Automated OpenBao bootstrap

### Problem

The current setup requires six manual steps before PatchHound can read any secrets. A first-time operator who follows `docker compose up` without reading `deploy/openbao/README.md` will see silent secret read failures with no indication of why.

### Solution — init container + auto-unseal via transit seal (or file-based auto-unseal for dev)

Two modes:

**Mode A — Dev / self-hosted single-node (default):** Use OpenBao's file-based auto-unseal by configuring a static unseal key in the `openbao.hcl` via an environment variable. This is not suitable for production secret management but eliminates the manual unseal step for self-hosted deployments where the host machine is the trust boundary.

```hcl
# deploy/openbao/config/openbao-dev.hcl
seal "shamir" {} # default, but override with auto-unseal config

# Alternative: use transit seal with a local key — no KMS dependency
```

A simpler approach: configure OpenBao with `VAULT_DEV_ROOT_TOKEN_ID` in dev mode (`-dev` flag), which starts already initialised and unsealed. This is appropriate for single-host self-hosted deployments where data persistence across restarts is less critical.

**Mode B — Production self-hosted:** Keep the current manual init/unseal flow (with the existing Makefile) but document it clearly as the production path.

**Automated bootstrap init container:**

Add an `openbao-init` one-shot service that runs after OpenBao is healthy, checks whether it is already initialised, and if not:
1. Initialises (`bao operator init`)
2. Unseals with the generated keys
3. Creates the KV mount
4. Writes the policy
5. Creates the application token
6. Writes the token to a shared Docker volume so the API and Worker containers can read it

```yaml
openbao-init:
  image: openbao/openbao
  depends_on:
    openbao:
      condition: service_healthy
  volumes:
    - openbao_init:/init-output
    - ./deploy/openbao/scripts/bootstrap.sh:/bootstrap.sh:ro
  entrypoint: ["/bin/sh", "/bootstrap.sh"]
  environment:
    BAO_ADDR: http://openbao:8200
    KV_MOUNT: ${OPENBAO_KV_MOUNT:-patchhound}
```

The `api` and `worker` services read `OPENBAO_TOKEN` from the shared volume via an entrypoint wrapper:

```bash
# entrypoint.sh
export OPENBAO_TOKEN=$(cat /init-output/app-token.txt)
exec "$@"
```

Alternatively: extend the existing Makefile `all` target to be the canonical first-run command and document it as the single entry point (`make -C deploy/openbao setup`).

**`deploy/openbao/scripts/bootstrap.sh`** — idempotent init + unseal + provision script (extracted from the Makefile targets and made container-friendly).

**New/affected files:**
- `deploy/openbao/config/openbao-dev.hcl` — dev mode config (pre-init, auto-unseal)
- `deploy/openbao/scripts/bootstrap.sh` — idempotent bootstrap script
- `docker-compose.yml` — `openbao-init` one-shot service, entrypoint changes for `api`/`worker`
- `deploy/openbao/README.md` — update to describe both dev and production paths

---

## 4. `infra/stacks/selfhosted/` documentation

Fill the currently empty directory with:

- `README.md` — canonical self-hosted runbook: prerequisites (Docker ≥ 27, compose ≥ 2.24), first-run steps, profile options, upgrade path, backup guidance (pg_dump, OpenBao snapshot, RabbitMQ definitions export)
- `docker-compose.override.yml.example` — template for local overrides (custom ports, volume paths, SSL termination)

The `infra/stacks/selfhosted/` directory should be the single reference for anyone deploying PatchHound on their own infrastructure.

---

## Acceptance criteria

### RabbitMQ
- [ ] `docker compose up` starts a healthy RabbitMQ container alongside all existing services
- [ ] Ingestion jobs are enqueued to RabbitMQ when their cron schedule is due (once #22 ingestion work is complete)
- [ ] A worker container restart does not lose queued ingestion jobs
- [ ] Failed ingestion jobs (after N retries) appear in the `ingestion-jobs-dead` queue and do not block other jobs
- [ ] The RabbitMQ management UI is accessible at `http://localhost:15672`
- [ ] `InProcessIngestionJobQueue` remains as a dev fallback when no `ConnectionStrings:RabbitMq` is configured

### Observability
- [ ] `docker compose --profile observability up` starts Prometheus, Tempo, Loki, Promtail, and Grafana
- [ ] Grafana is accessible at `http://localhost:3100` (or configured port) with pre-provisioned data sources
- [ ] The PatchHound dashboard shows ingestion job throughput, p99 DB latency, and active workflow counts after at least one ingestion run
- [ ] When the `observability` profile is not active, the API and Worker start normally with no telemetry errors

### OpenBao bootstrap
- [ ] `docker compose up` on a fresh volume initialises, unseals, and configures OpenBao automatically — no manual steps required
- [ ] The application token is available to API and Worker containers without being hardcoded in `docker-compose.yml`
- [ ] A volume reset followed by `docker compose up` re-bootstraps correctly
- [ ] The existing Makefile production flow (`make -C deploy/openbao all`) remains functional for operators who want explicit control

### Documentation
- [ ] `infra/stacks/selfhosted/README.md` covers first-run, profiles, upgrade, and backup
- [ ] `deploy/openbao/README.md` is updated to reflect the automated bootstrap path and notes when manual setup is preferred

---

## Dependencies

- #22 must define the `IIngestionJobQueue` interface and the OpenTelemetry SDK wiring before the RabbitMQ implementation and Grafana stack can be built on top of it
- Phase 3 of #17 must be complete before end-to-end ingestion via RabbitMQ can be tested

## Related

- #22 — Azure IaC, Azure Service Bus, OpenTelemetry SDK (application-level changes shared by both issues)

Service	Image	Purpose
`prometheus`	`prom/prometheus:v3`	Scrapes `/metrics` from API and Worker
`tempo`	`grafana/tempo:latest`	OTLP trace receiver + trace storage
`loki`	`grafana/loki:3`	Log aggregation (JSON log sink from API/Worker)
`grafana`	`grafana/grafana-oss:11`	Dashboards, explorer, alerting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Self-hosted deployment — production-grade setup aligned with #22 scalability model #23

Goal

Current state

1. RabbitMQ as the self-hosted message broker

Why RabbitMQ over in-process Channel

docker-compose addition

Application wiring

Queue topology

2. Grafana observability stack (optional docker-compose profile)

Profile structure

Provisioning

Log shipping

3. Automated OpenBao bootstrap

Problem

Solution — init container + auto-unseal via transit seal (or file-based auto-unseal for dev)

4. `infra/stacks/selfhosted/` documentation

Acceptance criteria

RabbitMQ

Observability

OpenBao bootstrap

Documentation

Dependencies

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Area	Current behaviour	Gap
Message queue	In-process `System.Threading.Channels.Channel<T>` (`SentinelAuditQueue`). Ingestion dispatch has no external queue at all.	Messages lost on worker restart. No dead-letter. No retry on failure. Not equivalent to Azure Service Bus.
Observability	Zero telemetry. No OTLP export configured.	Self-hosted operators have no visibility into ingestion performance, errors, or SLA metrics.
OpenBao bootstrap	6 manual steps: init → unseal (×3) → login → enable-kv → policy → token. Must be repeated after every volume reset.	First-run experience is brittle and undocumented at the compose level. A single missed step silently breaks all secret reads.
docker-compose profiles	No profiles. All services always start.	No way to opt in/out of observability services or run a minimal dev stack.
`infra/stacks/selfhosted/`	Empty directory.	No canonical self-hosted IaC or runbook.

Queue	Purpose	DLX
`ingestion-jobs`	Pending ingestion job messages	`ingestion-jobs-dlx`
`ingestion-jobs-dead`	Failed jobs after max retries	— (monitor manually)

feat: Self-hosted deployment — production-grade setup aligned with #22 scalability model #23

Description

Goal

Current state

1. RabbitMQ as the self-hosted message broker

Why RabbitMQ over in-process Channel

docker-compose addition

Application wiring

Queue topology

2. Grafana observability stack (optional docker-compose profile)

Profile structure

Provisioning

Log shipping

3. Automated OpenBao bootstrap

Problem

Solution — init container + auto-unseal via transit seal (or file-based auto-unseal for dev)

4. infra/stacks/selfhosted/ documentation

Acceptance criteria

RabbitMQ

Observability

OpenBao bootstrap

Documentation

Dependencies

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. `infra/stacks/selfhosted/` documentation