Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 58 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:

permissions:
contents: read
packages: write

jobs:
validate:
Expand All @@ -22,4 +23,60 @@ jobs:
- run: python -m mypy
- run: python -m pytest
- run: python -m cas_reference_product.evidence
- run: docker build --platform linux/amd64 -t cas-reference-product:ci .

docker:
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@b5730b14e8b0bc39f62ade545785cb7e6f44b97c # v3

- name: Build image
uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b4b5 # v6
with:
context: .
platforms: linux/amd64
push: false
load: true
tags: cas-reference-product:ci
cache-from: type=gha
cache-to: type=gha,mode=max

- name: Health-check smoke test
run: |
docker run -d --name cas-ci -p 8080:8080 \
-e ENVIRONMENT=local -e WORKFLOW_BACKEND=local \
cas-reference-product:ci
# Wait up to 30s for the app to become ready
for i in $(seq 1 15); do
if curl -sf http://localhost:8080/health/ready; then
echo "App is ready"; break
fi
sleep 2
done
curl -sf http://localhost:8080/health/live
curl -sf http://localhost:8080/health/ready
docker stop cas-ci

- name: Log in to GHCR
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Push to GHCR
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b4b5 # v6
with:
context: .
platforms: linux/amd64
push: true
tags: |
ghcr.io/coding-autopilot-system/cas-reference-product:latest
ghcr.io/coding-autopilot-system/cas-reference-product:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
18 changes: 16 additions & 2 deletions .planning/REQUIREMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,19 @@
- [x] **QUAL-01**: Unit, API, contract, static, and container configuration checks run in CI.
- [x] **DOC-01**: Architecture, threat model, operations, and local workflow are documented.

## Phase 2 Requirements — Telemetry Hardening

- [x] **TEL-01**: OpenTelemetry SDK wired into FastAPI lifespan — a `cas.api.workflows.execute` span is created for every /api/v1/workflows request, carrying `cas.correlation_id`, `cas.run_id`, and `cas.intent` attributes.
- [x] **TEL-02**: Canonical CAS lifecycle events emitted as span events: `workflow.started` (with correlation_id + run_id), `workflow.completed` on success, `workflow.failed` on error — never both completed and failed.
- [x] **TEL-03**: W3C trace context headers (`traceparent` / `tracestate`) propagated on inbound requests via `W3CTraceContextMiddleware`; downstream spans are parented to the caller's trace.
- [x] **TEL-04**: Application Insights exporter active when `APPLICATIONINSIGHTS_CONNECTION_STRING` env var is set (no-op if absent); uses managed identity and privacy-hardened instrumentation options.

## Phase 3 Requirements — Docker + CI Publish

- [x] **DOCK-01**: Dockerfile is multi-stage (builder + runtime), targets linux/amd64, exposes port 8080, runs as non-root user `appuser`, and health-checks via /health/ready.
- [x] **DOCK-02**: `docker-compose.yml` defines a local dev stack (`cas-ref` service, ports 8080:8080, env_file .env.example) that starts without Azure credentials.
- [x] **DOCK-03**: CI pipeline (`docker` job in ci.yml) builds the image, runs a health-check smoke test, and pushes to `ghcr.io/coding-autopilot-system/cas-reference-product` on merge to main.

## Out of Scope

| Feature | Reason |
Expand All @@ -26,7 +39,8 @@
## Traceability

All v0.1 requirements map to Phase 1 and are complete.
TEL-01 through TEL-04 map to Phase 2 and are complete.
DOCK-01 through DOCK-03 map to Phase 3 and are complete.

---
*Last updated: 2026-06-11 after v0.1 implementation*

*Last updated: 2026-06-14 after Phase 2 and Phase 3 implementation*
27 changes: 27 additions & 0 deletions .planning/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,30 @@

Status: Complete

## Phase 2: Telemetry Hardening

**Goal:** Wire OpenTelemetry end-to-end with lifecycle span events, W3C trace propagation, and Application Insights exporter.

**Requirements:** TEL-01, TEL-02, TEL-03, TEL-04

**Success criteria:**
- Every /api/v1/workflows request creates a `cas.api.workflows.execute` span with correlation_id, run_id, and intent attributes.
- Span events `workflow.started`, `workflow.completed`, and `workflow.failed` are emitted at the appropriate lifecycle points.
- W3C traceparent/tracestate headers on inbound requests are extracted and linked as parent context.
- Application Insights exporter activates when APPLICATIONINSIGHTS_CONNECTION_STRING is set; no-op otherwise.
- All telemetry behaviours verified by pytest with InMemorySpanExporter.

Status: Complete

## Phase 3: Docker + CI Publish

**Goal:** Containerize the app with a production-grade multi-stage Dockerfile and publish the image to GHCR on merge to main.

**Requirements:** DOCK-01, DOCK-02, DOCK-03

**Success criteria:**
- Multi-stage Dockerfile builds a linux/amd64 image, runs as non-root `appuser`, exposes port 8080, and health-checks via /health/ready.
- docker-compose.yml starts the local dev stack with env stubs using .env.example.
- CI docker job builds, smoke-tests (/health/live + /health/ready), and on push to main pushes to ghcr.io/coding-autopilot-system/cas-reference-product.

Status: Complete
57 changes: 57 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# cas-reference-product

Production-oriented CAS reference application demonstrating a complete workload integrated with **Microsoft Foundry Next Gen Agents** on a Container Apps foundation. Runs locally without Azure, deploys unmodified through the `cas-platform` interface.

## Project Context

See `.planning/PROJECT.md` for goals and requirements. This project is in early initialization.

**Core mandate**: Demonstrate canonical CAS lifecycle events, managed identity, observability, probes, and a safe local workflow — without embedding any Azure credentials.

## Tech Stack

| Layer | Technology |
|---|---|
| Language | Python 3.12+ |
| API framework | FastAPI + Pydantic |
| Azure identity | `ManagedIdentityCredential` (system-assigned; no embedded secrets) |
| Azure AI | Foundry Next Gen Agents (`WorkflowAgentService`) — never Classic Assistants |
| Observability | OpenTelemetry + Azure Application Insights |
| Container | Linux AMD64, port 8080 |
| Tests | pytest (in `tests/`) |
| Local dev | `scripts/run-local.ps1` |

## Key Files

| File | Purpose |
|---|---|
| `src/cas_reference_product/identity.py` | Identity/credential resolution (local vs. managed) |
| `.foundry/agent-metadata.yaml` | Foundry Next Gen Agent configuration |
| `.foundry/datasets/` | Seed data for local testing |
| `.env.example` | All required environment variable docs |
| `scripts/run-local.ps1` | Local run without Azure |

## Local Development

```powershell
.\scripts\run-local.ps1
```

Or manually:
```bash
cd portfolio/cas-reference-product
python -m venv .venv && .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt # if present, else pip install -e .
python -m pytest tests/
```

## Constraints

- **No embedded credentials** — use `DefaultAzureCredential` / `ManagedIdentityCredential` only
- **No Azure resource deployment** — local adapter runs without provisioning
- **Foundry Next Gen only** — reject any Classic Assistants (`asst_*`) usage
- **Public repo** — no sensitive data in examples or defaults

## GSD Workflow

Use `/gsd:plan-phase` before any multi-file change. Use `/gsd:quick` for single-file fixes.
41 changes: 34 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,22 +1,49 @@
FROM python:3.12-slim AS runtime
# --------------------------------------------------------------------------- #
# Stage 1 — builder: install dependencies into an isolated wheel cache #
# --------------------------------------------------------------------------- #
FROM python:3.12-slim AS builder

ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PORT=8080
PIP_NO_CACHE_DIR=1

WORKDIR /build

COPY pyproject.toml README.md ./
COPY src ./src

RUN pip install --no-cache-dir --no-compile --prefix=/install .

# --------------------------------------------------------------------------- #
# Stage 2 — runtime: minimal image, non-root user, port 8080 #
# --------------------------------------------------------------------------- #
FROM python:3.12-slim AS runtime

ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PORT=8080 \
PYTHONPATH=/app/src

WORKDIR /app

RUN addgroup --system app && adduser --system --ingroup app app
# Create non-root user and group
RUN groupadd --system app && useradd --system --gid app --no-create-home app

COPY pyproject.toml README.md ./
# Copy installed packages from builder stage
COPY --from=builder /install /usr/local

# Copy application source
COPY src ./src
RUN pip install --no-cache-dir --no-compile .

USER app

EXPOSE 8080

STOPSIGNAL SIGTERM
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/health/live', timeout=2)"

# Health check via /health/ready — verifies app is truly ready (DOCK-01)
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/health/ready', timeout=4)"

CMD ["uvicorn", "cas_reference_product.app:app", "--host", "0.0.0.0", "--port", "8080"]
32 changes: 32 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Local dev stack — mirrors the cas-platform container interface (DOCK-02)
# Usage: docker compose up --build
# Env: copy .env.example to .env and populate values as needed.

services:
cas-ref:
build:
context: .
dockerfile: Dockerfile
target: runtime
platforms:
- linux/amd64
image: cas-reference-product:local
ports:
- "8080:8080"
env_file:
- .env.example
environment:
# Override any .env.example defaults here for local dev
ENVIRONMENT: local
WORKFLOW_BACKEND: local
healthcheck:
test:
- CMD
- python
- -c
- "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/health/ready', timeout=4)"
interval: 30s
timeout: 5s
start_period: 15s
retries: 3
restart: unless-stopped
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies = [
"azure-monitor-opentelemetry>=1.6.0",
"fastapi>=0.115.0",
"opentelemetry-api>=1.29.0",
"opentelemetry-sdk>=1.29.0",
"pydantic-settings>=2.7.0",
"uvicorn[standard]>=0.34.0"
]
Expand Down
46 changes: 40 additions & 6 deletions src/cas_reference_product/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
from typing import Any

from fastapi import FastAPI, HTTPException, Request
from opentelemetry import trace

from .config import Settings, get_settings
from .models import PromptEnvelope, WorkflowResult
from .telemetry import configure_telemetry
from .telemetry import W3CTraceContextMiddleware, configure_telemetry
from .workflow import WorkflowAgentServiceError, WorkflowOrchestrator, build_workflow_agent_service

_tracer = trace.get_tracer(__name__)


def create_app(settings: Settings | None = None) -> FastAPI:
app_settings = settings or get_settings()
Expand All @@ -19,6 +22,8 @@ async def lifespan(_: FastAPI) -> AsyncIterator[None]:
yield

app = FastAPI(title="CAS Reference Product", version="0.1.0", lifespan=lifespan)
app.add_middleware(W3CTraceContextMiddleware)

service = build_workflow_agent_service(app_settings) if app_settings.ready else None

@app.get("/health/live")
Expand All @@ -36,11 +41,40 @@ def execute(envelope: PromptEnvelope, request: Request) -> WorkflowResult:
if service is None:
raise HTTPException(status_code=503, detail="Workflow backend is not ready")
request.state.correlation_id = envelope.correlationId
orchestrator = WorkflowOrchestrator(service, app_settings.repository)
try:
return orchestrator.execute(envelope)
except WorkflowAgentServiceError:
raise HTTPException(status_code=502, detail="Workflow backend request failed") from None
with _tracer.start_as_current_span("cas.api.workflows.execute") as span:
span.set_attribute("cas.correlation_id", envelope.correlationId)
span.set_attribute("cas.run_id", envelope.runId)
span.set_attribute("cas.intent", envelope.intent)
span.add_event(
"workflow.started",
attributes={
"cas.correlation_id": envelope.correlationId,
"cas.run_id": envelope.runId,
},
)
orchestrator = WorkflowOrchestrator(service, app_settings.repository)
try:
result = orchestrator.execute(envelope)
except WorkflowAgentServiceError:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit the failed lifecycle event for unexpected workflow errors

If the workflow service or orchestrator raises anything other than WorkflowAgentServiceError—for example a local adapter bug or result-validation failure—the exception bypasses this handler, leaving the API span with workflow.started but no workflow.failed event. WorkflowOrchestrator.execute explicitly rethrows all exceptions, so the failure event should be added for the general exception path while preserving the existing 502 translation only for WorkflowAgentServiceError.

Useful? React with 👍 / 👎.

span.add_event(
"workflow.failed",
attributes={
"cas.correlation_id": envelope.correlationId,
"cas.run_id": envelope.runId,
"error": True,
},
)
raise HTTPException(
status_code=502, detail="Workflow backend request failed"
) from None
span.add_event(
"workflow.completed",
attributes={
"cas.correlation_id": envelope.correlationId,
"cas.run_id": envelope.runId,
},
)
return result

@app.get("/")
def root() -> dict[str, Any]:
Expand Down
Loading
Loading