Skip to content

Commit 678bb69

Browse files
neelay-aignclaude
andcommitted
feat: add SRE incident response agent using Anthropic Managed Agents
Add a background SRE agent that triages BetterStack incidents for the Python SDK and creates fix PRs via the GitHub MCP server. Architecture: BetterStack webhook -> GitHub repository_dispatch -> GH Actions workflow -> Managed Agent session on Anthropic infra. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1b1b4b6 commit 678bb69

10 files changed

Lines changed: 939 additions & 0 deletions

File tree

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: SRE Incident Response
2+
3+
on:
4+
repository_dispatch:
5+
types: [betterstack-incident]
6+
workflow_dispatch:
7+
inputs:
8+
incident_id:
9+
description: "BetterStack incident ID to triage (leave empty if simulating)"
10+
required: false
11+
simulate:
12+
description: "Use a simulated incident instead of fetching from API"
13+
type: boolean
14+
default: true
15+
16+
concurrency:
17+
group: sre-incident-${{ github.event.client_payload.incident_id || inputs.incident_id || 'manual' }}
18+
cancel-in-progress: false
19+
20+
jobs:
21+
triage:
22+
runs-on: ubuntu-latest
23+
timeout-minutes: 30
24+
permissions:
25+
contents: read
26+
steps:
27+
- uses: actions/checkout@v4
28+
29+
- uses: astral-sh/setup-uv@v6
30+
31+
- name: Install dependencies
32+
working-directory: sre-agent
33+
run: uv sync
34+
35+
- name: Run SRE agent
36+
working-directory: sre-agent
37+
run: uv run python -m sre_agent.main
38+
env:
39+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
40+
SRE_BETTERSTACK_API_TOKEN: ${{ secrets.BETTERSTACK_API_TOKEN }}
41+
SRE_AGENT_ID: ${{ secrets.SRE_AGENT_ID }}
42+
SRE_ENVIRONMENT_ID: ${{ secrets.SRE_ENVIRONMENT_ID }}
43+
SRE_VAULT_ID: ${{ secrets.SRE_VAULT_ID }}
44+
SRE_GITHUB_REPO: aignostics/python-sdk
45+
INCIDENT_ID: ${{ github.event.client_payload.incident_id || inputs.incident_id }}
46+
SIMULATE: ${{ inputs.simulate || 'false' }}

sre-agent/pyproject.toml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
[project]
2+
name = "sre-agent"
3+
version = "0.1.0"
4+
description = "SRE incident response agent for the Aignostics Python SDK"
5+
requires-python = ">=3.12"
6+
dependencies = [
7+
"anthropic>=0.52.0",
8+
"pydantic-settings>=2.0.0",
9+
]
10+
11+
[project.optional-dependencies]
12+
dev = [
13+
"pytest>=8.0.0",
14+
]
15+
16+
[build-system]
17+
requires = ["hatchling"]
18+
build-backend = "hatchling.build"
19+
20+
[tool.hatch.build.targets.wheel]
21+
packages = ["src/sre_agent"]
22+
23+
[tool.pytest.ini_options]
24+
testpaths = ["tests"]
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
name: sre-runbook
3+
description: Repo-specific triage context for Aignostics Python SDK incidents
4+
---
5+
6+
# Aignostics Python SDK -- SRE Triage Guide
7+
8+
## Incident Types
9+
10+
### "Scheduled Audit" incidents
11+
- Cause: A dependency has a known CVE, or a license violation was detected.
12+
- The audit runs hourly via .github/workflows/_scheduled-audit.yml.
13+
- Tools used: pip-audit, pip-licenses, trivy.
14+
- Common fix: bump the vulnerable dependency in pyproject.toml,
15+
then run `uv lock --upgrade-package <pkg>`.
16+
17+
### "Scheduled Testing" incidents (staging)
18+
- Cause: Unit, integration, or e2e tests failed against staging.
19+
- Runs every 6 hours via .github/workflows/_scheduled-test-hourly.yml.
20+
- Check the workflow run logs for which test(s) failed.
21+
- Common causes: flaky tests, dependency updates, API contract changes.
22+
23+
### "Scheduled Testing" incidents (production)
24+
- Cause: Tests failed against production environment.
25+
- Runs daily via .github/workflows/_scheduled-test-daily.yml.
26+
- Common causes: platform API changes, credential expiry.
27+
- These often require human intervention -- create an issue, not a PR.
28+
29+
## Repo-Specific Context
30+
- Package manager: uv (not pip). Use `uv sync`, `uv add`, `uv run`.
31+
- Linting: `make lint` (ruff + mypy + pyright)
32+
- Testing: `make test_unit`, `make test_integration`, `make test_e2e`
33+
- Security audit: `make audit` (pip-audit + pip-licenses + trivy)
34+
- Dependency bumps: edit pyproject.toml, run `uv lock --upgrade-package <pkg>`
35+
- CI workflows live in .github/workflows/
36+
- Scheduled tests send heartbeats to BetterStack (see _scheduled-test-*.yml)
37+
38+
## PR Conventions
39+
- Conventional commits: feat(...), fix(...), chore(deps): ...
40+
- Always add labels: "sre-agent", "skip:test:long_running"
41+
- Create DRAFT PRs only
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""SRE incident response agent for the Aignostics Python SDK."""
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Allow running with `python -m sre_agent`."""
2+
3+
from sre_agent.main import main
4+
5+
main()

sre-agent/src/sre_agent/_config.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
"""Configuration for the SRE incident response agent."""
2+
3+
from __future__ import annotations
4+
5+
from pydantic import SecretStr
6+
from pydantic_settings import BaseSettings, SettingsConfigDict
7+
8+
9+
class SREAgentSettings(BaseSettings):
10+
"""Settings loaded from environment variables."""
11+
12+
model_config = SettingsConfigDict(env_prefix="SRE_")
13+
14+
# Anthropic Managed Agent resources (created by _setup.py)
15+
agent_id: str
16+
environment_id: str
17+
vault_id: str
18+
19+
# BetterStack API (for fetching incident details)
20+
betterstack_api_token: SecretStr
21+
22+
# GitHub repo to mount in the agent session
23+
github_repo: str = "aignostics/python-sdk"

sre-agent/src/sre_agent/_setup.py

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
"""One-time setup: create agent, environment, skill, and vault on Anthropic.
2+
3+
Usage:
4+
SRE_GITHUB_PAT=ghp_... uv run python -m sre_agent._setup
5+
6+
Prints the resource IDs to store as GitHub Actions secrets.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import os
12+
import sys
13+
from pathlib import Path
14+
15+
import anthropic
16+
17+
SYSTEM_PROMPT = """\
18+
You are an SRE incident response agent for the Aignostics Python SDK
19+
(github.com/aignostics/python-sdk).
20+
21+
You have been triggered by a BetterStack incident alert. The alert
22+
includes the incident name, cause, and -- critically -- the GitHub
23+
Actions run URL from the failed workflow.
24+
25+
Your job:
26+
27+
1. Read the incident details: name, cause, and the failed run URL.
28+
2. Use the GitHub MCP to read the failed workflow run's logs. This is
29+
your primary source of diagnostic information.
30+
3. Investigate the root cause:
31+
- Read the workflow run logs to identify the specific failure.
32+
- Check recent commits on main (git log in the mounted repo).
33+
- Read the relevant GitHub Actions workflow YAML files.
34+
- Use web_search to look up error messages, CVE details, or docs.
35+
4. Determine if the issue is fixable with a code change.
36+
5. If fixable: use the GitHub MCP to create a branch, commit the fix,
37+
and open a draft PR with your analysis in the body.
38+
6. If not fixable or uncertain: use the GitHub MCP to create an issue
39+
with your triage findings and recommended next steps.
40+
41+
Constraints:
42+
- Always create DRAFT PRs, never regular PRs. Humans must review and merge.
43+
- Always cite evidence for your root cause analysis.
44+
- For dependency CVEs: bump the minimum safe version, run the audit check.
45+
- For test failures: check if the test is flaky (search for prior failures)
46+
before proposing a code fix.
47+
- Never modify credentials, secrets, or authentication code.
48+
- Add the label "sre-agent" to any PR or issue you create.
49+
- Add the label "skip:test:long_running" to any PR you create.
50+
51+
The repo uses: uv (package manager), pytest (testing), ruff (linting),
52+
mypy + pyright (type checking). CI runs on GitHub Actions.
53+
"""
54+
55+
SKILL_DIR = Path(__file__).resolve().parent.parent.parent / "skills" / "sre-runbook"
56+
57+
58+
def main() -> None:
59+
github_pat = os.environ.get("SRE_GITHUB_PAT", "")
60+
if not github_pat:
61+
print("Error: SRE_GITHUB_PAT environment variable is required.", file=sys.stderr)
62+
sys.exit(1)
63+
64+
client = anthropic.Anthropic()
65+
66+
# 1. Upload runbook skill
67+
skill_md = (SKILL_DIR / "SKILL.md").read_bytes()
68+
skill = client.beta.skills.create(
69+
display_title="sre-runbook",
70+
files=[("sre-runbook/SKILL.md", skill_md, "text/markdown")],
71+
)
72+
print(f"Skill created: {skill.id} (version {skill.latest_version})")
73+
74+
# 2. Create environment
75+
environment = client.beta.environments.create(
76+
name="sre-incident-response",
77+
config={"type": "cloud", "networking": {"type": "limited"}},
78+
)
79+
print(f"Environment created: {environment.id}")
80+
81+
# 3. Create vault with GitHub PAT
82+
vault = client.beta.vaults.create(name="sre-github")
83+
client.beta.vaults.credentials.create(
84+
vault_id=vault.id,
85+
name="github",
86+
token=github_pat,
87+
)
88+
print(f"Vault created: {vault.id}")
89+
90+
# 4. Create agent
91+
agent = client.beta.agents.create(
92+
name="Aignostics SRE Incident Responder",
93+
model="claude-sonnet-4-6",
94+
system=SYSTEM_PROMPT,
95+
mcp_servers=[
96+
{
97+
"type": "url",
98+
"name": "github",
99+
"url": "https://api.githubcopilot.com/mcp/",
100+
},
101+
],
102+
tools=[
103+
{"type": "agent_toolset_20260401"},
104+
{"type": "mcp_toolset", "mcp_server_name": "github"},
105+
],
106+
skills=[
107+
{"type": "custom", "skill_id": skill.id, "version": skill.latest_version},
108+
],
109+
)
110+
print(f"Agent created: {agent.id} (version {agent.version})")
111+
112+
# Print secrets to configure in GitHub Actions
113+
print("\n--- Store these as GitHub Actions secrets ---")
114+
print(f"SRE_AGENT_ID={agent.id}")
115+
print(f"SRE_ENVIRONMENT_ID={environment.id}")
116+
print(f"SRE_VAULT_ID={vault.id}")
117+
118+
119+
if __name__ == "__main__":
120+
main()

0 commit comments

Comments
 (0)