| title | Incident Post-Mortem Writer | |||||
|---|---|---|---|---|---|---|
| emoji | 🚨 | |||||
| colorFrom | red | |||||
| colorTo | blue | |||||
| sdk | docker | |||||
| app_port | 7860 | |||||
| pinned | false | |||||
| tags |
|
OpenEnv Hackathon 2026 — Grand Finale Submission
Theme #3.1 — World Modeling: Professional Tasks · Scaler AI Labs Enterprise Workflows bonus track
An OpenEnv environment where AI agents learn to write structured incident post-mortems from raw alert logs, Slack threads, and service dependency graphs — a multi-app enterprise workflow that simulates real Site Reliability Engineering.
📄 Read the Round 2 blog post with training results — includes +32.8% reward improvement from TRL fine-tuning.
Every SRE team writes post-mortems after incidents. It's a high-stakes, time-pressured task that requires:
- Reconstructing a timeline from noisy, incomplete logs
- Identifying root cause despite misleading signals and red herrings
- Assigning concrete action items with owners and deadlines
This environment trains and evaluates agents on exactly this workflow — one of the most practically valuable skills in modern software operations.
This environment goes beyond standard task simulation by introducing:
Evidence gating via QUERY_LOGS — critical root cause evidence is hidden behind precise service + time window queries. Incorrect queries return realistic noise logs, forcing intentional investigation rather than guessing.
Adversarial Slack signals — threads include senior engineers confidently blaming the wrong service, misleading correlations between symptoms and causes, and red herrings designed to trap pattern-matching agents.
Delayed and partial observability — the agent never sees full logs upfront. It must actively explore under a hard query budget (max 8 queries with escalating penalties), simulating real incident response under time pressure.
Multi-layer deterministic grading — root cause is evaluated across service identification, cause category classification, and semantic keyword validation. Not string matching. A correct answer written in different words still scores correctly.
These design choices simulate real-world incident response, where incomplete information, misleading signals, and time pressure are the norm — and where the difference between a good engineer and a great one is knowing where to look.
This environment is difficult for AI agents because:
Hidden evidence — critical logs are not visible upfront and must be discovered through precise QUERY_LOGS calls targeting the right service and time window.
Conflicting information — Slack threads contain confident but incorrect hypotheses from senior engineers, forcing the agent to trust data over authority.
Limited exploration budget — queries are constrained (max 8) with escalating penalties, preventing brute-force search strategies.
Structured output requirement — the agent must not only reason correctly but also produce a coherent, validated 5-section document with specific content requirements per section.
These constraints test true agentic reasoning — not just text generation.
The agent receives a realistic incident bundle: timestamped alert logs, a Slack thread from the on-call team, and a service dependency graph. It must investigate the incident and produce a complete 5-section post-mortem document.
The key mechanic is QUERY_LOGS — the agent must identify which service and time window to investigate. The real root cause evidence is hidden behind a specific log query. Wrong queries are penalized with escalating costs. This forces intentional reasoning rather than pattern matching.
| Task | Incident | Key Challenge |
|---|---|---|
| Easy | Single-service DB connection leak | Clean signals, clear root cause |
| Medium | Cascading failure from Redis TTL misconfiguration | Multiple services affected, deployment buried in Slack |
| Hard | Multi-service degradation with planted false root causes | Senior engineer confidently wrong in Slack, real evidence in non-obvious log window |
| Expert | Security breach via compromised API key | 3 false root causes, senior security engineer wrong twice, 3-minute evidence window, GDPR action items required |
The hard task deliberately plants two false root causes. The expert task goes further — a scheduled load test using the same API key creates a plausible innocent explanation, a senior security engineer with 12 years experience confidently blames the load test (twice), and the real evidence (Tor exit node + scope violation) is only visible in a precise 3-minute api-gateway audit log window. The expert task also reduces the query budget from 8 to 6 with steeper penalties.
| Action | Fields | Description |
|---|---|---|
QUERY_LOGS |
query_service, query_from, query_to |
Query logs for a specific service and time window |
WRITE_SECTION |
section_name, section_content |
Write one of 5 post-mortem sections |
ASSIGN_ACTION_ITEM |
action_item_description, action_item_owner, action_item_due_date |
Assign a structured action item |
SUBMIT |
— | Finalize and submit for grading |
Valid section_name values: summary, timeline, root_cause, impact, action_items
Each step returns a typed observation containing:
{
"goal": str, # Natural language task description
"incident_id": str, # Incident identifier
"incident_title": str, # Human-readable incident name
"alerts": List[AlertLog], # Timestamped alert logs (severity, service, message)
"slack_thread": List[SlackMessage], # On-call Slack conversation
"service_graph": List[ServiceDependency], # Which service depends on which
"step": int, # Current step number
"max_steps": int, # Episode limit (25)
"queries_used": int, # Queries consumed
"max_queries": int, # Query limit (8)
"sections": List[SectionStatus], # State of each section (unwritten/invalid/valid)
"last_action_result": str, # Feedback from last action
"retrieved_logs": List[AlertLog] | None # Logs from last QUERY_LOGS call
}Rewards are shaped throughout the episode — not just at the end:
| Signal | Reward |
|---|---|
Correct QUERY_LOGS (right service + time window) |
+0.06 |
| Valid section written | +0.03 |
| Structured action item assigned | +0.08 |
Wrong QUERY_LOGS (1st mistake) |
−0.05 |
Wrong QUERY_LOGS (2nd mistake) |
−0.08 |
Wrong QUERY_LOGS (3rd+ mistake) |
−0.12 to −0.18 |
| Overwriting an already-valid section | −0.02 |
| Missing section at SUBMIT | −0.10 per section |
| Final grader score at SUBMIT | 0.0 – 1.0 |
The final grader score (added at SUBMIT) covers 60–70% of total reward and uses a weighted 5-component formula.
Each task is scored by a deterministic grader (0.0–1.0):
| Component | Weight | How it's measured |
|---|---|---|
| Root cause | 30% | 3-layer: correct service (L1=0.40) + cause category (L2=0.35) + keywords (L3=0.25) |
| Timeline | 25% | Events matched within ±3 min tolerance against gold standard |
| Action items | 20% | Owner + due date + theme coverage |
| Impact | 15% | Word count + service mention + duration + scale |
| Completeness | 10% | All 5 sections present and validated |
The environment is fully deterministic — scenarios are static JSON, grading is a pure function, and identical action sequences always produce identical scores.
Root cause special rules:
- If L1 (service identification) = 0, score capped at 0.65
- If false root cause service mentioned before real service, L1 reduced to 0.15
- Timeline score < 0.4 caps root cause at 0.60 (forces reasoning over guessing)
Using llama-3.1-8b-instant via Groq API (runtime: ~200 seconds):
easy : 1.000 ████████████████████
medium: 0.985 ███████████████████
hard : 0.797 ███████████████
expert: 0.674 █████████████
avg : 0.864
The difficulty staircase is genuine — each level is harder for a different reason. The hard task misleads agents with a confidently wrong senior engineer and CDN red herrings. The expert task adds a third false root cause, a senior security engineer who is wrong twice, a 3-minute evidence window, and 4 hidden timeline events that require querying the correct log window to unlock. The baseline intentionally does not achieve perfect scores on medium, hard, or expert tasks, demonstrating that the environment is challenging yet solvable. Scores are consistent across runs and deterministic given the same action sequence.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check — returns {"status": "healthy"} |
| POST | /reset |
Start new episode. Body: {"difficulty": "easy|medium|hard"} |
| POST | /step |
Execute action. Body: action JSON |
| GET | /state |
Current episode state |
| GET | /tasks |
List all 3 tasks |
| WS | /ws |
WebSocket persistent session |
| GET | /docs |
Interactive API documentation |
git clone https://huggingface.co/spaces/jeevan2717/incident-postmortem-writer
cd incident-postmortem-writer
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reloadTest it:
curl http://localhost:7860/health
# {"status":"healthy"}docker build -t postmortem-env .
docker run -p 7860:7860 postmortem-envFor the Grand Finale, we fine-tuned Qwen 2.5-0.5B using HuggingFace TRL's SFTTrainer via Rejection Sampling Fine-Tuning. Rollouts from a Llama-3.1-8B teacher were collected across all 4 difficulties, filtered by reward ≥ 0.50 (234 high-quality trajectories), and the student model trained for 5 epochs on a Colab T4 GPU.
Results — student reward improvement (before vs after training):
| Difficulty | Before | After | Change |
|---|---|---|---|
| Easy | 0.800 | 0.840 | +0.040 |
| Medium | 0.616 | 0.857 | +0.241 |
| Hard | 0.412 | 0.650 | +0.238 |
| Expert | 0.321 | 0.508 | +0.187 |
| Average | 0.537 | 0.714 | +0.176 |
Training loss descended cleanly from 3.09 → 0.035 over 290 steps. Full write-up and charts in BLOG.md. Reproducibility artifacts: training_results.json, reward_improvement.png, training_loss_curve.png.
export API_BASE_URL=https://api.groq.com/openai/v1
export MODEL_NAME=llama-3.1-8b-instant
export HF_TOKEN=your_api_key_here
python inference.pyThe inference script emits structured stdout logs in the OpenEnv-required format:
[START] task=easy env=incident-postmortem-writer model=llama-3.1-8b-instant
[STEP] step=1 action=QUERY_LOGS reward=0.06 done=false error=null
[STEP] step=2 action=WRITE_SECTION_summary reward=0.03 done=false error=null
...
[END] success=true steps=8 score=1.000 rewards=0.06,0.03,0.03,0.03,0.03,0.03,0.08,1.00
from client import PostMortemEnv
with PostMortemEnv(base_url="http://localhost:7860") as env:
result = env.reset(difficulty="easy")
print(result["observation"]["goal"])
# Query logs for evidence
result = env.query_logs("payments", "03:38", "03:45")
# Write sections
result = env.write_section("root_cause",
"Root cause: DB connection leak in payments service v2.4.0...")
# Submit
result = env.submit()
print(result["info"]["grade"]["total_score"])import asyncio, json, websockets
async def run():
async with websockets.connect(
"wss://jeevan2717-incident-postmortem-writer.hf.space/ws"
) as ws:
await ws.send(json.dumps({"command": "reset", "difficulty": "hard"}))
result = json.loads(await ws.recv())
print(result["data"]["observation"]["goal"])
asyncio.run(run())postmortem-env/
├── env/
│ ├── models.py # Pydantic typed models (Observation, Action, Reward)
│ └── scenarios/
│ ├── easy.json # Single-service DB outage
│ ├── medium.json # Cascading Redis TTL failure
│ ├── hard.json # Multi-service with false root causes
│ └── expert.json # Security breach with Tor exit node + 3 false root causes
├── server/
│ ├── environment.py # Core step/reset/state logic + deterministic grader
│ └── app.py # FastAPI server (REST + WebSocket)
├── client.py # Typed HTTP client
├── inference.py # Baseline agent script
├── openenv.yaml # OpenEnv manifest
├── Dockerfile
└── requirements.txt
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
https://api.openai.com/v1 |
LLM API endpoint |
MODEL_NAME |
gpt-4o-mini |
Model identifier |
HF_TOKEN |
— | API key |
WORKERS |
2 |
Uvicorn worker processes |
MAX_CONCURRENT_ENVS |
100 |
Max WebSocket sessions |
DIFFICULTY |
easy |
Default task difficulty (easy/medium/hard/expert) |