Incident Post-Mortem Writer

title

emoji

🚨

colorFrom

red

colorTo

blue

sdk

docker

app_port

7860

pinned

false

Incident Post-Mortem Writer

OpenEnv Hackathon 2026 — Grand Finale Submission
Theme #3.1 — World Modeling: Professional Tasks · Scaler AI Labs Enterprise Workflows bonus track

An OpenEnv environment where AI agents learn to write structured incident post-mortems from raw alert logs, Slack threads, and service dependency graphs — a multi-app enterprise workflow that simulates real Site Reliability Engineering.

📄 Read the Round 2 blog post with training results — includes +32.8% reward improvement from TRL fine-tuning.

Why This Environment

Every SRE team writes post-mortems after incidents. It's a high-stakes, time-pressured task that requires:

Reconstructing a timeline from noisy, incomplete logs
Identifying root cause despite misleading signals and red herrings
Assigning concrete action items with owners and deadlines

This environment trains and evaluates agents on exactly this workflow — one of the most practically valuable skills in modern software operations.

Key Innovations

This environment goes beyond standard task simulation by introducing:

Evidence gating via QUERY_LOGS — critical root cause evidence is hidden behind precise service + time window queries. Incorrect queries return realistic noise logs, forcing intentional investigation rather than guessing.

Adversarial Slack signals — threads include senior engineers confidently blaming the wrong service, misleading correlations between symptoms and causes, and red herrings designed to trap pattern-matching agents.

Delayed and partial observability — the agent never sees full logs upfront. It must actively explore under a hard query budget (max 8 queries with escalating penalties), simulating real incident response under time pressure.

Multi-layer deterministic grading — root cause is evaluated across service identification, cause category classification, and semantic keyword validation. Not string matching. A correct answer written in different words still scores correctly.

These design choices simulate real-world incident response, where incomplete information, misleading signals, and time pressure are the norm — and where the difference between a good engineer and a great one is knowing where to look.

Why This Is Challenging for Agents

This environment is difficult for AI agents because:

Hidden evidence — critical logs are not visible upfront and must be discovered through precise QUERY_LOGS calls targeting the right service and time window.

Conflicting information — Slack threads contain confident but incorrect hypotheses from senior engineers, forcing the agent to trust data over authority.

Limited exploration budget — queries are constrained (max 8) with escalating penalties, preventing brute-force search strategies.

Structured output requirement — the agent must not only reason correctly but also produce a coherent, validated 5-section document with specific content requirements per section.

These constraints test true agentic reasoning — not just text generation.

Environment Description

The agent receives a realistic incident bundle: timestamped alert logs, a Slack thread from the on-call team, and a service dependency graph. It must investigate the incident and produce a complete 5-section post-mortem document.

The key mechanic is QUERY_LOGS — the agent must identify which service and time window to investigate. The real root cause evidence is hidden behind a specific log query. Wrong queries are penalized with escalating costs. This forces intentional reasoning rather than pattern matching.

Four Difficulty Levels

Task	Incident	Key Challenge
Easy	Single-service DB connection leak	Clean signals, clear root cause
Medium	Cascading failure from Redis TTL misconfiguration	Multiple services affected, deployment buried in Slack
Hard	Multi-service degradation with planted false root causes	Senior engineer confidently wrong in Slack, real evidence in non-obvious log window
Expert	Security breach via compromised API key	3 false root causes, senior security engineer wrong twice, 3-minute evidence window, GDPR action items required

The hard task deliberately plants two false root causes. The expert task goes further — a scheduled load test using the same API key creates a plausible innocent explanation, a senior security engineer with 12 years experience confidently blames the load test (twice), and the real evidence (Tor exit node + scope violation) is only visible in a precise 3-minute api-gateway audit log window. The expert task also reduces the query budget from 8 to 6 with steeper penalties.

Action Space

Action	Fields	Description
`QUERY_LOGS`	`query_service`, `query_from`, `query_to`	Query logs for a specific service and time window
`WRITE_SECTION`	`section_name`, `section_content`	Write one of 5 post-mortem sections
`ASSIGN_ACTION_ITEM`	`action_item_description`, `action_item_owner`, `action_item_due_date`	Assign a structured action item
`SUBMIT`	—	Finalize and submit for grading

Valid section_name values: summary, timeline, root_cause, impact, action_items

Observation Space

Each step returns a typed observation containing:

{
  "goal": str,                    # Natural language task description
  "incident_id": str,             # Incident identifier
  "incident_title": str,          # Human-readable incident name
  "alerts": List[AlertLog],       # Timestamped alert logs (severity, service, message)
  "slack_thread": List[SlackMessage], # On-call Slack conversation
  "service_graph": List[ServiceDependency], # Which service depends on which
  "step": int,                    # Current step number
  "max_steps": int,               # Episode limit (25)
  "queries_used": int,            # Queries consumed
  "max_queries": int,             # Query limit (8)
  "sections": List[SectionStatus], # State of each section (unwritten/invalid/valid)
  "last_action_result": str,      # Feedback from last action
  "retrieved_logs": List[AlertLog] | None  # Logs from last QUERY_LOGS call
}

Reward Function

Rewards are shaped throughout the episode — not just at the end:

Signal	Reward
Correct `QUERY_LOGS` (right service + time window)	+0.06
Valid section written	+0.03
Structured action item assigned	+0.08
Wrong `QUERY_LOGS` (1st mistake)	−0.05
Wrong `QUERY_LOGS` (2nd mistake)	−0.08
Wrong `QUERY_LOGS` (3rd+ mistake)	−0.12 to −0.18
Overwriting an already-valid section	−0.02
Missing section at SUBMIT	−0.10 per section
Final grader score at SUBMIT	0.0 – 1.0

The final grader score (added at SUBMIT) covers 60–70% of total reward and uses a weighted 5-component formula.

Grader Design

Each task is scored by a deterministic grader (0.0–1.0):

Component	Weight	How it's measured
Root cause	30%	3-layer: correct service (L1=0.40) + cause category (L2=0.35) + keywords (L3=0.25)
Timeline	25%	Events matched within ±3 min tolerance against gold standard
Action items	20%	Owner + due date + theme coverage
Impact	15%	Word count + service mention + duration + scale
Completeness	10%	All 5 sections present and validated

The environment is fully deterministic — scenarios are static JSON, grading is a pure function, and identical action sequences always produce identical scores.

Root cause special rules:

If L1 (service identification) = 0, score capped at 0.65
If false root cause service mentioned before real service, L1 reduced to 0.15
Timeline score < 0.4 caps root cause at 0.60 (forces reasoning over guessing)

Baseline Scores

Using llama-3.1-8b-instant via Groq API (runtime: ~200 seconds):

easy  : 1.000  ████████████████████
medium: 0.985  ███████████████████
hard  : 0.797  ███████████████
expert: 0.674  █████████████
avg   : 0.864

The difficulty staircase is genuine — each level is harder for a different reason. The hard task misleads agents with a confidently wrong senior engineer and CDN red herrings. The expert task adds a third false root cause, a senior security engineer who is wrong twice, a 3-minute evidence window, and 4 hidden timeline events that require querying the correct log window to unlock. The baseline intentionally does not achieve perfect scores on medium, hard, or expert tasks, demonstrating that the environment is challenging yet solvable. Scores are consistent across runs and deterministic given the same action sequence.

API Endpoints

Method	Endpoint	Description
GET	`/health`	Health check — returns `{"status": "healthy"}`
POST	`/reset`	Start new episode. Body: `{"difficulty": "easy\|medium\|hard"}`
POST	`/step`	Execute action. Body: action JSON
GET	`/state`	Current episode state
GET	`/tasks`	List all 3 tasks
WS	`/ws`	WebSocket persistent session
GET	`/docs`	Interactive API documentation

Setup & Usage

Local

git clone https://huggingface.co/spaces/jeevan2717/incident-postmortem-writer
cd incident-postmortem-writer

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

Test it:

curl http://localhost:7860/health
# {"status":"healthy"}

Docker

docker build -t postmortem-env .
docker run -p 7860:7860 postmortem-env

Round 2 Training Results

For the Grand Finale, we fine-tuned Qwen 2.5-0.5B using HuggingFace TRL's SFTTrainer via Rejection Sampling Fine-Tuning. Rollouts from a Llama-3.1-8B teacher were collected across all 4 difficulties, filtered by reward ≥ 0.50 (234 high-quality trajectories), and the student model trained for 5 epochs on a Colab T4 GPU.

Results — student reward improvement (before vs after training):

Difficulty	Before	After	Change
Easy	0.800	0.840	+0.040
Medium	0.616	0.857	+0.241
Hard	0.412	0.650	+0.238
Expert	0.321	0.508	+0.187
Average	0.537	0.714	+0.176

Training loss descended cleanly from 3.09 → 0.035 over 290 steps. Full write-up and charts in BLOG.md. Reproducibility artifacts: training_results.json, reward_improvement.png, training_loss_curve.png.

Run Baseline Inference

export API_BASE_URL=https://api.groq.com/openai/v1
export MODEL_NAME=llama-3.1-8b-instant
export HF_TOKEN=your_api_key_here

python inference.py

The inference script emits structured stdout logs in the OpenEnv-required format:

[START] task=easy env=incident-postmortem-writer model=llama-3.1-8b-instant
[STEP] step=1 action=QUERY_LOGS reward=0.06 done=false error=null
[STEP] step=2 action=WRITE_SECTION_summary reward=0.03 done=false error=null
...
[END] success=true steps=8 score=1.000 rewards=0.06,0.03,0.03,0.03,0.03,0.03,0.08,1.00

Use as Client

from client import PostMortemEnv

with PostMortemEnv(base_url="http://localhost:7860") as env:
    result = env.reset(difficulty="easy")
    print(result["observation"]["goal"])

    # Query logs for evidence
    result = env.query_logs("payments", "03:38", "03:45")

    # Write sections
    result = env.write_section("root_cause",
        "Root cause: DB connection leak in payments service v2.4.0...")

    # Submit
    result = env.submit()
    print(result["info"]["grade"]["total_score"])

WebSocket Session

import asyncio, json, websockets

async def run():
    async with websockets.connect(
        "wss://jeevan2717-incident-postmortem-writer.hf.space/ws"
    ) as ws:
        await ws.send(json.dumps({"command": "reset", "difficulty": "hard"}))
        result = json.loads(await ws.recv())
        print(result["data"]["observation"]["goal"])

asyncio.run(run())

Project Structure

postmortem-env/
├── env/
│   ├── models.py              # Pydantic typed models (Observation, Action, Reward)
│   └── scenarios/
│       ├── easy.json          # Single-service DB outage
│       ├── medium.json        # Cascading Redis TTL failure
│       ├── hard.json          # Multi-service with false root causes
│       └── expert.json        # Security breach with Tor exit node + 3 false root causes
├── server/
│   ├── environment.py         # Core step/reset/state logic + deterministic grader
│   └── app.py                 # FastAPI server (REST + WebSocket)
├── client.py                  # Typed HTTP client
├── inference.py               # Baseline agent script
├── openenv.yaml               # OpenEnv manifest
├── Dockerfile
└── requirements.txt

Environment Variables

Variable	Default	Description
`API_BASE_URL`	`https://api.openai.com/v1`	LLM API endpoint
`MODEL_NAME`	`gpt-4o-mini`	Model identifier
`HF_TOKEN`	—	API key
`WORKERS`	`2`	Uvicorn worker processes
`MAX_CONCURRENT_ENVS`	`100`	Max WebSocket sessions
`DIFFICULTY`	`easy`	Default task difficulty (easy/medium/hard/expert)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incident Post-Mortem Writer

Why This Environment

Key Innovations

Why This Is Challenging for Agents

Environment Description

Four Difficulty Levels

Action Space

Observation Space

Reward Function

Grader Design

Baseline Scores

API Endpoints

Setup & Usage

Local

Docker

Round 2 Training Results

Run Baseline Inference

Use as Client

WebSocket Session

Project Structure

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
env		env
server		server
.gitattributes		.gitattributes
.gitignore		.gitignore
BLOG.md		BLOG.md
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
client.py		client.py
demo.py		demo.py
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
reward_improvement.png		reward_improvement.png
training_loss_curve.png		training_loss_curve.png
training_results.json		training_results.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Incident Post-Mortem Writer

Why This Environment

Key Innovations

Why This Is Challenging for Agents

Environment Description

Four Difficulty Levels

Action Space

Observation Space

Reward Function

Grader Design

Baseline Scores

API Endpoints

Setup & Usage

Local

Docker

Round 2 Training Results

Run Baseline Inference

Use as Client

WebSocket Session

Project Structure

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages