Autonomous Infrastructure Observability Agent
Detects architectural entropy in system log streams using semantic vector drift analysis — before crashes happen.
Traditional monitoring tools alert on thresholds — CPU > 90%, error rate > 5%. But many real failures are "silent killers":
- A database index change slows queries gradually.
- A retry storm fills up a connection pool over minutes.
- A zombie service is "up" but not processing.
These don't throw ERROR logs until it's too late.
Echo-Ops treats logs as vectors in high-dimensional space. A healthy system has a characteristic "shape" — its logs cluster near a known-good baseline. When behaviour changes, the shape drifts — even if no explicit errors are logged.
Healthy Logs → Embed → Endee Baseline Index
Live Logs → Embed → Compare vs. Baseline
Cosine Distance > 0.30 → ANOMALY
↓
Agentic LLM Loop
calls diagnostic tools
↓
Root Cause Report
┌─────────────────────────────────────────────┐
│ Echo-Ops Agent │
│ │
│ ingestion/ agent/ api/ │
│ ┌───────────┐ ┌──────────────┐ ┌───────┐ │
│ │ log_gen │ │ detector.py │ │server │ │
│ │ embedder │→ │ (Endee ANN) │→ │ SSE │ │
│ │ + cache │ │ agent.py │ │ / │ │
│ │ endee_ │ │ (Groq) │ │static │ │
│ │ client │ │ tools.py │ │ │ │
│ └───────────┘ └──────────────┘ └───────┘ │
└─────────────────────────────────────────────┘
│ HTTP │ SSE
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Endee DB │ │ Browser │
│ :8080 │ │ Dashboard │
└─────────────┘ └──────────────┘
| Operation | Endee Endpoint | Purpose |
|---|---|---|
create_index |
POST /api/v1/index/create |
Create echo_ops_baseline index (384-dim, cosine) |
upsert |
POST /api/v1/index/{name}/upsert |
Batch-write healthy log embeddings (baseline) |
query |
POST /api/v1/index/{name}/query |
ANN search: "how far is current log pattern from healthy?" |
Why Endee specifically? The drift check runs every 5 seconds against a 300+ vector index. Standard cloud vector DBs add 50–200ms of network latency per query. Endee's C++ HNSW core gives sub-millisecond local lookups, making real-time streaming analysis viable.
The embedder caches on the log template (e.g. "User {id} checkout failed") not the full message. Because logs repeat the same ~20 templates, we get ~80-90% cache hit rate — drastically reducing embedding model calls and keeping Endee write volume low.
@functools.lru_cache(maxsize=256)
def _embed_template(template: str) -> tuple:
# only called on cache miss (~10-20% of logs)
return tuple(model.encode(template).tolist())The agent is not a prompt wrapper. It's a genuine ReAct (Reason + Act) loop:
- Observe: Receives
(service, drift_score, sample_logs)from detector - Reason: LLM decides which tools to call
- Act: Calls
get_recent_commits,get_top_db_queries,get_resource_snapshot - Observe: Tool results feed back into the conversation
- Synthesize: Produces
Root Cause Analysis Reportas structured JSON
The LLM drives the investigation. We don't hardcode "if checkout → check DB". The agent figures that out.
echo-ops/
├── config.py # All config in one place
├── main.py # Entry point & orchestrator
├── docker-compose.yml # Endee vector DB
├── requirements.txt
├── ingestion/
│ ├── log_generator.py # Synthetic log stream (healthy + anomaly)
│ ├── embedder.py # MiniLM + LRU cache
│ └── endee_client.py # Endee HTTP API wrapper
├── agent/
│ ├── detector.py # Drift detection engine
│ ├── tools.py # Diagnostic tools + OpenAI function schemas
│ └── agent.py # Agentic LLM ReAct loop (Groq)
├── api/
│ └── server.py # FastAPI + SSE stream
└── static/
└── index.html # Real-time dashboard
- Python 3.11+
- Docker (for Endee Vector DB)
- Groq API key (free at console.groq.com/keys)
git clone https://github.com/codeRisshi25/echo-ops.git
cd echo-ops
python -m venv venv && source venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env and set GROQ_API_KEYdocker compose up -d
# Verify it's running:
curl http://localhost:8080/api/v1/index/list# Demo mode: automatically injects an anomaly at t=25s
python main.py --demo
# Open dashboard
open http://localhost:8000- t=0s — "Building healthy baseline (300 vectors)" — Endee index populated
- t=10s — Dashboard goes live, healthy log stream visible, drift score near 0
- t=25s — Anomaly injected (checkout retry storm)
- t=30s — Drift score spikes above 0.30, agent wakes up
- t=35s — LLM calls tools (commits, DB queries, resources)
- t=40s — Root Cause Report appears in dashboard: "DB index change in checkout service"
| Component | Technology | Cost |
|---|---|---|
| Vector DB | Endee (C++ HNSW) | Free, self-hosted |
| Embedding | fastembed + BAAI/bge-small-en-v1.5 (ONNX) | Free, local |
| LLM / Agent | Any Groq-compatible model | Free tier available |
| API Server | FastAPI + uvicorn | Open source |
| Dashboard | Vanilla HTML/CSS/JS | — |
Risshi Raj Sen — github.com/codeRisshi25