A Kubernetes-native system that automatically detects and heals unhealthy nodes using a Machine Learning anomaly detection model, with a real-time live dashboard.
- Overview
- Architecture
- Features
- Project Structure
- Prerequisites
- Quick Start (Local β No Kubernetes)
- Running on Kubernetes
- Dashboard Guide
- API Reference
- ML Model Details
- Troubleshooting
- Tech Stack
SHC (Self-Healing Cluster) monitors cloud node metrics, feeds them to an Isolation Forest ML model, and automatically restarts unhealthy pods when a persistent anomaly is confirmed using a double-verification strategy (detects β waits 20 s β re-checks β heals). Everything is visualised on a live dark-mode dashboard.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser Dashboard β
β (WebSocket Β· Chart.js Β· Real-time UI) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β WebSocket ws://
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Demo-Service (Node.js) β
β β’ 5-second metric broadcast loop β
β β’ 30-second anomaly check loop β
β β’ Double-verification before healing β
β β’ Kubernetes API for pod restarts β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β POST /predict (HTTP)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML-Service (Python FastAPI) β
β β’ Isolation Forest (200 estimators, 8 features) β
β β’ Trained on 5 failure scenario types β
β β’ Returns { anomaly: true/false, score: float } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- π€ ML Anomaly Detection β Isolation Forest trained on 5 failure types
- π Double-Verification β confirms anomaly before any healing action (no false positives)
- π©Ί Automatic Pod Restart β via Kubernetes API with RBAC scoped to minimum permissions
- π Live Dashboard β WebSocket-powered real-time metrics, node health map, rolling charts, healing event log
- β‘ Demo Controls β "Simulate Stress" and "Reset" buttons to show the full healing cycle
- π‘οΈ Fallback Threshold β works even if ML service is temporarily unreachable
- π³ Fully Dockerised β both services have production-ready Dockerfiles
SHC/
βββ .gitignore
βββ README.md
β
βββ Demo-Service/ β Node.js monitor + dashboard server
β βββ index.js Main application
β βββ package.json
β βββ Dockerfile
β βββ deployment.yaml Kubernetes Deployment
β βββ service.yaml Kubernetes Service (NodePort)
β βββ rbac.yaml ServiceAccount + Role + RoleBinding
β βββ dashboard/
β βββ index.html Dashboard UI
β βββ style.css Dark glassmorphism styles
β βββ app.js WebSocket client + Chart.js logic
β
βββ ML-Model/ β Python ML anomaly detection service
βββ train_model.py Dataset generator + model trainer
βββ ml_service.py FastAPI prediction service
βββ test_model.py Quick model validation script
βββ requirements.txt Python dependencies
βββ Dockerfile
βββ ml-deployment.yaml Kubernetes Deployment
βββ ml-service.yaml Kubernetes Service (ClusterIP)
βββ anomaly_model.pkl Trained model (generated)
βββ scaler.pkl Feature scaler (generated)
Make sure the following are installed on your system:
| Tool | Version | Purpose |
|---|---|---|
| Node.js | β₯ 18.x | Demo-Service runtime |
| npm | β₯ 9.x | Node package manager |
| Python | β₯ 3.10 | ML model and service |
| pip | β₯ 23.x | Python package manager |
For Kubernetes deployment only:
Tool Purpose Docker Desktop / Docker Engine Build container images kubectl Manage Kubernetes cluster Minikube / Kind / any K8s cluster The cluster itself
Node.js β https://nodejs.org/en/download
Python β https://www.python.org/downloads
Docker Desktop β https://www.docker.com/products/docker-desktop
kubectl β https://kubernetes.io/docs/tasks/tools
Minikube β https://minikube.sigs.k8s.io/docs/start
This runs everything on your laptop with no Kubernetes needed. Ideal for demos and development.
cd SHCcd ML-Model
pip install -r requirements.txt
python train_model.pyExpected output:
Generating synthetic node metrics dataset...
Dataset: 5000 total samples (4500 normal + 500 anomalous)
Model trained. Flagged 500/5000 samples as anomalous (10.0%)
Saved: anomaly_model.pkl scaler.pkl
Keep this terminal open:
# Still inside ML-Model/
uvicorn ml_service:app --host 0.0.0.0 --port 8000Verify it's up: http://localhost:8000/health β {"status":"ok"}
Open a new terminal:
cd SHC/Demo-Service
npm installWindows (PowerShell):
$env:ML_SERVICE_URL = "http://localhost:8000"
node index.jsmacOS / Linux (bash/zsh):
ML_SERVICE_URL=http://localhost:8000 node index.jsOpen your browser and go to:
http://localhost:3000
You should see the live dark-mode dashboard with metrics updating every 5 seconds.
minikube start
eval $(minikube docker-env) # macOS/Linux
# Windows PowerShell:
# & minikube -p minikube docker-env --shell powershell | Invoke-Expression# Build ML service image
cd SHC/ML-Model
docker build -t ml-anomaly-service .
# Build Demo-Service image
cd ../Demo-Service
docker build -t selfheal-app .cd SHC
# RBAC (service account, role, rolebinding)
kubectl apply -f Demo-Service/rbac.yaml
# ML Service
kubectl apply -f ML-Model/ml-deployment.yaml
kubectl apply -f ML-Model/ml-service.yaml
# Demo Service + Dashboard
kubectl apply -f Demo-Service/deployment.yaml
kubectl apply -f Demo-Service/service.yamlminikube service selfheal-serviceThis opens the dashboard automatically in your browser.
kubectl get pods
kubectl get servicesExpected:
NAME READY STATUS RESTARTS
ml-service-xxxx 1/1 Running 0
selfheal-app-xxxx 1/1 Running 0
| Section | Description |
|---|---|
| Header | System name, live status badge (NORMAL / DETECTING / CONFIRMED), live clock, WebSocket connection indicator |
| Cluster Nodes | 3 animated node cards (Master + 2 Workers) β change colour based on anomaly state |
| Live Metrics | 8 metric cards with progress bars β turn amber/red when thresholds exceeded |
| Rolling Chart | Chart.js multi-line chart showing last 60 data points for CPU, Memory, Latency, Error Rate |
| Anomaly Detection | Pulsing indicator with state description + counters (Heals, Anomalies, Uptime) |
| Healing Log | Table of all healing events β timestamp, issue, key metrics, action taken |
| Demo Controls | "β‘ Simulate Node Stress" β triggers anomalous metrics immediately |
| "βΊ Reset to Normal" β resets simulation back to normal metrics | |
Links to raw /api/metrics and /api/events JSON |
- Open dashboard β show NORMAL state, point out all 8 live metrics
- Click "β‘ Simulate Node Stress"
- Within ~30 seconds:
- Status badge changes: NORMAL β DETECTING β CONFIRMED
- Node cards change colour: Healthy β Degraded β Critical
- Metric cards turn red
- Anomaly count increments
- After healing is confirmed: new row appears in Healing Event Log
- Click "βΊ Reset" β system recovers automatically to NORMAL
- Mention the automatic 5 failure scenario rotation: CPU Spike β OOM β Disk I/O β Network β Crash Loop
All endpoints are on the Demo-Service (http://localhost:3000):
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Serves the dashboard UI |
GET |
/health |
Health check β {"status":"ok","state":"NORMAL"} |
GET |
/api/metrics |
Latest metrics snapshot (JSON) |
GET |
/api/events |
Full healing event log (JSON array) |
GET |
/api/state |
Current state + uptime stats |
GET |
/stress |
Trigger anomalous metrics simulation |
GET |
/reset |
Reset metrics to normal |
GET |
/crash?token=shc-secret |
Intentional crash (token-protected) |
ML Service (http://localhost:8000):
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
{"status":"ok"} |
GET |
/info |
Model metadata (algorithm, features, contamination) |
POST |
/predict |
Predict anomaly from 8 metrics β {"anomaly":bool,"score":float} |
GET |
/docs |
Interactive Swagger UI |
| Parameter | Value |
|---|---|
| Algorithm | Isolation Forest |
| Library | scikit-learn |
| Estimators | 200 trees |
| Contamination | 10% |
| Training samples | 5000 (4500 normal + 500 anomalous) |
| Features | 8 |
| Random seed | 42 |
| Feature | Description | Normal Range | Alert Threshold |
|---|---|---|---|
cpu_usage |
CPU utilisation % | 10β60% | > 85% |
memory_usage |
RAM utilisation % | 30β65% | > 90% |
request_rate |
Requests per second | 180β380 | < 50 |
latency |
Request latency (ms) | 60β280 | > 1500 |
pod_restarts |
Restart count | 0β1 | β₯ 4 |
disk_io |
Disk I/O utilisation % | 10β55% | > 87% |
network_errors |
Errors per minute | 0β6 | > 45 |
error_rate |
Error fraction 0β1 | 0β0.04 | > 0.30 |
| Scenario | Characteristics |
|---|---|
| CPU Spike | cpu_usage > 87%, high latency, elevated error rate |
| Memory Exhaustion (OOM) | memory_usage > 90%, many pod restarts, high error rate |
| Disk I/O Saturation | disk_io > 88%, extreme latency > 1800ms |
| Network Degradation | network_errors > 48/min, very high latency, high error rate |
| Crash Loop | pod_restarts > 5, high CPU + memory, low request rate |
# Windows
netstat -ano | findstr ":3000"
taskkill /PID <PID> /F
# Or run on a different port:
$env:PORT = "3001"
node index.jsThe Demo-Service has a built-in fallback using fixed thresholds β it will still detect anomalies even without the ML service. Check that ML_SERVICE_URL is set correctly.
Re-run the training script from inside the ML-Model/ directory:
cd ML-Model
python train_model.pyTry using a virtual environment:
cd ML-Model
python -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
pip install -r requirements.txtMake sure you built the Docker images after running eval $(minikube docker-env) so the images exist inside Minikube's registry:
eval $(minikube docker-env) # must run this first!
docker build -t ml-anomaly-service ./ML-Model
docker build -t selfheal-app ./Demo-Service| Layer | Technology |
|---|---|
| ML Model | Python Β· scikit-learn (IsolationForest) Β· pandas Β· numpy Β· joblib |
| ML Service | FastAPI Β· uvicorn Β· Pydantic |
| Monitor Service | Node.js Β· Express Β· ws (WebSocket) Β· axios |
| Kubernetes Client | @kubernetes/client-node |
| Dashboard | HTML5 Β· CSS3 Β· JavaScript (ES2022) Β· Chart.js |
| Containerisation | Docker |
| Orchestration | Kubernetes Β· kubectl |
| RBAC | Kubernetes ServiceAccount + Role + RoleBinding |
Swaroop Vyawahare
Shreyash Shirsat
Final Year Academic Project β Self-Healing Cluster (SHC)
Built to demonstrate how ML-driven observability can automate Kubernetes node recovery without human intervention.