Skip to content

VSwaroop07/SHC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⬑ SHC β€” Self-Healing Cluster

A Kubernetes-native system that automatically detects and heals unhealthy nodes using a Machine Learning anomaly detection model, with a real-time live dashboard.


πŸ“‹ Table of Contents

  1. Overview
  2. Architecture
  3. Features
  4. Project Structure
  5. Prerequisites
  6. Quick Start (Local β€” No Kubernetes)
  7. Running on Kubernetes
  8. Dashboard Guide
  9. API Reference
  10. ML Model Details
  11. Troubleshooting
  12. Tech Stack

Overview

SHC (Self-Healing Cluster) monitors cloud node metrics, feeds them to an Isolation Forest ML model, and automatically restarts unhealthy pods when a persistent anomaly is confirmed using a double-verification strategy (detects β†’ waits 20 s β†’ re-checks β†’ heals). Everything is visualised on a live dark-mode dashboard.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Browser Dashboard                     β”‚
β”‚        (WebSocket Β· Chart.js Β· Real-time UI)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚ WebSocket ws://
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Demo-Service  (Node.js)                   β”‚
β”‚  β€’ 5-second metric broadcast loop                       β”‚
β”‚  β€’ 30-second anomaly check loop                         β”‚
β”‚  β€’ Double-verification before healing                   β”‚
β”‚  β€’ Kubernetes API for pod restarts                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ POST /predict (HTTP)
                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               ML-Service  (Python FastAPI)              β”‚
β”‚  β€’ Isolation Forest (200 estimators, 8 features)        β”‚
β”‚  β€’ Trained on 5 failure scenario types                  β”‚
β”‚  β€’ Returns { anomaly: true/false, score: float }        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features

  • πŸ€– ML Anomaly Detection β€” Isolation Forest trained on 5 failure types
  • πŸ” Double-Verification β€” confirms anomaly before any healing action (no false positives)
  • 🩺 Automatic Pod Restart β€” via Kubernetes API with RBAC scoped to minimum permissions
  • πŸ“Š Live Dashboard β€” WebSocket-powered real-time metrics, node health map, rolling charts, healing event log
  • ⚑ Demo Controls β€” "Simulate Stress" and "Reset" buttons to show the full healing cycle
  • πŸ›‘οΈ Fallback Threshold β€” works even if ML service is temporarily unreachable
  • 🐳 Fully Dockerised β€” both services have production-ready Dockerfiles

Project Structure

SHC/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ Demo-Service/               ← Node.js monitor + dashboard server
β”‚   β”œβ”€β”€ index.js                   Main application
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ deployment.yaml            Kubernetes Deployment
β”‚   β”œβ”€β”€ service.yaml               Kubernetes Service (NodePort)
β”‚   β”œβ”€β”€ rbac.yaml                  ServiceAccount + Role + RoleBinding
β”‚   └── dashboard/
β”‚       β”œβ”€β”€ index.html             Dashboard UI
β”‚       β”œβ”€β”€ style.css              Dark glassmorphism styles
β”‚       └── app.js                 WebSocket client + Chart.js logic
β”‚
└── ML-Model/                   ← Python ML anomaly detection service
    β”œβ”€β”€ train_model.py             Dataset generator + model trainer
    β”œβ”€β”€ ml_service.py              FastAPI prediction service
    β”œβ”€β”€ test_model.py              Quick model validation script
    β”œβ”€β”€ requirements.txt           Python dependencies
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ ml-deployment.yaml         Kubernetes Deployment
    β”œβ”€β”€ ml-service.yaml            Kubernetes Service (ClusterIP)
    β”œβ”€β”€ anomaly_model.pkl          Trained model (generated)
    └── scaler.pkl                 Feature scaler (generated)

Prerequisites

Make sure the following are installed on your system:

Tool Version Purpose
Node.js β‰₯ 18.x Demo-Service runtime
npm β‰₯ 9.x Node package manager
Python β‰₯ 3.10 ML model and service
pip β‰₯ 23.x Python package manager

For Kubernetes deployment only:

Tool Purpose
Docker Desktop / Docker Engine Build container images
kubectl Manage Kubernetes cluster
Minikube / Kind / any K8s cluster The cluster itself

Installing Prerequisites

Node.js β€” https://nodejs.org/en/download
Python β€” https://www.python.org/downloads
Docker Desktop β€” https://www.docker.com/products/docker-desktop
kubectl β€” https://kubernetes.io/docs/tasks/tools
Minikube β€” https://minikube.sigs.k8s.io/docs/start


Quick Start (Local β€” No Kubernetes)

This runs everything on your laptop with no Kubernetes needed. Ideal for demos and development.

Step 1 β€” Clone / Navigate to the project

cd SHC

Step 2 β€” Train the ML Model

cd ML-Model
pip install -r requirements.txt
python train_model.py

Expected output:

Generating synthetic node metrics dataset...
Dataset: 5000 total samples  (4500 normal + 500 anomalous)
Model trained. Flagged 500/5000 samples as anomalous (10.0%)
Saved: anomaly_model.pkl  scaler.pkl

Step 3 β€” Start the ML Service

Keep this terminal open:

# Still inside ML-Model/
uvicorn ml_service:app --host 0.0.0.0 --port 8000

Verify it's up: http://localhost:8000/health β†’ {"status":"ok"}

Step 4 β€” Start the Demo Service + Dashboard

Open a new terminal:

cd SHC/Demo-Service
npm install

Windows (PowerShell):

$env:ML_SERVICE_URL = "http://localhost:8000"
node index.js

macOS / Linux (bash/zsh):

ML_SERVICE_URL=http://localhost:8000 node index.js

Step 5 β€” Open the Dashboard

Open your browser and go to:

http://localhost:3000

You should see the live dark-mode dashboard with metrics updating every 5 seconds.


Running on Kubernetes

Step 1 β€” Start Minikube

minikube start
eval $(minikube docker-env)    # macOS/Linux
# Windows PowerShell:
# & minikube -p minikube docker-env --shell powershell | Invoke-Expression

Step 2 β€” Build Docker Images

# Build ML service image
cd SHC/ML-Model
docker build -t ml-anomaly-service .

# Build Demo-Service image
cd ../Demo-Service
docker build -t selfheal-app .

Step 3 β€” Deploy to Kubernetes

cd SHC

# RBAC (service account, role, rolebinding)
kubectl apply -f Demo-Service/rbac.yaml

# ML Service
kubectl apply -f ML-Model/ml-deployment.yaml
kubectl apply -f ML-Model/ml-service.yaml

# Demo Service + Dashboard
kubectl apply -f Demo-Service/deployment.yaml
kubectl apply -f Demo-Service/service.yaml

Step 4 β€” Access the Dashboard

minikube service selfheal-service

This opens the dashboard automatically in your browser.

Step 5 β€” Verify All Pods are Running

kubectl get pods
kubectl get services

Expected:

NAME                            READY   STATUS    RESTARTS
ml-service-xxxx                 1/1     Running   0
selfheal-app-xxxx               1/1     Running   0

Dashboard Guide

Section Description
Header System name, live status badge (NORMAL / DETECTING / CONFIRMED), live clock, WebSocket connection indicator
Cluster Nodes 3 animated node cards (Master + 2 Workers) β€” change colour based on anomaly state
Live Metrics 8 metric cards with progress bars β€” turn amber/red when thresholds exceeded
Rolling Chart Chart.js multi-line chart showing last 60 data points for CPU, Memory, Latency, Error Rate
Anomaly Detection Pulsing indicator with state description + counters (Heals, Anomalies, Uptime)
Healing Log Table of all healing events β€” timestamp, issue, key metrics, action taken
Demo Controls "⚑ Simulate Node Stress" β€” triggers anomalous metrics immediately
"β†Ί Reset to Normal" β€” resets simulation back to normal metrics
Links to raw /api/metrics and /api/events JSON

Demo Scenario for Presentation

  1. Open dashboard β†’ show NORMAL state, point out all 8 live metrics
  2. Click "⚑ Simulate Node Stress"
  3. Within ~30 seconds:
    • Status badge changes: NORMAL β†’ DETECTING β†’ CONFIRMED
    • Node cards change colour: Healthy β†’ Degraded β†’ Critical
    • Metric cards turn red
    • Anomaly count increments
  4. After healing is confirmed: new row appears in Healing Event Log
  5. Click "β†Ί Reset" β€” system recovers automatically to NORMAL
  6. Mention the automatic 5 failure scenario rotation: CPU Spike β†’ OOM β†’ Disk I/O β†’ Network β†’ Crash Loop

API Reference

All endpoints are on the Demo-Service (http://localhost:3000):

Method Endpoint Description
GET / Serves the dashboard UI
GET /health Health check β†’ {"status":"ok","state":"NORMAL"}
GET /api/metrics Latest metrics snapshot (JSON)
GET /api/events Full healing event log (JSON array)
GET /api/state Current state + uptime stats
GET /stress Trigger anomalous metrics simulation
GET /reset Reset metrics to normal
GET /crash?token=shc-secret Intentional crash (token-protected)

ML Service (http://localhost:8000):

Method Endpoint Description
GET /health {"status":"ok"}
GET /info Model metadata (algorithm, features, contamination)
POST /predict Predict anomaly from 8 metrics β†’ {"anomaly":bool,"score":float}
GET /docs Interactive Swagger UI

ML Model Details

Parameter Value
Algorithm Isolation Forest
Library scikit-learn
Estimators 200 trees
Contamination 10%
Training samples 5000 (4500 normal + 500 anomalous)
Features 8
Random seed 42

Features

Feature Description Normal Range Alert Threshold
cpu_usage CPU utilisation % 10–60% > 85%
memory_usage RAM utilisation % 30–65% > 90%
request_rate Requests per second 180–380 < 50
latency Request latency (ms) 60–280 > 1500
pod_restarts Restart count 0–1 β‰₯ 4
disk_io Disk I/O utilisation % 10–55% > 87%
network_errors Errors per minute 0–6 > 45
error_rate Error fraction 0–1 0–0.04 > 0.30

Failure Scenarios Trained

Scenario Characteristics
CPU Spike cpu_usage > 87%, high latency, elevated error rate
Memory Exhaustion (OOM) memory_usage > 90%, many pod restarts, high error rate
Disk I/O Saturation disk_io > 88%, extreme latency > 1800ms
Network Degradation network_errors > 48/min, very high latency, high error rate
Crash Loop pod_restarts > 5, high CPU + memory, low request rate

Troubleshooting

Port 3000 already in use

# Windows
netstat -ano | findstr ":3000"
taskkill /PID <PID> /F

# Or run on a different port:
$env:PORT = "3001"
node index.js

ML service not reachable

The Demo-Service has a built-in fallback using fixed thresholds β€” it will still detect anomalies even without the ML service. Check that ML_SERVICE_URL is set correctly.

anomaly_model.pkl not found

Re-run the training script from inside the ML-Model/ directory:

cd ML-Model
python train_model.py

Python pip install fails

Try using a virtual environment:

cd ML-Model
python -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt

Minikube image not found

Make sure you built the Docker images after running eval $(minikube docker-env) so the images exist inside Minikube's registry:

eval $(minikube docker-env)   # must run this first!
docker build -t ml-anomaly-service ./ML-Model
docker build -t selfheal-app ./Demo-Service

Tech Stack

Layer Technology
ML Model Python Β· scikit-learn (IsolationForest) Β· pandas Β· numpy Β· joblib
ML Service FastAPI Β· uvicorn Β· Pydantic
Monitor Service Node.js Β· Express Β· ws (WebSocket) Β· axios
Kubernetes Client @kubernetes/client-node
Dashboard HTML5 Β· CSS3 Β· JavaScript (ES2022) Β· Chart.js
Containerisation Docker
Orchestration Kubernetes Β· kubectl
RBAC Kubernetes ServiceAccount + Role + RoleBinding

Author

Swaroop Vyawahare

Contributor

Shreyash Shirsat

Final Year Academic Project β€” Self-Healing Cluster (SHC)

Built to demonstrate how ML-driven observability can automate Kubernetes node recovery without human intervention.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors