A modular Security Operations Center (SOC) detection engine combining supervised machine learning, anomaly detection, and rule-based logic to detect cyber threats in real-time. The system is designed with evaluation rigor, temporal validation, and deployment constraints in mind.
-
Hybrid Detection Pipeline Combines ML-based classification with anomaly scoring and deterministic rules.
-
Multi-Layer Threat Detection
- Structured ML model (LightGBM) for flow-based intrusion detection
- Sentence-BERT for semantic log understanding
- Isolation Forest for anomaly detection
- Rule engine for deterministic threat signatures
-
MITRE ATT&CK Mapping Automatically maps detections to TTPs (e.g., T1110 – Brute Force).
-
Temporal Evaluation Framework Supports both random and chronological data splits to measure generalization under distribution shift.
-
Drift Monitoring Monitors embedding and feature drift to detect changes in traffic behavior.
-
Production-Oriented Design Async FastAPI microservice, structured logging, Dockerized deployment.
graph TD
A[Log Ingestion] -->|Async Queue| B(API Gateway / FastAPI);
B --> C{Detection Core};
C -->|ML Classifier| D[LightGBM IDS];
C -->|Semantic Analysis| E[Sentence-BERT];
C -->|Statistical Check| F[Isolation Forest];
C -->|Rule Check| G[Rule Engine];
D --> H[Threshold Calibration];
C --> I[Result Aggregator];
I --> J[MITRE Mapper];
J --> K[JSON Response];
src/
├── api/ # FastAPI application & endpoints
├── models/ # LightGBM IDS, BERT, Isolation Forest
├── detection/ # Core detection orchestration
├── rules/ # Rule-based detection logic
├── mitre/ # MITRE ATT&CK mapping
├── monitoring/ # Drift detection & calibration
├── evaluation/ # Benchmarking & temporal validation
└── utils/ # Preprocessing & helpers
scripts/ # Load testing & utilities
- Python 3.8+
- Docker (optional)
- Clone the repository
- Install dependencies:
pip install -r requirements.txt- Train the model:
python train_siem.py- Start the API:
uvicorn src.api.main:app --reloaddocker-compose up --build -dEvaluation performed under two strategies:
| Split Strategy | ROC-AUC | Detection Rate | False Positive Rate |
|---|---|---|---|
| Random Flow-Level | ~0.999 | ~99.98% | ~0.1% |
| Chronological (Time-Based) | Evaluated to measure real-world generalization |
Note: Chronological split simulates deployment by training on earlier capture days and testing on future traffic to reduce leakage effects.
- Single Sample Latency: ~3–5 ms (CPU)
- Throughput (Batch 32): ~4000 samples/sec
- Async API Throughput: ~200+ logs/sec per worker
| Detection Layer | Technique | Example |
|---|---|---|
| Flow-Based IDS | LightGBM | DDoS, DoS, PortScan, Brute Force |
| Semantic | Sentence-BERT | Suspicious command patterns in logs |
| Statistical | Isolation Forest | Traffic volume anomalies |
| Rule-Based | Threshold/Pattern | 5 failed logins in 10s |
This project emphasizes:
- Impact of data splitting strategy on IDS performance
- Performance inflation under naive random splits
- Temporal validation to approximate deployment behavior
- Threshold calibration (Youden-J vs Max-F1)
- Per-class detection analysis for rare attacks
The goal is not just high metrics, but defensible and reproducible evaluation.
Benchmark:
python src/evaluation/benchmark.pyLoad test:
python scripts/load_test.py- Cross-dataset validation (UNSW-NB15 / CIC-IDS2018)
- Adaptive thresholding under drift
- Online learning module
- Entity graph anomaly detection
Rishit Sharma, Kokkula Srinivas Detection Engineering | ML for Cyber Defense