I compared probabilistic clustering with density-based anomaly detection to evaluate behavioral outlier consistency across modeling assumptions.
This project implements an unsupervised behavioral segmentation framework using Principal Component Analysis (PCA) and Gaussian Mixture Models (GMM).
The objective is to identify structured behavioral patterns and detect probabilistic anomalies in high-dimensional interaction data.
Rather than relying on labeled outcomes, this approach models latent structure within engagement behavior to uncover:
- Dominant behavioral clusters
- Mixed-membership interaction profiles
- Overlapping probabilistic segments
- High-variance behavioral outliers
This framework is applicable to cybersecurity (user behavior analytics), fraud detection, risk modeling, and anomaly detection environments.
Dataset: Public engagement interaction dataset (CSV format)
Location: data/engagement_behavior_data.csv
The dataset contains interaction-level metrics including reactions, comments, shares, and engagement attributes.
High-dimensional interaction variables were reduced using PCA to:
- Preserve dominant variance structure
- Identify engagement intensity axes
- Detect compositional behavior differences
The first principal component captures overall intensity.
The second component captures interaction structure differences.
Gaussian Mixture Models (GMM) were applied for soft clustering.
This allows:
- Probabilistic segment membership
- Identification of hybrid behavior
- Detection of overlapping clusters
- More realistic behavioral modeling compared to hard clustering
Posterior probability distributions were analyzed to detect anomaly-like behavior patterns.
To compare clustering-based segmentation with density-based anomaly detection, Isolation Forest was applied to the same feature space.
This provides an alternative perspective on behavioral outliers under a different modeling assumption.
Python · scikit-learn · pandas · numpy · matplotlib
- Behavioral engagement exhibits structured clustering rather than random dispersion.
- The first principal component captures overall interaction intensity.
- Gaussian Mixture Models reveal overlapping probabilistic segments.
- Isolation Forest identifies high-intensity edge cases concentrated in extreme PCA regions.
- Behavioral outliers are more visible under density-based anomaly modeling.
- Isolation Forest anomaly comparison
- DBSCAN density-based clustering
- Application to cybersecurity log datasets
- Temporal behavioral drift modeling
Regina
Cybersecurity & Risk Analytics