Project Improvements - Multivariate time series clustering integration#182
Project Improvements - Multivariate time series clustering integration#182RalphVita wants to merge 23 commits intopetrobras:other_improvementsfrom
Conversation
…me series clustering capabilities
…stering estimators and transformers.
…mputations like DTW.
…rarchical clustering.
…ank instances from most extreme outlier to the dense core based on recursive distance evaluation.
…are consistently clustered across multiple sensor variables.
…al and simulated data
|
Hi, @RalphVita. It seems that this PR will result in another significant contribution from the Federal University of Espírito Santo to the 3W Project. We will evaluate it ASAP and will let you know here if we have any questions and/or requests for adjustments. On behalf of the 3W Community, I thank you for this PR. |
|
Some automatic checks were not successful. Can you check details here and adjust whatever is necessary so that 100% of these checks are successful? |
Hi @ricardoevvargas, thanks for the feedback! I fixed the issues in commit 24e41bf. The tests are now passing locally (mypy, black, ruff, and pytest — 507 passed). |
Overview
This PR introduces a complete multivariate time series clustering pipeline to the ThreeWToolkit, enabling automatic selection of structurally representative instances from well event data. Two complementary clustering strategies are implemented — agglomerative hierarchical and divisive ranking — combined through a multivariate consensus mechanism that identifies instances consistently similar across all sensor variables simultaneously.
Summary of Changes
1. Clustering Pipeline (
ThreeWToolkit/clustering/)Seven sklearn-compatible components following the
fit/transformcontract, each driven by a Pydantic config model:InstanceQualityFilter,TimeSeriesResampler,TimeSeriesScaler,DistanceComputer,HierarchicalClusterer,DivisiveRanker,MultivariateConsensus, and thecompute_dba_centroidutility.2. Agglomerative Hierarchical Clustering
For each sensor variable, a hierarchical tree is built using average-linkage over pairwise DTW distances, with all distances normalized to
[0, 1]to make thresholds comparable across variables. Sweeping thresholds from0.1to1.0produces a per-variable cluster membership at each cut.3. Divisive Ranking
As a complementary strategy, instances are eliminated top-down: at each step the instance with the highest cumulative distance to all remaining instances is removed first. This produces a ranked ordering from most-outlier to tightest centroid, with interpretable elimination distances per instance.
4. Multivariate Consensus (
MultivariateConsensus)Intersects the per-variable cluster memberships across all sensor variables at every threshold. An instance survives only if it belongs to the main cluster for every variable simultaneously, producing a binary selection matrix (instances × thresholds) and a consensus survival curve.
5. Visualization Suite (
data_visualization/clustering_plots.py)Six new plot classes:
DataQualityHeatmap,DendrogramPlot,ClusterSizeCurvePlot,SelectionHeatmapPlot,ClusteringOverlayPlot(with DBA centroid in red), andRankedDistancePlot(distinguishing consensus-selected, vetoed, and local-outlier instances).6. Optional Dependency Group (
pyproject.toml)pip install ThreeWToolkit[clustering]— addsdtaidistancefor DTW computation andjoblibfor parallel distance matrix computation.7. Test Coverage (
tests/)195 new tests across 6 files covering all pipeline components, edge cases, and visualizations.
8. Example Notebooks (
docs/notebooks/)Two fully executed notebooks: a step-by-step pipeline walkthrough (
clustering_pipeline_examples.ipynb) and a source comparison across Real, Simulated, and Combined data (clustering_real_vs_simulated_examples.ipynb).Final Notes
This pipeline gives researchers a systematic, reproducible method to select the most structurally consistent instances of each event type before feeding them into downstream models — reducing noise in training data and providing interpretable visual evidence for every selection decision.
By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):
Our CLAs are based on the Apache Software Foundation's CLAs: