Skip to content

Project Improvements - Multivariate time series clustering integration#182

Open
RalphVita wants to merge 23 commits intopetrobras:other_improvementsfrom
RalphVita:feat/clustering-integration
Open

Project Improvements - Multivariate time series clustering integration#182
RalphVita wants to merge 23 commits intopetrobras:other_improvementsfrom
RalphVita:feat/clustering-integration

Conversation

@RalphVita
Copy link

Overview

This PR introduces a complete multivariate time series clustering pipeline to the ThreeWToolkit, enabling automatic selection of structurally representative instances from well event data. Two complementary clustering strategies are implemented — agglomerative hierarchical and divisive ranking — combined through a multivariate consensus mechanism that identifies instances consistently similar across all sensor variables simultaneously.


Summary of Changes

1. Clustering Pipeline (ThreeWToolkit/clustering/)

Seven sklearn-compatible components following the fit/transform contract, each driven by a Pydantic config model: InstanceQualityFilter, TimeSeriesResampler, TimeSeriesScaler, DistanceComputer, HierarchicalClusterer, DivisiveRanker, MultivariateConsensus, and the compute_dba_centroid utility.

2. Agglomerative Hierarchical Clustering

For each sensor variable, a hierarchical tree is built using average-linkage over pairwise DTW distances, with all distances normalized to [0, 1] to make thresholds comparable across variables. Sweeping thresholds from 0.1 to 1.0 produces a per-variable cluster membership at each cut.

3. Divisive Ranking

As a complementary strategy, instances are eliminated top-down: at each step the instance with the highest cumulative distance to all remaining instances is removed first. This produces a ranked ordering from most-outlier to tightest centroid, with interpretable elimination distances per instance.

4. Multivariate Consensus (MultivariateConsensus)

Intersects the per-variable cluster memberships across all sensor variables at every threshold. An instance survives only if it belongs to the main cluster for every variable simultaneously, producing a binary selection matrix (instances × thresholds) and a consensus survival curve.

5. Visualization Suite (data_visualization/clustering_plots.py)

Six new plot classes: DataQualityHeatmap, DendrogramPlot, ClusterSizeCurvePlot, SelectionHeatmapPlot, ClusteringOverlayPlot (with DBA centroid in red), and RankedDistancePlot (distinguishing consensus-selected, vetoed, and local-outlier instances).

6. Optional Dependency Group (pyproject.toml)

pip install ThreeWToolkit[clustering] — adds dtaidistance for DTW computation and joblib for parallel distance matrix computation.

7. Test Coverage (tests/)

195 new tests across 6 files covering all pipeline components, edge cases, and visualizations.

8. Example Notebooks (docs/notebooks/)

Two fully executed notebooks: a step-by-step pipeline walkthrough (clustering_pipeline_examples.ipynb) and a source comparison across Real, Simulated, and Combined data (clustering_real_vs_simulated_examples.ipynb).


Final Notes

This pipeline gives researchers a systematic, reproducible method to select the most structurally consistent instances of each event type before feeding them into downstream models — reducing noise in training data and providing interpretable visual evidence for every selection decision.


By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

Our CLAs are based on the Apache Software Foundation's CLAs:

RalphVita added 22 commits March 1, 2026 06:34
…ank instances from most extreme

outlier to the dense core based on recursive distance evaluation.
…are consistently clustered across multiple sensor variables.
@ricardoevvargas
Copy link
Collaborator

Hi, @RalphVita.

It seems that this PR will result in another significant contribution from the Federal University of Espírito Santo to the 3W Project.

We will evaluate it ASAP and will let you know here if we have any questions and/or requests for adjustments.

On behalf of the 3W Community, I thank you for this PR.

@ricardoevvargas
Copy link
Collaborator

@RalphVita,

Some automatic checks were not successful. Can you check details here and adjust whatever is necessary so that 100% of these checks are successful?

@RalphVita
Copy link
Author

@RalphVita,

Some automatic checks were not successful. Can you check details here and adjust whatever is necessary so that 100% of these checks are successful?

Hi @ricardoevvargas, thanks for the feedback!

I fixed the issues in commit 24e41bf. The tests are now passing locally (mypy, black, ruff, and pytest — 507 passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants