Project Improvements - Multivariate time series clustering integration by RalphVita · Pull Request #182 · petrobras/3W

RalphVita · 2026-03-06T20:30:29Z

Overview

This PR introduces a complete multivariate time series clustering pipeline to the ThreeWToolkit, enabling automatic selection of structurally representative instances from well event data. Two complementary clustering strategies are implemented — agglomerative hierarchical and divisive ranking — combined through a multivariate consensus mechanism that identifies instances consistently similar across all sensor variables simultaneously.

Summary of Changes

1. Clustering Pipeline (ThreeWToolkit/clustering/)

Seven sklearn-compatible components following the fit/transform contract, each driven by a Pydantic config model: InstanceQualityFilter, TimeSeriesResampler, TimeSeriesScaler, DistanceComputer, HierarchicalClusterer, DivisiveRanker, MultivariateConsensus, and the compute_dba_centroid utility.

2. Agglomerative Hierarchical Clustering

For each sensor variable, a hierarchical tree is built using average-linkage over pairwise DTW distances, with all distances normalized to [0, 1] to make thresholds comparable across variables. Sweeping thresholds from 0.1 to 1.0 produces a per-variable cluster membership at each cut.

3. Divisive Ranking

As a complementary strategy, instances are eliminated top-down: at each step the instance with the highest cumulative distance to all remaining instances is removed first. This produces a ranked ordering from most-outlier to tightest centroid, with interpretable elimination distances per instance.

4. Multivariate Consensus (MultivariateConsensus)

Intersects the per-variable cluster memberships across all sensor variables at every threshold. An instance survives only if it belongs to the main cluster for every variable simultaneously, producing a binary selection matrix (instances × thresholds) and a consensus survival curve.

5. Visualization Suite (data_visualization/clustering_plots.py)

Six new plot classes: DataQualityHeatmap, DendrogramPlot, ClusterSizeCurvePlot, SelectionHeatmapPlot, ClusteringOverlayPlot (with DBA centroid in red), and RankedDistancePlot (distinguishing consensus-selected, vetoed, and local-outlier instances).

6. Optional Dependency Group (pyproject.toml)

pip install ThreeWToolkit[clustering] — adds dtaidistance for DTW computation and joblib for parallel distance matrix computation.

7. Test Coverage (tests/)

195 new tests across 6 files covering all pipeline components, edge cases, and visualizations.

8. Example Notebooks (docs/notebooks/)

Two fully executed notebooks: a step-by-step pipeline walkthrough (clustering_pipeline_examples.ipynb) and a source comparison across Real, Simulated, and Combined data (clustering_real_vs_simulated_examples.ipynb).

Final Notes

This pipeline gives researchers a systematic, reproducible method to select the most structurally consistent instances of each event type before feeding them into downstream models — reducing noise in training data and providing interpretable visual evidence for every selection decision.

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

ICLA: Individual Contributor License Agreement on behalf of only yourself;
CCLA: Corporate Contributor License Agreement on behalf of your employer.

Our CLAs are based on the Apache Software Foundation's CLAs:

ICLA: Individual Contributor License Agreement
CCLA: Corporate Contributor License Agreement

…me series clustering capabilities

…stering estimators and transformers.

…mputations like DTW.

…rarchical clustering.

…ank instances from most extreme outlier to the dense core based on recursive distance evaluation.

…are consistently clustered across multiple sensor variables.

…fore clustering.

…al and simulated data

ricardoevvargas · 2026-03-09T18:31:21Z

Hi, @RalphVita.

It seems that this PR will result in another significant contribution from the Federal University of Espírito Santo to the 3W Project.

We will evaluate it ASAP and will let you know here if we have any questions and/or requests for adjustments.

On behalf of the 3W Community, I thank you for this PR.

ricardoevvargas · 2026-03-09T18:46:09Z

@RalphVita,

Some automatic checks were not successful. Can you check details here and adjust whatever is necessary so that 100% of these checks are successful?

RalphVita · 2026-03-09T20:56:22Z

@RalphVita,

Some automatic checks were not successful. Can you check details here and adjust whatever is necessary so that 100% of these checks are successful?

Hi @ricardoevvargas, thanks for the feedback!

I fixed the issues in commit 24e41bf. The tests are now passing locally (mypy, black, ruff, and pytest — 507 passed).

RalphVita added 22 commits March 1, 2026 06:34

Add DistanceMetricEnum and LinkageMethodEnum, new enums to support ti…

e0a1946

…me series clustering capabilities

add configs for time series clustering

ca89c3a

add load_instances_by_variable to parquet dataset

50c9a4d

add clustering module initialization that exports the time series clu…

2849da3

…stering estimators and transformers.

Implements downsampling of time series arrays to speed up distance co…

6cf6d2b

…mputations like DTW.

Implements per-instance Z-normalization.

16f62c8

Implements pairwise distance matrix computation supporting DTW.

581d8bd

Add clustering optional dependency group to pyproject.toml

c0e5160

add HierarchicalClusterer estimator that implements agglomerative hie…

6e0c7b2

…rarchical clustering.

Add time series clustering tests

af81fae

add DivisiveRanker estimator that implements a top-down approach to r…

32a48ee

…ank instances from most extreme outlier to the dense core based on recursive distance evaluation.

Implements cross-variable consensus selection to find instances that …

d023c22

…are consistently clustered across multiple sensor variables.

Detection and repair of frozen sensors and NaN-corrupted instances be…

39ba15b

…fore clustering.

Add clustering visualization classes.

6f99dff

Create python notebooks examples

f125c36

Plot original instances indices

a01394e

Improve visualization of the Selection Heatmap Plot

c2434e2

Plot DTW centroid

d71bf34

Improve the ranked distance plot

f58cc46

Adding an example of using Clustering by comparing the analysis of re…

711adac

…al and simulated data

Fix tests

cc6d795

black formatting

ae7537a

Fix mypy and ruff errors

24e41bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Improvements - Multivariate time series clustering integration#182

Project Improvements - Multivariate time series clustering integration#182
RalphVita wants to merge 23 commits intopetrobras:other_improvementsfrom
RalphVita:feat/clustering-integration

RalphVita commented Mar 6, 2026

Uh oh!

ricardoevvargas commented Mar 9, 2026

Uh oh!

ricardoevvargas commented Mar 9, 2026

Uh oh!

RalphVita commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RalphVita commented Mar 6, 2026

Overview

Summary of Changes

Final Notes

Uh oh!

ricardoevvargas commented Mar 9, 2026

Uh oh!

ricardoevvargas commented Mar 9, 2026

Uh oh!

RalphVita commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants