Add preprocessing pipeline intergration for complex dataset by minhnhu123 · Pull Request #72 · DataBytes-Organisation/Intelligent-IoT-Data-Management

minhnhu123 · 2026-04-17T10:13:08Z

Overview

This pull request implements the data preprocessing module and refactors the correlation_alert/main.py to integrate with the preprocessing pipeline.

The work focuses on preparing clean and consistent time-series sensor data and ensuring the correlation alert pipeline uses modular and reusable preprocessing logic.

What was implemented

Preprocessing pipeline in preprocessing.py

Implemented a reusable preprocessing pipeline with the following steps:

load_sensor_data(...): loads dataset and validates required columns
fix_timestamps(...): converts timestamps to numeric ,removes invalid and duplicate timestamps, sorts data by time
convert_sensor_columns_to_numeric(...): ensures all sensor streams are numeric
handle_missing_values(...): handles missing values using interpolation
remove_outliers(...): detects and replaces outliers using IQR method
align_to_common_index(...): aligns time series to a consistent time index
validate_output(...): ensures data is clean, sorted, and ready for analysis
run_pipeline(...): wrapper function that connects all preprocessing steps

Testing completed
Function-level testing

Verified preprocessing steps:

timestamp handling
missing value handling
numeric conversion
outlier removal

Verified integration with correlation pipeline:

cleaned data is correctly passed into rolling window stage
Dataset-level validation

Tested using:

complex.csv

Confirmed:

preprocessing pipeline produces clean dataset
correlation pipeline runs successfully on processed data
outputs (changes and alerts) are generated correctly

The preprocessing module is complete and successfully integrated with the correlation alert pipeline.

Signed-off-by: minhnhu123 <minhnhu171202@gmail.com>

senuradp

Good work on the preprocessing implementation. I reviewed the code locally, and the preprocessing logic is relevant for the correlation alert pipeline. It correctly selects required columns, converts values to numeric, sorts by timestamp, removes duplicate timestamps, sets the timestamp as the index, and handles missing values.

Before merging, please keep this PR focused on preprocessing only. The changes in main.py currently implement the full pipeline, which overlaps with other team members’ assigned components and may create integration conflicts.

For final integration, please ensure this contribution plugs into the wrapper function:
preprocess_timeseries(df, timestamp_col, selected_streams)

Also, any hardcoded dataset paths such as datasets/complex.csv should be limited to testing/demo files and not required for the main reusable pipeline.

…essing via run_pipeline

minhnhu123 · 2026-04-27T14:03:56Z

Thank you for you feedback, I have updated main.py as requested:

Additional changes made

fixed main.py to ensure it focuses only on preprocessing functionality as requested
Removed implementation of:
- rolling window creation
- correlation computation
- correlation change detection
- alert generation logic

Implemented a dedicated wrapper function:

preprocess_timeseries(df, timestamp_col, selected_streams)

senuradp · 2026-05-13T11:38:58Z

Thanks for updating the PR based on the earlier feedback. I reviewed the updated implementation, and this version is much better aligned with the intended modular pipeline structure.

The main.py now correctly focuses on preprocessing only, and the preprocess_timeseries(df, timestamp_col, selected_streams) function matches the wrapper design. The preprocessing steps (column selection, timestamp handling, numeric conversion, missing value handling, and validation) are well implemented and produce a clean dataframe suitable for the next stages of the pipeline.

There is a minor import/path issue related to how preprocessing.py is referenced. I will fix this during integration to ensure consistency with the project structure.

Overall, this is a solid preprocessing implementation and is ready to proceed.

minhnhu123 added 2 commits April 17, 2026 16:58

Add preprocessing pipeline intergration for complex dataset

d2d81de

Add files via upload

8719d80

Signed-off-by: minhnhu123 <minhnhu171202@gmail.com>

senuradp requested changes Apr 25, 2026

View reviewed changes

minhnhu123 added 3 commits April 27, 2026 20:41

Fixing main.py to focus on correlation pipeline and integrate preproc…

1730eff

…essing via run_pipeline

Fixing comment

486618e

Keep main.py only for preprocessing

9b6e719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocessing pipeline intergration for complex dataset#72

Add preprocessing pipeline intergration for complex dataset#72
minhnhu123 wants to merge 5 commits into
DataBytes-Organisation:feature/correlation-alert/venura/quangfrom
minhnhu123:feature/correlation-alert/venura/quang

minhnhu123 commented Apr 17, 2026 •

edited

Loading

Uh oh!

senuradp left a comment

Uh oh!

minhnhu123 commented Apr 27, 2026

Uh oh!

senuradp commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

minhnhu123 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

senuradp left a comment

Choose a reason for hiding this comment

Uh oh!

minhnhu123 commented Apr 27, 2026

Uh oh!

senuradp commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

minhnhu123 commented Apr 17, 2026 •

edited

Loading