Skip to content

Add preprocessing pipeline intergration for complex dataset#72

Open
minhnhu123 wants to merge 5 commits into
DataBytes-Organisation:feature/correlation-alert/venura/quangfrom
minhnhu123:feature/correlation-alert/venura/quang
Open

Add preprocessing pipeline intergration for complex dataset#72
minhnhu123 wants to merge 5 commits into
DataBytes-Organisation:feature/correlation-alert/venura/quangfrom
minhnhu123:feature/correlation-alert/venura/quang

Conversation

@minhnhu123
Copy link
Copy Markdown

@minhnhu123 minhnhu123 commented Apr 17, 2026

Overview

This pull request implements the data preprocessing module and refactors the correlation_alert/main.py to integrate with the preprocessing pipeline.

The work focuses on preparing clean and consistent time-series sensor data and ensuring the correlation alert pipeline uses modular and reusable preprocessing logic.

What was implemented

Preprocessing pipeline in preprocessing.py

Implemented a reusable preprocessing pipeline with the following steps:

  • load_sensor_data(...): loads dataset and validates required columns
  • fix_timestamps(...): converts timestamps to numeric ,removes invalid and duplicate timestamps, sorts data by time
  • convert_sensor_columns_to_numeric(...): ensures all sensor streams are numeric
  • handle_missing_values(...): handles missing values using interpolation
  • remove_outliers(...): detects and replaces outliers using IQR method
  • align_to_common_index(...): aligns time series to a consistent time index
  • validate_output(...): ensures data is clean, sorted, and ready for analysis
  • run_pipeline(...): wrapper function that connects all preprocessing steps

Testing completed
Function-level testing

Verified preprocessing steps:

  • timestamp handling
  • missing value handling
  • numeric conversion
  • outlier removal

Verified integration with correlation pipeline:

  • cleaned data is correctly passed into rolling window stage
    Dataset-level validation

Tested using:

complex.csv

Confirmed:

  • preprocessing pipeline produces clean dataset
  • correlation pipeline runs successfully on processed data
  • outputs (changes and alerts) are generated correctly

The preprocessing module is complete and successfully integrated with the correlation alert pipeline.

Copy link
Copy Markdown
Collaborator

@senuradp senuradp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the preprocessing implementation. I reviewed the code locally, and the preprocessing logic is relevant for the correlation alert pipeline. It correctly selects required columns, converts values to numeric, sorts by timestamp, removes duplicate timestamps, sets the timestamp as the index, and handles missing values.

Before merging, please keep this PR focused on preprocessing only. The changes in main.py currently implement the full pipeline, which overlaps with other team members’ assigned components and may create integration conflicts.

For final integration, please ensure this contribution plugs into the wrapper function:
preprocess_timeseries(df, timestamp_col, selected_streams)

Also, any hardcoded dataset paths such as datasets/complex.csv should be limited to testing/demo files and not required for the main reusable pipeline.

@minhnhu123
Copy link
Copy Markdown
Author

Thank you for you feedback, I have updated main.py as requested:

Additional changes made

  • fixed main.py to ensure it focuses only on preprocessing functionality as requested

  • Removed implementation of:

    • rolling window creation
    • correlation computation
    • correlation change detection
    • alert generation logic
  • Implemented a dedicated wrapper function:

    preprocess_timeseries(df, timestamp_col, selected_streams)

@senuradp
Copy link
Copy Markdown
Collaborator

Thanks for updating the PR based on the earlier feedback. I reviewed the updated implementation, and this version is much better aligned with the intended modular pipeline structure.

The main.py now correctly focuses on preprocessing only, and the preprocess_timeseries(df, timestamp_col, selected_streams) function matches the wrapper design. The preprocessing steps (column selection, timestamp handling, numeric conversion, missing value handling, and validation) are well implemented and produce a clean dataframe suitable for the next stages of the pipeline.

There is a minor import/path issue related to how preprocessing.py is referenced. I will fix this during integration to ensure consistency with the project structure.

Overall, this is a solid preprocessing implementation and is ready to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants