Skip to content

[Phase 2] Preprocessing Pipeline #3

@Sakeeb91

Description

@Sakeeb91

Objective

Clean data, handle missing values, normalize features for ML pipeline.

Dependencies

  • Phase 1: Data Foundation (requires loaded APOGEE data)

Tasks

  • Implement src/data/preprocessor.py with data cleaning pipeline
  • Implement src/data/feature_selector.py for relevant feature selection
  • Handle missing values (imputation or removal strategy)
  • Implement outlier detection without removing rare stellar types
  • Normalize features to comparable scales
  • Save processed data to data/processed/apogee_clean.parquet
  • Write unit tests for preprocessing

Files to Create

File Purpose
src/data/preprocessor.py Cleaning pipeline
src/data/feature_selector.py Select relevant features
tests/test_preprocessor.py Preprocessing tests

Starter Code

# src/data/preprocessor.py
"""Data preprocessing pipeline."""

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

MISSING_VALUE_SENTINEL = -9999.0

def clean_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """Replace sentinel missing values with NaN."""
    return df.replace(MISSING_VALUE_SENTINEL, np.nan)

def remove_invalid_parameters(df: pd.DataFrame) -> pd.DataFrame:
    """Remove rows with invalid stellar parameters."""
    valid_mask = (
        (df["TEFF"] > 2500) & (df["TEFF"] < 10000) &
        (df["LOGG"] > -1) & (df["LOGG"] < 6) &
        (df["FE_H"] > -3) & (df["FE_H"] < 1)
    )
    return df[valid_mask].copy()

def normalize_features(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    """Standardize specified columns."""
    scaler = StandardScaler()
    df[columns] = scaler.fit_transform(df[columns])
    return df

Definition of Done

  • No missing values in processed output
  • Features normalized to mean=0, std=1
  • Outliers flagged but not removed (preserve rare types)
  • Parquet file saved for fast loading
  • All tests passing

Technical Notes

  • Consider correlation analysis for feature selection
  • Preserve original values in separate columns before normalization
  • Document imputation choices for reproducibility

Part of #1 (Meta Issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions