Objective
Clean data, handle missing values, normalize features for ML pipeline.
Dependencies
- Phase 1: Data Foundation (requires loaded APOGEE data)
Tasks
Files to Create
| File |
Purpose |
src/data/preprocessor.py |
Cleaning pipeline |
src/data/feature_selector.py |
Select relevant features |
tests/test_preprocessor.py |
Preprocessing tests |
Starter Code
# src/data/preprocessor.py
"""Data preprocessing pipeline."""
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
MISSING_VALUE_SENTINEL = -9999.0
def clean_missing_values(df: pd.DataFrame) -> pd.DataFrame:
"""Replace sentinel missing values with NaN."""
return df.replace(MISSING_VALUE_SENTINEL, np.nan)
def remove_invalid_parameters(df: pd.DataFrame) -> pd.DataFrame:
"""Remove rows with invalid stellar parameters."""
valid_mask = (
(df["TEFF"] > 2500) & (df["TEFF"] < 10000) &
(df["LOGG"] > -1) & (df["LOGG"] < 6) &
(df["FE_H"] > -3) & (df["FE_H"] < 1)
)
return df[valid_mask].copy()
def normalize_features(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
"""Standardize specified columns."""
scaler = StandardScaler()
df[columns] = scaler.fit_transform(df[columns])
return df
Definition of Done
Technical Notes
- Consider correlation analysis for feature selection
- Preserve original values in separate columns before normalization
- Document imputation choices for reproducibility
Part of #1 (Meta Issue)
Objective
Clean data, handle missing values, normalize features for ML pipeline.
Dependencies
Tasks
src/data/preprocessor.pywith data cleaning pipelinesrc/data/feature_selector.pyfor relevant feature selectiondata/processed/apogee_clean.parquetFiles to Create
src/data/preprocessor.pysrc/data/feature_selector.pytests/test_preprocessor.pyStarter Code
Definition of Done
Technical Notes
Part of #1 (Meta Issue)