[Phase 2] Preprocessing Pipeline

## Objective

Clean data, handle missing values, normalize features for ML pipeline.

## Dependencies

- Phase 1: Data Foundation (requires loaded APOGEE data)

## Tasks

- [ ] Implement `src/data/preprocessor.py` with data cleaning pipeline
- [ ] Implement `src/data/feature_selector.py` for relevant feature selection
- [ ] Handle missing values (imputation or removal strategy)
- [ ] Implement outlier detection without removing rare stellar types
- [ ] Normalize features to comparable scales
- [ ] Save processed data to `data/processed/apogee_clean.parquet`
- [ ] Write unit tests for preprocessing

## Files to Create

| File | Purpose |
|------|---------|
| `src/data/preprocessor.py` | Cleaning pipeline |
| `src/data/feature_selector.py` | Select relevant features |
| `tests/test_preprocessor.py` | Preprocessing tests |

## Starter Code

```python
# src/data/preprocessor.py
"""Data preprocessing pipeline."""

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

MISSING_VALUE_SENTINEL = -9999.0

def clean_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """Replace sentinel missing values with NaN."""
    return df.replace(MISSING_VALUE_SENTINEL, np.nan)

def remove_invalid_parameters(df: pd.DataFrame) -> pd.DataFrame:
    """Remove rows with invalid stellar parameters."""
    valid_mask = (
        (df["TEFF"] > 2500) & (df["TEFF"] < 10000) &
        (df["LOGG"] > -1) & (df["LOGG"] < 6) &
        (df["FE_H"] > -3) & (df["FE_H"] < 1)
    )
    return df[valid_mask].copy()

def normalize_features(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    """Standardize specified columns."""
    scaler = StandardScaler()
    df[columns] = scaler.fit_transform(df[columns])
    return df
```

## Definition of Done

- [ ] No missing values in processed output
- [ ] Features normalized to mean=0, std=1
- [ ] Outliers flagged but not removed (preserve rare types)
- [ ] Parquet file saved for fast loading
- [ ] All tests passing

## Technical Notes

- Consider correlation analysis for feature selection
- Preserve original values in separate columns before normalization
- Document imputation choices for reproducibility

---
Part of #1 (Meta Issue)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 2] Preprocessing Pipeline #3

Objective

Dependencies

Tasks

Files to Create

Starter Code

Definition of Done

Technical Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Purpose
`src/data/preprocessor.py`	Cleaning pipeline
`src/data/feature_selector.py`	Select relevant features
`tests/test_preprocessor.py`	Preprocessing tests

[Phase 2] Preprocessing Pipeline #3

Description

Objective

Dependencies

Tasks

Files to Create

Starter Code

Definition of Done

Technical Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions