Role: ML Engineer / Astrophysics Domain Specialist
Rationale: This project requires deep understanding of spectroscopic data processing, multi-survey data fusion, and regression/classification techniques for stellar parameters. The role combines ML expertise with domain knowledge of stellar atmospheres, spectral features, and survey-specific systematics.
┌─────────────────────────────────────────────────────────────────────────────┐
│ STELLAR CLASSIFICATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ APOGEE │ │ GALAH │ │ LAMOST │ │
│ │ DR17 │ │ DR3 │ │ DR7 │ │
│ │ (allStar) │ │ (Summary) │ │ (Catalog) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────┬─────┴─────────────┬─────┘ │
│ ▼ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ DATA INGESTION LAYER │ │
│ │ • Survey-specific parsers │ │
│ │ • Quality flag filtering │ │
│ │ • Cross-match by coordinates │ │
│ └────────────────┬───────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ PREPROCESSING LAYER │ │
│ │ • Missing value handling │ │
│ │ • Outlier detection │ │
│ │ • Feature normalization │ │
│ │ • Survey bias correction │ │
│ └────────────────┬───────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ FEATURE ENGINEERING │ │
│ │ • Spectral indices │ │
│ │ • Color-magnitude features │ │
│ │ • Abundance ratios │ │
│ └────────────────┬───────────────────┘ │
│ ▼ │
│ ┌─────────────────────┴─────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ CLASSIFICATION │ │ REGRESSION │ │
│ │ MODULE │ │ MODULE │ │
│ │ │ │ │ │
│ │ • Stellar type │ │ • Teff │ │
│ │ • Luminosity │ │ • log g │ │
│ │ class │ │ • [Fe/H] │ │
│ │ • Evolutionary │ │ • [alpha/Fe] │ │
│ │ stage │ │ │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ └─────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ EVALUATION & OUTPUT │ │
│ │ • Cross-survey validation │ │
│ │ • Uncertainty quantification │ │
│ │ • Results export │ │
│ └────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Component | Choice | Rationale | Tradeoff | Fallback |
|---|---|---|---|---|
| Language | Python 3.10+ | Industry standard for ML/astro | Slower than C++ | None needed |
| Data Loading | astropy, pandas | Native FITS support, tabular data | Memory for large files | Dask for >RAM |
| ML Framework | scikit-learn | Simple API, good baselines | Limited deep learning | XGBoost upgrade |
| Regression | Random Forest, XGBoost | Handles non-linear, missing data | Requires tuning | Ridge regression |
| Classification | Random Forest | Interpretable, robust | May underfit complex | Gradient boosting |
| Visualization | matplotlib, seaborn | Standard, extensive docs | Basic interactivity | Plotly |
| Testing | pytest | Simple, widely used | None significant | unittest |
astropy>=5.0 # FITS file handling, coordinates
pandas>=2.0 # Data manipulation
numpy>=1.24 # Numerical operations
scikit-learn>=1.3 # ML algorithms
xgboost>=2.0 # Gradient boosting
matplotlib>=3.7 # Visualization
seaborn>=0.12 # Statistical plots
pytest>=7.0 # Testing
Scope: Download APOGEE allStar file, create data loading infrastructure
Deliverables:
src/data/apogee_loader.py- APOGEE data parsersrc/data/quality_filters.py- Quality flag filteringdata/raw/- Downloaded allStar FITS filenotebooks/01_data_exploration.ipynb- Initial EDA
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/__init__.py |
Package init | 1 |
src/data/__init__.py |
Subpackage init | 1 |
src/data/apogee_loader.py |
Load APOGEE data | 80-100 |
src/data/quality_filters.py |
Filter bad data | 50-70 |
configs/data_config.yaml |
Data paths, parameters | 30 |
Technical Challenges:
- FITS file can be >1GB; need chunked loading strategy
- Quality flags are bitmasks requiring careful interpretation
- Missing values encoded as -9999 or NaN inconsistently
Verification:
# Test: Load data successfully
def test_apogee_loader():
df = load_apogee_allstar("data/raw/allStar-dr17-synspec_rev1.fits")
assert len(df) > 600000
assert "TEFF" in df.columns
assert "LOGG" in df.columnsDefinition of Done:
- APOGEE allStar file downloaded
- Loader returns DataFrame with correct columns
- Quality filters remove flagged stars
- EDA notebook shows data distributions
Scope: Clean data, handle missing values, normalize features
Deliverables:
src/data/preprocessor.py- Data cleaning pipelinesrc/data/feature_selector.py- Feature selection utilitiesdata/processed/apogee_clean.parquet- Cleaned dataset
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/data/preprocessor.py |
Cleaning pipeline | 120-150 |
src/data/feature_selector.py |
Select relevant features | 60-80 |
tests/test_preprocessor.py |
Preprocessing tests | 80 |
Technical Challenges:
- Deciding imputation strategy for missing abundances
- Outlier detection without removing rare stellar types
- Feature correlation may cause multicollinearity
Verification:
def test_no_missing_values():
df = preprocess_apogee(raw_df)
assert df.isnull().sum().sum() == 0
def test_feature_ranges():
df = preprocess_apogee(raw_df)
assert df["TEFF"].between(3000, 8000).all()Definition of Done:
- No missing values in output
- Features normalized to comparable scales
- Outliers flagged but preserved
- Parquet file saved for fast loading
Scope: Train stellar type classifier using spectroscopic features
Deliverables:
src/models/classifier.py- Classification model classsrc/evaluation/metrics.py- Evaluation utilitiesmodels/stellar_classifier_v1.joblib- Trained model
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/models/__init__.py |
Subpackage init | 1 |
src/models/classifier.py |
Classifier implementation | 100-130 |
src/evaluation/__init__.py |
Subpackage init | 1 |
src/evaluation/metrics.py |
Custom metrics | 60-80 |
tests/test_classifier.py |
Model tests | 70 |
Technical Challenges:
- Class imbalance (many more dwarfs than giants)
- Defining meaningful stellar type labels from continuous parameters
- Cross-validation strategy for astronomical data (spatial correlation)
Verification:
def test_classifier_accuracy():
clf = StellarClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
assert accuracy > 0.85 # Baseline thresholdDefinition of Done:
- Classifier achieves >85% accuracy
- Confusion matrix shows reasonable class separation
- Model saved and loadable
- Feature importance extracted
Scope: Predict Teff, log g, [Fe/H], [alpha/Fe] from features
Deliverables:
src/models/regressor.py- Multi-output regressionsrc/evaluation/regression_metrics.py- MAE, scatter metricsmodels/parameter_regressor_v1.joblib- Trained models
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/models/regressor.py |
Regression models | 130-160 |
src/evaluation/regression_metrics.py |
Regression metrics | 50-70 |
notebooks/03_regression_analysis.ipynb |
Analysis notebook | N/A |
Technical Challenges:
- Different parameters have different scales and error characteristics
- Correlation between parameters (Teff-log g relation)
- Heteroscedastic errors (uncertainty varies with stellar type)
Verification:
def test_teff_prediction():
reg = ParameterRegressor(target="TEFF")
reg.fit(X_train, y_train)
mae = mean_absolute_error(y_test, reg.predict(X_test))
assert mae < 100 # Within 100K for TeffDefinition of Done:
- Teff MAE < 100 K
- log g MAE < 0.2 dex
- [Fe/H] MAE < 0.1 dex
- Residual analysis shows no systematic bias
Scope: Validate on GALAH overlap, assess systematic differences
Deliverables:
src/data/galah_loader.py- GALAH data parsersrc/data/crossmatch.py- Coordinate cross-matchingnotebooks/04_cross_validation.ipynb- Cross-survey analysis
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/data/galah_loader.py |
Load GALAH data | 70-90 |
src/data/crossmatch.py |
Cross-match utilities | 80-100 |
src/evaluation/cross_survey.py |
Cross-survey metrics | 60-80 |
Technical Challenges:
- Different wavelength coverage (IR vs optical)
- Systematic offsets between survey pipelines
- Limited overlap sample size
Verification:
def test_cross_survey_consistency():
# Compare APOGEE predictions to GALAH labels
overlap = crossmatch_surveys(apogee_df, galah_df)
teff_diff = overlap["TEFF_apogee"] - overlap["TEFF_galah"]
assert abs(teff_diff.median()) < 50 # Systematic < 50KDefinition of Done:
- Cross-match identifies >10,000 common stars
- Systematic offsets quantified
- Calibration applied if needed
- Results documented
Scope: End-to-end pipeline, CLI interface, comprehensive docs
Deliverables:
src/pipeline.py- Full inference pipelinescripts/run_pipeline.py- CLI entry pointdocs/- API documentation
Files to Create:
| File | Purpose | Lines (est) |
|---|---|---|
src/pipeline.py |
End-to-end pipeline | 100-130 |
scripts/run_pipeline.py |
CLI interface | 50-70 |
requirements.txt |
Dependencies | 15 |
Definition of Done:
- Single command runs full pipeline
- Results reproducible with seed
- Documentation complete
- All tests passing
| Risk | Likelihood | Impact | Early Warning | Mitigation |
|---|---|---|---|---|
| FITS file download fails | Medium | 🔴 High | Network timeout | Use SDSS CasJobs for subset |
| Memory overflow | High | 🟡 Medium | Swap usage | Use chunked loading, Dask |
| Poor model performance | Medium | 🔴 High | Val loss plateau | Simpler model, more features |
| Survey systematics | High | 🟡 Medium | Cross-val scatter | Document as limitation |
| Missing rare types | Medium | 🟡 Medium | Class imbalance | Stratified sampling |
Test 1: Data Loading
# tests/test_data_loading.py
def test_apogee_loader_returns_dataframe():
"""Verify APOGEE loader returns valid DataFrame."""
from src.data.apogee_loader import load_apogee_allstar
df = load_apogee_allstar("data/raw/allStar-dr17-synspec_rev1.fits")
assert isinstance(df, pd.DataFrame)
assert len(df) > 0
required_cols = ["APOGEE_ID", "TEFF", "LOGG", "FE_H"]
for col in required_cols:
assert col in df.columnsTest 2: Quality Filtering
# tests/test_quality_filters.py
def test_quality_filter_removes_bad_data():
"""Verify quality filter removes flagged stars."""
from src.data.quality_filters import apply_quality_cuts
raw_df = pd.DataFrame({
"TEFF": [4500, -9999, 5000],
"ASPCAPFLAG": [0, 1, 0]
})
filtered = apply_quality_cuts(raw_df)
assert len(filtered) == 2
assert -9999 not in filtered["TEFF"].valuesTest 3: Model Prediction Shape
# tests/test_model.py
def test_classifier_prediction_shape():
"""Verify classifier returns correct shape."""
from src.models.classifier import StellarClassifier
X = np.random.randn(100, 10)
y = np.random.randint(0, 3, 100)
clf = StellarClassifier()
clf.fit(X, y)
preds = clf.predict(X)
assert preds.shape == (100,)def load_apogee_allstar(filepath: str, columns: list[str] | None = None) -> pd.DataFrame:
"""
Load APOGEE allStar FITS file into a pandas DataFrame.
Parameters
----------
filepath : str
Path to the allStar FITS file
columns : list[str] | None
Specific columns to load. If None, loads default set.
Returns
-------
pd.DataFrame
DataFrame with stellar parameters and metadata
""""""APOGEE DR17 data loader."""
from pathlib import Path
import pandas as pd
from astropy.io import fits
from astropy.table import Table
# Default columns to extract (reduces memory usage)
DEFAULT_COLUMNS = [
"APOGEE_ID",
"RA", "DEC",
"TEFF", "TEFF_ERR",
"LOGG", "LOGG_ERR",
"FE_H", "FE_H_ERR",
"ALPHA_M", "ALPHA_M_ERR",
"ASPCAPFLAG", "STARFLAG",
"SNR",
"J", "H", "K"
]
def load_apogee_allstar(
filepath: str,
columns: list[str] | None = None
) -> pd.DataFrame:
"""
Load APOGEE allStar FITS file into a pandas DataFrame.
Parameters
----------
filepath : str
Path to the allStar FITS file
columns : list[str] | None
Specific columns to load. If None, loads default set.
Returns
-------
pd.DataFrame
DataFrame with stellar parameters and metadata
Raises
------
FileNotFoundError
If the FITS file doesn't exist
ValueError
If requested columns don't exist in file
"""
filepath = Path(filepath)
if not filepath.exists():
raise FileNotFoundError(f"FITS file not found: {filepath}")
cols_to_load = columns or DEFAULT_COLUMNS
# Use astropy Table for efficient FITS reading
with fits.open(filepath, memmap=True) as hdul:
# allStar data is in extension 1
table = Table.read(hdul[1])
# Validate requested columns exist
available = set(table.colnames)
requested = set(cols_to_load)
missing = requested - available
if missing:
raise ValueError(f"Columns not found: {missing}")
# Convert to DataFrame with selected columns
df = table[cols_to_load].to_pandas()
return df
if __name__ == "__main__":
# Quick test
import sys
if len(sys.argv) > 1:
df = load_apogee_allstar(sys.argv[1])
print(f"Loaded {len(df)} stars")
print(df.head())python -c "from src.data.apogee_loader import load_apogee_allstar; print('Import successful')"feat(data): Add APOGEE allStar data loader
- Implement load_apogee_allstar() function
- Support selective column loading for memory efficiency
- Use astropy Table for robust FITS parsing
- Add input validation and error handling
-
FITS Files: Flexible Image Transport System - standard format in astronomy. Unlike CSV, can contain multiple data tables ("extensions"). Use
astropy.io.fitsto read. -
Stellar Parameters:
Teff(Effective Temperature): Surface temperature in Kelvin (3000-8000K typical)log g(Surface Gravity): Logarithm of gravity in cgs units (0-5 typical)[Fe/H](Metallicity): Iron abundance relative to Sun (-2 to +0.5 typical)[alpha/Fe](Alpha Enhancement): Ratio of alpha elements to iron
-
Quality Flags: Bitmask integers where each bit indicates a specific issue. Use bitwise AND (
&) to check specific flags. -
Memory Mapping:
memmap=Truereads file without loading entirely into RAM - essential for large files.