Skip to content

[Phase 6] Production Pipeline and Documentation #7

@Sakeeb91

Description

@Sakeeb91

Objective

Create end-to-end inference pipeline with CLI interface and comprehensive documentation.

Dependencies

  • All previous phases completed

Tasks

  • Implement src/pipeline.py with full inference pipeline
  • Implement scripts/run_pipeline.py CLI entry point
  • Create requirements.txt with all dependencies
  • Ensure reproducibility with random seed control
  • Write comprehensive docstrings
  • Update README with usage instructions
  • Final testing and validation

Files to Create

File Purpose
src/pipeline.py End-to-end pipeline
scripts/run_pipeline.py CLI interface
requirements.txt Dependencies

Starter Code

# src/pipeline.py
"""End-to-end stellar classification pipeline."""

import pandas as pd
import joblib
from pathlib import Path

from src.data.apogee_loader import load_apogee_allstar
from src.data.preprocessor import preprocess_data
from src.models.classifier import StellarClassifier
from src.models.regressor import ParameterRegressor

class StellarPipeline:
    """Full inference pipeline for stellar classification."""

    def __init__(self, model_dir: str = "models"):
        self.model_dir = Path(model_dir)
        self.classifier = None
        self.regressors = {}

    def load_models(self) -> None:
        """Load trained models from disk."""
        self.classifier = joblib.load(self.model_dir / "stellar_classifier_v1.joblib")
        for param in ["TEFF", "LOGG", "FE_H"]:
            self.regressors[param] = joblib.load(
                self.model_dir / f"regressor_{param.lower()}_v1.joblib"
            )

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run full prediction pipeline."""
        # Preprocess
        X = preprocess_data(df)

        # Classification
        df["stellar_type"] = self.classifier.predict(X)

        # Regression
        for param, reg in self.regressors.items():
            df[f"{param}_pred"] = reg.predict(X)

        return df
# scripts/run_pipeline.py
"""CLI for stellar classification pipeline."""

import argparse
from src.pipeline import StellarPipeline

def main():
    parser = argparse.ArgumentParser(description="Stellar Classification Pipeline")
    parser.add_argument("input", help="Input FITS file path")
    parser.add_argument("-o", "--output", default="results.csv", help="Output CSV path")
    parser.add_argument("--model-dir", default="models", help="Model directory")

    args = parser.parse_args()

    pipeline = StellarPipeline(model_dir=args.model_dir)
    pipeline.load_models()

    # Load and process
    from src.data.apogee_loader import load_apogee_allstar
    df = load_apogee_allstar(args.input)
    results = pipeline.predict(df)

    # Save results
    results.to_csv(args.output, index=False)
    print(f"Results saved to {args.output}")

if __name__ == "__main__":
    main()

Definition of Done

  • Single command runs full pipeline
  • Results reproducible with seed control
  • Documentation complete
  • All tests passing
  • README updated with usage examples

Part of #1 (Meta Issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions