Skip to content

ramshty/age_prediction_MRI

Repository files navigation

Brain Age Regression Kaggle Project

This project was built around a Kaggle competition. The final result placed in the top 5% against nearly 500 participants.

The repo does not pretend the notebook experiments are already productionized. The new code sits on top of the existing work and makes the current state explicit.

What is in scope now

  • scripts/train_baseline.py: trains a clean scikit-learn baseline on the bundled MRI feature tables and writes a submission CSV plus a JSON run report.
  • scripts/report_legacy_results.py: inspects the saved notebooks and artifacts and prints the best recorded results that already exist in the repo.
  • src/brain_age_lab/: reusable baseline code.

What is still legacy work

  • abdellah_task1/: Abdellah's preprocessing notebook plus saved submissions.
  • mehdi_task1/: Mehdi's experiment branch for imputation and outlier removal, plus partial stubs for a larger pipeline.
  • ramy_task1/: Ramy's main experiment branch, with notebooks, saved OOF predictions, blending weights, and submissions.

I left those folders intact because they are the evidence for the project history. They are useful, but they are exploratory work rather than a polished library.

Dataset facts

The bundled data in ramy_task1/data/ contains:

  • X_train.csv: 1212 rows, 832 feature columns plus id
  • X_test.csv: 776 rows, 832 feature columns plus id
  • y_train.csv: 1212 target rows
  • sample.csv: 776-row submission template

The same competition tables appear in multiple contributor folders. The baseline defaults to the ramy_task1/data/ copy so there is one documented entry point.

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

PYTHONPATH=src python scripts/train_baseline.py \
  --data-dir ramy_task1/data \
  --output outputs/baseline_submission.csv

PYTHONPATH=src python scripts/report_legacy_results.py

Baseline pipeline

The new baseline is intentionally simple and reproducible:

  • median or KNN imputation
  • optional IsolationForest trimming on the training folds only
  • univariate feature selection
  • PLSRegression or ElasticNetCV

This is not presented as the best model in the folder. It is the projectized path that is easiest to rerun and extend.

On a local run dated April 10, 2026, the default packaged baseline (PLSRegression, median imputation, 150 selected features) produced:

  • mean CV R² = 0.2354 +/- 0.0931
  • out-of-fold R² = 0.2342

That gap versus the notebook experiments is intentional to show the difference between "reproducible baseline" and "best exploratory result in the archive."

Recorded legacy results

These are historical notebook outputs already present in the repo, not claims about the new baseline:

  • tuned SVR notebook: mean test R² = 0.6224 +/- 0.0099
  • XGBoost + GPR notebook: mean hybrid R² = 0.6081
  • LightGBM + GPR notebook: mean hybrid R² = 0.5921
  • stacking notebook:
    • pure GPR OOF R² = 0.6727
    • XGBoost + GPR OOF R² = 0.6421
    • SVR + GPR OOF R² = 0.6482
    • LightGBM + GPR OOF R² = 0.6195
    • average off-diagonal OOF correlation = 0.9520

The saved artifacts also include:

  • blend weights: 0.718246761453894 and 0.2817532385461061
  • bias calibration: a = 8.271319003671536, b = 0.8891245933394192
  • a 1660-feature saved selection list in ramy_task1/artifacts/selected_features_full.txt

Run scripts/report_legacy_results.py to regenerate that summary from disk instead of trusting this README blindly.

Suggested next steps

  • swap the baseline model for a stronger tree-based learner if you want leaderboard-oriented performance
  • port the best notebook path into src/ once you decide which experiment is worth maintaining
  • deduplicate the copied datasets once you no longer need contributor-level provenance

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors