Brain Age Regression Kaggle Project

This project was built around a Kaggle competition. The final result placed in the top 5% against nearly 500 participants.

The repo does not pretend the notebook experiments are already productionized. The new code sits on top of the existing work and makes the current state explicit.

What is in scope now

scripts/train_baseline.py: trains a clean scikit-learn baseline on the bundled MRI feature tables and writes a submission CSV plus a JSON run report.
scripts/report_legacy_results.py: inspects the saved notebooks and artifacts and prints the best recorded results that already exist in the repo.
src/brain_age_lab/: reusable baseline code.

What is still legacy work

abdellah_task1/: Abdellah's preprocessing notebook plus saved submissions.
mehdi_task1/: Mehdi's experiment branch for imputation and outlier removal, plus partial stubs for a larger pipeline.
ramy_task1/: Ramy's main experiment branch, with notebooks, saved OOF predictions, blending weights, and submissions.

I left those folders intact because they are the evidence for the project history. They are useful, but they are exploratory work rather than a polished library.

Dataset facts

The bundled data in ramy_task1/data/ contains:

X_train.csv: 1212 rows, 832 feature columns plus id
X_test.csv: 776 rows, 832 feature columns plus id
y_train.csv: 1212 target rows
sample.csv: 776-row submission template

The same competition tables appear in multiple contributor folders. The baseline defaults to the ramy_task1/data/ copy so there is one documented entry point.

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

PYTHONPATH=src python scripts/train_baseline.py \
  --data-dir ramy_task1/data \
  --output outputs/baseline_submission.csv

PYTHONPATH=src python scripts/report_legacy_results.py

Baseline pipeline

The new baseline is intentionally simple and reproducible:

median or KNN imputation
optional IsolationForest trimming on the training folds only
univariate feature selection
PLSRegression or ElasticNetCV

This is not presented as the best model in the folder. It is the projectized path that is easiest to rerun and extend.

On a local run dated April 10, 2026, the default packaged baseline (PLSRegression, median imputation, 150 selected features) produced:

mean CV R² = 0.2354 +/- 0.0931
out-of-fold R² = 0.2342

That gap versus the notebook experiments is intentional to show the difference between "reproducible baseline" and "best exploratory result in the archive."

Recorded legacy results

These are historical notebook outputs already present in the repo, not claims about the new baseline:

tuned SVR notebook: mean test R² = 0.6224 +/- 0.0099
XGBoost + GPR notebook: mean hybrid R² = 0.6081
LightGBM + GPR notebook: mean hybrid R² = 0.5921
stacking notebook:
- pure GPR OOF R² = 0.6727
- XGBoost + GPR OOF R² = 0.6421
- SVR + GPR OOF R² = 0.6482
- LightGBM + GPR OOF R² = 0.6195
- average off-diagonal OOF correlation = 0.9520

The saved artifacts also include:

blend weights: 0.718246761453894 and 0.2817532385461061
bias calibration: a = 8.271319003671536, b = 0.8891245933394192
a 1660-feature saved selection list in ramy_task1/artifacts/selected_features_full.txt

Run scripts/report_legacy_results.py to regenerate that summary from disk instead of trusting this README blindly.

Suggested next steps

swap the baseline model for a stronger tree-based learner if you want leaderboard-oriented performance
port the best notebook path into src/ once you decide which experiment is worth maintaining
deduplicate the copied datasets once you no longer need contributor-level provenance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brain Age Regression Kaggle Project

What is in scope now

What is still legacy work

Dataset facts

Quickstart

Baseline pipeline

Recorded legacy results

Suggested next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
abdellah_task1		abdellah_task1
mehdi_task1		mehdi_task1
ramy_task1		ramy_task1
scripts		scripts
src/brain_age_lab		src/brain_age_lab
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Brain Age Regression Kaggle Project

What is in scope now

What is still legacy work

Dataset facts

Quickstart

Baseline pipeline

Recorded legacy results

Suggested next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages