This project was built around a Kaggle competition. The final result placed in the top 5% against nearly 500 participants.
The repo does not pretend the notebook experiments are already productionized. The new code sits on top of the existing work and makes the current state explicit.
scripts/train_baseline.py: trains a clean scikit-learn baseline on the bundled MRI feature tables and writes a submission CSV plus a JSON run report.scripts/report_legacy_results.py: inspects the saved notebooks and artifacts and prints the best recorded results that already exist in the repo.src/brain_age_lab/: reusable baseline code.
abdellah_task1/: Abdellah's preprocessing notebook plus saved submissions.mehdi_task1/: Mehdi's experiment branch for imputation and outlier removal, plus partial stubs for a larger pipeline.ramy_task1/: Ramy's main experiment branch, with notebooks, saved OOF predictions, blending weights, and submissions.
I left those folders intact because they are the evidence for the project history. They are useful, but they are exploratory work rather than a polished library.
The bundled data in ramy_task1/data/ contains:
X_train.csv: 1212 rows, 832 feature columns plusidX_test.csv: 776 rows, 832 feature columns plusidy_train.csv: 1212 target rowssample.csv: 776-row submission template
The same competition tables appear in multiple contributor folders. The baseline defaults to the ramy_task1/data/ copy so there is one documented entry point.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=src python scripts/train_baseline.py \
--data-dir ramy_task1/data \
--output outputs/baseline_submission.csv
PYTHONPATH=src python scripts/report_legacy_results.pyThe new baseline is intentionally simple and reproducible:
- median or KNN imputation
- optional
IsolationForesttrimming on the training folds only - univariate feature selection
PLSRegressionorElasticNetCV
This is not presented as the best model in the folder. It is the projectized path that is easiest to rerun and extend.
On a local run dated April 10, 2026, the default packaged baseline (PLSRegression, median imputation, 150 selected features) produced:
- mean CV
R² = 0.2354 +/- 0.0931 - out-of-fold
R² = 0.2342
That gap versus the notebook experiments is intentional to show the difference between "reproducible baseline" and "best exploratory result in the archive."
These are historical notebook outputs already present in the repo, not claims about the new baseline:
- tuned SVR notebook: mean test
R² = 0.6224 +/- 0.0099 - XGBoost + GPR notebook: mean hybrid
R² = 0.6081 - LightGBM + GPR notebook: mean hybrid
R² = 0.5921 - stacking notebook:
- pure GPR OOF
R² = 0.6727 - XGBoost + GPR OOF
R² = 0.6421 - SVR + GPR OOF
R² = 0.6482 - LightGBM + GPR OOF
R² = 0.6195 - average off-diagonal OOF correlation
= 0.9520
- pure GPR OOF
The saved artifacts also include:
- blend weights:
0.718246761453894and0.2817532385461061 - bias calibration:
a = 8.271319003671536,b = 0.8891245933394192 - a 1660-feature saved selection list in
ramy_task1/artifacts/selected_features_full.txt
Run scripts/report_legacy_results.py to regenerate that summary from disk instead of trusting this README blindly.
- swap the baseline model for a stronger tree-based learner if you want leaderboard-oriented performance
- port the best notebook path into
src/once you decide which experiment is worth maintaining - deduplicate the copied datasets once you no longer need contributor-level provenance