ENH add template starting_kit

tomMoral · tomMoral · commit 5f3a42f07565 · 2025-12-16T14:10:38.000+01:00
diff --git a/ingestion_program/ingestion.py b/ingestion_program/ingestion.py
@@ -15,14 +15,20 @@ def evaluate_model(model, X_test):
     return pd.DataFrame(y_pred)
 
 
+def get_train_data(data_dir):
+    data_dir = Path(data_dir)
+    training_dir = data_dir / "train"
+    X_train = pd.read_csv(training_dir / "train_features.csv")
+    y_train = pd.read_csv(training_dir / "train_labels.csv")
+    return X_train, y_train
+
+
 def main(data_dir, output_dir):
     # Here, you can import info from the submission module, to evaluate the
     # submission
     from submission import get_model
 
-    training_dir = data_dir / "train"
-    X_train = pd.read_csv(training_dir / "train_features.csv")
-    y_train = pd.read_csv(training_dir / "train_labels.csv")
+    X_train, y_train = get_train_data(data_dir)
 
     print("Training the model")
 
diff --git a/scoring_program/scoring.py b/scoring_program/scoring.py
@@ -8,9 +8,9 @@
 
 def compute_accuracy(predictions, targets):
     # Make sure there is no NaN, as pandas ignores them in mean computation
-    predictions = predictions.fillna(-10)
+    predictions = predictions.fillna(-10).values
     # Return mean of correct predictions
-    return (predictions == targets).mean()
+    return (predictions == targets.values).mean()
 
 
 def main(reference_dir, prediction_dir, output_dir):
diff --git a/solution/submission.py b/solution/submission.py
@@ -1,5 +1,7 @@
 from sklearn.ensemble import RandomForestClassifier
 
 
+# The submission here should simply be a function that returns a model
+# compatible with scikit-learn API
 def get_model():
     return RandomForestClassifier()
diff --git a/template_starting_kit.ipynb b/template_starting_kit.ipynb
@@ -0,0 +1,184 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div style=\"text-align: center;\">\n",
+    "    <a href=\"https://www.hi-paris.fr/\">\n",
+    "        <img border=\"0\" src=\"https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png\" width=\"25%\"></a>\n",
+    "    <a href=\"https://www.dataia.eu/\">\n",
+    "        <img border=\"0\" src=\"https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png\" width=\"70%\"></a>\n",
+    "</div>\n",
+    "\n",
+    "# Template Kit for Cadabench challenge in the Datacamp\n",
+    "\n",
+    "<i> Thomas Moreau (Inria) </i><br/>\n",
+    "<i> Pedro Rodrigues (Inria) </i>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "Describe the challenge, in particular:\n",
+    "\n",
+    "- Where the data comes from?\n",
+    "- What is the task this challenge aims to solve?\n",
+    "- Why does it matter?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exploratory data analysis\n",
+    "\n",
+    "The goal of this section is to show what's in the data, and how to play with it.\n",
+    "This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.\n",
+    "\n",
+    "You can first load and describe the data, and then show some interesting properties of it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "pd.set_option('display.max_columns', None)\n",
+    "\n",
+    "# Load the data\n",
+    "from ingestion_program.ingestion import get_train_data\n",
+    "X_df, y = get_train_data(\"dev_phase/input_data\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Challenge evaluation\n",
+    "\n",
+    "A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Submission format\n",
+    "\n",
+    "Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the codabench platform."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The submission file\n",
+    "\n",
+    "The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# %load solution/submission.py\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "\n",
+    "\n",
+    "# The submission here should simply be a function that returns a model\n",
+    "# compatible with scikit-learn API\n",
+    "def get_model():\n",
+    "    return RandomForestClassifier()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Local testing pipeline\n",
+    "\n",
+    "Here you can show how the model will be used to generate predictions on the test set, and how the evaluation will be performed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy on test set: 0.95\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/tom/.local/miniconda/lib/python3.12/site-packages/sklearn/base.py:1363: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
+      "  return fit_method(estimator, *args, **kwargs)\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = get_model()\n",
+    "X_train, y_train = get_train_data(\"dev_phase/input_data\")\n",
+    "model.fit(X_train, y_train)\n",
+    "\n",
+    "X_test = pd.read_csv(\"dev_phase/input_data/test/test_features.csv\")\n",
+    "from ingestion_program.ingestion import evaluate_model\n",
+    "y_test = evaluate_model(model, X_test)\n",
+    "\n",
+    "from scoring_program.scoring import compute_accuracy\n",
+    "print(\"Accuracy on test set:\", compute_accuracy(y_test, pd.read_csv(\"dev_phase/input_data/test/test_labels.csv\")))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Submission\n",
+    "\n",
+    "To submit your code, you can refer to the actual challenge."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}