Skip to content

Commit 5f3a42f

Browse files
committed
ENH add template starting_kit
1 parent 8977412 commit 5f3a42f

4 files changed

Lines changed: 197 additions & 5 deletions

File tree

ingestion_program/ingestion.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,20 @@ def evaluate_model(model, X_test):
1515
return pd.DataFrame(y_pred)
1616

1717

18+
def get_train_data(data_dir):
19+
data_dir = Path(data_dir)
20+
training_dir = data_dir / "train"
21+
X_train = pd.read_csv(training_dir / "train_features.csv")
22+
y_train = pd.read_csv(training_dir / "train_labels.csv")
23+
return X_train, y_train
24+
25+
1826
def main(data_dir, output_dir):
1927
# Here, you can import info from the submission module, to evaluate the
2028
# submission
2129
from submission import get_model
2230

23-
training_dir = data_dir / "train"
24-
X_train = pd.read_csv(training_dir / "train_features.csv")
25-
y_train = pd.read_csv(training_dir / "train_labels.csv")
31+
X_train, y_train = get_train_data(data_dir)
2632

2733
print("Training the model")
2834

scoring_program/scoring.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88

99
def compute_accuracy(predictions, targets):
1010
# Make sure there is no NaN, as pandas ignores them in mean computation
11-
predictions = predictions.fillna(-10)
11+
predictions = predictions.fillna(-10).values
1212
# Return mean of correct predictions
13-
return (predictions == targets).mean()
13+
return (predictions == targets.values).mean()
1414

1515

1616
def main(reference_dir, prediction_dir, output_dir):

solution/submission.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from sklearn.ensemble import RandomForestClassifier
22

33

4+
# The submission here should simply be a function that returns a model
5+
# compatible with scikit-learn API
46
def get_model():
57
return RandomForestClassifier()

template_starting_kit.ipynb

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<div style=\"text-align: center;\">\n",
8+
" <a href=\"https://www.hi-paris.fr/\">\n",
9+
" <img border=\"0\" src=\"https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png\" width=\"25%\"></a>\n",
10+
" <a href=\"https://www.dataia.eu/\">\n",
11+
" <img border=\"0\" src=\"https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png\" width=\"70%\"></a>\n",
12+
"</div>\n",
13+
"\n",
14+
"# Template Kit for Cadabench challenge in the Datacamp\n",
15+
"\n",
16+
"<i> Thomas Moreau (Inria) </i><br/>\n",
17+
"<i> Pedro Rodrigues (Inria) </i>"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"## Introduction\n",
25+
"\n",
26+
"Describe the challenge, in particular:\n",
27+
"\n",
28+
"- Where the data comes from?\n",
29+
"- What is the task this challenge aims to solve?\n",
30+
"- Why does it matter?"
31+
]
32+
},
33+
{
34+
"cell_type": "markdown",
35+
"metadata": {},
36+
"source": [
37+
"# Exploratory data analysis\n",
38+
"\n",
39+
"The goal of this section is to show what's in the data, and how to play with it.\n",
40+
"This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.\n",
41+
"\n",
42+
"You can first load and describe the data, and then show some interesting properties of it."
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": 1,
48+
"metadata": {},
49+
"outputs": [],
50+
"source": [
51+
"%matplotlib inline\n",
52+
"import numpy as np\n",
53+
"import pandas as pd\n",
54+
"import matplotlib.pyplot as plt\n",
55+
"pd.set_option('display.max_columns', None)\n",
56+
"\n",
57+
"# Load the data\n",
58+
"from ingestion_program.ingestion import get_train_data\n",
59+
"X_df, y = get_train_data(\"dev_phase/input_data\")"
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"metadata": {},
65+
"source": [
66+
"# Challenge evaluation\n",
67+
"\n",
68+
"A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results."
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"metadata": {},
74+
"source": [
75+
"# Submission format\n",
76+
"\n",
77+
"Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the codabench platform."
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"## The submission file\n",
85+
"\n",
86+
"The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with."
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": 2,
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"# %load solution/submission.py\n",
96+
"from sklearn.ensemble import RandomForestClassifier\n",
97+
"\n",
98+
"\n",
99+
"# The submission here should simply be a function that returns a model\n",
100+
"# compatible with scikit-learn API\n",
101+
"def get_model():\n",
102+
" return RandomForestClassifier()\n"
103+
]
104+
},
105+
{
106+
"cell_type": "markdown",
107+
"metadata": {},
108+
"source": [
109+
"## Local testing pipeline\n",
110+
"\n",
111+
"Here you can show how the model will be used to generate predictions on the test set, and how the evaluation will be performed."
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": 3,
117+
"metadata": {},
118+
"outputs": [
119+
{
120+
"name": "stdout",
121+
"output_type": "stream",
122+
"text": [
123+
"Accuracy on test set: 0.95\n"
124+
]
125+
},
126+
{
127+
"name": "stderr",
128+
"output_type": "stream",
129+
"text": [
130+
"/home/tom/.local/miniconda/lib/python3.12/site-packages/sklearn/base.py:1363: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
131+
" return fit_method(estimator, *args, **kwargs)\n"
132+
]
133+
}
134+
],
135+
"source": [
136+
"model = get_model()\n",
137+
"X_train, y_train = get_train_data(\"dev_phase/input_data\")\n",
138+
"model.fit(X_train, y_train)\n",
139+
"\n",
140+
"X_test = pd.read_csv(\"dev_phase/input_data/test/test_features.csv\")\n",
141+
"from ingestion_program.ingestion import evaluate_model\n",
142+
"y_test = evaluate_model(model, X_test)\n",
143+
"\n",
144+
"from scoring_program.scoring import compute_accuracy\n",
145+
"print(\"Accuracy on test set:\", compute_accuracy(y_test, pd.read_csv(\"dev_phase/input_data/test/test_labels.csv\")))"
146+
]
147+
},
148+
{
149+
"cell_type": "markdown",
150+
"metadata": {},
151+
"source": [
152+
"## Submission\n",
153+
"\n",
154+
"To submit your code, you can refer to the actual challenge."
155+
]
156+
},
157+
{
158+
"cell_type": "markdown",
159+
"metadata": {},
160+
"source": []
161+
}
162+
],
163+
"metadata": {
164+
"kernelspec": {
165+
"display_name": "base",
166+
"language": "python",
167+
"name": "python3"
168+
},
169+
"language_info": {
170+
"codemirror_mode": {
171+
"name": "ipython",
172+
"version": 3
173+
},
174+
"file_extension": ".py",
175+
"mimetype": "text/x-python",
176+
"name": "python",
177+
"nbconvert_exporter": "python",
178+
"pygments_lexer": "ipython3",
179+
"version": "3.12.10"
180+
}
181+
},
182+
"nbformat": 4,
183+
"nbformat_minor": 1
184+
}

0 commit comments

Comments
 (0)