Python_Fossee-task/Task-1 at main · Eternity2401/Python_Fossee-task · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
T
he Yaksh system is an AI-powered Coding tutoring assistant, developed and maintained by FOSSEE
at IIT Bombay. Yaksh is widely used across Indian educational institutions for conducting Coding.
When students submit incorrect code, PyPal generates natural-language explanations intended to
guide them toward the correct solution without directly revealing it.
T
he effectiveness of such AI-generated feedback is not well understood. While large language models
(LLMs) have demonstrated strong performance on code generation and debugging benchmarks such
as HumanEval [7] and SWE-bench, the question of whether they can teach -- that is, guide students
pedagogically rather than simply solving problems for them -- remains largely open.
1.2 Problem Statement
T
he central problem addressed in this work is threefold:
1. Data Quality: What are the characteristics, distributions, and anomalies present in the PyPal
interaction dataset, and what do they reveal about student behaviour and AI response quality?
2. Pedagogical Effectiveness: How well do AI-generated responses score on established
pedagogical criteria when evaluated by human experts?
3. Classification Reliability: Can LLMs reliably classify student errors using a structured
taxonomy, and how do their classifications compare to human judgment?
1.3 Objectives
T
he specific objectives of this internship were:
● Conduct a comprehensive exploratory data analysis of the full 14,408-entry PyPal dataset.
● Participate in the manual pedagogical evaluation of 600 AI responses using a structured
rubric.
● Perform a cross-verification study comparing LLM error classifications against human labels
on a curated 100-question subset.
● Review state-of-the-art benchmarks for evaluating LLMs as pedagogical agents.
● Estimate the cost of replicating benchmark methodologies on the PyPal dataset.


1.4 Paper Organisation
T
he remainder of this report is organised as follows. Section 2 describes the PyPal dataset. Section 3
presents the exploratory data analysis. Section 4 covers the manual pedagogical evaluation of 600 AI
responses. Section 5 describes the error taxonomy and LLM classification pipeline. Section 6 details
the cross-verification study comparing human and LLM labels. Section 7 reviews related benchmarks
and provides a cost analysis for replication. Section 8 concludes with a summary of findings and
directions for future work.
2. Dataset Description
2.1 The PyPal Dataset
T
he primary dataset comprises 14,408 student-AI interaction records collected from the Yaksh
platform at IIT Bombay. Each record captures a student's incorrect Python code submission and the
corresponding AI-generated explanation.
Parameter
Total Records
Language
Source Platform
Value
14,408
Python
Yaksh, IIT Bombay
Concept Categories
Question Types
Key Columns
2.2 The 100-Question Subset
6
11
Question, Student Code, Error Type, AI
Explanation, Concept, and others
A curated subset of 100 representative questions was later selected from the full dataset for the
cross-verification study (Section 6). This subset was chosen because running multiple LLM APIs on
all 14,408 entries would have been prohibitively expensive. The 100 questions were selected to be
representative of the error distribution and concept coverage in the full dataset.
3. Exploratory Data Analysis
A comprehensive EDA was conducted on the full 14,408-entry dataset using Python (Pandas, NumPy,
Matplotlib, Seaborn) in Google Colab. The analysis covered six dimensions: error type distribution,
AI response quality, friendliness scoring, readability, error clustering, and category imbalance.
3.1 Error Type Distribution
Errors were classified into two broad categories: logic-based failures (where the code runs but produces
incorrect output) and code-crashing errors (where the code raises a runtime exception).
Based on section 3.1 of the document, the data on Error Type Distribution is already presented in two
tables.Error Class Distribution (Total Records: 14,408)
T
his table classifies errors into two broad categories:
Error Class
Logic Fails (incorrect output)
9,141
Count
Percentage
63.4%
Code-Crashing Errors (runtime
exceptions)
Total
5,267
14,408
Breakdown of Code-Crashing Errors (Total: 5,267)
T
his table provides the breakdown of the 5,267 code-crashing errors by exception type:
Error Type
TypeError
Count
2,544
% of Code-Crashing
48.3%
36.6%
100%
Primary Root Cause
Incorrect parameter handling, type
mismatches
NameError
1,247
23.7%
Undefined variables, typos in variable names
ZeroDivisionError
UnboundLocalError
AttributeError
607
390
210
11.5%
7.4%
4.0%
Missing edge-case checks for division
Variable referenced before assignment
Accessing non-existent attributes
EOFError
IndexError
ValueError
KeyError
Others
88
81
70
16
14
1.7%
1.5%
1.3%
0.3%
0.3%
Input handling issues
List/sequence index out of range
Invalid value for a given operation
Dictionary key access errors
RecursionError, OverflowError, etc.
However, when viewed by overall frequency (including logic-based "Test Case Failures"), the
distribution changes significantly. Test Case Failure is the single most common error type in the
dataset, occurring over 8,000 times -- more than three times as frequent as the next most common
type (TypeError at ~2,500).
3.2 TypeError Anomaly
A striking finding was that 92.9% of all TypeErrors in the dataset correspond to the sub-category
"Function Argument Mismatch". Drilling further into this sub-category reveals that 100% of these
cases produce the identical message: "main() takes 0 positional arguments but 1 was given". This is not
a genuine student mistake but a system-level issue in the Yaksh platform's function invocation
mechanism.
TypeError Sub-Category
Percentage
Count (approx.)
Function Argument Mismatch
String to Int Conversion
92.9%
~0.9%
~2,363
~23
Int to String Conversion
Other Type Error
~0.7%
~18
~5.5%
~140
Note: The document highlights that the Function Argument Mismatch is a system-level issue, as 100% of these cases produce the
identical error message: "main() takes 0 positional arguments but 1 was given".
Figure 2: TypeError Sub-categories Distribution (pie chart) and Distribution of Type Error Sub-categories (bar chart).
T
his anomaly inflates the apparent error rate and should be addressed at the platform level rather
than through AI tutoring.
3.3 AI Response Quality (Semantic Similarity)
T
he semantic similarity between AI-generated explanations and reference solutions was computed to
assess how well the AI understood each error.
Metric
Value
Average Semantic Similarity Score
0.216 (on a 0–1 scale)
Correct Interpretation Rate
Test Failure Category Match
~40%
2.91% (lowest of all categories)
With average semantic similarity of 0.216, the automatically generated explanations often fail to
accurately address student code errors, showing significant variability by error type.The "Test Failure"
category performed worst with a match rate of only 2.91%, suggesting a significant blind spot.
Notably, even the best-performing category (KeyError at 33.52%) achieves only a third semantic
alignment -- confirming that low response quality is systemic, not confined to a single error type.
A scatter plot of response length versus semantic match further confirms that longer responses do
not correlate with better quality. The bulk of responses cluster between 50--200 words with semantic
match scores between 0.05--0.45, and no positive trend is visible.
Figure 3 AI Interpretation by Error Type (table and bar chart).
Figure 4: Response Length vs Semantic Match (scatter plot).
3.4 Friendliness Scoring
AI responses were scored for tone and approachability on a scale of 1 (Very Unfriendly) to 5 (Very
Friendly).
Rating
Very Unfriendly
Neutral
34.0%
Percentage
58.3%
Friendly
Very Friendly
7.6%
0.2%
T
he average friendliness score of 2.21 out of 5 indicates that the majority of AI responses are either
neutral or actively unfriendly in tone. Over a third (34%) were rated "Very Unfriendly", while only
7.8% were rated Friendly or above. The histogram of scores shows a roughly normal distribution
centred around the mean of 2.21, with a long tail toward higher friendliness scores that very few
responses reach. For a platform serving beginner programmers, this is a significant concern, as
discouraging feedback can negatively impact student motivation and learning outcomes.
Figure 5: Distribution of Friendliness Scores (histogram with mean line at 2.21).
3.5 Readability Analysis
Flesch-Kincaid readability analysis was applied to assess whether AI explanations are pitched at an
appropriate level for the target audience (beginner Python programmers).
Readability Level
Easy (accessible to beginners)
Medium (general audience)
Percentage
28%
68%
Hard (advanced vocabulary/structure)
Readability Distribution by Concept
3.4%
Concept 6 has the highest proportion of "Easy" responses (41.93%), while Concept 1 has the lowest
(21.14%) coupled with the highest "Hard" percentage (5.25%). Given that the target audience consists
of introductory programming students, a higher proportion of easy-to-read responses would be
desirable -- particularly for Concept 1, which is also the most frequently occurring concept in the
dataset.
Concept
Easy %
Hard % Medium %
Concept 1 21.14
5.25
73.60
Concept 2 29.38
Concept 3 37.55
3.02
2.05
67.60
60.40
Concept 4 24.83
Concept 5 27.32
Concept 6 41.93
3.6 Error Clustering
2.24
3.66
1.50
72.93
69.02
56.57
Analysis of error co-occurrence patterns among students revealed common combinations:
Based on your request, here is the data from section 3.6 of the document, "Error Clustering,"
presented in a table format:
Error Combination
NameError + TypeError (top pair)
Number of Students
275
NameError + TypeError +
ZeroDivisionError (top triplet)
153
T
he document notes that these clusters suggest that students who make one type of fundamental mistake often make related
mistakes as well, indicating common underlying misconceptions (such as confusion about variable scope and type handling).
3.7 Category Distribution and Imbalance
T
he dataset exhibits significant imbalance across concept categories. The full distribution is as
follows:
T
he dataset's imbalance heavily features list and basic-operation errors, which may lead to AI models
performing poorly on less common categories like conditional logic and unit conversion.
Furthermore, Data Structures (Dictionary and List) problems generally have the longest student
submissions (~350–450 characters), indicating higher complexity, whereas Unit Conversion and
Conditional Logic problems are shorter and simpler.
Figure 7: Distribution of Problems by Category (bar chart).
4. Manual Pedagogical Evaluation
4.1 Methodology
A team of three evaluators (Me, Yuvraj Pandey, and Tanishi) independently assessed approximately
600 AI-generated responses (roughly 200 per evaluator). Each response was rated on a five-dimension
rubric:
Dimension
Description
Conceptual Accuracy
Clarity and Structure
Is the explanation factually correct about the
underlying concept or algorithm?
Is the explanation clearly written and logically
organised?
Scale
0–4
0–4
Helpful Guidance / Hint Quality Does the response teach *how to think* about
the problem without giving the solution?
0–4
Reasoning Depth and Intuition
Edge Cases / Constraints
Awareness
Does the response explain *why* the approach
works, not just *what* to do?
Does the response mention important pitfalls,
boundary conditions, or constraints?
Total Score Interpretation Scale
0–4
0–4
T
he maximum possible score was 20 (5 dimensions x 4 points each). The scores were interpreted as follows:
Total Score
18–20
14–17
10–13
Interpretation
Excellent teaching explanation
Good
Adequate
5–9
0–4
4.2 Observations
Weak
Poor or unusable
T
he manual evaluation of 600 AI responses revealed that while many responses are factually accurate,
they frequently fall short on pedagogical dimensions -- particularly on Helpful Guidance / Hint Quality
and Reasoning Depth. A common pattern was responses that correctly identified the error but then
provided the solution directly, rather than guiding the student to discover it independently.
T
his pattern mirrors the "over-helpfulness" problem identified by GuideLM [3], where base models
(without pedagogical fine-tuning) tend to act as solvers rather than tutors.