-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi -- very nice eval.
I'm looking at the difficult subset and it seems like there are a number of problems that are incorrectly specified or have bugs in the reference solutions. Here are a few examples I've run into when checking the first ~30 tests.
evoeval-1: The task says inputs will be split by spaces into groups, but many tests are of the form "(()))()(()" with the expected output being "(())", "()" but this is wrong according to the spec, this should be treated as one group and dropped.
evoeval-14: In the example case all_prefix_suffix_pairs('abcadg', 2) it's not specified why ('abc', 'adg') shouldn't be valid. This also has length>=2 and is non-overlapping.
evoeval-22: Several test cases are wrong; on the input [True, False, None, 0, 1, 2] the output given is [false, 0, true, 1, 2] but false and true are not integers.
evoeval-23: The task says that whitespace should be removed, but the test case '\t\n', True, False says Expected 2 when '\t\n' is whitespace.
evoeval-3: The test case doesn't clearly state that there is one transaction per day, a valid interpretation is that all transactions are done in one day and so either exceeding the daily limit (in sum total) is invalid or going negative is invalid
evoeval-32: The polynomial [10, -15, 56, -40] on the input range [-10, 10] does cross zero; the test case asserts the answer is false however, because the two endpoints have the same sign. The task technically says this is the correct output but I'd argue the task description should be changed.
I'm curious what process was used to generate and filter these test cases? What do you think is the highest achievable accuracy on this dataset? (From my quick scanning it looks like maybe 70-80% would correspond to a saturated dataset.)