Skip to content

Potential test case bugs in the difficult subset #2

@carlini

Description

@carlini

Hi -- very nice eval.

I'm looking at the difficult subset and it seems like there are a number of problems that are incorrectly specified or have bugs in the reference solutions. Here are a few examples I've run into when checking the first ~30 tests.

evoeval-1: The task says inputs will be split by spaces into groups, but many tests are of the form "(()))()(()" with the expected output being "(())", "()" but this is wrong according to the spec, this should be treated as one group and dropped.

evoeval-14: In the example case all_prefix_suffix_pairs('abcadg', 2) it's not specified why ('abc', 'adg') shouldn't be valid. This also has length>=2 and is non-overlapping.

evoeval-22: Several test cases are wrong; on the input [True, False, None, 0, 1, 2] the output given is [false, 0, true, 1, 2] but false and true are not integers.

evoeval-23: The task says that whitespace should be removed, but the test case '\t\n', True, False says Expected 2 when '\t\n' is whitespace.

evoeval-3: The test case doesn't clearly state that there is one transaction per day, a valid interpretation is that all transactions are done in one day and so either exceeding the daily limit (in sum total) is invalid or going negative is invalid

evoeval-32: The polynomial [10, -15, 56, -40] on the input range [-10, 10] does cross zero; the test case asserts the answer is false however, because the two endpoints have the same sign. The task technically says this is the correct output but I'd argue the task description should be changed.

I'm curious what process was used to generate and filter these test cases? What do you think is the highest achievable accuracy on this dataset? (From my quick scanning it looks like maybe 70-80% would correspond to a saturated dataset.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions