ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319)

Follow-up to #319, which fixes the common multiple-choice false negatives where a leading pronoun/article shadows the chosen letter (`"Answer: I think it is C"` was graded `I`).

A narrower residual case is intentionally left out of #319, because every naive fix trades one failure mode for another. Filing it separately for discussion.

## Problem

`find_answer_letter` returns the **first** boundary-isolated in-range capital on the answer line. When the model states a rejected distractor *before* its actual pick, the extractor grabs the distractor:

```
"Answer: It is not B, the answer is D"   ->  B   (should be D)
"Answer: rules out C, leaving D"         ->  C   (should be D)
```

The extractor has no notion of negation (`not B`, `rules out C`), so the first valid letter wins. Reachability is lower than the cases fixed in #319 (models usually emit a bare or leading letter), but it is a real false negative on 4–10-choice cases when the model answers by elimination.

## Why it isn't in #319

Each obvious fix has a counter-example:

- **Take the last in-range letter on the line** instead of the first → fixes `not B ... D` but regresses `"Answer: D, not B"` and any line that ends on a distractor.
- **Skip a letter preceded by a negation cue** (`not`, `except`, `rules out`, `isn't`) → brittle, English-specific, easy to fool.
- **Tighten the prompt** to force a bare-letter final line → already the instruction; it doesn't bind non-compliant outputs.

Robustly deciding "which stated letter is the actual choice" needs selection-vs-rejection (sentence-level) understanding, which is beyond the current lexical extractor.

## Notes

Found during the same grader audit behind #319. A `--self-test-extractors` case is ready to lock whichever resolution you prefer; I kept it out so the suite stays green until the approach is decided. Happy to implement any of the above (or another direction).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319) #321

Problem

Why it isn't in #319

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319) #321

Description

Problem

Why it isn't in #319

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions