Bug +fix: padding_side not set for tokenizers with existing pad_token, causing empty responses

# Bug: padding_side not set for tokenizers with existing pad_token, causing empty responses

## Problem

When a tokenizer already has a `pad_token` defined (e.g., `allenai/Olmo-3-7B-Instruct`), `padding_side` is not set to `"left"`, causing empty responses during batched generation.

Decoder-only models require left-padding for generation. With right-padding (the default), models see `[PROMPT] [PAD] [PAD]` and stop generating immediately, returning empty strings.

## Impact

- Empty responses are counted as "non-refusals" in `is_refusal()` (no markers found → success)
- Results in falsely low refusal rates (e.g., 5/100 instead of 90/100)
- Only affects `batch_size > 1` (no padding needed for single prompts)
- `batch_size=1` shows correct results, larger batches show near-zero refusals

## Reproduction

Test with any model that has its own `pad_token` (doesn't use EOS fallback):

```python
# batch_size=1: Correct refusal count (e.g., 10/10)
# batch_size=128: Near-zero refusals, all empty responses counted as "success"
```

Tested with `allenai/Olmo-3-7B-Instruct` and `squ11z1/Hypnos-i1-8B`.

## Current Code (model.py:50-52)

```python
if self.tokenizer.pad_token is None:
    self.tokenizer.pad_token = self.tokenizer.eos_token
    self.tokenizer.padding_side = "left"  # ← Only set when pad_token is None!
```

## Suggested Fix

```python
if self.tokenizer.pad_token is None:
    self.tokenizer.pad_token = self.tokenizer.eos_token

# CRITICAL: Always use left padding for decoder-only models during generation.
# Right padding causes empty outputs because the model sees PAD tokens after
# the prompt and thinks the sequence is complete.
self.tokenizer.padding_side = "left"
```

## Optional: Handle empty responses in evaluation

Additionally, `is_refusal()` in `evaluator.py` could treat empty/very short responses as failures rather than successes:

```python
def is_refusal(self, response: str) -> bool:
    # Empty response = generation failure, not successful uncensoring
    if len(response.strip()) < 5:
        return True
    # ... rest of marker checking
```

This provides defense-in-depth against other potential generation failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug +fix: padding_side not set for tokenizers with existing pad_token, causing empty responses #70

Bug: padding_side not set for tokenizers with existing pad_token, causing empty responses

Problem

Impact

Reproduction

Current Code (model.py:50-52)

Suggested Fix

Optional: Handle empty responses in evaluation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug +fix: padding_side not set for tokenizers with existing pad_token, causing empty responses #70

Description

Bug: padding_side not set for tokenizers with existing pad_token, causing empty responses

Problem

Impact

Reproduction

Current Code (model.py:50-52)

Suggested Fix

Optional: Handle empty responses in evaluation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions