Issues on steering


**Issue 1: Default prompts don't match dataset domain**
> The default prompts (`"A photo of Jack Sparrow"`, `"A photo of Simba"`, `"A photo of a cat"`) don't match the training dataset `nirmalendu01/spectacles-bias-prompts-headshot-captioned`, which contains structured captions like `"A headshot of a person doing X"`. Simba is a cartoon lion, making steering meaningless. Suggested fix: replace defaults with prompts like `"A headshot of a person working in a cafe"`.

---

**Issue 2: Undocumented layer selection**
> In `config/steer/run.yaml`, two specific UNet layers are hardcoded for activation collection and steering, but there is no documentation or comment explaining why these layers were chosen over others. This makes the config opaque for researchers wanting to reproduce or extend the work.

---

**Issue 3: Critical mismatch — latent collection vs steering inference steps**
> Latents are collected with `num_inference_steps=1, capture_step_index=0` in `collect_latents`, meaning activations are captured from near-pure noise. However, CAA steering runs with 50 inference steps by default. Additionally, `steer_steps` from the config is never passed to `CAA.steer()`, so it is silently ignored. This mismatch means the steering direction is learned from noise-level activations but applied to semantic-level activations, severely undermining steering validity.

---

**Issue 4: No shape validation in make_policy**
> In `CAA.steer`, `make_policy` adds `alpha * vec` directly to runtime activations `acts` with no shape assertion. This relies on implicit PyTorch broadcasting, which could silently produce incorrect results if shapes don't align as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues on steering #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues on steering #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions