-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Issue 1: Default prompts don't match dataset domain
The default prompts (
"A photo of Jack Sparrow","A photo of Simba","A photo of a cat") don't match the training datasetnirmalendu01/spectacles-bias-prompts-headshot-captioned, which contains structured captions like"A headshot of a person doing X". Simba is a cartoon lion, making steering meaningless. Suggested fix: replace defaults with prompts like"A headshot of a person working in a cafe".
Issue 2: Undocumented layer selection
In
config/steer/run.yaml, two specific UNet layers are hardcoded for activation collection and steering, but there is no documentation or comment explaining why these layers were chosen over others. This makes the config opaque for researchers wanting to reproduce or extend the work.
Issue 3: Critical mismatch — latent collection vs steering inference steps
Latents are collected with
num_inference_steps=1, capture_step_index=0incollect_latents, meaning activations are captured from near-pure noise. However, CAA steering runs with 50 inference steps by default. Additionally,steer_stepsfrom the config is never passed toCAA.steer(), so it is silently ignored. This mismatch means the steering direction is learned from noise-level activations but applied to semantic-level activations, severely undermining steering validity.
Issue 4: No shape validation in make_policy
In
CAA.steer,make_policyaddsalpha * vecdirectly to runtime activationsactswith no shape assertion. This relies on implicit PyTorch broadcasting, which could silently produce incorrect results if shapes don't align as expected.