Skip to content

Issues on steering #4

@madhuri723

Description

@madhuri723

Issue 1: Default prompts don't match dataset domain

The default prompts ("A photo of Jack Sparrow", "A photo of Simba", "A photo of a cat") don't match the training dataset nirmalendu01/spectacles-bias-prompts-headshot-captioned, which contains structured captions like "A headshot of a person doing X". Simba is a cartoon lion, making steering meaningless. Suggested fix: replace defaults with prompts like "A headshot of a person working in a cafe".


Issue 2: Undocumented layer selection

In config/steer/run.yaml, two specific UNet layers are hardcoded for activation collection and steering, but there is no documentation or comment explaining why these layers were chosen over others. This makes the config opaque for researchers wanting to reproduce or extend the work.


Issue 3: Critical mismatch — latent collection vs steering inference steps

Latents are collected with num_inference_steps=1, capture_step_index=0 in collect_latents, meaning activations are captured from near-pure noise. However, CAA steering runs with 50 inference steps by default. Additionally, steer_steps from the config is never passed to CAA.steer(), so it is silently ignored. This mismatch means the steering direction is learned from noise-level activations but applied to semantic-level activations, severely undermining steering validity.


Issue 4: No shape validation in make_policy

In CAA.steer, make_policy adds alpha * vec directly to runtime activations acts with no shape assertion. This relies on implicit PyTorch broadcasting, which could silently produce incorrect results if shapes don't align as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions