Skip to content

feat: detect and ignore massive activations#154

Draft
spikymoth wants to merge 1 commit intop-e-w:masterfrom
spikymoth:massive-activations
Draft

feat: detect and ignore massive activations#154
spikymoth wants to merge 1 commit intop-e-w:masterfrom
spikymoth:massive-activations

Conversation

@spikymoth
Copy link
Copy Markdown
Contributor

Closes #124. This is still rough around the edges, containing some hardcoded constants for detection, but I'm not sure how much time I'll have to work on it this weekend. The detection seems pretty effective in my local runs.

There are 5 criteria for detection:

  1. The absolute value of the activation must be greater than 100
  2. The absolute value of the activation must be greater than 1000x the median of absolute values
  3. The absolute deviation from the median of absolute values must be greater than 1000x the MAD
  4. It must be part of the cluster of largest outliers and that cluster must contain no more than 3 values
  5. The detection must persist for at least 25% of the model's layers

Detection is applied to an initial prompt with common triggers and then to all good and bad prompt residuals.

Finally, we zero out all dimensions with at least 1 detection in both the good and bad residuals. As a result, they won't contribute to the refusal direction and will be left untouched during ablation.

What I'd still like to do before taking this out of draft:

  1. Figure out which criteria are most effective at distinguishing massive activations (and maybe finetune the defaults)
  2. Remove any redundant criteria and make the rest configurable
  3. Make massive activation detection and removal configurable
  4. Make the initial detection prompt configurable, maybe?

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @spikymoth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature to detect and mitigate the impact of "massive activations" within the model's hidden states. By identifying and zeroing out these outlier activations, the change aims to improve the stability and interpretability of refusal directions, preventing highly sensitive dimensions from disproportionately influencing model behavior during analysis and ablation.

Highlights

  • Massive Activation Detection Implemented: A new mechanism was introduced to detect 'massive activations' within the model's hidden states, based on five statistical criteria: absolute value, median multiple, MAD multiple, cluster size, and persistence across layers.
  • Outlier Dimensions Zeroed Out: Dimensions identified as containing massive activations are now zeroed out in both good and bad prompt residuals, preventing them from disproportionately influencing refusal directions and ablation.
  • Initial Detection During Model Setup: Massive activation detection is initialized during model setup using a specific 'trigger' prompt designed to elicit common massive activation patterns.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/heretic/main.py
    • Added conditional logic to zero out good_residuals and bad_residuals based on model.massive_dims.
  • src/heretic/model.py
    • Added massive_dims attribute to the Model class.
    • Introduced a call to _init_massive_activation_dims in the Model constructor.
    • Implemented handle_massive_activations method to identify outlier dimensions based on several statistical criteria.
    • Integrated handle_massive_activations into the get_residuals method to process hidden states.
    • Added _init_massive_activation_dims method to initialize massive_dims using a predefined trigger prompt.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for detecting and handling massive activations. The implementation is a solid first step, though it is still, as you noted, a bit rough around the edges. My review focuses on improving code quality and adherence to the repository's style guide. I've identified several style guide violations, such as missing type hints and improperly formatted comments. Additionally, I've suggested refactoring a particularly complex method and replacing hardcoded values with named constants to enhance readability and maintainability. Addressing these points will help make the code cleaner and easier to build upon.


return responses

def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method and _init_massive_activation_dims (at line 757) are missing their return type annotations. According to the style guide, all method signatures must be fully type-annotated. Since these methods do not return a value, they should be annotated with -> None.

Suggested change
def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):
def handle_massive_activations(self, hidden_states: tuple[FloatTensor]) -> None:
References
  1. Function and method signatures must be fully type-annotated, including the return type (if any). (link)

Comment on lines +607 to +699
def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):
num_layers = len(hidden_states)
device = hidden_states[0].device

# The absolute value should be at least this high.
min_abs_value = 100
# The absolute value should be at least this multiple of the median.
min_med_mult = 1000
# The absolute deviation should be at least this multiple of the MAD.
min_mad_mult = 1000
# The outliers should form a cluster of no more than this many points.
max_cluster_size = 3
# Require several consecutive layers as confirmation.
num_confirm = num_layers // 4

# Look for massive activations in the hidden states.
candidates_list: list[Tensor] = []
for layer in hidden_states:
# layer has shape (prompt, position, component). Get the absolute values.
abs_value = torch.abs(layer.to(torch.float32))
# Get the (prompt, position, 1) medians along the components.
abs_median = torch.quantile(abs_value, q=0.5, dim=2, keepdim=True)
abs_dev = torch.abs(abs_value - abs_median)
mad = 1.4826 * torch.quantile(abs_dev, q=0.5, dim=2, keepdim=True)
# Get the indices of potential massive activations.
mask = (
(abs_value > min_abs_value)
& (abs_value > min_med_mult * abs_median)
& (abs_dev > min_mad_mult * mad)
)
candidates = mask.nonzero()
# Check whether the outliers are part of small clusters.
top_dev = torch.topk(abs_dev, k=10, dim=2).values
logs = torch.log(top_dev)
gaps = -torch.diff(logs, n=1, dim=2)
max_gaps, max_gap_indices = torch.max(gaps, dim=2)
cumsum_gaps = torch.cumsum(gaps[:, :, : max_cluster_size - 1], dim=2)
indices = (
(max_gap_indices - 1)
.clamp(min=0, max=max_cluster_size - 2)
.unsqueeze(-1)
)
intra_spread = cumsum_gaps.gather(dim=2, index=indices).squeeze(-1)
intra_spread = torch.where(
max_gap_indices > 0, intra_spread, torch.zeros_like(intra_spread)
)
small_clusters = (
(max_gap_indices < max_cluster_size)
& (max_gaps > 1)
& (max_gaps > intra_spread)
)
clusters_mask = small_clusters[candidates[:, 0], candidates[:, 1]]
candidates = candidates[clusters_mask]
candidates_list.append(candidates)

# For any massive activation candidates found,
# confirm them by checking if they persist for several layers.
massive_dims_list: list[Tensor] = []
for C_layer_idx in range(num_layers - num_confirm):
C = candidates_list[C_layer_idx]
if C.numel() < 1:
continue
D_all = candidates_list[C_layer_idx + 1 : C_layer_idx + num_confirm + 1]
# As a small optimization, check that the other layers have candidates.
any_empty = False
for D in D_all:
if D.numel() < 1:
any_empty = True
break
if any_empty:
continue
# Check that the other layers have matching candidates.
candidates = torch.arange(C.size(0), dtype=torch.long, device=device)
current_C = C
for D in D_all:
concat = torch.cat([current_C[candidates], D], dim=0)
_, inverse, counts = torch.unique(
concat, dim=0, return_inverse=True, return_counts=True
)
matches = counts[inverse[: candidates.numel()]] > 1
candidates = candidates[matches]
if candidates.numel() == 0:
break
if candidates.numel() == 0:
continue
confirmed_inds = C[candidates]
# Add confirmed massive dims (index 2 gives the dimension).
massive_dims_list.append(confirmed_inds[:, 2].unique())

if len(massive_dims_list) > 0:
self.massive_dims = torch.cat(
[self.massive_dims] + massive_dims_list
).unique()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method is quite long and contains complex logic. To improve readability and maintainability, consider refactoring it into smaller, more focused helper methods. For example, you could have separate private methods for:

  1. Finding candidate activations in a single layer.
  2. Filtering candidates based on cluster properties.
  3. Confirming candidates by checking for persistence across layers.


return torch.cat(residuals, dim=0)

def _init_massive_activation_dims(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method is missing its return type annotation. According to the style guide, all method signatures must be fully type-annotated. Since this method does not return a value, it should be annotated with -> None.

    def _init_massive_activation_dims(self) -> None:
References
  1. Function and method signatures must be fully type-annotated, including the return type (if any). (link)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about this, as omitting the annotation for void-returning functions seemed somewhat common (e.g. _apply_lora and reset_model).

Comment on lines +760 to +761
# This prompt contains most common massive activation triggers.
user="Summer is warm, and winter\nfrom here doesn't feel cold of all seasons.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 760 violates the style guide (it needs a period). Additionally, the hardcoded prompt string on line 761 should be defined as a named constant at the module or class level. This improves readability and makes it easier to manage if it needs to be changed or used elsewhere.

            # This prompt contains most common massive activation triggers.
            user="Summer is warm, and winter\nfrom here doesn't feel cold of all seasons.",
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about the comment (sike) is a hallucination. But putting the prompt in a constant or making it configurable is certainly possible.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange that the bot hallucinates such basic issues :(

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 15, 2026

Why is the detection logic so complicated and multifaceted? Basically, why isn't the criterion

The absolute value of the activation must be greater than 1000x the median of absolute values

enough? This seems to perfectly describe what massive activations are, whereas absolute values are pretty suspect since even norms contain learned parameters so they don't really guarantee any particular order of magnitude.

BTW, you might be interested in this post from the creator of MPOA: https://huggingface.co/posts/grimjim/119553185590597

@spikymoth
Copy link
Copy Markdown
Contributor Author

spikymoth commented Feb 15, 2026

Why is the detection logic so complicated and multifaceted? Basically, why isn't the criterion

The absolute value of the activation must be greater than 1000x the median of absolute values

enough? This seems to perfectly describe what massive activations are, whereas absolute values are pretty suspect since even norms contain learned parameters so they don't really guarantee any particular order of magnitude.

The original criteria as proposed by the researchers are points 1 and 2:

  1. The absolute value of the activation must be greater than 100
  2. The absolute value of the activation must be greater than 1000x the median of absolute values

I think the absolute value is used because activation magnitudes tend to grow from layer to layer, so the absolute value is an easy way to exclude values that aren't astonishingly high in early layers.

I experimented with a bunch of different criteria because these numbers felt kind of arbitrary to me. 1000x the median of absolute values still ignores the scale/dispersion of the values - later layers tend to be more heavy-tailed than earlier layers. I suspect comparing against the MAD (point 3) gives more consistent results.

One of the primary criteria that I found helpful for excluding occasional outliers was consistent detections between layers - but I tried a bunch of different things to look for both false positives and false negatives. I kind of ran out of steam, but I still want to try adding some measurements for different models to see which criteria have the most discriminating power. It's kind of annoying to implement because we want scores for particular tokens from particular prompts in particular layers, then aggregate them into something we can actually draw conclusions from.

Once I have a clearer (data-driven) view of which criteria are most effective, I'll reduce it to 1 or 2 + the cross-layer validation.

BTW, you might be interested in this post from the creator of MPOA: huggingface.co/posts/grimjim/119553185590597

Interesting!

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 15, 2026

But why do we want to be so conservative? If an activation is 1000x the median, it is always going to dominate computations, regardless of whether it also occurs in other layers, of whether it is above a certain absolute magnitude etc. Whenever such an activation pops up, we can assume that the refusal direction will be poor, irrespective of the activation mechanics involved. Those components simply drown out everything else.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 15, 2026

Either way, what are the results with this PR? Have you observed improvements compared to winsorization?

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 15, 2026

There might also be a connection with #160, which suggests that more tokens can be "structural" than we currently assume. Perhaps fixing this issue properly, and skipping to the "true" start of the output, can also improve the problem of massive activations.

@spikymoth
Copy link
Copy Markdown
Contributor Author

But why do we want to be so conservative? If an activation is 1000x the median, it is always going to dominate computations, regardless of whether it also occurs in other layers, of whether it is above a certain absolute magnitude etc. Whenever such an activation pops up, we can assume that the refusal direction will be poor, irrespective of the activation mechanics involved. Those components simply drown out everything else.

In my experiments, treating too many dimensions as massive actually reduced performance. I think sometimes large activations are just important, and ignoring the dimension containing them just weakens the effectiveness of ablation.

Either way, what are the results with this PR? Have you observed improvements compared to winsorization?

I've observed good performance, but it's very difficult with my hardware to gather a large comparative dataset. I'll try to get you a more concrete answer.

There might also be a connection with #160, which suggests that more tokens can be "structural" than we currently assume. Perhaps fixing this issue properly, and skipping to the "true" start of the output, can also improve the problem of massive activations.

Yes, I think this is also an important problem - we want to do our measurements on the logits that best identify the difference between acceptance and refusal. But I think moving the measurement position is orthogonal to detecting and ignoring dimensions in the hidden state that exhibit massive activations.

Massive activations are very token-specific and basically don't show up at the end of prompt / start of response position (with the exception of that 1 dimension in Gemma 3, which is just massive all the time), so moving the measurement position won't do anything to reduce the impact of ablating those dimensions on model performance.

IIRC dimensions that exhibit massive activations do generally tend to have larger norms in the weight matrix (though large norms don't unambiguously identify them), so I think they do tend to have a larger than normal impact on the refusal direction. So winsorizing or ignoring them might also help "purify" the direction in that sense too.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 15, 2026

IIRC dimensions that exhibit massive activations do generally tend to have larger norms in the weight matrix (though large norms don't unambiguously identify them), so I think they do tend to have a larger than normal impact on the refusal direction. So winsorizing or ignoring them might also help "purify" the direction in that sense too.

Perhaps it could also make sense to apply this pruning to magnitude preservation somehow.

As always, the main problem is that we can't clearly tell whether a change is an improvement or not. This is a huge issue that we need to address comprehensively and scientifically. Many potential enhancements depend on us being able to tell the difference between a genuine improvement and noise.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 30, 2026

@spikymoth

The TurboQuant paper uses random rotations to deal with massive activations insofar as they affect quantization (see my Reddit post explaining how this works).

While this doesn't eliminate the vector alignment problem, it could make our computations a lot more numerically stable, and perhaps have other benefits as well.

@spikymoth
Copy link
Copy Markdown
Contributor Author

Funnily enough, I did actually play with rotation matrices a while ago to see if it would improve the direction estimates (I made a robust estimator that worked by applying a rotation, taking the medians and undoing the rotation, then repeating that with several other random rotations and averaging the results). It worked but didn't really outperform other forms of robust estimation.

However, I didn't attempt to apply it further along the chain - e.g. removing a rotated direction from a rotated weight matrix. I'm not sure how well it would work, but it could be interesting as earlier layers tend to be light-tailed while later layers tend to be heavy-tailed, and orthogonal rotations give them all normal tails (which might make the direction apply more universally, maybe?).

I'm not sure how much it would really impact massive activations, though. Hoping to get back to this PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect and handle massive activations

2 participants