feat: detect and ignore massive activations by spikymoth · Pull Request #154 · p-e-w/heretic

spikymoth · 2026-02-14T16:07:07Z

Closes #124. This is still rough around the edges, containing some hardcoded constants for detection, but I'm not sure how much time I'll have to work on it this weekend. The detection seems pretty effective in my local runs.

There are 5 criteria for detection:

The absolute value of the activation must be greater than 100
The absolute value of the activation must be greater than 1000x the median of absolute values
The absolute deviation from the median of absolute values must be greater than 1000x the MAD
It must be part of the cluster of largest outliers and that cluster must contain no more than 3 values
The detection must persist for at least 25% of the model's layers

Detection is applied to an initial prompt with common triggers and then to all good and bad prompt residuals.

Finally, we zero out all dimensions with at least 1 detection in both the good and bad residuals. As a result, they won't contribute to the refusal direction and will be left untouched during ablation.

What I'd still like to do before taking this out of draft:

Figure out which criteria are most effective at distinguishing massive activations (and maybe finetune the defaults)
Remove any redundant criteria and make the rest configurable
Make massive activation detection and removal configurable
Make the initial detection prompt configurable, maybe?

gemini-code-assist · 2026-02-14T16:07:21Z

Summary of Changes

Hello @spikymoth, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature to detect and mitigate the impact of "massive activations" within the model's hidden states. By identifying and zeroing out these outlier activations, the change aims to improve the stability and interpretability of refusal directions, preventing highly sensitive dimensions from disproportionately influencing model behavior during analysis and ablation.

Highlights

Massive Activation Detection Implemented: A new mechanism was introduced to detect 'massive activations' within the model's hidden states, based on five statistical criteria: absolute value, median multiple, MAD multiple, cluster size, and persistence across layers.
Outlier Dimensions Zeroed Out: Dimensions identified as containing massive activations are now zeroed out in both good and bad prompt residuals, preventing them from disproportionately influencing refusal directions and ablation.
Initial Detection During Model Setup: Massive activation detection is initialized during model setup using a specific 'trigger' prompt designed to elicit common massive activation patterns.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/heretic/main.py
- Added conditional logic to zero out good_residuals and bad_residuals based on model.massive_dims.
src/heretic/model.py
- Added massive_dims attribute to the Model class.
- Introduced a call to _init_massive_activation_dims in the Model constructor.
- Implemented handle_massive_activations method to identify outlier dimensions based on several statistical criteria.
- Integrated handle_massive_activations into the get_residuals method to process hidden states.
- Added _init_massive_activation_dims method to initialize massive_dims using a predefined trigger prompt.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for detecting and handling massive activations. The implementation is a solid first step, though it is still, as you noted, a bit rough around the edges. My review focuses on improving code quality and adherence to the repository's style guide. I've identified several style guide violations, such as missing type hints and improperly formatted comments. Additionally, I've suggested refactoring a particularly complex method and replacing hardcoded values with named constants to enhance readability and maintainability. Addressing these points will help make the code cleaner and easier to build upon.

src/heretic/main.py

src/heretic/model.py

gemini-code-assist · 2026-02-14T16:09:43Z

src/heretic/model.py


        return responses

+    def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):


This method and _init_massive_activation_dims (at line 757) are missing their return type annotations. According to the style guide, all method signatures must be fully type-annotated. Since these methods do not return a value, they should be annotated with -> None.

Suggested change

def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):

def handle_massive_activations(self, hidden_states: tuple[FloatTensor]) -> None:

References

Function and method signatures must be fully type-annotated, including the return type (if any). ^(link)

gemini-code-assist · 2026-02-14T16:09:43Z

src/heretic/model.py

+    def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):
+        num_layers = len(hidden_states)
+        device = hidden_states[0].device
+
+        # The absolute value should be at least this high.
+        min_abs_value = 100
+        # The absolute value should be at least this multiple of the median.
+        min_med_mult = 1000
+        # The absolute deviation should be at least this multiple of the MAD.
+        min_mad_mult = 1000
+        # The outliers should form a cluster of no more than this many points.
+        max_cluster_size = 3
+        # Require several consecutive layers as confirmation.
+        num_confirm = num_layers // 4
+
+        # Look for massive activations in the hidden states.
+        candidates_list: list[Tensor] = []
+        for layer in hidden_states:
+            # layer has shape (prompt, position, component). Get the absolute values.
+            abs_value = torch.abs(layer.to(torch.float32))
+            # Get the (prompt, position, 1) medians along the components.
+            abs_median = torch.quantile(abs_value, q=0.5, dim=2, keepdim=True)
+            abs_dev = torch.abs(abs_value - abs_median)
+            mad = 1.4826 * torch.quantile(abs_dev, q=0.5, dim=2, keepdim=True)
+            # Get the indices of potential massive activations.
+            mask = (
+                (abs_value > min_abs_value)
+                & (abs_value > min_med_mult * abs_median)
+                & (abs_dev > min_mad_mult * mad)
+            )
+            candidates = mask.nonzero()
+            # Check whether the outliers are part of small clusters.
+            top_dev = torch.topk(abs_dev, k=10, dim=2).values
+            logs = torch.log(top_dev)
+            gaps = -torch.diff(logs, n=1, dim=2)
+            max_gaps, max_gap_indices = torch.max(gaps, dim=2)
+            cumsum_gaps = torch.cumsum(gaps[:, :, : max_cluster_size - 1], dim=2)
+            indices = (
+                (max_gap_indices - 1)
+                .clamp(min=0, max=max_cluster_size - 2)
+                .unsqueeze(-1)
+            )
+            intra_spread = cumsum_gaps.gather(dim=2, index=indices).squeeze(-1)
+            intra_spread = torch.where(
+                max_gap_indices > 0, intra_spread, torch.zeros_like(intra_spread)
+            )
+            small_clusters = (
+                (max_gap_indices < max_cluster_size)
+                & (max_gaps > 1)
+                & (max_gaps > intra_spread)
+            )
+            clusters_mask = small_clusters[candidates[:, 0], candidates[:, 1]]
+            candidates = candidates[clusters_mask]
+            candidates_list.append(candidates)
+
+        # For any massive activation candidates found,
+        # confirm them by checking if they persist for several layers.
+        massive_dims_list: list[Tensor] = []
+        for C_layer_idx in range(num_layers - num_confirm):
+            C = candidates_list[C_layer_idx]
+            if C.numel() < 1:
+                continue
+            D_all = candidates_list[C_layer_idx + 1 : C_layer_idx + num_confirm + 1]
+            # As a small optimization, check that the other layers have candidates.
+            any_empty = False
+            for D in D_all:
+                if D.numel() < 1:
+                    any_empty = True
+                    break
+            if any_empty:
+                continue
+            # Check that the other layers have matching candidates.
+            candidates = torch.arange(C.size(0), dtype=torch.long, device=device)
+            current_C = C
+            for D in D_all:
+                concat = torch.cat([current_C[candidates], D], dim=0)
+                _, inverse, counts = torch.unique(
+                    concat, dim=0, return_inverse=True, return_counts=True
+                )
+                matches = counts[inverse[: candidates.numel()]] > 1
+                candidates = candidates[matches]
+                if candidates.numel() == 0:
+                    break
+            if candidates.numel() == 0:
+                continue
+            confirmed_inds = C[candidates]
+            # Add confirmed massive dims (index 2 gives the dimension).
+            massive_dims_list.append(confirmed_inds[:, 2].unique())
+
+        if len(massive_dims_list) > 0:
+            self.massive_dims = torch.cat(
+                [self.massive_dims] + massive_dims_list
+            ).unique()


This method is quite long and contains complex logic. To improve readability and maintainability, consider refactoring it into smaller, more focused helper methods. For example, you could have separate private methods for:

Finding candidate activations in a single layer.

Filtering candidates based on cluster properties.

Confirming candidates by checking for persistence across layers.

src/heretic/model.py

gemini-code-assist · 2026-02-14T16:09:43Z

src/heretic/model.py


        return torch.cat(residuals, dim=0)

+    def _init_massive_activation_dims(self):


This method is missing its return type annotation. According to the style guide, all method signatures must be fully type-annotated. Since this method does not return a value, it should be annotated with -> None.

def _init_massive_activation_dims(self) -> None:

References

Function and method signatures must be fully type-annotated, including the return type (if any). ^(link)

I wasn't sure about this, as omitting the annotation for void-returning functions seemed somewhat common (e.g. _apply_lora and reset_model).

gemini-code-assist · 2026-02-14T16:09:43Z

src/heretic/model.py

+            # This prompt contains most common massive activation triggers.
+            user="Summer is warm, and winter\nfrom here doesn't feel cold of all seasons.",


The comment on line 760 violates the style guide (it needs a period). Additionally, the hardcoded prompt string on line 761 should be defined as a named constant at the module or class level. This improves readability and makes it easier to manage if it needs to be changed or used elsewhere.

# This prompt contains most common massive activation triggers. user="Summer is warm, and winter\nfrom here doesn't feel cold of all seasons.",

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

The comment about the comment (sike) is a hallucination. But putting the prompt in a constant or making it configurable is certainly possible.

Strange that the bot hallucinates such basic issues :(

p-e-w · 2026-02-15T14:08:15Z

Why is the detection logic so complicated and multifaceted? Basically, why isn't the criterion

The absolute value of the activation must be greater than 1000x the median of absolute values

enough? This seems to perfectly describe what massive activations are, whereas absolute values are pretty suspect since even norms contain learned parameters so they don't really guarantee any particular order of magnitude.

BTW, you might be interested in this post from the creator of MPOA: https://huggingface.co/posts/grimjim/119553185590597

spikymoth · 2026-02-15T14:35:59Z

Why is the detection logic so complicated and multifaceted? Basically, why isn't the criterion

The absolute value of the activation must be greater than 1000x the median of absolute values

enough? This seems to perfectly describe what massive activations are, whereas absolute values are pretty suspect since even norms contain learned parameters so they don't really guarantee any particular order of magnitude.

The original criteria as proposed by the researchers are points 1 and 2:

The absolute value of the activation must be greater than 100
The absolute value of the activation must be greater than 1000x the median of absolute values

I think the absolute value is used because activation magnitudes tend to grow from layer to layer, so the absolute value is an easy way to exclude values that aren't astonishingly high in early layers.

I experimented with a bunch of different criteria because these numbers felt kind of arbitrary to me. 1000x the median of absolute values still ignores the scale/dispersion of the values - later layers tend to be more heavy-tailed than earlier layers. I suspect comparing against the MAD (point 3) gives more consistent results.

One of the primary criteria that I found helpful for excluding occasional outliers was consistent detections between layers - but I tried a bunch of different things to look for both false positives and false negatives. I kind of ran out of steam, but I still want to try adding some measurements for different models to see which criteria have the most discriminating power. It's kind of annoying to implement because we want scores for particular tokens from particular prompts in particular layers, then aggregate them into something we can actually draw conclusions from.

Once I have a clearer (data-driven) view of which criteria are most effective, I'll reduce it to 1 or 2 + the cross-layer validation.

BTW, you might be interested in this post from the creator of MPOA: huggingface.co/posts/grimjim/119553185590597

Interesting!

p-e-w · 2026-02-15T14:45:13Z

But why do we want to be so conservative? If an activation is 1000x the median, it is always going to dominate computations, regardless of whether it also occurs in other layers, of whether it is above a certain absolute magnitude etc. Whenever such an activation pops up, we can assume that the refusal direction will be poor, irrespective of the activation mechanics involved. Those components simply drown out everything else.

p-e-w · 2026-02-15T14:46:23Z

Either way, what are the results with this PR? Have you observed improvements compared to winsorization?

p-e-w · 2026-02-15T14:49:26Z

There might also be a connection with #160, which suggests that more tokens can be "structural" than we currently assume. Perhaps fixing this issue properly, and skipping to the "true" start of the output, can also improve the problem of massive activations.

spikymoth · 2026-02-15T15:10:08Z

But why do we want to be so conservative? If an activation is 1000x the median, it is always going to dominate computations, regardless of whether it also occurs in other layers, of whether it is above a certain absolute magnitude etc. Whenever such an activation pops up, we can assume that the refusal direction will be poor, irrespective of the activation mechanics involved. Those components simply drown out everything else.

In my experiments, treating too many dimensions as massive actually reduced performance. I think sometimes large activations are just important, and ignoring the dimension containing them just weakens the effectiveness of ablation.

Either way, what are the results with this PR? Have you observed improvements compared to winsorization?

I've observed good performance, but it's very difficult with my hardware to gather a large comparative dataset. I'll try to get you a more concrete answer.

There might also be a connection with #160, which suggests that more tokens can be "structural" than we currently assume. Perhaps fixing this issue properly, and skipping to the "true" start of the output, can also improve the problem of massive activations.

Yes, I think this is also an important problem - we want to do our measurements on the logits that best identify the difference between acceptance and refusal. But I think moving the measurement position is orthogonal to detecting and ignoring dimensions in the hidden state that exhibit massive activations.

Massive activations are very token-specific and basically don't show up at the end of prompt / start of response position (with the exception of that 1 dimension in Gemma 3, which is just massive all the time), so moving the measurement position won't do anything to reduce the impact of ablating those dimensions on model performance.

IIRC dimensions that exhibit massive activations do generally tend to have larger norms in the weight matrix (though large norms don't unambiguously identify them), so I think they do tend to have a larger than normal impact on the refusal direction. So winsorizing or ignoring them might also help "purify" the direction in that sense too.

p-e-w · 2026-02-15T15:51:00Z

IIRC dimensions that exhibit massive activations do generally tend to have larger norms in the weight matrix (though large norms don't unambiguously identify them), so I think they do tend to have a larger than normal impact on the refusal direction. So winsorizing or ignoring them might also help "purify" the direction in that sense too.

Perhaps it could also make sense to apply this pruning to magnitude preservation somehow.

As always, the main problem is that we can't clearly tell whether a change is an improvement or not. This is a huge issue that we need to address comprehensively and scientifically. Many potential enhancements depend on us being able to tell the difference between a genuine improvement and noise.

p-e-w · 2026-03-30T05:44:00Z

@spikymoth

The TurboQuant paper uses random rotations to deal with massive activations insofar as they affect quantization (see my Reddit post explaining how this works).

While this doesn't eliminate the vector alignment problem, it could make our computations a lot more numerically stable, and perhaps have other benefits as well.

spikymoth · 2026-03-30T16:51:51Z

Funnily enough, I did actually play with rotation matrices a while ago to see if it would improve the direction estimates (I made a robust estimator that worked by applying a rotation, taking the medians and undoing the rotation, then repeating that with several other random rotations and averaging the results). It worked but didn't really outperform other forms of robust estimation.

However, I didn't attempt to apply it further along the chain - e.g. removing a rotated direction from a rotated weight matrix. I'm not sure how well it would work, but it could be interesting as earlier layers tend to be light-tailed while later layers tend to be heavy-tailed, and orthogonal rotations give them all normal tails (which might make the direction apply more universally, maybe?).

I'm not sure how much it would really impact massive activations, though. Hoping to get back to this PR soon.

feat: detect and ignore massive activations

077ab4b

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes


		return responses

		def handle_massive_activations(self, hidden_states: tuple[FloatTensor]):


		return torch.cat(residuals, dim=0)

		def _init_massive_activation_dims(self):

		# This prompt contains most common massive activation triggers.
		user="Summer is warm, and winter\nfrom here doesn't feel cold of all seasons.",

Conversation

spikymoth commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

spikymoth Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

spikymoth Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Feb 15, 2026

Uh oh!

spikymoth commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Feb 15, 2026

Uh oh!

p-e-w commented Feb 15, 2026

Uh oh!

p-e-w commented Feb 15, 2026

Uh oh!

spikymoth commented Feb 15, 2026

Uh oh!

p-e-w commented Feb 15, 2026

Uh oh!

p-e-w commented Mar 30, 2026

Uh oh!

spikymoth commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spikymoth commented Feb 15, 2026 •

edited

Loading