This project provides tools for analyzing attention mechanisms. It enables detailed investigation of how attention patterns and value transformations change when semantically related but meaningfully different statements are processed.
Please note that this project is not complete and only my scratchbook to understand the mechanisms better
- Attention pattern visualization before and after value matrix transformation
- Counterfactual analysis comparing attention patterns between related statements
- Statistical metrics for quantifying attention pattern differences
- Support for both raw attention weights and value-weighted attention analysis
- Layer-wise and head-wise analysis capabilities
- PyTorch
- Transformers
- Seaborn
- Matplotlib
- NumPy
This investigates how transformer-based language models process negation through attention mechanism analysis. Using a custom AttentionProbe class, we visualize and analyze attention patterns and value-weighted attention across different layers and heads. The model used here is TinyBert (https://huggingface.co/prajjwal1/bert-tiny)
The analysis uses a custom AttentionProbe class that:
- Extracts attention weights and value matrices from transformer layers
- Computes value-weighted attention patterns
- Visualizes results using heatmaps for both raw attention and value-weighted attention
The analysis is done for multiple negation statements and are observed as statistical objects to find patterns and circuits.
- Layer specialization:
- Layer 0: Focuses on local token relationships
- Layer 1: Handles broader semantic connections
- Negation processing:
- Strong bidirectional attention between negation words and affected terms
- Distinct value-weighted patterns in negation contexts
- Diffusion of value-weighted vectors around the negation term
- Raw attention weights: Shows direct attention patterns between tokens
- Value-weighted attention: Reveals semantic relationships after value transformation
- Analyzed across multiple layers and attention heads
Left: Raw attention weights showing local token relationships
Right: Value-weighted attention revealing semantic connections
Left: Raw attention with broader distribution
Right: Value-weighted patterns showing cross-token semantic relationships
Left: Raw attention focused on syntactic relationships
Right: Value-weighted attention showing refined semantic connections
Left: Raw attention with strong endpoint connections
Right: Value-weighted attention showing integrated semantic relationships
Analysis of counter factual statements in a DistillBert model
The visualization includes four heatmaps:
- Original statement attention weights
- Counterfactual statement attention weights
- Original value-weighted attention
- Counterfactual value-weighted attention
max_attention_diff: Maximum absolute difference in attention weightsmean_attention_diff: Average absolute difference in attention weightsattention_pattern_correlation: Correlation between original and counterfactual attention patternsvalue_output_correlation: Correlation between original and counterfactual value-weighted outputs