Mechanistic interpretability of tiny models

This project provides tools for analyzing attention mechanisms. It enables detailed investigation of how attention patterns and value transformations change when semantically related but meaningfully different statements are processed.

Please note that this project is not complete and only my scratchbook to understand the mechanisms better

Features

Attention pattern visualization before and after value matrix transformation
Counterfactual analysis comparing attention patterns between related statements
Statistical metrics for quantifying attention pattern differences
Support for both raw attention weights and value-weighted attention analysis
Layer-wise and head-wise analysis capabilities

Requirements

PyTorch
Transformers
Seaborn
Matplotlib
NumPy

Negation analysis

This investigates how transformer-based language models process negation through attention mechanism analysis. Using a custom AttentionProbe class, we visualize and analyze attention patterns and value-weighted attention across different layers and heads. The model used here is TinyBert (https://huggingface.co/prajjwal1/bert-tiny)

Method

The analysis uses a custom AttentionProbe class that:

Extracts attention weights and value matrices from transformer layers
Computes value-weighted attention patterns
Visualizes results using heatmaps for both raw attention and value-weighted attention

The analysis is done for multiple negation statements and are observed as statistical objects to find patterns and circuits.

Key findings

Layer specialization:

Layer 0: Focuses on local token relationships
Layer 1: Handles broader semantic connections

Negation processing:

Strong bidirectional attention between negation words and affected terms
Distinct value-weighted patterns in negation contexts
Diffusion of value-weighted vectors around the negation term

Results

Raw attention weights: Shows direct attention patterns between tokens
Value-weighted attention: Reveals semantic relationships after value transformation
Analyzed across multiple layers and attention heads

Layer 0 Head 0

Left: Raw attention weights showing local token relationships Right: Value-weighted attention revealing semantic connections

Layer 0 Head 1

Left: Raw attention with broader distribution Right: Value-weighted patterns showing cross-token semantic relationships

Layer 1 Head 0

Left: Raw attention focused on syntactic relationships Right: Value-weighted attention showing refined semantic connections

Layer 1 Head 1

Left: Raw attention with strong endpoint connections Right: Value-weighted attention showing integrated semantic relationships

Counter factual analysis

Analysis of counter factual statements in a DistillBert model

Visualization Components

The visualization includes four heatmaps:

Original statement attention weights
Counterfactual statement attention weights
Original value-weighted attention
Counterfactual value-weighted attention

Statistical Metrics

max_attention_diff: Maximum absolute difference in attention weights
mean_attention_diff: Average absolute difference in attention weights
attention_pattern_correlation: Correlation between original and counterfactual attention patterns
value_output_correlation: Correlation between original and counterfactual value-weighted outputs

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
.gitignore		.gitignore
README.md		README.md
analysis.ipynb		analysis.ipynb
main.py		main.py
model.py		model.py
probes.py		probes.py
probing-counterfactual-test.ipynb		probing-counterfactual-test.ipynb
probing-negation.ipynb		probing-negation.ipynb
probing.ipynb		probing.ipynb
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mechanistic interpretability of tiny models

Features

Requirements

Negation analysis

Method

Key findings

Results

Layer 0 Head 0

Layer 0 Head 1

Layer 1 Head 0

Layer 1 Head 1

Counter factual analysis

Visualization Components

Statistical Metrics

References

About

Uh oh!

Releases

Packages

Languages

chandanms/tinyllm-mechinter

Folders and files

Latest commit

History

Repository files navigation

Mechanistic interpretability of tiny models

Features

Requirements

Negation analysis

Method

Key findings

Results

Layer 0 Head 0

Layer 0 Head 1

Layer 1 Head 0

Layer 1 Head 1

Counter factual analysis

Visualization Components

Statistical Metrics

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages