Tools for interpreting transcoder adapter features: collecting activation examples, classifying features, auto-interpretability, and generating attribution graphs viewable in circuit-tracer.
Install dependencies (from repo root):
pip install -r requirements.txtFeature classification and auto-interp require an OpenAI API key:
cp .env.example .env
# edit .env with your OPENAI_API_KEYThe main entry point is collecting activating text — this produces per-feature activation examples in an expanded JSON format that all other analysis steps consume.
After collection, three independent analyses can be run:
- Classification — LLM-judge categorization of features
- Auto-interp — automated feature descriptions with detection evaluation
- Attribution — graph-based analysis of feature interactions (requires additional steps: packing features, uploading to HF, and specifying prompt text)
analysis/
├── features/
│ ├── collect_feature_activations.py # Step 1: collect activating text
│ ├── collect_neuron_activations.py # Same, but for MLP neurons (baseline)
│ ├── classify_features.py # LLM-judge classification
│ ├── auto_interp.py # Auto-interp with detection evaluation
│ └── pack_features.py # Pack into circuit-tracer binary format
├── attribution/
│ ├── run_attribution.py # CLI: generate attribution graphs
│ ├── attribute.py # Core RelP attribution algorithm
│ ├── relp_model.py # Model wrapper for attribution
│ └── relp_context.py # Forward/backward caching for RelP
Runs the model on validation data and collects top-activating examples, logit lens, and activation statistics for each feature. This is a prerequisite for all other analysis.
python -m analysis.features.collect_feature_activations \
--model_path nathu0/transcoder-adapters-R1-Distill-Qwen-7B-l1w0.001-l0-1.4 \
--val_data hf://nathu0/transcoder-adapters-openthoughts3-stratified-55k/data/val.jsonl \
--output_dir ./feature_dataOutputs:
features/{cantor_id}.json— per-feature activation examples + logit lensfeature_metadata.json— activation frequencies, domain/region breakdowns
These steps can be run in any order after step 1.
LLM-judge classification into categories (language, domain, reasoning, uninterpretable).
python -m analysis.features.classify_features \
--input_dir ./feature_data \
--output feature_classifications.json \
--n_per_layer 250Generates feature descriptions and evaluates with a detection task.
python -m analysis.features.auto_interp \
--input_dir ./feature_data \
--data_path /path/to/openthoughts_val.jsonl \
--output autointerp.json \
--n_per_layer 100Attribution graphs show how features influence each other and the model's output for a specific prompt. The graphs are viewable in the circuit-tracer frontend, which loads feature data from HuggingFace.
To generate graphs on new text (for a model whose features are already on HF), just write your prompts and run attribution:
python -m analysis.attribution.run_attribution \
--checkpoint nathu0/transcoder-adapters-R1-Distill-Qwen-7B-l1w0.001-l0-1.4 \
--run_name r1_l0_1p4 \
--scan nathu0/transcoder-adapters-R1-Distill-Qwen-7B-l1w0.001-l0-1.4 \
--prompts /path/to/prompts/ \
--output_dir ./graph_files--checkpointcan be an HF repo ID or a local path--scanmust be an HF repo ID — the circuit-tracer frontend loads feature activation data from HuggingFace at browse time--promptscan be a directory of.txtfiles or a single.txtfile. Attribution is computed for the prediction of the final token — the last token in the file is the target being predicted and is included for readability but gets dropped from the prompt.--batch_sizecontrols memory usage (default 16); lower if you hit OOM- Already-computed graphs are skipped on re-run
To analyze a new model, you first need to collect, pack, and upload features before running attribution:
- Collect activating text (step 1 above)
- Pack into circuit-tracer binary format:
python -m analysis.features.pack_features \ --feature_dir ./feature_data/features \ --output_dir ./packed_features \ --n_layers 28 \ --n_features 8192 - Upload to HuggingFace:
huggingface-cli upload <your-hf-repo> \ ./packed_features features --repo-type model
- Then run attribution as above, setting
--checkpointand--scanto your HF repo ID.
Start the circuit-tracer frontend:
circuit-tracer start-server --graph_file_dir ./graph_files --port 8042Open http://localhost:8042. Click any feature node to load its activation examples from HuggingFace.
| Resource | Location |
|---|---|
| Packed features (l1w0.001) | features/ folder in model repo |
| Feature classifications | feature_classifications.json in model repo |
| Pre-built attribution graphs | TODO |