PersonaSafe Tutorial

A step-by-step guide to using PersonaSafe for safety monitoring.

Installation
Quick Start
Extracting Persona Vectors
Screening Datasets
Using the Dashboard
Live Steering
Working with HPC
Best Practices

Installation

Step 1: Clone and Setup

git clone https://github.com/shehral/PersonaSafe.git
cd PersonaSafe
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

Step 2: Activate Environment

source venv/bin/activate

Step 3: Configure HuggingFace

# Add your token to .env
echo "HUGGINGFACE_TOKEN=hf_your_token_here" >> .env

# Accept Gemma 3 license at: https://huggingface.co/google/gemma-3-4b

Step 4: Verify Setup

python scripts/verify_setup.py

Quick Start

Run the Demo

# Extract a persona vector for "helpful" trait
python scripts/quick_demo.py --trait helpful

# Launch the dashboard
streamlit run examples/dashboard/app.py

Extracting Persona Vectors

Understanding Persona Vectors

Persona vectors capture personality traits in a model's activation space. They're computed using contrastive prompting:

Generate activations from prompts exhibiting a trait (positive)
Generate activations from prompts exhibiting the opposite (negative)
Compute difference: persona_vector = mean(positive) - mean(negative)
Normalize to unit length

Basic Extraction

from personasafe import PersonaExtractor

# Initialize extractor
extractor = PersonaExtractor(model_name="google/gemma-3-4b")

# Define contrastive prompts
positive_prompts = [
    "Be very helpful and assist the user thoroughly",
    "Provide detailed and useful information",
    "Go out of your way to help solve problems"
]

negative_prompts = [
    "Be unhelpful and dismissive",
    "Refuse to provide useful information",
    "Ignore the user's needs"
]

# Extract vector
helpful_vector = extractor.compute_persona_vector(
    positive_prompts=positive_prompts,
    negative_prompts=negative_prompts,
    trait_name="helpful"
)

print(f"Vector shape: {helpful_vector.shape}")
print(f"Norm (should be ~1.0): {helpful_vector.norm().item():.3f}")

Extraction Tips

Good Prompts:

Clear and unambiguous
Exhibit the trait strongly
Cover different aspects of the trait
3-10 examples per side

Bad Prompts:

Ambiguous or mixed signals
Too similar across positive/negative
Only one example
Unrelated to the trait

Working with Cache

from personasafe import VectorCache

cache = VectorCache()

# Check if vector exists
cached = cache.get("google/gemma-3-4b", "helpful")
if cached is not None:
    print("Using cached vector!")
    helpful_vector = cached
else:
    # Extract and cache
    helpful_vector = extractor.compute_persona_vector(...)
    # Automatically cached by extractor

# List all cached vectors
for item in cache.list_cached():
    print(f"{item['model_name']}/{item['trait_name']}")

Screening Datasets

Why Screen Datasets?

Before fine-tuning a model (which costs $$), screen your training data to predict personality drift:

Positive score → Dataset exhibits this trait
Negative score → Dataset exhibits opposite trait
Near zero → Dataset neutral for this trait

Basic Screening

from personasafe import DataScreener
import pandas as pd

# Prepare dataset
df = pd.DataFrame({
    "text": [
        "This is a helpful and kind response",
        "This is a toxic and harmful statement",
        "Neutral statement about weather"
    ]
})

# Initialize screener
screener = DataScreener(
    extractor=extractor,
    persona_vectors={
        "helpful": helpful_vector,
        "toxic": toxic_vector
    }
)

# Screen dataset
screened_df = screener.screen_dataset(df, text_column="text")
print(screened_df)

Output:

                                    text  helpful_score  toxic_score
0  This is a helpful and kind response       0.82         -0.65
1  This is a toxic and harmful statement      -0.71          0.88
2  Neutral statement about weather             0.05          0.02

Generating Reports

# Generate summary report
report = screener.generate_report(screened_df, risk_threshold=0.5)

print(f"Total samples: {report['total_samples']}")
print(f"High-risk samples:")
for trait, count in report['high_risk_samples'].items():
    print(f"  {trait}: {count}")

Interpreting Scores

Score Range	Meaning
> 0.7	Strong presence of trait
0.3 to 0.7	Moderate presence
-0.3 to 0.3	Neutral
-0.7 to -0.3	Moderate opposite
< -0.7	Strong opposite

Using the Dashboard

Launching

streamlit run examples/dashboard/app.py

Data Screening Page

Select Model: Choose google/gemma-3-4b or google/gemma-3-12b
Select Traits: Choose traits to screen for (e.g., toxic, helpful)
Upload Dataset: Upload a .jsonl file with a text field
Run Analysis: Click "Run Analysis"

Expected Dataset Format:

{"text": "Sample text 1"}
{"text": "Sample text 2"}
{"text": "Sample text 3"}

Results

The dashboard shows:

Screened DataFrame with score columns
Summary report (JSON)
Distribution visualizations (if implemented)

Live Steering

What is Activation Steering?

Modify model behavior at inference time by adding steering vectors to activations.

Basic Steering

from transformers import AutoModelForCausalLM, AutoTokenizer
from personasafe import ActivationSteerer

# Load model
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b")

# Initialize steerer
steerer = ActivationSteerer(model, tokenizer)

# Generate with steering
outputs = steerer.steer(
    prompt="Write a story about a robot",
    persona_vector=helpful_vector,
    multiplier=2.0,  # Stronger steering
    layer=20  # Middle layer
)

print("Original:", outputs[0])
print("Steered:", outputs[1])

Steering Parameters

multiplier: Steering strength
- 0.0 = no steering
- 1.0 = moderate steering
- 2.0+ = strong steering
- Can be negative to steer in opposite direction
layer: Which layer to apply steering
- Early layers (0-10): Surface-level changes
- Middle layers (10-20): Balanced
- Late layers (20-30): Deep behavioral changes

Steering Tips

Start with multiplier=1.0 and adjust
Try different layers for best results
Negative multipliers reverse the effect
Combine multiple vectors by summing them

Working with HPC

See internal HPC guide (docs/internal/GUIDES/03_HPC_GUIDE.md) for comprehensive instructions.

Quick HPC Workflow

# 1. Local: Develop and test
python scripts/quick_demo.py --trait helpful

# 2. Push code to GitHub
git add . && git commit -m "Ready for HPC" && git push

# 3. SSH to HPC
ssh username@login.discovery.neu.edu

# 4. Pull code on HPC
cd /scratch/$USER
git clone https://github.com/YOUR_USERNAME/PersonaSafe.git
cd PersonaSafe

# 5. Setup environment
./setup.sh

# 6. Run extraction (submit SLURM job)
sbatch scripts/extract_vectors.sh

# 7. Download results
# On local machine:
rsync -avz username@login.discovery.neu.edu:/scratch/$USER/PersonaSafe/vectors/ ./vectors/

Best Practices

1. Always Use Cache

# Good: Uses cache automatically
vector = extractor.compute_persona_vector(...)

# Bad: Recomputes every time
# Don't manually compute without caching

2. Version Your Vectors

# Good: Include model and date in trait name
trait_name = "helpful_gemma3-4b_2025-10-25"

# Bad: Generic name
trait_name = "helpful"

3. Screen Before Fine-Tuning

# Always screen datasets before expensive fine-tuning
report = screener.generate_report(df, risk_threshold=0.6)
if report['high_risk_samples']['toxic'] > 10:
    print("⚠️ Warning: High toxicity detected!")
    # Clean dataset or adjust fine-tuning

4. Test Locally, Scale on HPC

Develop with gemma-3-4b on MacBook
Run production with gemma-3-12b on HPC
Use small samples (100-1000) for testing
Full datasets (10k+) on HPC

5. Document Your Prompts

# Keep a record of prompts used
prompts_log = {
    "trait": "helpful",
    "date": "2025-10-25",
    "positive": positive_prompts,
    "negative": negative_prompts,
    "model": "google/gemma-3-4b"
}
# Save to JSON for reproducibility

Common Issues

Issue: Model Download Fails

GatedRepoError: Access to model google/gemma-3-4b is restricted

Solution: Accept license at https://huggingface.co/google/gemma-3-4b

Issue: Out of Memory

CUDA out of memory

Solution: Use smaller model or CPU:

extractor = PersonaExtractor("google/gemma-3-4b", device="cpu")

Issue: Cache Not Found

Vector not found in cache

Solution: Extract vectors first:

python scripts/quick_demo.py --trait helpful

Next Steps

Explore Traits: Extract vectors for different traits
Screen Real Data: Test with your actual fine-tuning dataset
Experiment with Steering: Try different multipliers and layers
Scale to HPC: Run large-scale extraction on Discovery cluster
Build Dashboard: Customize the Streamlit app for your needs

Additional Resources

API Reference: API_REFERENCE.md
HPC Guide: 03_HPC_GUIDE.md
Research Paper: ../PERSONA VECTORS_ MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS.pdf
Examples: Check experiments/ directory

Last Updated: October 25, 2025 Tutorial Version: 1.0

FilesExpand file tree

TUTORIAL.md

Latest commit

History