A step-by-step guide to using PersonaSafe for safety monitoring.
- Installation
- Quick Start
- Extracting Persona Vectors
- Screening Datasets
- Using the Dashboard
- Live Steering
- Working with HPC
- Best Practices
git clone https://github.com/shehral/PersonaSafe.git
cd PersonaSafe
python -m venv venv && source venv/bin/activate
pip install -r requirements.txtsource venv/bin/activate# Add your token to .env
echo "HUGGINGFACE_TOKEN=hf_your_token_here" >> .env
# Accept Gemma 3 license at: https://huggingface.co/google/gemma-3-4bpython scripts/verify_setup.py# Extract a persona vector for "helpful" trait
python scripts/quick_demo.py --trait helpful
# Launch the dashboard
streamlit run examples/dashboard/app.pyPersona vectors capture personality traits in a model's activation space. They're computed using contrastive prompting:
- Generate activations from prompts exhibiting a trait (positive)
- Generate activations from prompts exhibiting the opposite (negative)
- Compute difference:
persona_vector = mean(positive) - mean(negative) - Normalize to unit length
from personasafe import PersonaExtractor
# Initialize extractor
extractor = PersonaExtractor(model_name="google/gemma-3-4b")
# Define contrastive prompts
positive_prompts = [
"Be very helpful and assist the user thoroughly",
"Provide detailed and useful information",
"Go out of your way to help solve problems"
]
negative_prompts = [
"Be unhelpful and dismissive",
"Refuse to provide useful information",
"Ignore the user's needs"
]
# Extract vector
helpful_vector = extractor.compute_persona_vector(
positive_prompts=positive_prompts,
negative_prompts=negative_prompts,
trait_name="helpful"
)
print(f"Vector shape: {helpful_vector.shape}")
print(f"Norm (should be ~1.0): {helpful_vector.norm().item():.3f}")Good Prompts:
- Clear and unambiguous
- Exhibit the trait strongly
- Cover different aspects of the trait
- 3-10 examples per side
Bad Prompts:
- Ambiguous or mixed signals
- Too similar across positive/negative
- Only one example
- Unrelated to the trait
from personasafe import VectorCache
cache = VectorCache()
# Check if vector exists
cached = cache.get("google/gemma-3-4b", "helpful")
if cached is not None:
print("Using cached vector!")
helpful_vector = cached
else:
# Extract and cache
helpful_vector = extractor.compute_persona_vector(...)
# Automatically cached by extractor
# List all cached vectors
for item in cache.list_cached():
print(f"{item['model_name']}/{item['trait_name']}")Before fine-tuning a model (which costs $$), screen your training data to predict personality drift:
- Positive score → Dataset exhibits this trait
- Negative score → Dataset exhibits opposite trait
- Near zero → Dataset neutral for this trait
from personasafe import DataScreener
import pandas as pd
# Prepare dataset
df = pd.DataFrame({
"text": [
"This is a helpful and kind response",
"This is a toxic and harmful statement",
"Neutral statement about weather"
]
})
# Initialize screener
screener = DataScreener(
extractor=extractor,
persona_vectors={
"helpful": helpful_vector,
"toxic": toxic_vector
}
)
# Screen dataset
screened_df = screener.screen_dataset(df, text_column="text")
print(screened_df)Output:
text helpful_score toxic_score
0 This is a helpful and kind response 0.82 -0.65
1 This is a toxic and harmful statement -0.71 0.88
2 Neutral statement about weather 0.05 0.02
# Generate summary report
report = screener.generate_report(screened_df, risk_threshold=0.5)
print(f"Total samples: {report['total_samples']}")
print(f"High-risk samples:")
for trait, count in report['high_risk_samples'].items():
print(f" {trait}: {count}")| Score Range | Meaning |
|---|---|
| > 0.7 | Strong presence of trait |
| 0.3 to 0.7 | Moderate presence |
| -0.3 to 0.3 | Neutral |
| -0.7 to -0.3 | Moderate opposite |
| < -0.7 | Strong opposite |
streamlit run examples/dashboard/app.py- Select Model: Choose
google/gemma-3-4borgoogle/gemma-3-12b - Select Traits: Choose traits to screen for (e.g., toxic, helpful)
- Upload Dataset: Upload a
.jsonlfile with atextfield - Run Analysis: Click "Run Analysis"
Expected Dataset Format:
{"text": "Sample text 1"}
{"text": "Sample text 2"}
{"text": "Sample text 3"}The dashboard shows:
- Screened DataFrame with score columns
- Summary report (JSON)
- Distribution visualizations (if implemented)
Modify model behavior at inference time by adding steering vectors to activations.
from transformers import AutoModelForCausalLM, AutoTokenizer
from personasafe import ActivationSteerer
# Load model
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b")
# Initialize steerer
steerer = ActivationSteerer(model, tokenizer)
# Generate with steering
outputs = steerer.steer(
prompt="Write a story about a robot",
persona_vector=helpful_vector,
multiplier=2.0, # Stronger steering
layer=20 # Middle layer
)
print("Original:", outputs[0])
print("Steered:", outputs[1])-
multiplier: Steering strength- 0.0 = no steering
- 1.0 = moderate steering
- 2.0+ = strong steering
- Can be negative to steer in opposite direction
-
layer: Which layer to apply steering- Early layers (0-10): Surface-level changes
- Middle layers (10-20): Balanced
- Late layers (20-30): Deep behavioral changes
- Start with
multiplier=1.0and adjust - Try different layers for best results
- Negative multipliers reverse the effect
- Combine multiple vectors by summing them
See internal HPC guide (docs/internal/GUIDES/03_HPC_GUIDE.md) for comprehensive instructions.
# 1. Local: Develop and test
python scripts/quick_demo.py --trait helpful
# 2. Push code to GitHub
git add . && git commit -m "Ready for HPC" && git push
# 3. SSH to HPC
ssh username@login.discovery.neu.edu
# 4. Pull code on HPC
cd /scratch/$USER
git clone https://github.com/YOUR_USERNAME/PersonaSafe.git
cd PersonaSafe
# 5. Setup environment
./setup.sh
# 6. Run extraction (submit SLURM job)
sbatch scripts/extract_vectors.sh
# 7. Download results
# On local machine:
rsync -avz username@login.discovery.neu.edu:/scratch/$USER/PersonaSafe/vectors/ ./vectors/# Good: Uses cache automatically
vector = extractor.compute_persona_vector(...)
# Bad: Recomputes every time
# Don't manually compute without caching# Good: Include model and date in trait name
trait_name = "helpful_gemma3-4b_2025-10-25"
# Bad: Generic name
trait_name = "helpful"# Always screen datasets before expensive fine-tuning
report = screener.generate_report(df, risk_threshold=0.6)
if report['high_risk_samples']['toxic'] > 10:
print("⚠️ Warning: High toxicity detected!")
# Clean dataset or adjust fine-tuning- Develop with
gemma-3-4bon MacBook - Run production with
gemma-3-12bon HPC - Use small samples (100-1000) for testing
- Full datasets (10k+) on HPC
# Keep a record of prompts used
prompts_log = {
"trait": "helpful",
"date": "2025-10-25",
"positive": positive_prompts,
"negative": negative_prompts,
"model": "google/gemma-3-4b"
}
# Save to JSON for reproducibilityGatedRepoError: Access to model google/gemma-3-4b is restricted
Solution: Accept license at https://huggingface.co/google/gemma-3-4b
CUDA out of memory
Solution: Use smaller model or CPU:
extractor = PersonaExtractor("google/gemma-3-4b", device="cpu")Vector not found in cache
Solution: Extract vectors first:
python scripts/quick_demo.py --trait helpful- Explore Traits: Extract vectors for different traits
- Screen Real Data: Test with your actual fine-tuning dataset
- Experiment with Steering: Try different multipliers and layers
- Scale to HPC: Run large-scale extraction on Discovery cluster
- Build Dashboard: Customize the Streamlit app for your needs
- API Reference: API_REFERENCE.md
- HPC Guide: 03_HPC_GUIDE.md
- Research Paper:
../PERSONA VECTORS_ MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS.pdf - Examples: Check
experiments/directory
Last Updated: October 25, 2025 Tutorial Version: 1.0