Complete API documentation for PersonaSafe components.
Location: personasafe/core/persona_extractor.py
Extract persona vectors from language models using contrastive prompting.
PersonaExtractor(model_name: str, device: str = "auto", layer_idx: int = -1)Parameters:
model_name(str): HuggingFace model identifier (e.g., "google/gemma-3-4b")device(str): Device placement ("auto", "cuda", "cpu"). Default: "auto"layer_idx(int): Layer index to extract activations from. Default: -1 (last layer)
Example:
from personasafe import PersonaExtractor
extractor = PersonaExtractor(model_name="google/gemma-3-4b")compute_persona_vector(
positive_prompts: List[str],
negative_prompts: List[str],
trait_name: str,
layer: int | None = None
) -> torch.TensorComputes a persona vector using contrastive prompts.
Parameters:
positive_prompts(List[str]): Prompts exhibiting the target traitnegative_prompts(List[str]): Prompts exhibiting the opposite traittrait_name(str): Name of the trait (used for caching)layer(int, optional): Override default layer
Returns:
torch.Tensor: Normalized persona vector of shape[hidden_dim]
Example:
vector = extractor.compute_persona_vector(
positive_prompts=["Be very helpful and kind to users"],
negative_prompts=["Be unhelpful and dismissive to users"],
trait_name="helpful"
)extract_activations(text: str, layer: int | None = None) -> torch.TensorExtracts hidden state activations for a given text.
Parameters:
text(str): Input text to processlayer(int, optional): Layer to extract from
Returns:
torch.Tensor: Activation vector of shape[hidden_dim]
Location: personasafe/core/vector_cache.py
Disk-based caching system for persona vectors.
VectorCache(cache_dir: str = "vectors")Parameters:
cache_dir(str): Directory for storing cached vectors. Default: "vectors"
Example:
from personasafe import VectorCache
cache = VectorCache(cache_dir="./my_vectors")get(model_name: str, trait_name: str) -> Optional[torch.Tensor]Retrieves a cached vector.
Parameters:
model_name(str): Model identifiertrait_name(str): Trait identifier
Returns:
torch.TensororNone: Cached vector if exists, otherwise None
Example:
vector = cache.get("google/gemma-3-4b", "helpful")
if vector is None:
print("Vector not cached")set(model_name: str, trait_name: str, vector: torch.Tensor) -> NoneStores a vector in the cache.
Parameters:
model_name(str): Model identifiertrait_name(str): Trait identifiervector(torch.Tensor): Vector to cache
Example:
cache.set("google/gemma-3-4b", "helpful", my_vector)list_cached() -> Dict[str, Any]
list_cached_as_list() -> List[Dict[str, Any]]Lists all cached vectors with metadata (dict keyed by cache key) or as a list.
Example:
cached_map = cache.list_cached()
cached_items = cache.list_cached_as_list()Location: personasafe/screening/data_screener.py
Screen datasets for personality drift using persona vectors.
DataScreener(
extractor: PersonaExtractor,
persona_vectors: Dict[str, torch.Tensor]
)Parameters:
extractor(PersonaExtractor): Initialized extractor instancepersona_vectors(Dict[str, torch.Tensor]): Dictionary mapping trait names to vectors
Example:
from personasafe import PersonaExtractor, DataScreener
extractor = PersonaExtractor("google/gemma-3-4b")
screener = DataScreener(
extractor=extractor,
persona_vectors={"helpful": helpful_vec, "toxic": toxic_vec}
)score_text(text: str) -> Dict[str, float]Scores a single text against all persona vectors.
Parameters:
text(str): Text to score
Returns:
Dict[str, float]: Dictionary mapping trait names to projection scores
Example:
scores = screener.score_text("This is a helpful response")
# Returns: {"helpful": 0.85, "toxic": -0.23}screen_dataset(
dataset: pd.DataFrame,
text_column: str = "text"
) -> pd.DataFrameScreens an entire dataset.
Parameters:
dataset(pd.DataFrame): DataFrame containing textstext_column(str): Column name containing text. Default: "text"
Returns:
pd.DataFrame: Original DataFrame with added score columns
Example:
import pandas as pd
df = pd.DataFrame({"text": ["Sample 1", "Sample 2"]})
screened_df = screener.screen_dataset(df)
# Returns df with columns: text, helpful_score, toxic_scoregenerate_report(
screened_df: pd.DataFrame,
risk_threshold: float = 0.5
) -> Dict[str, Any]Generates a summary report from screened data.
Parameters:
screened_df(pd.DataFrame): DataFrame fromscreen_dataset()risk_threshold(float): Threshold for high-risk classification. Default: 0.5
Returns:
Dict[str, Any]: Keys includetotal_samples,risk_threshold,high_risk_counts(alias:high_risk_samples),high_risk_indices,average_scores.
Location: personasafe/steering/activation_steerer.py
Apply steering vectors during text generation.
ActivationSteerer(model, tokenizer)Parameters:
model: HuggingFace model instancetokenizer: HuggingFace tokenizer instance
Example:
from transformers import AutoModelForCausalLM, AutoTokenizer
from personasafe import ActivationSteerer
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b")
steerer = ActivationSteerer(model, tokenizer)steer(
prompt: str,
persona_vector: torch.Tensor,
multiplier: float = 1.0,
layer: int = 20,
max_new_tokens: int = 50
) -> List[str]Generates text with and without steering for comparison.
Parameters:
prompt(str): Input promptpersona_vector(torch.Tensor): Steering vectormultiplier(float): Steering strength. Default: 1.0layer(int): Layer to apply steering. Default: 20max_new_tokens(int): Maximum tokens to generate. Default: 50
Returns:
List[str]: [original_output, steered_output]
Example:
outputs = steerer.steer(
prompt="Write a story about",
persona_vector=helpful_vector,
multiplier=2.0
)
print("Original:", outputs[0])
print("Steered:", outputs[1])PersonaSafe uses Python type hints throughout. Import types:
from typing import List, Dict, Optional, Any
import torch
import pandas as pdAll PersonaSafe components raise standard Python exceptions:
ValueError: Invalid parametersFileNotFoundError: Missing cache filesRuntimeError: Model loading failuresKeyError: Missing required keys in data structures
Example error handling:
try:
extractor = PersonaExtractor("invalid/model")
except Exception as e:
print(f"Model loading failed: {e}")PersonaSafe respects this environment variable:
HUGGINGFACE_TOKEN: HuggingFace API token (required for gated models)
Example:
export HUGGINGFACE_TOKEN=hf_your_token_herefrom personasafe import PersonaExtractor, VectorCache, DataScreener
import pandas as pd
# 1. Setup
extractor = PersonaExtractor("google/gemma-3-4b")
cache = VectorCache()
# 2. Extract or load vectors
helpful_vec = cache.get("google/gemma-3-4b", "helpful")
if helpful_vec is None:
helpful_vec = extractor.compute_persona_vector(
positive_prompts=["Be very helpful"],
negative_prompts=["Be unhelpful"],
trait_name="helpful"
)
# 3. Screen dataset
screener = DataScreener(
extractor=extractor,
persona_vectors={"helpful": helpful_vec}
)
df = pd.read_json("dataset.jsonl", lines=True)
screened_df = screener.screen_dataset(df, text_column="text")
# 4. Generate report
report = screener.generate_report(screened_df, risk_threshold=0.6)
print(report)import personasafe
print(personasafe.__version__) # "0.1.0"Last Updated: October 25, 2025 API Version: 0.1.0 Python Compatibility: 3.10+