With the widespread deployment of Computer-Using Agents (CUAs) in complex real-world environments, long-term risks often lead to severe and irreversible consequences. Most existing guardrails adopt a reactive approach—constraining behavior only within the current observation space. They can prevent immediate risks (e.g., clicking a phishing link) but cannot avoid long-term risks: seemingly reasonable actions can yield high-risk outcomes that appear only later (e.g., cleaning logs makes future audits untraceable), which reactive guardrails cannot see in the current observation.
We propose a predictive guardrail approach: align predicted future risks with current decisions. SafePred implements this via:
-
Short- and long-term risk prediction — Using safety policies as the basis, SafePred leverages a world model to produce semantic risk representations (short- and long-term), identifying and pruning actions that lead to high-risk states.
-
Decision optimization — Translating predicted risks into actionable guidance through step-level interventions and task-level re-planning.
Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.
This section provides instructions for setting up the project, including cloning the repository, configuring environment variables, and setting up separate environments for the WASP and OS-Harm benchmarks.
git clone https://github.com/your-username/SafePred.git
cd SafePredThe project requires API keys for language models. Copy the example environment file and add your keys. This step is required for both benchmarks.
cp .env.example .envThen, edit the .env file to add your API keys for the providers you use:
| Provider | Environment Variables |
|---|---|
| OpenAI | OPENAI_API_KEY, OPENAI_API_URL |
| Qwen | QWEN_API_KEY, QWEN_API_URL |
| Others | Add corresponding variables |
Edit config/config.yaml to set up your providers and models:
- Set the
providerfield in each LLM block to match your.envvariables - Configure
world_model_llmfor state prediction and risk evaluation - Configure
rule_extractor_llmfor extracting policies from documents (optional) - Configure
action_agent_llmfor generating candidate actions (optional) - Adjust
model_name,temperature,max_tokens, and risk thresholds as needed
The project uses two benchmarks, WASP and OS-Harm, which require separate environments. Each benchmark also requires a VM or cloud instance (e.g. AWS EC2) and specific services; for VM/EC2 setup and configuration, see each benchmark’s documentation below.
For VM/EC2 and web services (Reddit, GitLab) setup, see benchmark/wasp/README.md and benchmark/wasp/visualwebarena/environment_docker/README.md.
Create and activate a conda environment for WASP:
conda create -n wasp python=3.10 -y
conda activate waspInstall the required packages for WASP:
pip install -r benchmark/wasp/webarena_prompt_injections/requirements.txtFor VM setup (e.g. VMware Workstation, Ubuntu VM), see benchmark/os-harm/README.md and OSWorld installation.
Create and activate a conda environment for OS-Harm:
conda create -n osworld python=3.10 -y
conda activate osworldInstall the required packages for OS-Harm:
pip install -r benchmark/os-harm/baseline/code/requirements.txtSafePred uses safety policies to evaluate action risk. To extract policies from your own documents, see Extracting Policies.
This section provides example commands for running the WASP and OS-Harm benchmarks with SafePred integration.
⚠️ Prerequisites: Before running, you need to deploy Reddit and GitLab services and replace the placeholder URLs with your own.
cd benchmark/wasp
export DATASET=webarena_prompt_injections
export REDDIT="<your_reddit_domain>:9999"
export GITLAB="<your_gitlab_domain>:8023"
cd webarena_prompt_injections
python run.py \
--config configs/experiment_config.raw.json \
--model gpt-4o \
--system-prompt configs/system_prompts/wa_p_cot_id_actree_3s.json \
--output-dir /data/chenyurun/SafePred/benchmark/wasp/res \
--output-format webarena \
--use_safepred \
--safepred_config_path ../../../config/config.yaml \
--policy ../../../policies/my_policies.jsoncd benchmark/os-harm
python run.py \
--path_to_vm /path/to/Ubuntu/Ubuntu.vmx \
--observation_type screenshot_a11y_tree \
--model o4-mini \
--max_tokens 6000 \
--result_dir ./results \
--safepred_policy_path ../../policies/my_policies.json \
--test_all_meta_path evaluation_examples/test_misuse.json \
--inject \
--enable_safety_check \
--safepred_config_path ../../config/config.yamlSafePred provides a wrapper for easy integration with benchmarks. This section explains how to integrate SafePred with existing benchmarks or extend it to new ones.
from SafePred import SafePredWrapper
wrapper = SafePredWrapper(
benchmark="visualwebarena", # or "stwebagentbench", "osworld"
config_path="config/config.yaml",
policy_path="policies/my_policies.json",
# optional:
# use_planning=True,
# web_agent_model_name="gpt-4",
)You can use SafePred to evaluate action risk before execution:
result = wrapper.evaluate_action_risk(
state=benchmark_state, # benchmark-specific state
action=action_to_evaluate,
candidate_actions=[action1, action2, ...],
intent="User task description",
metadata={
"task_id": "task_001",
"action_history": [...],
"current_response": "Agent's reasoning for this step",
},
)
# Use the result
if result["requires_regeneration"]:
# Prompt your agent again with result["risk_guidance"]
passTo integrate SafePred with a new benchmark:
-
Implement the adapter: Create a new adapter in
adapters/that converts your benchmark's state/action format to SafePred's format. -
Register the benchmark: Add your benchmark name to the supported list in SafePredWrapper.
-
Use the wrapper: Initialize SafePredWrapper with your benchmark name and appropriate config/policy paths.
Example Adapter Structure
# adapters/my_benchmark.py
class MyBenchmarkAdapter(BaseAdapter):
def convert_state(self, benchmark_state):
# Convert your benchmark state to SafePred format
return converted_state
def convert_action(self, safepred_action):
# Convert SafePred action back to your benchmark format
return converted_actionFor any questions or issues, please contact via email.
@article{chen2026safepred,
title={SafePred: A Predictive Guardrail for Computer-Using Agents via World Models},
author={Chen, Yurun and Liao, Zeyi and Yin, Ping and Xie, Taotao and Yin, Keting and Zhang, Shengyu},
journal={arXiv preprint arXiv:2602.01725},
year={2026}
}