Xinyi Shang*β
Yi Tang*β
Jiacheng Cui*β
Ahmed Elhagryβ
Salwa K. Al Khatib
Sondos Mahmoud Bsharatβ
Jiacheng Liuβ
Xiaohan Zhaoβ
Jing-Hao Xueβ
Hao Liβ
Salman Khanβ
Zhiqiang Shenβ
* Equal contribution β|β β Corresponding author
Mask-based labels misalign with true edit signals (top). Our pixel-difference labels are precisely aligned with the generative footprint (bottom).
TL;DR Β We expose a fundamental flaw in mask-based tampering benchmarks and introduce PIXAR: a 420K+ benchmark with pixel-faithful labels, 8 manipulation types, and a VLM detector that simultaneously localizes, classifies, and describes tampered regions, achieving 2.7Γ IoU improvement over prior SOTA.
- [2026-03] π¦ π Major Update: Code and pre-trained weights are now available, along with our large-scale PIXAR benchmark (420K+ pairs).
- [2026-02] π Paper Status: Accepted to CVPR 2026 Findings. We have opted to withdraw it for further enhancement and resubmission.
| Model | Base | Training Data | HuggingFace |
|---|---|---|---|
| PIXAR-7B | SIDA-7B | Full training set (train_0.05) |
π€ Link |
| PIXAR-7B_lite | SIDA-7B | Lite training set (train_mask-only_0.05) |
π€ Link |
| PIXAR-13B | SIDA-13B | Full training set (train_0.05) |
π€ Link |
| PIXAR-13B_lite | SIDA-13B | Lite training set (train_mask-only_0.05) |
π€ Link |
_lite variants are fine-tuned on a subset of our training data that contains masks to guide the generation of tampered image.
Existing tampering benchmarks rely on coarse object masks as ground truth, which severely misalign with the true edit signal: many pixels inside a mask are untouched, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning- and language-aware task.
PIXAR is a large-scale benchmark and training framework with three core contributions:
- A new taxonomy spanning 8 edit primitives (replace / remove / splice / inpaint / attribute / colorization, etc.) linked to the semantic class of the tampered object.
- A new benchmark of over 420K training image pairs and a carefully balanced 40K test set, each with per-pixel tamper maps, semantic category labels, and natural language descriptions.
- A new training framework and metrics that quantify pixel-level correctness with localization, assess confidence on true edit intensity, and measure tamper meaning understanding via semantics-aware classification and natural language descriptions.
Existing benchmarks typically rely on coarse object masks as ground truth, which often conflates unedited pixels within the mask with actual tamper evidence, while simultaneously overlooking subtle edits beyond the mask boundaries. To alleviate this, we replace binary masks with a per-pixel difference map
| Split | Size | Labels |
|---|---|---|
| Training | 420K+ image pairs | Pixel-level |
| Test | 40K image pairs (balanced) | Pixel-level |
Built on LLaVA + LLaMA-2 with LoRA fine-tuning, integrated with SAM ViT-H for pixel-level decoding and CLIP ViT-L/14 for visual-language alignment. Three special tokens anchor the multi-task heads in the token sequence:
| Token | Role |
|---|---|
[CLS] |
Hidden state β 2-way classification (real / tampered) via FC_cls |
[OBJ] |
Hidden state β multi-label object recognition (81 COCO classes) via FC_obj |
[SEG] |
Hidden state β SAM prompt for pixel localization via FC_seg |
Default weights:
The [SEG] token embedding can be fused with the generated text description in three modes:
| Mode | Description |
|---|---|
seg_only |
Uses seg_emb only |
text_only |
Uses text_emb only |
fuse |
gate Β· seg_emb + (1 β gate) Β· text_emb, where gate = Ο(MLP([seg_emb, text_emb])) |
-
model/β Core model (PIXARForCausalLM): LLaVA backbone + SAM ViT-H encoder +[CLS]/[OBJ]/[SEG]multi-task heads. -
finetune/β DeepSpeed training scripts for various LoRA and hyperparameter configurations. -
evaluation/β Evaluation launchers and metrics;text_eval/compute_css.pyscores descriptions via CSS. -
utils/β Dataset class, IoU metrics, and distributed batch sampler. -
preprocess_raw/β End-to-end dataset preprocessing pipeline:-
1_download-data/β rclone scripts to download and extract the raw PIXAR dataset from Google Drive. -
2_construct_dataset/β Builds the dataset from raw image pairs: generates pixel-difference maps and labels at any$\tau$ .
-
-
train_PIXAR.pyβ Main training entry point. -
test_parallel.pyβ Multi-GPU parallel evaluation. -
chat.pyβ Interactive single-image inference. -
merge_lora_weights_and_save_hf_model.py/merge.shβ Merges LoRA adapters and exports to HuggingFace format.
Requires Python 3.10.
pip install -r requirements.txtSome packages require minor patches after installation. Run:
bash fix/fix.shIf issues persist, run fix/fix_again.sh.
Download the following weights and place them in your preferred directory:
| Component | Details |
|---|---|
| SIDA-7B base model | HuggingFace-format LLaVA-LLaMA-2 base |
| SAM ViT-H | sam_vit_h_4b8939.pth |
| CLIP ViT-L/14 | openai/clip-vit-large-patch14 (auto-downloaded) |
See the Data section below.
We provide two ways to obtain the PIXAR dataset:
Option A β Download preprocessed data (recommended) Training and test sets preprocessed at
$\tau$ = 0.05 are available on Google Drive.The drive contains 9 files in total:
File Description train_0.05.tar.gzFull training set β includes all tampered type images train_mask-only_0.05.tar.gzLite training set β mask-only tampered type images only; intended for training the lite model variant test_full_0.05.tar.gzFull test set β contains all 6 generation sources combined (see splits below) test_qwen_0.05.tar.gzTest split β Qwen-generated images test_gemini2.5_0.05.tar.gzTest split β Gemini 2.5-generated images test_gemini3_0.05.tar.gzTest split β Gemini 3-generated images test_flux2_0.05.tar.gzTest split β FLUX 2-generated images test_gpt_0.05.tar.gzTest split β GPT-generated images test_seedream_0.05.tar.gzTest split β SeeDream-generated images
test_full_0.05.tar.gzaggregates all 6 per-source test splits (qwen, gemini2.5, gemini3, flux2, gpt, seedream).
Option B β Build from raw data with a custom
$\tau$ We release the raw image pairs alongside the pixel-difference maps, allowing labels to be re-derived at any$\tau$ . See Custom Dataset Processing for details.
dataset_dir/
βββ train/
β βββ real/ # Authentic images
β βββ full_synthetic/ # fully generated (empty)
β βββ tampered/ # AI-Tampered images
β βββ masks/ # masks used to generate the tampered images
β βββ soft_masks/ # Pixel-difference maps M_Ο (default Ο = 0.05)
β βββ metadata/ # JSON per tampered image: {"cls": [...], "text": "..."}
βββ validation/
βββ (same structure)
If you want to build the dataset from scratch at a different
Step 1 β Download raw data
Follow preprocess_raw/1_download-data/README.md to configure rclone, populate preprocess_raw/1_download-data/files.txt with the raw zip filenames, and run:
bash preprocess_raw/1_download-data/download.shStep 2 β Build dataset at your preferred
Edit the config block in preprocess_raw/2_construct_dataset/generate_v2.sh:
DATASET_DIR="/path/to/raw_outputs" # output of download.sh
OUT_DIR="/path/to/output_dataset" # where to write the processed dataset
TAOS=(0.05) # e.g. (0.01) (0.1) (0.2) or multiple valuesThen run:
cd preprocess_raw/2_construct_dataset
bash generate_v2.sh # mask-only labels
bash generate_v2-text.sh # labels + text descriptions (requires descriptions.csv)| Effect | |
|---|---|
| 0.01 | Captures micro-edits and subtle pixel changes |
| 0.05 | Default β balanced sensitivity (recommended) |
| 0.1 | High-confidence semantic changes only |
| 0.2 | Conservative β only large, obvious edits |
You can launch training directly or use the provided scripts under finetune/:
deepspeed --include localhost:0 --master_port=12345 train_PIXAR.py \
--version <path_to_base_model> \
--dataset_dir <path_to_dataset> \
--vision_pretrained <path_to_sam_vit_h.pth> \
--val_dataset <path_to_validation_set> \
--batch_size 2 --epochs 10 --steps_per_epoch 1000 \
--lr 1e-4 --dice_loss_weight 1.0 --seg_prompt_mode seg_only \
--precision bf16 --exp_name "pixar_experiment" --log_base_dir ./runsKey hyperparameters: LoRA rank 8,
After training, convert the DeepSpeed checkpoint and merge LoRA adapters before evaluation:
# Step 1: Convert DeepSpeed checkpoint to fp32
cd runs/<exp_name>/ckpt
python zero_to_fp32.py . ../pytorch_model.bin
# Step 2: Merge LoRA adapters into the base model
# Set the path to pytorch_model.bin in merge.sh, then run:
bash merge.shSee evaluation/README.md for the full guide.
# Multi-GPU parallel evaluation
python test_parallel.py \
--version <merged_model> --dataset_dir <test_data> \
--vision_pretrained <sam.pth> --gpus 0,1,2,3 \
--output_dir ./evaluation/logs/my_exp \
--seg_prompt_mode seg_only --precision bf16 --save_generated_text
# Text quality (CSS score, requires --save_generated_text above)
cd evaluation/text_eval
python compute_css.py \
--json_path ../logs/my_exp/generated_texts.json \
--output_path ./logs/my_exp/css_scores.jsonpython chat.py --version <merged_model> --precision bf16 --seg_prompt_mode seg_onlyWe also provide a set of demo notebooks in the playground/ directory that you can use to explore the model interactively.
If you find this work useful, please cite:


