Skip to content

VILA-Lab/PIXAR

Repository files navigation

PIXAR Logo

From Masks to Pixels and Meaning

A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Paper Β  Dataset Β  Model Β  License

Benchmark Β  Tasks Β 

Xinyi Shang*  Yi Tang*  Jiacheng Cui*  Ahmed Elhagry  Salwa K. Al Khatib
Sondos Mahmoud Bsharat  Jiacheng Liu  Xiaohan Zhao  Jing-Hao Xue  Hao Li  Salman Khan  Zhiqiang Shen†

* Equal contribution  |  † Corresponding author



PIXAR Motivation

Mask-based labels misalign with true edit signals (top). Our pixel-difference labels are precisely aligned with the generative footprint (bottom).


TL;DR Β We expose a fundamental flaw in mask-based tampering benchmarks and introduce PIXAR: a 420K+ benchmark with pixel-faithful labels, 8 manipulation types, and a VLM detector that simultaneously localizes, classifies, and describes tampered regions, achieving 2.7Γ— IoU improvement over prior SOTA.


πŸ”₯ News

  • [2026-03] πŸ“¦ πŸš€ Major Update: Code and pre-trained weights are now available, along with our large-scale PIXAR benchmark (420K+ pairs).
  • [2026-02] πŸ“‘ Paper Status: Accepted to CVPR 2026 Findings. We have opted to withdraw it for further enhancement and resubmission.

πŸ€— Released Models

Model Base Training Data HuggingFace
PIXAR-7B SIDA-7B Full training set (train_0.05) πŸ€— Link
PIXAR-7B_lite SIDA-7B Lite training set (train_mask-only_0.05) πŸ€— Link
PIXAR-13B SIDA-13B Full training set (train_0.05) πŸ€— Link
PIXAR-13B_lite SIDA-13B Lite training set (train_mask-only_0.05) πŸ€— Link

_lite variants are fine-tuned on a subset of our training data that contains masks to guide the generation of tampered image.


πŸ“– Overview

Existing tampering benchmarks rely on coarse object masks as ground truth, which severely misalign with the true edit signal: many pixels inside a mask are untouched, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning- and language-aware task.

PIXAR is a large-scale benchmark and training framework with three core contributions:

  1. A new taxonomy spanning 8 edit primitives (replace / remove / splice / inpaint / attribute / colorization, etc.) linked to the semantic class of the tampered object.
  2. A new benchmark of over 420K training image pairs and a carefully balanced 40K test set, each with per-pixel tamper maps, semantic category labels, and natural language descriptions.
  3. A new training framework and metrics that quantify pixel-level correctness with localization, assess confidence on true edit intensity, and measure tamper meaning understanding via semantics-aware classification and natural language descriptions.

πŸ’‘ Motivation

Existing benchmarks typically rely on coarse object masks as ground truth, which often conflates unedited pixels within the mask with actual tamper evidence, while simultaneously overlooking subtle edits beyond the mask boundaries. To alleviate this, we replace binary masks with a per-pixel difference map $D = |I_{\text{orig}} - I_{\text{gen}}|$. By applying a tunable threshold $\tau$, we derive $M_\tau$, which is a dynamic ground truth that encapsulates both micro-edits at lower $\tau$ values and high-confidence semantic changes at higher $\tau$ values.

Tau visualization

Pixel-level labels under different Ο„.


πŸ“¦ PIXAR Benchmark

Split Size Labels
Training 420K+ image pairs Pixel-level $M_\tau$, semantic class, text description, tampered or not
Test 40K image pairs (balanced) Pixel-level $M_\tau$, semantic class, text description, tampered or not

πŸ”¬ Method

PIXAR Method

Architecture

Built on LLaVA + LLaMA-2 with LoRA fine-tuning, integrated with SAM ViT-H for pixel-level decoding and CLIP ViT-L/14 for visual-language alignment. Three special tokens anchor the multi-task heads in the token sequence:

Token Role
[CLS] Hidden state β†’ 2-way classification (real / tampered) via FC_cls
[OBJ] Hidden state β†’ multi-label object recognition (81 COCO classes) via FC_obj
[SEG] Hidden state β†’ SAM prompt for pixel localization via FC_seg

Training Objective

$$\mathcal{L}_\text{total} = \lambda_\text{sem},\mathcal{L}_\text{sem} + \lambda_\text{bce},\mathcal{L}_\text{bce} + \lambda_\text{dice},\mathcal{L}_\text{dice} + \lambda_\text{text},\mathcal{L}_\text{text} + \lambda_\text{cls},\mathcal{L}_\text{cls}$$

Default weights: $\lambda_\text{sem}$ = 0.5, $\lambda_\text{cls}$ = 1.0, $\lambda_\text{bce}$ = 1.0, $\lambda_\text{dice}$ = 1.0, $\lambda_\text{text}$ = 3.0.

Segmentation Prompt Modes

The [SEG] token embedding can be fused with the generated text description in three modes:

Mode Description
seg_only Uses seg_emb only
text_only Uses text_emb only
fuse gate Β· seg_emb + (1 βˆ’ gate) Β· text_emb, where gate = Οƒ(MLP([seg_emb, text_emb]))

πŸ“Š Results

Experimental Results

πŸ—‚οΈ Project Structure

  • model/ β€” Core model (PIXARForCausalLM): LLaVA backbone + SAM ViT-H encoder + [CLS]/[OBJ]/[SEG] multi-task heads.
  • finetune/ β€” DeepSpeed training scripts for various LoRA and hyperparameter configurations.
  • evaluation/ β€” Evaluation launchers and metrics; text_eval/compute_css.py scores descriptions via CSS.
  • utils/ β€” Dataset class, IoU metrics, and distributed batch sampler.
  • preprocess_raw/ β€” End-to-end dataset preprocessing pipeline:
    • 1_download-data/ β€” rclone scripts to download and extract the raw PIXAR dataset from Google Drive.
    • 2_construct_dataset/ β€” Builds the dataset from raw image pairs: generates pixel-difference maps and labels at any $\tau$.
  • train_PIXAR.py β€” Main training entry point.
  • test_parallel.py β€” Multi-GPU parallel evaluation.
  • chat.py β€” Interactive single-image inference.
  • merge_lora_weights_and_save_hf_model.py / merge.sh β€” Merges LoRA adapters and exports to HuggingFace format.

βš™οΈ Environment Setup

Requires Python 3.10.

1. Install Dependencies

pip install -r requirements.txt

2. Fix Environment

Some packages require minor patches after installation. Run:

bash fix/fix.sh

If issues persist, run fix/fix_again.sh.

3. Pretrained Weights

Download the following weights and place them in your preferred directory:

Component Details
SIDA-7B base model HuggingFace-format LLaVA-LLaMA-2 base
SAM ViT-H sam_vit_h_4b8939.pth
CLIP ViT-L/14 openai/clip-vit-large-patch14 (auto-downloaded)

4. Pre-processed PIXAR Dataset

See the Data section below.


πŸ—„οΈ Data

We provide two ways to obtain the PIXAR dataset:

Option A β€” Download preprocessed data (recommended) Training and test sets preprocessed at $\tau$ = 0.05 are available on Google Drive.

The drive contains 9 files in total:

File Description
train_0.05.tar.gz Full training set β€” includes all tampered type images
train_mask-only_0.05.tar.gz Lite training set β€” mask-only tampered type images only; intended for training the lite model variant
test_full_0.05.tar.gz Full test set β€” contains all 6 generation sources combined (see splits below)
test_qwen_0.05.tar.gz Test split β€” Qwen-generated images
test_gemini2.5_0.05.tar.gz Test split β€” Gemini 2.5-generated images
test_gemini3_0.05.tar.gz Test split β€” Gemini 3-generated images
test_flux2_0.05.tar.gz Test split β€” FLUX 2-generated images
test_gpt_0.05.tar.gz Test split β€” GPT-generated images
test_seedream_0.05.tar.gz Test split β€” SeeDream-generated images

test_full_0.05.tar.gz aggregates all 6 per-source test splits (qwen, gemini2.5, gemini3, flux2, gpt, seedream).

Option B β€” Build from raw data with a custom $\tau$ We release the raw image pairs alongside the pixel-difference maps, allowing labels to be re-derived at any $\tau$. See Custom Dataset Processing for details.

Dataset Format

dataset_dir/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ real/               # Authentic images
β”‚   β”œβ”€β”€ full_synthetic/     # fully generated (empty)
β”‚   β”œβ”€β”€ tampered/           # AI-Tampered images
β”‚   β”œβ”€β”€ masks/              # masks used to generate the tampered images
β”‚   β”œβ”€β”€ soft_masks/         # Pixel-difference maps M_Ο„ (default Ο„ = 0.05)
β”‚   └── metadata/           # JSON per tampered image: {"cls": [...], "text": "..."}
└── validation/
    └── (same structure)

Custom Dataset Processing

If you want to build the dataset from scratch at a different $\tau$, follow these two steps.

Step 1 β€” Download raw data

Follow preprocess_raw/1_download-data/README.md to configure rclone, populate preprocess_raw/1_download-data/files.txt with the raw zip filenames, and run:

bash preprocess_raw/1_download-data/download.sh

Step 2 β€” Build dataset at your preferred $\tau$

Edit the config block in preprocess_raw/2_construct_dataset/generate_v2.sh:

DATASET_DIR="/path/to/raw_outputs"   # output of download.sh
OUT_DIR="/path/to/output_dataset"    # where to write the processed dataset
TAOS=(0.05)                          # e.g. (0.01) (0.1) (0.2) or multiple values

Then run:

cd preprocess_raw/2_construct_dataset

bash generate_v2.sh          # mask-only labels
bash generate_v2-text.sh     # labels + text descriptions (requires descriptions.csv)

$\tau$ selection guide:

$\tau$ Effect
0.01 Captures micro-edits and subtle pixel changes
0.05 Default β€” balanced sensitivity (recommended)
0.1 High-confidence semantic changes only
0.2 Conservative β€” only large, obvious edits

πŸš€ Training

You can launch training directly or use the provided scripts under finetune/:

deepspeed --include localhost:0 --master_port=12345 train_PIXAR.py \
  --version <path_to_base_model> \
  --dataset_dir <path_to_dataset> \
  --vision_pretrained <path_to_sam_vit_h.pth> \
  --val_dataset <path_to_validation_set> \
  --batch_size 2 --epochs 10 --steps_per_epoch 1000 \
  --lr 1e-4 --dice_loss_weight 1.0 --seg_prompt_mode seg_only \
  --precision bf16 --exp_name "pixar_experiment" --log_base_dir ./runs

Key hyperparameters: LoRA rank 8, $\lambda_\text{dice}$ = 1.0, $\lambda_\text{sem}$ = 0.1, $\lambda_\text{text}$ = 3.0, $\tau$ = 0.05.

Merging LoRA Weights

After training, convert the DeepSpeed checkpoint and merge LoRA adapters before evaluation:

# Step 1: Convert DeepSpeed checkpoint to fp32
cd runs/<exp_name>/ckpt
python zero_to_fp32.py . ../pytorch_model.bin

# Step 2: Merge LoRA adapters into the base model
# Set the path to pytorch_model.bin in merge.sh, then run:
bash merge.sh

πŸ“ Evaluation

See evaluation/README.md for the full guide.

# Multi-GPU parallel evaluation
python test_parallel.py \
  --version <merged_model> --dataset_dir <test_data> \
  --vision_pretrained <sam.pth> --gpus 0,1,2,3 \
  --output_dir ./evaluation/logs/my_exp \
  --seg_prompt_mode seg_only --precision bf16 --save_generated_text

# Text quality (CSS score, requires --save_generated_text above)
cd evaluation/text_eval
python compute_css.py \
  --json_path ../logs/my_exp/generated_texts.json \
  --output_path ./logs/my_exp/css_scores.json

πŸ’¬ Interactive Inference

python chat.py --version <merged_model> --precision bf16 --seg_prompt_mode seg_only

We also provide a set of demo notebooks in the playground/ directory that you can use to explore the model interactively.


πŸ“ Citation

If you find this work useful, please cite:

About

Official implementation of paper: From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages