Skip to content

A fully automated, intelligent photo-editing agent that autonomously plans multi-step aesthetic enhancements, smartly chooses diverse editing tools, and enables everyday users to achieve professional-looking results without crafting complex prompts.

Notifications You must be signed in to change notification settings

mdyao/PhotoAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

PhotoAgent

Agentic Photo Editing with Exploratory Visual Aesthetic Planning

     

Mingde Yao1,5   Zhiyuan You1   King-Man Tam4   Menglu Wang3   Tianfan Xue1,2,5

1CUHK MMLab   2Shanghai AI Lab   3USTC   4Institute of Science Tokyo   5CPII InnoHK

One goal, one click, autonomous enhancement. PhotoAgent turns photos into professionally edited results through exploratory visual aesthetic planning — no step-by-step prompts required.


Paradigm: From Human-in-the-loop to Agent-in-the-loop

Traditional photo editing demands expertise, endless parameter tuning, and exhausting trial-and-error. PhotoAgent replaces this fragile human-in-the-loop pipeline with an autonomous agent-in-the-loop system — one that perceives, plans, executes, and evaluates like a seasoned professional.


Highlights

Feature Description
🔄 Closed-loop Planning Perceive–plan–execute–evaluate cycle with action memory and visual feedback. No open-loop, single-shot edits.
🌳 MCTS-based Aesthetic Planner Monte Carlo Tree Search explores editing trajectories, avoids short-sighted or irreversible decisions.
🧠 Action Memory & History Full editing history prevents redundant operations, enables context-aware decisions and faster convergence.
🎯 Scene-Aware Classification Fine-grained scene classification (portrait, landscape, urban, food, low-light, indoor) with scene-specific strategies.
🛠 Rich Toolset Orchestrates GPT-Image-1, Flux.1 Kontext, Step1X-Edit, Nano Banana, ZImage, and more.
📐 UGC-Oriented Evaluation UGC-Edit dataset (~7,000 photos) and reward model trained via GRPO on Qwen2.5-VL.

How PhotoAgent Works

PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem with four core components in a closed-loop system:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌───────────┐    ┌──────────┐    ┌──────────┐    ┌─────────────┐
│   │ Perceiver │───▶│ Planner  │───▶│ Executor │───▶│  Evaluator  │
│   │  (VLM)    │    │  (MCTS)  │    │  (Tools) │    │  (Metrics)  │
│   └───────────┘    └──────────┘    └──────────┘    └──────┬──────┘
│         ▲                                                  │
│         └──────────── memory + feedback ───────────────────┘
│                                                              │
└──────────────────────────────────────────────────────────────┘
Step Component What it does
1 Perceiver VLM (Qwen3-VL) interprets the image and proposes K diverse, atomic editing actions
2 Planner MCTS explores candidate actions via selection, expansion, simulation, and backpropagation
3 Executor Runs selected actions with traditional or generative tools; retains the highest-scoring result
4 Evaluator Ensemble of no-reference metrics, CLIP scores, and UGC reward model; drives re-planning when needed

Results

State-of-the-art on instruction adherence and visual quality; preferred in user studies.

Editing Examples

Input Output Input Output

More results and interactive comparisons on the project page.

Iterative Editing Process

PhotoAgent refines images over multiple iterations, with each round building on the previous result:


Original

Iter 1: Color & tone

Iter 2: Atmosphere

Iter 3: Final polish

UGC-Edit Dataset & Benchmark

Component Details
UGC-Edit Dataset ~7,000 authentic user-generated photos from LAION Aesthetic & RealQA, with human aesthetic scores (1–5 scale)
UGC Reward Model Qwen2.5-VL fine-tuned with GRPO; learns from relative rankings for robust aesthetic scoring
Editing Benchmark 1,017 real-world photos covering portraits, landscapes, urban scenes, food, low-light, and more

Code

Coming soon! Code, pretrained models, and demo will be released here. Star the repo to stay updated.


Citation

@article{yao2025photoagent,
  title   = {PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning},
  author  = {Yao, Mingde and You, Zhiyuan and Tam, King-Man and Wang, Menglu and Xue, Tianfan},
  year    = {2025}
}

License

This project is released under the MIT License.

About

A fully automated, intelligent photo-editing agent that autonomously plans multi-step aesthetic enhancements, smartly chooses diverse editing tools, and enables everyday users to achieve professional-looking results without crafting complex prompts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages