Guanyu Zhou1, Yida Yin1, Wenhao Chai1, Shengbang Tong2, Xingyu Fu1, Zhuang Liu1
1Princeton University, 2New York University
Use VisionFoundry to generate your own Synthetic Images Dataset with just one keyword!
Training
We use the ms-swift framework for SFT training. Please follow the official installation guidance and ensure your environment matches the recommended versions. The simplest install is:
pip install ms-swift -UIf you need a source install, you can clone the repo and run pip install -e . as documented. Refer to the official ms-swift README for full requirements and options.
Evaluation
We use VLMEvalKit for evaluation. A minimal setup follows the official quickstart:
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .Then configure model paths and keys, and run evaluations with python run.py ... or torchrun ... as needed.
The VisionFoundry pipeline is implemented in data_engine/vision_foundry.py. It generates synthetic VQA data and images using OpenAI and Gemini APIs.
We run the pipeline inside the ms-swift environment and add the following packages:
pip install openai google-genai pillow requests tqdm numpyCopy and edit data_engine/config.example.json (or provide your own JSON via --api_config). The default configuration uses the official OpenAI and Gemini APIs. You can set a custom OpenAI-compatible base URL by editing the base_url field in the config if needed.
Required environment variables:
OPENAI_API_KEYGEMINI_API_KEY
Example (single-image mode):
python data_engine/vision_foundry.py \
--task "<YOUR_TASK_DESCRIPTION>" \
--num 200 \
--mode single \
--output_dir ./output \
--annotation_output annotations.json \
--prompts_output prompts.jsonl \
--statements_output statements.jsonl \
--pool_output pool.json \
--api_config data_engine/config.example.jsonExample (multi-image mode, beta):
python data_engine/vision_foundry.py \
--config /path/to/task_config.json \
--num 200 \
--mode multi \
--num_images 3 \
--multi_image_form story_chain \
--output_dir ./output_multi \
--api_config data_engine/config.example.jsonParameters (All) — click to expand
--task: Task short description (used when--configis not provided)--config: Path to JSON config file (task template)--save_config_template: Save example task templates to a path--api_config: Path to API config JSON--num: Number of cases to generate--mode:singleormulti--num_objects: Number of objects per case--num_images: Number of images in multi mode--multi_image_form:multi_generate,story_chain, ormixed--objects_size: Size of auto-generated objects list--attributes_size: Size of auto-generated attributes list--scenes_size: Size of auto-generated scenes list--styles_size: Size of auto-generated styles list--max_items_per_call: Max items per LLM call for pool generation--llm_decide_attr_size: Let LLM estimate attribute pool size--objects: Custom object list--attributes: Custom attributes list--scenes: Custom scenes list--styles: Custom styles list--global_pool: Path to a global pool JSON--generate_missing: Auto-generate missing lists--max_iter: Max generation attempts per case--output_dir: Output directory--annotation_output: Annotation file name--prompts_output: Prompts file name--statements_output: Statements file name--pool_output: Pool file name--seed: Random seed--parallel: Number of parallel workers--use_edit: Enable image-editing repair after failed verification
The pipeline produces:
annotations.json: Training annotations in multi-image or single-image formatprompts.jsonl: Prompts used for image generationstatements.jsonl: Verification statementspool.json: The final object/attribute/scene/style pool
We provide three training scripts under train_scripts/:
train_scripts/train_qwen.sh(for Qwen2.5-VL-3B-Instruct)train_scripts/train_mimo.sh(for MiMo-VL-7B-SFT)train_scripts/train_llama.sh(for Llama-3.2-11B-Vision-Instruct)
Each script contains two runnable templates:
- Local single-node (e.g., a single 8-GPU machine)
- Slurm cluster submission
Fill in the placeholders for model path, dataset path, output directory, and logging directory, then uncomment the section you want to use.
After downloading the dataset from HuggingFace:
huggingface-cli download zlab-princeton/VisionFoundry-10KThen run python restore_images_from_parquet.py to rebuild the image folder structure from images.parquet. Each record in annotations.json contains a messages list and an images list following the ms-swift format.
We follow the ms-swift multimodal SFT JSON format. Each item contains a messages list and an images list. A minimal single-image example:
{
"messages": [
{"role": "user", "content": "<image>\nWhat is the color of the car?"},
{"role": "assistant", "content": "red"}
],
"images": ["/abs/or/rel/path/to/image.png"],
"qid": 1
}Multi-image examples use <images> in the user message and provide multiple image paths:
{
"messages": [
{"role": "user", "content": "<images>\nWhat changed across the images?"},
{"role": "assistant", "content": "the cup moved to the left"}
],
"images": ["img_0.png", "img_1.png", "img_2.png"],
"qid": 1
}Our pipeline outputs annotations.json in this format. You can also prepare your own dataset by modifying the annotations.json accordingly.
To launch training, fill in the placeholders in one of the scripts and run:
bash train_scripts/train_qwen.shWe do not include custom evaluation code in this repo. Use VLMEvalKit's built-in benchmark support:
- Install VLMEvalKit (see above).
- Configure your model in
vlmeval/config.pyand ensure the model weights are accessible. - Run evaluation, for example:
python run.py --data <BENCH1> <BENCH2> --model <MODEL_NAME> --verboseFor distributed inference, use torchrun --nproc-per-node=<N> run.py .... See the VLMEvalKit quickstart for more details.
When evaluating with MiMo-VL-7B-SFT, please append /no_think to the prompt to disable chain-of-thought reasoning for a fair comparison. See the official repo for details: XiaomiMiMo/MiMo-VL.
This project is released under the Apache-2.0 license. See LICENSE for details.
If you use this work, please cite:
@article{zhou2026visionfoundry,
title={VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images},
author={Zhou, Guanyu and Yin, Yida and Chai, Wenhao and Tong, Shengbang and Fu, Xingyu and Liu, Zhuang},
journal={arXiv preprint arXiv:2604.09531},
year={2026}
}This project builds on several strong open-source foundations:
- ms-swift for SFT training infrastructure
- VLMEvalKit for multimodal evaluation
