OpenUni/docs/DATASETS.md at main · wusize/OpenUni

Pretrain

Some of our pretraining data are directly from open-source datasets, including megalith10m.md and text2image2m.md. Additionally, we collected and re-captioned 6m Laion images and 5m RedCaps images. Due to legal concerns, we are unable to redistribute these images. Please refer to laion6m.md and redcaps5m.md for the urls and captions.

Considering that RedCaps images are super large and hard to obtain, we give an alternative dataset for this part of data, i.e., CC12M. Again, we do not have the right to redistribute the images but only upload our captions. So we recommend to download the images from other sources such as pixparse/cc12m-wds. There is a guidance to prepare the CC12M dataset in cc12m.md

Finetune

Currently, we use BLIP3o/BLIP3o-60k as the finetuning dataset to compare with prior models such as BLIP3o-4B. There is a guidance to prepare this dataset in blip3o60k.md. However, BLIP3o/BLIP3o-60k contains 10K data samples that use the exactly same prompt templates as GenEval, thus drastically boosting performance on this benchmark. If you have concerns over data leakage, it is recommended to use text2image2m.md for finetuning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrain

Finetune

FilesExpand file tree

DATASETS.md

Latest commit

History

DATASETS.md

File metadata and controls

Pretrain

Finetune