Synthetic Multimodal WebDatasets for Benchmarking

This repository provides a pipeline for generating synthetic multimodal datasets and packing them using megatron-energon and webdataset. It is designed as a starting point for VLM (Vision-Language Model) training frameworks, demonstrating how to handle interleaved images, multi-turn conversations, and efficient data packing.

Contact: tockier@cvc.uab.cat (Computer Vision Center)

Features

Synthetic Generation: Generate massive datasets with randomized text (Lorem Ipsum) and random images (Gaussian noise).
Multimodal Support: Support for captioning (1 image), VQA (multiple turns), and interleaved data (multiple images per sample).
Megatron-Energon Integration: Ready-to-use TaskEncoder and Cookers for megatron-energon.
Data Packing: Demonstrates how to pack multiple variable-length samples into a single fixed-length sequence using cu_seqlens.
Diagnostic Tools: High-fidelity visualization of token distributions and decoded text in packed batches.

Project Structure

src/generate.py: Main script for synthetic dataset generation.
src/task_encoders.py: Contains TaskEncoder implementations and Cookers.
src/viz_synthetic.py: Visualizes token distributions (image vs. text vs. padding).
src/viz_text.py: Decodes and prints the text within packed batches.
configs/: TOML configuration files for various dataset types (Captioning, VQA, Interleaved).
ENERGON_DOCS.md: Detailed documentation on Energon integration.

Installation

This project uses uv for dependency management. To set up the environment:

uv sync

Workflow

1. Generate Synthetic Datasets

Use the provided configurations or create your own to generate WebDataset shards:

# Generate a simple VQA dataset
uv run python src/generate.py configs/vqa.toml

# Generate an interleaved dataset with multiple images
uv run python src/generate.py configs/interleaved.toml

2. Prepare for Energon

Before using the dataset with Energon, you must prepare the metadata:

uv run energon prepare data/vqa --non-interactive --split-ratio 1.0,0,0 --sample-type CrudeWebdataset

Note: We use CrudeSample to keep the raw data accessible to our custom Cookers.

3. Visualization and Inspection

Verify that the data is being loaded and packed correctly.

Token Distribution Map

Visualize how User Text, Assistant Text, Images, and Padding are distributed in a batch:

uv run python src/viz_synthetic.py \
    --dataset data/vqa \
    --encoder-class DataPackingEncoder \
    --output visualizations/vqa_tokens.png

Text Inspector

Decode the actual text being fed to the model:

uv run python src/viz_text.py \
    --dataset data/vqa \
    --encoder-class DataPackingEncoder

Examples

Captioning dataset with small images:

VQA dataset with multiple user-assistant turns:

Interleaved dataset with multiple images in the same sample:

Contributing

For more details on how to extend the TaskEncoder or add new Cookers, please refer to ENERGON_DOCS.md.

Feel free to create any issues or PRs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
src		src
visualizations		visualizations
.gitignore		.gitignore
.python-version		.python-version
ENERGON_DOCS.md		ENERGON_DOCS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Multimodal WebDatasets for Benchmarking

Features

Project Structure

Installation

Workflow

1. Generate Synthetic Datasets

2. Prepare for Energon

3. Visualization and Inspection

Token Distribution Map

Text Inspector

Examples

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synthetic Multimodal WebDatasets for Benchmarking

Features

Project Structure

Installation

Workflow

1. Generate Synthetic Datasets

2. Prepare for Energon

3. Visualization and Inspection

Token Distribution Map

Text Inspector

Examples

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages