This repository provides a pipeline for generating synthetic multimodal datasets and packing them using megatron-energon and webdataset. It is designed as a starting point for VLM (Vision-Language Model) training frameworks, demonstrating how to handle interleaved images, multi-turn conversations, and efficient data packing.
Contact:
tockier@cvc.uab.cat(Computer Vision Center)
- Synthetic Generation: Generate massive datasets with randomized text (Lorem Ipsum) and random images (Gaussian noise).
- Multimodal Support: Support for captioning (1 image), VQA (multiple turns), and interleaved data (multiple images per sample).
- Megatron-Energon Integration: Ready-to-use
TaskEncoderandCookersformegatron-energon. - Data Packing: Demonstrates how to pack multiple variable-length samples into a single fixed-length sequence using
cu_seqlens. - Diagnostic Tools: High-fidelity visualization of token distributions and decoded text in packed batches.
src/generate.py: Main script for synthetic dataset generation.src/task_encoders.py: ContainsTaskEncoderimplementations andCookers.src/viz_synthetic.py: Visualizes token distributions (image vs. text vs. padding).src/viz_text.py: Decodes and prints the text within packed batches.configs/: TOML configuration files for various dataset types (Captioning, VQA, Interleaved).ENERGON_DOCS.md: Detailed documentation on Energon integration.
This project uses uv for dependency management. To set up the environment:
uv syncUse the provided configurations or create your own to generate WebDataset shards:
# Generate a simple VQA dataset
uv run python src/generate.py configs/vqa.toml
# Generate an interleaved dataset with multiple images
uv run python src/generate.py configs/interleaved.tomlBefore using the dataset with Energon, you must prepare the metadata:
uv run energon prepare data/vqa --non-interactive --split-ratio 1.0,0,0 --sample-type CrudeWebdatasetNote: We use CrudeSample to keep the raw data accessible to our custom Cookers.
Verify that the data is being loaded and packed correctly.
Visualize how User Text, Assistant Text, Images, and Padding are distributed in a batch:
uv run python src/viz_synthetic.py \
--dataset data/vqa \
--encoder-class DataPackingEncoder \
--output visualizations/vqa_tokens.pngDecode the actual text being fed to the model:
uv run python src/viz_text.py \
--dataset data/vqa \
--encoder-class DataPackingEncoderCaptioning dataset with small images:

VQA dataset with multiple user-assistant turns:

Interleaved dataset with multiple images in the same sample:

For more details on how to extend the TaskEncoder or add new Cookers, please refer to ENERGON_DOCS.md.
Feel free to create any issues or PRs.