Builds a pseudo-labeled cooking video dataset for object state understanding.
- Put raw videos under
dataset/raw_videos/. - Add video metadata either in
configs/dataset_config.yamlundervideos:or indataset/metadata/video_metadata.json:
[
{
"video_id": "onion_001",
"video_path": "raw_videos/onion_001.mp4",
"title": "How to dice an onion",
"object": "onion",
"task": "dice onion"
}
]- Create and activate an isolated Python environment, then install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Run the pipeline:
python src/run_pipeline.py --config configs/dataset_config.yamldataset/frame_dataset.jsonldataset/temporal_dataset.jsonlsplits/train.jsonl,splits/val.jsonl,splits/test.jsonldataset/metadata/video_metadata.json- per-video frames, masks, crops, scores, pseudo-labels, features, and final metadata.
The detector/tracker/VLM/feature modules expose fallback implementations so the pipeline can run locally. Replace the internals of detect_objects.py, track_masks.py, score_states.py, and extract_features.py with Grounding DINO, SAM2, CLIP/SigLIP, and DINOv2 adapters when those models are available.