From 66ac7fa6b2c7e0644493d31b5747dec286cb82d2 Mon Sep 17 00:00:00 2001
From: BettiM7 From Raw Footage to Recipe: Extracting Cooking Steps from Egocentric Video
Video representations are produced by V-JEPA 2, which encodes each video as a sequence of 64-frame block embeddings without requiring labeled pretraining data.
The result is an end-to-end pipeline that turns an unstructured kitchen video into a structured, step-by-step recipe.