From Raw Footage to Recipe: Extracting Cooking Steps from Egocentric Video
++ This project builds a system that watches egocentric cooking videos and automatically extracts the sequence of cooking actions performed, with the goal of reconstructing a recipe from raw footage alone. + Because most frames in a cooking video are irrelevant, the pipeline first applies a relevance classifier to filter out background activity, then routes the remaining clips through an RNN-based action classifier that identifies steps such as cutting, peeling, and boiling. + Video representations are produced by V-JEPA 2, which encodes each video as a sequence of 64-frame block embeddings without requiring labeled pretraining data. + The result is an end-to-end pipeline that turns an unstructured kitchen video into a structured, step-by-step recipe. +
+ +
+