diff --git a/index.html b/index.html index 7ceb546..7c25518 100644 --- a/index.html +++ b/index.html @@ -301,51 +301,6 @@

Open-Vocabulary Object Tracking with Grounding DINO, SAM 2 and CLIP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- + +
+ +
+

Video understanding, vision-language models, action captioning

+

Action/Event-Focused Captioning: A Three-Model Comparison

+

+ This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.” +

+ For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task. +

+ +
+