From 03ce9f884d52abcd8d29ded4989bc946ce561529 Mon Sep 17 00:00:00 2001 From: Nawfal Date: Sat, 30 May 2026 15:53:15 +0200 Subject: [PATCH 1/3] Add Group V project card Add Group V project card for action/event-focused captioning. --- index.html | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/index.html b/index.html index 7ceb546..300f3e6 100644 --- a/index.html +++ b/index.html @@ -110,6 +110,32 @@

Semantic Change Maps from Everyday Walks

+ + +
+ +
+

Video understanding, vision-language models, action captioning

+

Action/Event-Focused Captioning: A Three-Model Comparison

+

+ This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.” +

+ For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task. +

+ +
+
+ + + +
- - -
- -
-

Video understanding, vision-language models, action captioning

-

Action/Event-Focused Captioning: A Three-Model Comparison

-

- This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.” -

- For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task. -

- -
-
- - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- + +
+ +
+

Video understanding, vision-language models, action captioning

+

Action/Event-Focused Captioning: A Three-Model Comparison

+

+ This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.” +

+ For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task. +

+ +
+