-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
483 lines (458 loc) · 32.4 KB
/
Copy pathindex.html
File metadata and controls
483 lines (458 loc) · 32.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Project ideas for the USI Computer Vision & Pattern Recognition course, Spring 2026.">
<title>Computer Vision & Pattern Recognition | USI</title>
<link rel="icon" href="assets/favicon.svg" type="image/svg+xml">
<link rel="stylesheet" href="styles.css">
</head>
<body>
<nav class="site-nav" aria-label="Main navigation">
<div class="brand">USI Computer Vision 2026</div>
<div class="nav-links">
<a href="#abstract">Course</a>
<a href="#projects">Projects</a>
<a href="https://github.com/Computer-Vision-2026" target="_blank" rel="noopener">GitHub</a>
<a href="https://search.usi.ch/courses/35275471/computer-vision-pattern-recognition" target="_blank" rel="noopener">Official profile</a>
</div>
</nav>
<header class="hero">
<div class="hero-content">
<h1>Computer Vision</h1>
<p class="hero-copy">
A Spring 2026 course connecting the historical foundations of vision with modern deep learning,
transformers, and hands-on project work.
</p>
</div>
</header>
<main>
<section id="abstract">
<div class="section-inner about-grid">
<div class="about-copy">
<div class="section-heading">
<h2>Abstract</h2>
</div>
<p>
Machine learning has profoundly changed computer vision, but the field's current methods build on a long
history of image formation, geometry, perception, recognition, and representation learning. This lecture
takes a holistic view of the task of vision.
</p>
<p>
Lectures and tutorials are accompanied by bi-weekly quizzes and project work. Assessment combines
bi-weekly quizzes, the project, and the final exam.
</p>
<div class="topic-list" aria-label="Course topics">
<span class="topic">Foundations of vision</span>
<span class="topic">Deep learning</span>
<span class="topic">Transformers</span>
</div>
</div>
<ul class="detail-list" aria-label="Course details">
<li>
<b>Instructor</b>
<span><a href="https://francisengelmann.github.io/" target="_blank" rel="noopener">Francis Engelmann</a></span>
</li>
<li>
<b>Assistant</b>
<span><a href="https://nihermann.github.io/" target="_blank" rel="noopener">Nicolai Hermann</a></span>
</li>
<li>
<b>Format</b>
<span>Lectures, tutorials, bi-weekly quizzes, project work, and final exam.</span>
</li>
<li>
<b>Bibliography</b>
<span><i>Foundations of Computer Vision</i>, Antonio Torralba, Phillip Isola, William T. Freeman, MIT Press, 2024.</span>
</li>
<li>
<b>Programs</b>
<span>MSc Artificial Intelligence, MSc Computational Science, MSc Informatics, and Faculty of Informatics PhD students.</span>
</li>
</ul>
</div>
</section>
<section class="band" id="projects">
<div class="section-inner">
<div class="section-heading">
<h2>Student Project Ideas</h2>
<p>
Project proposals live here so classmates can quickly scan possible directions, compare ideas, and submit
additions by pull request.
</p>
</div>
<div class="projects-grid">
<!-- Copy this article to add a new project idea. Keep the teaser visual and write a 60-90 word pitch. -->
<article class="project-card">
<div class="teaser" role="img" aria-label="Abstract computer vision teaser with image grid, camera frame, and detected regions.">
<span class="teaser-label">Example project idea</span>
</div>
<div class="project-content">
<p class="project-meta">Scene understanding, geometry, foundation models</p>
<h3>Semantic Change Maps from Everyday Walks</h3>
<p class="project-abstract">
Can a short phone video reveal how a campus route changes over time? This project combines monocular
depth, semantic segmentation, and feature matching to align walks recorded on different days, then
highlights moved objects, blocked paths, or new scene elements. The result would be a small visual demo
that connects 3D reasoning, human attention, and practical scene understanding for urban navigation.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Two hands playing rock-paper-scissors, but one holds a banana instead of a valid sign, illustrating anomaly detection.">
<img src="assets/group_J.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group J</span>
</div>
<div class="project-content">
<p class="project-meta">Self-supervised video representations, foundation models, anomaly detection</p>
<h3>Can V-JEPA read ECG?</h3>
<p class="project-abstract">
Can V-JEPA learn meaningful representations for ECG-based arrhythmia detection? In this project, we investigate whether a V-JEPA encoder, originally designed for video understanding, can capture clinically relevant patterns from ECG signals. We transform ECG recordings into video-like inputs and use the pretrained V-JEPA encoder to generate latent representations. These representations are then evaluated by training a range of downstream predictors for arrhythmia classification. By comparing performance across predictors, we assess the quality and transferability of V-JEPA features for cardiac signal analysis.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Visual product search for marketplace items using image embeddings, segmentation, and on-device retrieval.">
<img src="assets/group_B.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group B</span>
</div>
<div class="project-content">
<p class="project-meta">Visual search, marketplace retrieval, segmentation, Android demo</p>
<h3>Visual Search for Marketplace</h3>
<p class="project-abstract">
This project explores visual product search for marketplace applications, where users can search for visually similar
items using a single photo. I build a retrieval pipeline based on vision-language embeddings, object segmentation,
category-aware routing, and color-aware reranking to improve search quality for fashion and marketplace-style product
images. The final system is demonstrated in an Android app that performs on-device image-based retrieval over a local
product catalog.
</p>
<p>
<a href="https://github.com/siiena25/ImageSearch" target="_blank" rel="noopener noreferrer">
GitHub / Code
</a>
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Futuristic AR dashboard overlay on motorcycle FPV footage with object locking and hazard prediction highlights.">
<img src="assets/group_R.png" alt="AR Assistant Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group R</span>
</div>
<div class="project-content">
<p class="project-meta">Real-time segmentation, AR visualization</p>
<h3>Aegis Rider - AR Assistant for Motorcyclists</h3>
<p class="project-abstract">
Aegis Rider is a motorcycle AR demo that simulates a smart-helmet riding experience.
It takes a first-person riding video, detects surrounding road users, estimates potential collision risks,
and overlays a HUD directly onto the video. The system also includes radar-style awareness visualization
and basic navigation features, providing riders with real-time environmental and directional information
in an intuitive AR interface.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Interface preview of SemanticSpot showing natural language search and object segmentation in a video.">
<img src="assets/group_C.png" alt="SemanticSpot Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group C</span>
</div>
<div class="project-content">
<p class="project-meta">Video Search, Segmentation, Foundation Models</p>
<h3>SemanticSpot: "Ctrl+F" for Videos</h3>
<p class="project-abstract">
Ever wanted to find a specific object or action in a long video without scrubbing through it manually? SemanticSpot is an interactive web app that lets you search videos using natural language. You can simply type a query like "the person wearing a blue hat" or "the red coffee mug". Behind the scenes, the app uses CLIP to instantly locate the exact timestamp where your search appears, and SAM (Segment Anything) to dynamically track and highlight the object on the screen. It’s a smart, zero-shot visual search engine combined with automatic segmentation!
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Dashcam video with multiple colored masks for detections responding to natural language text queries such as 'a classic new york taxi', 'a work truck', and 'a pedestrian'.">
<img src="assets/group_E.png" alt="Semantic Querying of Driving Scenes" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group E</span>
</div>
<div class="project-content">
<p class="project-meta">Object segmentation, tracking, vision-language models, open-vocabulary queries</p>
<h3>Semantic Querying of Driving Scenes</h3>
<p class="project-abstract">
This project transforms a driving scene taken from a dashcam into an interface where users can type natural language queries such as "a classic new york taxi", "a work truck" and "a pedestrian" and see matching objects highlighted and tracked in real time.
<br><br>
The demo emphasizes open-vocabulary capabilities: queries can written in natural language, multiple concepts can be explored simultaneously, and each query is visualized with distinct colors for clarity.
<br><br>
Rather than focusing purely on detection, the project also adds a way to detect whenever a pedestrian is actively crossing the road, marking it in red to highlight it. Moreover a small detector runs in parallel to detect traffic signs, also responding to natural language queries given by the user, like "no parking sign" or "one-way sign".
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Facial expression recognition with emotion labels and bounding boxes over a human face.">
<img src="assets/group M.png" alt="Facial Expression Recognition Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group M</span>
</div>
<div class="project-content">
<p class="project-meta">Emotion recognition, deep learning, CNN-RNN</p>
<h3>Facial Expression Recognition with Hybrid Models</h3>
<p class="project-abstract">
Can a model reliably read human emotions from a single image? This project explores facial expression recognition by classifying faces into categories such as happiness, sadness, anger, and surprise. We build a small interactive demo that visualizes predicted emotions on input images.
<br><br>
To achieve this, we combine CNNs for spatial feature extraction with RNNs to capture temporal patterns, comparing pretrained models (MobileNetV2, InceptionV3) with a custom CNN-RNN trained from scratch on FER2013 and CK+ datasets.
<br><br>
The project focuses on evaluating performance, robustness, and how transfer learning influences emotion recognition across different data conditions.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Promptable Video Event Finder with Segmentation-Guided Motion Analysis.">
<img src="assets/group_O.png" alt="Highlights Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group O</span>
</div>
<div class="project-content">
<p class="project-meta">Prompted segmentation, motion tracking, event detection, video highlights</p>
<h3>Behaviour Lens</h3>
<p class="project-abstract">
What if a video highlight came with the evidence behind it? Behaviour Lens takes a video and a text prompt, uses SAM3 to segment the requested object in sampled frames, and turns those detections into trajectories, velocities, timestamps, masks, overlays, and auditable CSV outputs.
<br><br>
The goal is to combine modern segmentation models (such as SAM) with classical computer vision techniques. Segmentation serves as a strong perception layer, while event detection is driven by motion-based features such as trajectories, velocity, and frequency analysis, along with lightweight reasoning.
The system follows a modular design, consisting of a general perception and feature extraction pipeline combined with task-specific event detection modules.
<br><br>
The system is primarily designed for human action detection (e.g., waving, raising a hand, standing up). As an extension, it can also handle simple sports scenarios, such as tracking a ball moving toward or crossing a goal, demonstrating its ability to generalize to multi-object interactions.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Open-vocabulary tracking project.">
<img src="assets/group_X.png" alt="Two segmented puppies in a park" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group X</span>
</div>
<div class="project-content">
<p class="project-meta">Object detection, segmentation, tracking, vision-language models</p>
<h3>Open-Vocabulary Object Tracking with Grounding DINO, SAM 2 and CLIP</h3>
<p class="project-abstract">
We present an open-vocabulary object tracking system that enables users to search, segment, and track arbitrary objects in images and videos using natural language queries.
<br><br>
Our pipeline combines Grounding DINO for text-conditioned object detection, CLIP for semantic verification, and SAM 2 for segmentation and temporal tracking.
<br><br>
The system supports interactive querying through a Gradio web interface and demonstrates how modern vision foundation models can be integrated into a unified visual understanding pipeline.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Depth-aware augmented reality music interface showing virtual instruments placed on detected floor regions.">
<img src="assets/group_F.png" alt="Air Instrument depth-aware AR music interface preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group F</span>
</div>
<div class="project-content">
<p class="project-meta">Monocular depth estimation, hand tracking, augmented reality, human-computer interaction</p>
<h3>Air Instrument: Depth-Aware Virtual Music Placement</h3>
<p class="project-abstract">
Air Instrument explores how a normal webcam can turn a room into an interactive musical stage. The system first
estimates scene depth using Depth Anything V2, detects candidate floor or surface regions, and lets users place
virtual instruments into available 3D space through hand gestures. Once instruments are placed, a playing mode uses
MediaPipe hand tracking to control expressive parameters such as pitch and volume without touching any physical
device.
<br><br>
The project combines monocular depth estimation, spatial reasoning, gesture recognition, and augmented reality
rendering into a live demo. Our goal is to study how depth-aware scene understanding can support natural interaction:
where can an object be placed, how large should it appear, and how can the user control it through movement?
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Image retrieval with CLIP.">
<img src="assets/group_Q.png" alt="Image retrieval preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group Q</span>
</div>
<div class="project-content">
<p class="project-meta">CLIP, FAISS, patch-level matching, image retrieval</p>
<h3>Image retrieval with CLIP</h3>
<p class="project-abstract">
Got a clue? ImageDetective can find the picture. Describe what you’re looking for, upload an image, or search by visual details. ImageDetective connects text and images through foundation models, combining global semantic search with patch-level matching for smarter and more explainable retrieval.
<br><br>
We will propose a vision language retrieval system to implement bidirectional search between images and natural language. With CLIP, we will map both modalities into a share embedding space, a image could be retrieved with FAISS indexing. Perhaps the accurary should be improved, further approach is using a SAM3 for text guided semantic segmentation and enable patch level matching.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="SfM with Colmap">
<img src="assets/group_I.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group I</span>
</div>
<div class="project-content">
<p class="project-meta">Structure from Motion, 3D reconstruction, mobile robots</p>
<h3>One video, a 3D reconstruction of civil infrastructure</h3>
<p class="project-abstract">
Structure from Motion (SfM) is one of the most widely used techniques for reconstructing
objects and scenes from images or video. This project aims to test hand-crafted
Structure-from-Motion algorithms available in Colmap for reconstructing 3D environments
characterized by poor illumination, using frames captured from a limited linear camera
movements. The objective is to evaluate the limitations and accuracy of state-of-the-art
methods in challenging, hard-to-reconstruct environments.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Hand_signs">
<img src="assets/group_A.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group A</span>
</div>
<div class="project-content">
<p class="project-meta">Gesture detection, tracking</p>
<h3>Real time hand gesture detection: from rock paper scissors to sign interpretation</h3>
<p class="project-abstract">
This project focuses on building a computer vision system capable of recognizing a variety of hand signs and gestures captured by a static camera.
The core goal is to develop a robust gesture recognition pipeline that can distinguish between different hand configurations in real time.
As a proof-of-concept, the system will be integrated into the game of rock-paper-scissors, where it detects each player's gesture and determines the outcome of each round.
The final result will be an interactive demo showcasing accurate and responsive hand gesture recognition in a playful, real-world scenario.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Cooking action recognition">
<img src="assets/group_P.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group P</span>
</div>
<div class="project-content">
<p class="project-meta">Action recognition, video understanding, self-supervised embeddings</p>
<h3>From Raw Footage to Recipe: Extracting Cooking Steps from Egocentric Video</h3>
<p class="project-abstract">
This project builds a system that watches egocentric cooking videos and automatically extracts the sequence of cooking actions performed, with the goal of reconstructing a recipe from raw footage alone.
Because most frames in a cooking video are irrelevant, the pipeline first applies a relevance classifier to filter out background activity, then routes the remaining clips through an RNN-based action classifier that identifies steps such as cutting, peeling, and boiling.
Video representations are produced by V-JEPA 2, which encodes each video as a sequence of 64-frame block embeddings without requiring labeled pretraining data.
The result is an end-to-end pipeline that turns an unstructured kitchen video into a structured, step-by-step recipe.
</p>
<p>
<a href="https://github.com/PabloLandro/computer-vision" target="_blank" rel="noopener noreferrer">
GitHub / Code
</a>
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Real-time whiteboard transcription pipeline.">
<img src="assets/group_W.png" alt="Whiteboard with detected text regions and entity overlays" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group W</span>
</div>
<div class="project-content">
<p class="project-meta">Computer vision, OCR, segmentation, vision-language models, object tracking</p>
<h3>Real-Time Whiteboard Transcription with Temporal Ledger</h3>
<p class="project-abstract">
When a professor is at the board, you have two choices, pay attention, or copy. You can't really do both at the same time.
<br><br>
We wanted to eliminate that trade-off. Our system transcribes in real time what the professor writes, so the student is free to just listen and understand.
<br><br>
The pipeline captures the full evolution of whiteboard content across a lecture, every correction and erasure included, and synthesises it into structured Markdown output.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="AI image captioning system turning video frames into short action labels.">
<img src="assets/group_V.png" alt="Group V image captioning preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group V</span>
</div>
<div class="project-content">
<p class="project-meta">Video understanding, vision-language models, action captioning</p>
<h3>Action/Event-Focused Captioning: A Three-Model Comparison</h3>
<p class="project-abstract">
This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.”
<br><br>
For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card add-project-card">
<a href="https://github.com/Computer-Vision-2026/Computer-Vision-2026.github.io/edit/main/index.html" target="_blank" rel="noopener">
<span class="add-project-icon" aria-hidden="true">+</span>
<strong>Add Your Own</strong>
<span>Open GitHub, copy the example card, and submit your project pitch as a pull request.</span>
</a>
</article>
</div>
</div>
</section>
</main>
<footer>
<div class="footer-inner">
<span>Computer Vision & Pattern Recognition, USI, Spring 2026</span>
<a href="https://github.com/Computer-Vision-2026" target="_blank" rel="noopener">GitHub organization</a>
<a href="https://search.usi.ch/courses/35275471/computer-vision-pattern-recognition" target="_blank" rel="noopener">Official course profile</a>
</div>
</footer>
</body>
</html>