-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
466 lines (390 loc) · 27.2 KB
/
Copy pathindex.html
File metadata and controls
466 lines (390 loc) · 27.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Project ideas for the USI Computer Vision & Pattern Recognition course, Spring 2026.">
<title>Computer Vision & Pattern Recognition | USI</title>
<link rel="icon" href="assets/favicon.svg" type="image/svg+xml">
<link rel="stylesheet" href="styles.css">
</head>
<body>
<nav class="site-nav" aria-label="Main navigation">
<div class="brand">USI Computer Vision 2026</div>
<div class="nav-links">
<a href="#abstract">Course</a>
<a href="#projects">Projects</a>
<a href="https://github.com/Computer-Vision-2026" target="_blank" rel="noopener">GitHub</a>
<a href="https://search.usi.ch/courses/35275471/computer-vision-pattern-recognition" target="_blank" rel="noopener">Official profile</a>
</div>
</nav>
<header class="hero">
<div class="hero-content">
<h1>Computer Vision</h1>
<p class="hero-copy">
A Spring 2026 course connecting the historical foundations of vision with modern deep learning,
transformers, and hands-on project work.
</p>
</div>
</header>
<main>
<section id="abstract">
<div class="section-inner about-grid">
<div class="about-copy">
<div class="section-heading">
<h2>Abstract</h2>
</div>
<p>
Machine learning has profoundly changed computer vision, but the field's current methods build on a long
history of image formation, geometry, perception, recognition, and representation learning. This lecture
takes a holistic view of the task of vision.
</p>
<p>
Lectures and tutorials are accompanied by bi-weekly quizzes and project work. Assessment combines
bi-weekly quizzes, the project, and the final exam.
</p>
<div class="topic-list" aria-label="Course topics">
<span class="topic">Foundations of vision</span>
<span class="topic">Deep learning</span>
<span class="topic">Transformers</span>
</div>
</div>
<ul class="detail-list" aria-label="Course details">
<li>
<b>Instructor</b>
<span><a href="https://francisengelmann.github.io/" target="_blank" rel="noopener">Francis Engelmann</a></span>
</li>
<li>
<b>Assistant</b>
<span><a href="https://nihermann.github.io/" target="_blank" rel="noopener">Nicolai Hermann</a></span>
</li>
<li>
<b>Format</b>
<span>Lectures, tutorials, bi-weekly quizzes, project work, and final exam.</span>
</li>
<li>
<b>Bibliography</b>
<span><i>Foundations of Computer Vision</i>, Antonio Torralba, Phillip Isola, William T. Freeman, MIT Press, 2024.</span>
</li>
<li>
<b>Programs</b>
<span>MSc Artificial Intelligence, MSc Computational Science, MSc Informatics, and Faculty of Informatics PhD students.</span>
</li>
</ul>
</div>
</section>
<section class="band" id="projects">
<div class="section-inner">
<div class="section-heading">
<h2>Student Project Ideas</h2>
<p>
Project proposals live here so classmates can quickly scan possible directions, compare ideas, and submit
additions by pull request.
</p>
</div>
<div class="projects-grid">
<!-- Copy this article to add a new project idea. Keep the teaser visual and write a 60-90 word pitch. -->
<article class="project-card">
<div class="teaser" role="img" aria-label="Abstract computer vision teaser with image grid, camera frame, and detected regions.">
<span class="teaser-label">Example project idea</span>
</div>
<div class="project-content">
<p class="project-meta">Scene understanding, geometry, foundation models</p>
<h3>Semantic Change Maps from Everyday Walks</h3>
<p class="project-abstract">
Can a short phone video reveal how a campus route changes over time? This project combines monocular
depth, semantic segmentation, and feature matching to align walks recorded on different days, then
highlights moved objects, blocked paths, or new scene elements. The result would be a small visual demo
that connects 3D reasoning, human attention, and practical scene understanding for urban navigation.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Two hands playing rock-paper-scissors, but one holds a banana instead of a valid sign, illustrating anomaly detection.">
<img src="assets/group_J.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group J</span>
</div>
<div class="project-content">
<p class="project-meta">Self-supervised video representations, foundation models, anomaly detection</p>
<h3>Probing V-JEPA 2: What Does a Video Model Actually See?</h3>
<p class="project-abstract">
V-JEPA 2 is Meta’s self-supervised video encoder, trained without labels to predict masked
spatio-temporal regions. We want to open up its latent space and understand how it reacts to the
visual world — and, more interestingly, to things that don’t belong in it. Starting from a frozen
pretrained encoder, we build an interactive demo that embeds short clips and surfaces structure,
similarity, and drift over time. On top of this, we explore anomaly detection as a concrete
application: can the embedding space tell a banana from a pair of scissors in a rock-paper-scissors
game, a boat on a highway, or an abnormal beat in an ECG recording?
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Two hands playing rock-paper-scissors, but one holds a banana instead of a valid sign, illustrating anomaly detection.">
<img src="assets/group_B.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group B</span>
</div>
<div class="project-content">
<p class="project-meta">Vision-language embeddings, segmentation, on-device retrieval</p>
<h3>One Photo, Many Aisles: Visual Search Across a Heterogeneous Marketplace</h3>
<p class="project-abstract">
A single foundation model rarely knows a sofa, a smartphone and a floral dress equally well — yet a real
marketplace catalog mixes all of them on the same shelf. This project studies where general-purpose visual
encoders like CLIP and SigLIP 2 stop being enough for e-commerce retrieval, and what has to be rebuilt
around them when the catalog is not one domain but twenty. Each query is first stripped of its context —
mannequins, human models, living-room scenes, studio gradients — then routed to a category-specific
expert whose embeddings are reranked with the fine-grained color and texture cues that generic models
quietly discard. The whole pipeline is then compressed into an on-device Android demo, raising a second
question the paper versions of Google Lens rarely address in the open: how much of a foundation-model
retrieval system actually survives when it has to run on a phone?
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Futuristic AR dashboard overlay on motorcycle FPV footage with object locking and hazard prediction highlights.">
<img src="assets/group_R.png" alt="Jarvis AR Assistant Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group R</span>
</div>
<div class="project-content">
<p class="project-meta">Real-time segmentation, predictive architectures, AR visualization</p>
<h3>AI-Powered "Jarvis" AR Assistant for Motorcyclists</h3>
<p class="project-abstract">
Can we transform a standard motorcycle commute into a safer, futuristic experience? This project leverages
first-person view (FPV) footage to build an intelligent AR dashboard. By integrating SAM 3 for precise
real-time object "locking" and V-JEPA 2 to predict potential road hazards,
the system simulates a smart helmet interface showcasing 3D spatial awareness.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Interface preview of SemanticSpot showing natural language search and object segmentation in a video.">
<img src="assets/group_C.png" alt="SemanticSpot Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group C</span>
</div>
<div class="project-content">
<p class="project-meta">Video Search, Segmentation, Foundation Models</p>
<h3>SemanticSpot: "Ctrl+F" for Videos</h3>
<p class="project-abstract">
Ever wanted to find a specific object or action in a long video without scrubbing through it manually? SemanticSpot is an interactive web app that lets you search videos using natural language. You can simply type a query like "the person wearing a blue hat" or "the red coffee mug". Behind the scenes, the app uses CLIP to instantly locate the exact timestamp where your search appears, and SAM (Segment Anything) to dynamically track and highlight the object on the screen. It’s a smart, zero-shot visual search engine combined with automatic segmentation!
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Real-time video stream with multiple colored bounding boxes responding to dynamic text queries such as 'car', 'truck', and 'pedestrian'.">
<img src="assets/group_E.png" alt="Interactive Video Query System Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group E</span>
</div>
<div class="project-content">
<p class="project-meta">Object detection, tracking, vision-language models, real-time interaction</p>
<h3>Interactive Real-Time Semantic Querying of Driving Scenes</h3>
<p class="project-abstract">
What if you could search inside a video as it plays? This project transforms a driving scene into an interactive interface where users can type natural language queries such as "car", "truck", or "pedestrian" and instantly see matching objects highlighted and tracked in real time.
<br><br>
The demo emphasizes fluid interaction: queries can be added or removed on the fly without interrupting the video, multiple concepts can be explored simultaneously, and each query is visualized with distinct colors for clarity. This allows users to dynamically “interrogate” the scene and observe how the system adapts immediately to new inputs.
<br><br>
Rather than focusing purely on detection, the project showcases a new way of interacting with visual data, turning passive video into an active, query-driven exploration tool.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Facial expression recognition with emotion labels and bounding boxes over a human face.">
<img src="assets/group M.png" alt="Facial Expression Recognition Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group M</span>
</div>
<div class="project-content">
<p class="project-meta">Emotion recognition, deep learning, CNN-RNN</p>
<h3>Facial Expression Recognition with Hybrid Models</h3>
<p class="project-abstract">
Can a model reliably read human emotions from a single image? This project explores facial expression recognition by classifying faces into categories such as happiness, sadness, anger, and surprise. We build a small interactive demo that visualizes predicted emotions on input images.
<br><br>
To achieve this, we combine CNNs for spatial feature extraction with RNNs to capture temporal patterns, comparing pretrained models (MobileNetV2, InceptionV3) with a custom CNN-RNN trained from scratch on FER2013 and CK+ datasets.
<br><br>
The project focuses on evaluating performance, robustness, and how transfer learning influences emotion recognition across different data conditions.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Promptable Video Event Finder with Segmentation-Guided Motion Analysis.">
<img src="assets/group_O.png" alt="Highlights Preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group O</span>
</div>
<div class="project-content">
<p class="project-meta">Segmentation, Object detection, tracking, foundation models</p>
<h3>Smart Event Detection for Highlight Clips</h3>
<p class="project-abstract">
Have you ever missed a highlight during a match? This system can capture highlights based on a user prompt or directly from a video.
It uses advanced, state-of-the-art approaches, such as Meta’s SAM3, to track objects, detect events, and generate short highlight clips.
<br><br>
The goal is to combine modern segmentation models (such as SAM) with classical computer vision techniques. Segmentation serves as a strong perception layer, while event detection is driven by motion-based features such as trajectories, velocity, and frequency analysis, along with lightweight reasoning.
The system follows a modular design, consisting of a general perception and feature extraction pipeline combined with task-specific event detection modules.
<br><br>
The system is primarily designed for human action detection (e.g., waving, raising a hand, standing up). As an extension, it can also handle simple sports scenarios, such as tracking a ball moving toward or crossing a goal, demonstrating its ability to generalize to multi-object interactions.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Open-vocabulary tracking project.">
<img src="assets/group_X.png" alt="Two segmented puppies in a park" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group X</span>
</div>
<div class="project-content">
<p class="project-meta">Object detection, segmentation, tracking, vision-language models</p>
<h3>Open-Vocabulary Object Tracking with Grounding DINO, SAM 2 and CLIP</h3>
<p class="project-abstract">
We present an open-vocabulary object tracking system that enables users to search, segment, and track arbitrary objects in images and videos using natural language queries.
<br><br>
Our pipeline combines Grounding DINO for text-conditioned object detection, CLIP for semantic verification, and SAM 2 for segmentation and temporal tracking.
<br><br>
The system supports interactive querying through a Gradio web interface and demonstrates how modern vision foundation models can be integrated into a unified visual understanding pipeline.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Image retrieval with CLIP.">
<img src="assets/group_Q.png" alt="Image retrieval preview" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group Q</span>
</div>
<div class="project-content">
<p class="project-meta">CLIP, FAISS, patch-level matching, image retrieval</p>
<h3>Image retrieval with CLIP</h3>
<p class="project-abstract">
Got a clue? ImageDetective can find the picture. Describe what you’re looking for, upload an image, or search by visual details. ImageDetective connects text and images through foundation models, combining global semantic search with patch-level matching for smarter and more explainable retrieval.
<br><br>
We will propose a vision language retrieval system to implement bidirectional search between images and natural language. With CLIP, we will map both modalities into a share embedding space, a image could be retrieved with FAISS indexing. Perhaps the accurary should be improved, further approach is using a SAM3 for text guided semantic segmentation and enable patch level matching.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="SfM with Colmap">
<img src="assets/group_I.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group I</span>
</div>
<div class="project-content">
<p class="project-meta">Structure from Motion, 3D reconstruction, mobile robots</p>
<h3>One video, a 3D reconstruction of civil infrastructure</h3>
<p class="project-abstract">
Structure from Motion (SfM) is one of the most widely used techniques for reconstructing
objects and scenes from images or video. This project aims to test hand-crafted
Structure-from-Motion algorithms available in Colmap for reconstructing 3D environments
characterized by poor illumination, using frames captured from a limited linear camera
movements. The objective is to evaluate the limitations and accuracy of state-of-the-art
methods in challenging, hard-to-reconstruct environments.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Hand_signs">
<img src="assets/group_A.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group A</span>
</div>
<div class="project-content">
<p class="project-meta">Gesture detection, tracking</p>
<h3>Real time hand gesture detection: from rock paper scissors to sign interpretation</h3>
<p class="project-abstract">
This project focuses on building a computer vision system capable of recognizing a variety of hand signs and gestures captured by a static camera.
The core goal is to develop a robust gesture recognition pipeline that can distinguish between different hand configurations in real time.
As a proof-of-concept, the system will be integrated into the game of rock-paper-scissors, where it detects each player's gesture and determines the outcome of each round.
The final result will be an interactive demo showcasing accurate and responsive hand gesture recognition in a playful, real-world scenario.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card">
<div class="teaser" role="img" aria-label="Cooking action recognition">
<img src="assets/group_P.png" alt="" style="position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:2;">
<span class="teaser-label" style="z-index:3;">Group P</span>
</div>
<div class="project-content">
<p class="project-meta">Action recognition, video understanding, self-supervised embeddings</p>
<h3>From Raw Footage to Recipe: Extracting Cooking Steps from Egocentric Video</h3>
<p class="project-abstract">
This project builds a system that watches egocentric cooking videos and automatically extracts the sequence of cooking actions performed, with the goal of reconstructing a recipe from raw footage alone.
Because most frames in a cooking video are irrelevant, the pipeline first applies a relevance classifier to filter out background activity, then routes the remaining clips through an RNN-based action classifier that identifies steps such as cutting, peeling, and boiling.
Video representations are produced by V-JEPA 2, which encodes each video as a sequence of 64-frame block embeddings without requiring labeled pretraining data.
The result is an end-to-end pipeline that turns an unstructured kitchen video into a structured, step-by-step recipe.
</p>
<label class="project-toggle-label">
<input class="project-toggle" type="checkbox" aria-label="Toggle full project pitch">
<span class="project-toggle-more">Read more</span>
<span class="project-toggle-less">Show less</span>
</label>
</div>
</article>
<article class="project-card add-project-card">
<a href="https://github.com/Computer-Vision-2026/Computer-Vision-2026.github.io/edit/main/index.html" target="_blank" rel="noopener">
<span class="add-project-icon" aria-hidden="true">+</span>
<strong>Add Your Own</strong>
<span>Open GitHub, copy the example card, and submit your project pitch as a pull request.</span>
</a>
</article>
</div>
</div>
</section>
</main>
<footer>
<div class="footer-inner">
<span>Computer Vision & Pattern Recognition, USI, Spring 2026</span>
<a href="https://github.com/Computer-Vision-2026" target="_blank" rel="noopener">GitHub organization</a>
<a href="https://search.usi.ch/courses/35275471/computer-vision-pattern-recognition" target="_blank" rel="noopener">Official course profile</a>
</div>
</footer>
</body>
</html>