A deep learning–based multi-class video classification system built end-to-end to understand both spatial and temporal patterns in video data.
This project was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA GPU servers.
This system classifies videos into four content categories:
- Animation
- Gaming
- Natural Content
- Flat Content
Unlike image classification, video understanding requires modeling motion, temporal dependencies, and long-range context.
To address this, we designed a two-stage architecture combining CNNs, Bi-LSTMs, and self-attention.
- ~93–95% classification accuracy on the test set
- End-to-end system built from scratch (no plug-and-play repositories)
- CNN + Bi-LSTM + Multi-Head Attention architecture
- Model ensembling and Test-Time Augmentation (TTA) for robustness
- Trained on NVIDIA A100 GPU (MIG partition)
- Pretrained CNN backbones:
- ResNet-50 / ResNet-101
- EfficientNet-V2-S / EfficientNet-V2-M
- Extracts frame-level semantic features
- Feature dimension: 1280
- Input projection with Layer Normalization
- 4-layer Bidirectional LSTM for temporal sequence learning
- Multi-Head Self-Attention to focus on informative video segments
- Attention pooling for sequence aggregation
- Source: YouTube-8M
- Total videos: ~4,000
- Categories: 4 main classes with 46 subcategories
- Split:
- Train: 70%
- Validation: 20%
- Test: 10%
- Framework: PyTorch
- Loss Function: Focal Loss with Label Smoothing
- Optimizer: AdamW
- Learning Rate Scheduler: Cosine Annealing with Warm Restarts
- Regularization:
- Dropout
- Gradient Clipping
- Weight Decay
- Test Accuracy: ~93% (95% with TTA)
- Balanced class-wise F1 scores (>95% with ensemble + TTA)
- Robust performance across visually diverse categories
- Backend: Flask
- Frontend: HTML, CSS (Tailwind), JavaScript
- Inference supports:
- Single-video classification
- Ensemble inference
- Optional Test-Time Augmentation
- Fast inference using pre-extracted features
- GPU: NVIDIA A100 (MIG – 9.8 GB VRAM)
- CPU: Intel Xeon Gold
- RAM: 251 GB
- PyTorch, Torchvision
- OpenCV
- NumPy, Pandas
- CUDA 11.8, cuDNN
- Manas Kulkarni
- Samiksha Nalawade
- Rajlakshmi Desai
Faculty Guide: Dr. Shripad Bhatlawande
This project demonstrates a production-ready video classification pipeline that effectively combines deep learning research, large-scale data processing, and real-world deployment considerations. It highlights the challenges and solutions involved in moving from image understanding to full-fledged video intelligence.