Skip to content

An industry-sponsored Deep Learning project by NVIDIA focused on multi-class video classification, built end-to-end as part of an academic project.

Notifications You must be signed in to change notification settings

itsmanask/NVIDIA-Video-Classification-Project

Repository files navigation

🎥 NVIDIA Video Classification Project

A deep learning–based multi-class video classification system built end-to-end to understand both spatial and temporal patterns in video data.
This project was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA GPU servers.


📌 Overview

This system classifies videos into four content categories:

  • Animation
  • Gaming
  • Natural Content
  • Flat Content

Unlike image classification, video understanding requires modeling motion, temporal dependencies, and long-range context.
To address this, we designed a two-stage architecture combining CNNs, Bi-LSTMs, and self-attention.


🧠 Key Highlights

  • ~93–95% classification accuracy on the test set
  • End-to-end system built from scratch (no plug-and-play repositories)
  • CNN + Bi-LSTM + Multi-Head Attention architecture
  • Model ensembling and Test-Time Augmentation (TTA) for robustness
  • Trained on NVIDIA A100 GPU (MIG partition)

🏗️ System Architecture

Stage 1: Spatial Feature Extraction

  • Pretrained CNN backbones:
    • ResNet-50 / ResNet-101
    • EfficientNet-V2-S / EfficientNet-V2-M
  • Extracts frame-level semantic features
  • Feature dimension: 1280

Stage 2: Temporal Modeling

  • Input projection with Layer Normalization
  • 4-layer Bidirectional LSTM for temporal sequence learning
  • Multi-Head Self-Attention to focus on informative video segments
  • Attention pooling for sequence aggregation

📊 Dataset

  • Source: YouTube-8M
  • Total videos: ~4,000
  • Categories: 4 main classes with 46 subcategories
  • Split:
    • Train: 70%
    • Validation: 20%
    • Test: 10%

⚙️ Training & Optimization

  • Framework: PyTorch
  • Loss Function: Focal Loss with Label Smoothing
  • Optimizer: AdamW
  • Learning Rate Scheduler: Cosine Annealing with Warm Restarts
  • Regularization:
    • Dropout
    • Gradient Clipping
    • Weight Decay

🧪 Performance

  • Test Accuracy: ~93% (95% with TTA)
  • Balanced class-wise F1 scores (>95% with ensemble + TTA)
  • Robust performance across visually diverse categories

🚀 Deployment

  • Backend: Flask
  • Frontend: HTML, CSS (Tailwind), JavaScript
  • Inference supports:
    • Single-video classification
    • Ensemble inference
    • Optional Test-Time Augmentation
  • Fast inference using pre-extracted features

🖥️ Hardware & Software

Hardware

  • GPU: NVIDIA A100 (MIG – 9.8 GB VRAM)
  • CPU: Intel Xeon Gold
  • RAM: 251 GB

Software

  • PyTorch, Torchvision
  • OpenCV
  • NumPy, Pandas
  • CUDA 11.8, cuDNN

👥 Team

  • Manas Kulkarni
  • Samiksha Nalawade
  • Rajlakshmi Desai

Faculty Guide: Dr. Shripad Bhatlawande


🎯 Conclusion

This project demonstrates a production-ready video classification pipeline that effectively combines deep learning research, large-scale data processing, and real-world deployment considerations. It highlights the challenges and solutions involved in moving from image understanding to full-fledged video intelligence.

About

An industry-sponsored Deep Learning project by NVIDIA focused on multi-class video classification, built end-to-end as part of an academic project.

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •