WYT-Net (Wavelet-YOLO-Transformer Network) is a lightweight, frequency-aware deep learning architecture designed for real-time deepfake detection on edge devices.
Unlike heavy Vision Transformers or deep CNN-based forensic models, WYT-Net:
- Integrates Discrete Wavelet Transform (DWT) for frequency-domain artifact detection.
- Uses SimAM parameter-free attention.
- Incorporates FastViT-inspired hybrid CNN-Transformer blocks.
- Maintains a highly efficient footprint of < 2.6 Million parameters.
- Runs efficiently on edge hardware like Raspberry Pi 4 and NVIDIA Jetson Nano.
The framework achieves an optimal balance between accuracy, efficiency, and deployability.
The rapid rise of high-fidelity deepfake media threatens digital integrity, journalism, biometric systems, and public trust.
- Heavy Models: Architectures like ResNet, ViT, and EfficientNet are computationally expensive.
- Hardware Dependency: High reliance on server-grade GPUs for inference.
- Edge Unfriendly: Poor feasibility for deployment on mobile or IoT devices.
- Spatial Bias: Limited frequency-domain awareness allows subtle generative artifacts to go unnoticed.
GANs and Diffusion models leave high-frequency artifacts that spatial CNNs struggle to capture. WYT-Net addresses this using wavelet-domain feature engineering combined with a lightweight hybrid architecture.
- ✅ Frequency-Domain Artifact Detection: Achieve this via Wavelet Decomposition.
- ✅ Architectural Efficiency: Design a hybrid CNN-Transformer backbone under 2.6M parameters.
- ✅ High Accuracy: Surpass 94% classification accuracy on challenging datasets.
- ✅ Edge Deployment: Guarantee real-time performance on edge devices.
- ✅ Robust Generalization: Ensure resilience against real-world artifacts using comprehensive data augmentation.
Instead of standard RGB, WYT-Net uses Daubechies-2 (db2) Wavelet Decomposition:
| Channel | Description |
|---|---|
| LL | Approximation (Global Spatial Features) |
| LH | Horizontal Details (High-Frequency) |
| HL | Vertical Details (High-Frequency) |
| HH | Diagonal Details (High-Frequency) |
This explicit separation exposes forgery-induced noise patterns that are otherwise invisible in the RGB spectrum.
Modified from the YOLOv8n backbone using architectural "surgery":
- Split Backbone ➡️ Multi-scale feature extraction.
- SimAM Attention ➡️ Parameter-free 3D attention refinement for localized texture traces.
- FastViT Block ➡️ Structural reparameterization + token mixing for global structural consistency.
- Feature Fusion ➡️ Dual-stream Global Average Pooling (GAP) concatenation.
- Final MLP Head ➡️ Optimized binary classification.
| Metric | Baseline YOLOv8n | WYT-Net (Proposed) |
|---|---|---|
| Accuracy | 91.45% | 94.44% |
| mAP@0.5 | 0.8521 | 0.9371 |
| F1-Score (Fake) | 0.7145 | 0.8022 |
| Parameters | 2.04 M | 2.57 M |
| GFLOPs | 0.2498 | 0.2841 |
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Real | 0.9735 | 0.9619 | 0.9677 | 23,834 |
| Fake | 0.7738 | 0.8328 | 0.8022 | 3,731 |
| Average / Total | 0.9465 | 0.9444 | 0.9453 | 27,565 |
- Inference Engine: Tencent NCNN
- Latency: ~124.5 ms per frame
- FPS: ~8 FPS
- Optimization: Multi-threading (4 cores) + ARM Neon SIMD
| Precision | Latency |
|---|---|
| FP32 | 28.5 ms |
| FP16 | 14.2 ms |
| INT8 | 8.1 ms |
- Peak Power Consumption: ~3.1W
- Performance: Near real-time inference achieved natively on edge hardware.
.
├── Models/
│ ├── yolov8_ai_augmented.py # Main WYT-Net implementation
│ ├── yolov8_hybrid_ai.py # Hybrid backbone variant
│ ├── augment_dataset.py # Data augmentation pipeline
│ └── extract_faces.py # Face detection & DWT preprocessing
├── Results/
│ ├── results_journal_Yolov8_Ai_augmented/ # Proposed model metrics & plots
│ ├── results_hybrid_no_ai_augmented/ # Ablation study metrics
│ └── results_baseline/ # Reference benchmark metrics
├── test_model.py # Inference and validation script
└── requirements.txt # Project dependencies
git clone [https://github.com/codewithyug06/deepfake-detection-pipelines.git](https://github.com/codewithyug06/deepfake-detection-pipelines.git)
cd deepfake-detection-pipelines
pip install -r requirements.txtThe models were trained and evaluated on data derived from the Celeb-DF (v2) dataset.
- Original Dataset: 27,565 images
- Augmented Dataset: 3× expansion resulting in a more balanced and robust training set
- Random Rotation (±15°)
- Horizontal Flip
- Color Jittering
| Configuration | Accuracy |
|---|---|
| Baseline YOLOv8n | 91.45% |
| + Wavelet Input | 92.81% |
| + SimAM Attention | 93.56% |
| + FastViT (WYT-Net) | 94.44% |
Key Takeaway:
Wavelet-domain information, when combined with hybrid global reasoning via FastViT, significantly improves deepfake detection reliability.
- First of Its Kind: Lightweight YOLO-based frequency-aware deepfake detector explicitly optimized for edge computing
- Extreme Efficiency: Highly pruned hybrid CNN-Transformer architecture with fewer than 2.6M parameters
- Edge-Ready: Proven real-time inference on Raspberry Pi 4 and NVIDIA Jetson Nano
- Optimal Tradeoff: Superior balance between detection accuracy and computational complexity compared to state-of-the-art forensic models