Official PyTorch Implementation of One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection, 2026.
- [2026-02-02]: ✅ UniADet has been independently reproduced by a third-party team, confirming our state-of-the-art results.
- [2026-01-13]: 🚀 Initialized the official UniADet code repository.
- [2026-01-09]: 📄 The UniADet paper is now available on arXiv.
- Introduction
- UniADet Framework
- Language-Free UniADet with Different Foundation Models
- Comparison with State-of-the-Arts
- Complexity and Efficiency Comparisons
- Ablation Studies
- Comparison with Language-Dependent AnomalyCLIP
- ToDo List
- Citation
UniADet is a language-free universal (Zero- and Few-shot) visual anomaly detection framework. It achieves superior performance outperforming state-of-the-art language-dependent Zero- and Few-shot AD models while also exhibiting remarkable simplicity and efficiency.
- We rethink vision-language ADs and find that language prompts and encoders are unnecessary. This insight leads to an embarrassingly simple(language-free+dual-decoupling), efficient(0.015M or 0.02M learnable params), effective (SOTA zero-/few-shot) and general (support VLMs and Pure VMs) framework for universal anomaly detection.
- We fully decouple global anomaly classification and local anomaly segmentation across multi-scale hierarchical features, i.e., learning layer-wise cls/seg weights, effectively mitigating the learning conflict between different feature manifolds and substantially improving AD performance.
- Comprehensive experiments conclusively validate that our approach achieves state-of-the-art zero-shot and few-shot performance. Notably, our few-shot UniADet is the first to outperform full-shot state-of-the-art.
| Backbone | Shot | MVTec-AD | VisA | Real-IAD |
|---|---|---|---|---|
| CLIP (ViT-L/14@336px) | 0 | 92.4 / 42.8 | 88.0 / 28.0 | 78.6 / 33.6 |
| CLIP (ViT-L/14@336px) | 4 | 97.7 / 58.8 | 93.3 / 36.7 | 84.3 / 37.2 |
| DINOv2 (Register ViT-L/14) | 0 | 93.5 / 50.9 | 91.3 / 32.7 | 82.5 / 43.1 |
| DINOv2 (Register ViT-L/14) | 4 | 98.7 / 65.4 | 96.9 / 45.2 | 90.3 / 48.5 |
| DINOv3 (ViT-L/16) | 0 | 94.0 / 52.7 | 91.9 / 32.5 | 81.2 / 41.6 |
| DINOv3 (ViT-L/16) | 4 | 98.2 / 69.0 | 97.1 / 45.5 | 88.5 / 49.8 |
Note: The performance is mesured by Image-AUROC / Pixel-AUPR, and the same below.
| Methods | Venue | Language-Free | Shots | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
|
UniADet |
ours | 0 | 93.5 / 50.9 | 91.3 / 32.7 | 82.5 / 43.1 | |
| WinCLIP | CVPR 23 | 0 | 90.4 / 18.2 | 75.5 / 5.4 | 67.0 / 3.3 | |
| APRIL-GAN | CVPRW 23 | 0 | 86.1 / - | 78.0 / - | - | |
| AnomalyCLIP | ICLR 24 | 0 | 91.6 / 34.5 | 82.0 / 21.3 | 69.5 / 26.7 | |
| AdaCLIP | ECCV 24 | 0 | 90.7 / 39.1 | 81.7 / 31.0 | 73.3 / 30.5 | |
| VCPCLIP | ECCV 24 | 0 | 92.1 / 49.4 | 83.8 / 30.1 | - | |
| Bayes-PFL | CVPR 25 | 0 | 92.5 / 48.3 | 87.0 / 29.8 | 70.0 / 27.6 | |
| AA-CLIP | CVPR 25 | 0 | 90.5 / - | 84.6 / - | - | |
| FE-CLIP | ICCV 25 | 0 | 91.9 / - | 84.6 / - | - | |
| FAPrompt | ICCV 25 | 0 | 91.9 / - | 84.6 / - | - | |
| RareCLIP | ICCV 25 | 0 | 91.5 / 46.1 | 86.1 / 27.0 | - | |
| AdaptCLIP | AAAI 26 | 0 | 93.5 / 38.3 | 84.8 / 26.1 | 74.2 / 28.2 |
| Methods | Venue | Language-Free | Shots | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
|
UniADet |
ours | 1 | 97.6 / 63.1 | 95.2 / 42.1 | 88.7 / 48.4 | |
|
UniADet |
ours | 2 | 98.0 / 64.1 | 96.1 / 44.2 | 89.0 / 46.7 | |
|
UniADet |
ours | 4 | 98.7 / 65.4 | 96.9 / 45.2 | 90.3 / 48.5 | |
| MetaUAS | NeurIPS 24 | 1 | 90.7 / 59.3 | 81.2 / 42.7 | 80.0 / 36.6 | |
| APRIL-GAN | CVPRW 23 | 4 | 92.8 / 54.5 | 92.6 / 32.2 | - | |
| PromptAD | CVPR 24 | 4 | 96.6 / 52.9 | 89.1 / 31.5 | - | |
| UniVAD | CVPR 25 | 1 | 97.8 / 55.6 | 93.5 / 42.8 | 85.1 / 37.6 | |
| AdaptCLIP | AAAI 26 | 1 | 94.5 / 53.7 | 90.5 / 38.9 | 81.8 / 36.6 | |
| AdaptCLIP | AAAI 26 | 2 | 95.7 / 55.1 | 92.2 / 40.7 | 82.9 / 37.8 | |
| AdaptCLIP | AAAI 26 | 4 | 96.6 / 57.2 | 93.1 / 41.8 | 83.9 / 39.1 |
| Methods | Venue | Language-Free | Setting | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
| Dinomaly | CVPR 25 | multi-class (full train set) | 99.6 / 69.3 | 98.7 / 53.2 | 89.3 / 42.8 | |
| UniAD | NeurIPS 24 | multi-class (full train set) | 96.5 / 44.7 | 90.8 / 33.6 | 83.0 / 21.1 | |
| MuSc | ICLR 24 | online (full test set) | 97.8 / 62.7 | 92.8 / 45.1 | - |
Important
Note1: If you find that any existing zero-shot/few-shot AD methods are missing from the table above, please feel free to open an issue so we can add them.
Note2: Dinomaly and UniAD are multi-class unsupervised AD algorithms, and they require dataset-specific training with full normal images.
Note3: MuSc is an Online algorithm that requires access to statistics from the entire test dataset to evaluate the current image. Therefore, it is not a strictly zero-shot AD.
| Shots | Methods | Models | Input Size | # Params (M) | Inf. Time (ms) |
|---|---|---|---|---|---|
| 0 | AdaCLIP | CLIP ViT-L/14@336px | 518×518 | 428.8 + 1.1e+1 | 107.4 |
| 0 | AnomalyCLIP | CLIP ViT-L/14@336px | 518×518 | 427.9 + 5.6e+0 | 70.7 |
| 0 | Bayes-PFL | CLIP ViT-L/14@336px | 518×518 | 427.9 + 2.7e+1 | 154.9 |
| 0 | AdaptCLIP-Zero | CLIP ViT-L/14@336px | 518×518 | 427.9 + 6.0e-1 | 57.5 |
| 0 |
UniADet |
CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.5e-2 | 15.7 |
| 0 |
UniADet |
DINOv2 ViT-L/14 | 518×518 | 303.2 + 2.0e-2 | 41.9 |
| 1 | InCtrl | CLIP ViT-B-16+240 | 240×240 | 208.4 + 3.0e-1 | 59.0 |
| 1 | AnomalyCLIP+ | CLIP ViT-L/14@336px | 518×518 | 427.9 + 5.6e+0 | 76.2 |
| 1 | AdaptCLIP | CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.8e+0 | 58.7 |
| 1 |
UniADet |
CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.5e-2 | 22.4 |
| 1 |
UniADet |
DINOv2 ViT-L/14 | 518×518 | 303.2 + 2.0e-2 | 48.4 |
Note: The number of learnable parameters (1.5e-3 and 2.0e-3) is not correct for our UniADet
Ablation studies about different components.
| No | DCS | DHF | CAA | Shot | MVTec | VisA |
|---|---|---|---|---|---|---|
| 0 | 0 | 85.4 / 36.4 | 77.9 / 26.1 | |||
| 1 | 0 | 91.8 / 38.3 | 85.9 / 27.2 | |||
| 2 | 0 | 92.2 / 40.7 | 86.0 / 27.6 | |||
| 3 | 0 | 92.4 / 42.8 | 88.0 / 28.0 | |||
| 4 | random | 0 | 91.3 / 41.5 | 87.5 / 26.6 | ||
| 5 | 1 | 95.9 / 54.6 | 91.3 / 32.5 |
Note: The ablation studies are conducted by UniADet
- Essential Differences
| Feature | AnomalyCLIP | UniADet (Ours) |
|---|---|---|
| 🧠 Paradigm | 🔴 Language-Dependent | ✅ Language-Free |
| ⚡ Task Decoupling | Shared Cls/Seg Weight |
✅ Decoupled Cls/Seg Weights |
| 🏗️ Hierarchical Decoupling | Single / Last Layer | ✅ Layer-Wise Cls/Seg Weights |
| 🤖 Backbones | CLIP Only | ✅ CLIP, DINOv2-R, DINOv3 |
| 📉 Params | 🚀 0.015M or 0.020M(Efficient) |
Important
Note1: The official AnomalyCLIP team has clarified that their initial multi-layer claim was a code bug; the effective implementation relies only on the last layer.
Note2: A naive extension of AnomalyCLIP to multiple blocks leads to significant performance degradation.
- Commonality
| Feature | AnomalyCLIP & UniADet |
|---|---|
| Core Objective | Learning Normal / Anomaly Weights |
| Loss Function | CE for Cls and Focal + Dice for Seg |
| Training Data | Auxiliary Data (e.g., VisA or MVTec) |
- Performace on VisA
| Shots | Backbones | AnomalyCLIP | UniADet (Ours) |
|---|---|---|---|
| 0-Shot | CLIP | 82.0 / 21.3 | 88.0 / 28.0 |
| 0-Shot | DINOv2-R | Not supported | 91.3 / 32.7 |
| 0-Shot | DINOv3 | Not supported | 91.9 / 32.5 |
| 4-Shot | CLIP | Not supported | 93.3 / 36.7 |
| 4-Shot | DINOv2-R | Not supported | 96.9 / 45.2 |
| 4-Shot | DINOv3 | Not supported | 97.1 / 45.5 |
- release pre-trained UniADet models
- deploy online UniADet Demo on huggingface
- open training and testing code
If you find this work useful in your research, please consider citing:
@inproceedings{uniadet,
title={One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection},
author={Gao, Bin-Bin and Wang, Chengjie},
booktitle={arXiv:2601.05552},
year={2026}
}

