Other Names:
- Voice-face Cross-modal Biometric Matching
- Voice-face Representation Learning
- Voice-face Cross-modal Mapping
-
wav audio data, 1,251 people in total, 39 GB after decompression.
-
Baidu Cloud link: VoxCeleb1
-
VoxCeleb Document Classification:VoxCeleb1 Document
-
Decompression command:
-
zip -s 0 split.zip --out unsplit.zip
-
unzip unslit.zip
-
Vox1 official website: VoxCeleb1
-
MP4 video data, files include audio, total of 5,994 people, 255 GB after decompression.
-
Baidu Cloud link: VoxCeleb2
-
Decompression command:
-
zip -s 0 vox2_mp4_dev.zip --out unsplit.zip
-
unzip unslit.zip
-
Vox2 official website: VoxCeleb2
If there are any errors in the paper descriptions or omissions, please feel free to leave a comment to supplement them.
| Paper | Code |
|---|---|
| Nagrani A, Albanie S, Zisserman A. Seeing voices and hearing faces: Cross-modal biometric matching[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8427-8436. | code ,model |
| Horiguchi S, Kanda N, Nagamatsu K. Face-voice matching using cross-modal embeddings[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1011-1019. | ❎ |
| Nagrani A, Albanie S, Zisserman A. Learnable pins: Cross-modal embeddings for person identity[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 71-88. | official, pytorch |
| Kim C, Shin H V, Oh T H, et al. On learning associations of faces and voices[C]//Asian Conference on Computer Vision. Cham: Springer International Publishing, 2018: 276-292. | ❎ |
| Nawaz S, Janjua M K, Gallo I, et al. Deep latent space learning for cross-modal mapping of audio and visual signals. In 2019 Digital Image Computing: Techniques and Applications (DICTA)[J]. IEEE, 2019, 1(2): 5. | ❎ |
| Wen Y, Ismail M A, Liu W, et al. Disjoint mapping network for cross-modal matching of voices and faces[C]//7th International Conference on Learning Representations, {ICLR} 2019, New Orleans, LA, USA, May 6-9, 2019 | ❎ |
| Wang R, Huang H, Zhang X, et al. A novel distance learning for elastic cross-modal audio-visual matching[C]//2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2019: 300-305. | ❎ |
| Xiong C, Zhang D, Liu T, et al. Voice-face cross-modal matching and retrieval: A benchmark[J]. arXiv preprint arXiv:1911.09338, 2019. | ❎ |
| Wang R, Liu X, Cheung Y, et al. Learning discriminative joint embeddings for efficient face and voice association[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020: 1881-1884. | ❎ |
| Cheng K, Liu X, Cheung Y, et al. Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 448-455. | ❎ |
| Tao R, Das R K, Li H. Audio-visual speaker recognition with a cross-modal discriminative network[C]//21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020:2242--2246 | ❎ |
| Zheng A, Hu M, Jiang B, et al. Adversarial-metric learning for audio-visual cross-modal matching[J]. IEEE Transactions on Multimedia, 2021, 24: 338-351. | official, copy |
| Wen P, Xu Q, Jiang Y, et al. Seeking the shape of sound: An adaptive framework for learning voice-face association[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 16347-16356. | code |
| Nawaz S, Saeed M S, Morerio P, et al. Cross-modal speaker verification and recognition: A multilingual perspective[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition Workshop. 2021: 1682-1691. | ❎ |
| Ning H, Zheng X, Lu X, et al. Disentangled representation learning for cross-modal biometric matching[J]. IEEE Transactions on Multimedia, 2021, 24: 1763-1774. | ❎ |
| Saeed M S, Khan M H, Nawaz S, et al. Fusion and orthogonal projection for improved face-voice association[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7057-7061. | code |
| Chen G, Zhang D, Liu T, et al. Self-lifting: A novel framework for unsupervised voice-face association learning[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval. 2022: 527-535. | code |
| Zhu B, Xu K, Wang C, et al. Unsupervised voice-face representation learning by cross-modal prototype contrast[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI} 2022:3787--3794 | code |
| Yu Z, Liu X, Cheung Y M, et al. Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching[C]//2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 648-655. | ❎ |
| Wang J, Li C, Zheng A, et al. Looking and hearing into details: Dual-enhanced Siamese adversarial network for audio-visual matching[J]. IEEE Transactions on Multimedia, 2023, 25: 7505-7516. | code |
| Saeed M S, Nawaz S, Khan M H, et al. Single-branch network for multimodal training[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5. | code |
| Chen G, Liu X, Xu X, et al. Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 7056-7064. | code |
| Jing X, He L, Song Z, et al. Audio–Visual Fusion Based on Interactive Attention for Person Verification[J]. Sensors, 2023, 23(24): 9845. | ❎ |
| Yuan F, Wang J, Zhou X, et al. Audio Visual Cross-modal Matching based on Relational Similarity Learning[C]//2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 2023: 49-52. | ❎ |
| Zheng A, Yuan F, Zhang H, et al. Public-private attributes-based variational adversarial network for audio-visual cross-modal matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8698-8709. | code |
| Wang J, Zheng A, Yan Y, et al. Attribute-guided cross-modal interaction and enhancement for audio-visual matching[J]. IEEE Transactions on Information Forensics and Security, 2024, 19: 4986-4998. | code |
| Zhang H, Wang J, Shi H, et al. Attention-Guided Contrastive Masked Autoencoders for Self-supervised Cross-Modal Biometric Matching[C]//International Conference on Digital Forensics and Cyber Crime. Cham: Springer Nature Switzerland, 2024: 162-173. | ❎ |
| Tang J, Wang X, Xiao Z, et al. Exploring Robust Face-Voice Matching in Multilingual Environments[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11335-11341. | code |
| Sun J, Su J. Unsupervised Multi-level Search and Correspondence for Generic Voice-Face Feature Spaces[C]//International Conference on Pattern Recognition. Cham: Springer Nature Switzerland, 2024: 219-232. | ❎ |
| Chen W, Sun Y, Xu K, et al. Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11348-11354. | ❎ |
| Chen W, Xu K, Dou Y, et al. Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model[C]//2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024: 1-6. | ❎ |
| Chen W, Zhu B, Xu K, et al. VoiceStyle: Voice-based face generation via cross-modal prototype contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(9): 1-23. | ❎ |
| Tao R, Shi Z, Jiang Y, et al. Multi-stage Face-voice Association Learning with Keynote Speaker Diarization[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11342-11347. | code |
| Kim T, Kang J. Face and voice cross-modal association with learning convex feature embedding[J]. Multimedia Systems, 2025, 31(4): 296. | ❎ |
| Wang J, Zheng A, Liu L, et al. Adaptive Interaction and Correction Attention Network for Audio-Visual Matching[J]. IEEE Transactions on Information Forensics and Security, vol. 20, pp. 7558-7571, 2025, doi: 10.1109/TIFS.2025.3586484 | code |
| Hannan A, Manzoor M A, Nawaz S, et al. PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association[C]. Interspeech, 2025, 2710--2714. | code |
| Liu Y, Fang Y, Lin Z. MuteSwap: Silent Face-based Voice Conversion[J]. arXiv preprint arXiv:2507.00498, 2025. | code |
| Fang Z, Tao S, Wang J, et al. XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association[J]. arXiv preprint arXiv:2512.06757, 2025. | code |
| Liu Y, Fang Y, Lin Z. Visual-informed Silent Video Identity Conversion[C]//Proceedings of the 33rd ACM International Conference on Multimedia. 2025: 2104-2112. | code |
| Zhang Z, Naito K, Dahmani H. Contrastive gated fusion for multilingual speaker verification[J]. Authorea Preprints, 2025. | code |