Audio-visual matching (voice-face association) learning-related papers

Continuously updated !!!

Other Names：

Voice-face Cross-modal Biometric Matching
Voice-face Representation Learning
Voice-face Cross-modal Mapping

Get dataset

Download the VoxCeleb, VGGFace

VoxCeleb1

wav audio data, 1,251 people in total, 39 GB after decompression.
Baidu Cloud link: VoxCeleb1
VoxCeleb Document Classification：VoxCeleb1 Document
Decompression command:
zip -s 0 split.zip --out unsplit.zip
unzip unslit.zip
Vox1 official website: VoxCeleb1

VoxCeleb2

MP4 video data, files include audio, total of 5,994 people, 255 GB after decompression.
Baidu Cloud link: VoxCeleb2
Decompression command:
zip -s 0 vox2_mp4_dev.zip --out unsplit.zip
unzip unslit.zip
Vox2 official website: VoxCeleb2

List of Papers

If there are any errors in the paper descriptions or omissions, please feel free to leave a comment to supplement them.

Paper	Code
Nagrani A, Albanie S, Zisserman A. Seeing voices and hearing faces: Cross-modal biometric matching[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8427-8436.	code ,model
Horiguchi S, Kanda N, Nagamatsu K. Face-voice matching using cross-modal embeddings[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1011-1019.	❎
Nagrani A, Albanie S, Zisserman A. Learnable pins: Cross-modal embeddings for person identity[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 71-88.	official, pytorch
Kim C, Shin H V, Oh T H, et al. On learning associations of faces and voices[C]//Asian Conference on Computer Vision. Cham: Springer International Publishing, 2018: 276-292.	❎
Nawaz S, Janjua M K, Gallo I, et al. Deep latent space learning for cross-modal mapping of audio and visual signals. In 2019 Digital Image Computing: Techniques and Applications (DICTA)[J]. IEEE, 2019, 1(2): 5.	❎
Wen Y, Ismail M A, Liu W, et al. Disjoint mapping network for cross-modal matching of voices and faces[C]//7th International Conference on Learning Representations, {ICLR} 2019, New Orleans, LA, USA, May 6-9, 2019	❎
Wang R, Huang H, Zhang X, et al. A novel distance learning for elastic cross-modal audio-visual matching[C]//2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2019: 300-305.	❎
Xiong C, Zhang D, Liu T, et al. Voice-face cross-modal matching and retrieval: A benchmark[J]. arXiv preprint arXiv:1911.09338, 2019.	❎
Wang R, Liu X, Cheung Y, et al. Learning discriminative joint embeddings for efficient face and voice association[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020: 1881-1884.	❎
Cheng K, Liu X, Cheung Y, et al. Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 448-455.	❎
Tao R, Das R K, Li H. Audio-visual speaker recognition with a cross-modal discriminative network[C]//21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020:2242--2246	❎
Zheng A, Hu M, Jiang B, et al. Adversarial-metric learning for audio-visual cross-modal matching[J]. IEEE Transactions on Multimedia, 2021, 24: 338-351.	official, copy
Wen P, Xu Q, Jiang Y, et al. Seeking the shape of sound: An adaptive framework for learning voice-face association[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 16347-16356.	code
Nawaz S, Saeed M S, Morerio P, et al. Cross-modal speaker verification and recognition: A multilingual perspective[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition Workshop. 2021: 1682-1691.	❎
Ning H, Zheng X, Lu X, et al. Disentangled representation learning for cross-modal biometric matching[J]. IEEE Transactions on Multimedia, 2021, 24: 1763-1774.	❎
Saeed M S, Khan M H, Nawaz S, et al. Fusion and orthogonal projection for improved face-voice association[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7057-7061.	code
Chen G, Zhang D, Liu T, et al. Self-lifting: A novel framework for unsupervised voice-face association learning[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval. 2022: 527-535.	code
Zhu B, Xu K, Wang C, et al. Unsupervised voice-face representation learning by cross-modal prototype contrast[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI} 2022:3787--3794	code
Yu Z, Liu X, Cheung Y M, et al. Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching[C]//2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 648-655.	❎
Wang J, Li C, Zheng A, et al. Looking and hearing into details: Dual-enhanced Siamese adversarial network for audio-visual matching[J]. IEEE Transactions on Multimedia, 2023, 25: 7505-7516.	code
Saeed M S, Nawaz S, Khan M H, et al. Single-branch network for multimodal training[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5.	code
Chen G, Liu X, Xu X, et al. Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 7056-7064.	code
Jing X, He L, Song Z, et al. Audio–Visual Fusion Based on Interactive Attention for Person Verification[J]. Sensors, 2023, 23(24): 9845.	❎
Yuan F, Wang J, Zhou X, et al. Audio Visual Cross-modal Matching based on Relational Similarity Learning[C]//2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 2023: 49-52.	❎
Zheng A, Yuan F, Zhang H, et al. Public-private attributes-based variational adversarial network for audio-visual cross-modal matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8698-8709.	code
Wang J, Zheng A, Yan Y, et al. Attribute-guided cross-modal interaction and enhancement for audio-visual matching[J]. IEEE Transactions on Information Forensics and Security, 2024, 19: 4986-4998.	code
Zhang H, Wang J, Shi H, et al. Attention-Guided Contrastive Masked Autoencoders for Self-supervised Cross-Modal Biometric Matching[C]//International Conference on Digital Forensics and Cyber Crime. Cham: Springer Nature Switzerland, 2024: 162-173.	❎
Tang J, Wang X, Xiao Z, et al. Exploring Robust Face-Voice Matching in Multilingual Environments[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11335-11341.	code
Sun J, Su J. Unsupervised Multi-level Search and Correspondence for Generic Voice-Face Feature Spaces[C]//International Conference on Pattern Recognition. Cham: Springer Nature Switzerland, 2024: 219-232.	❎
Chen W, Sun Y, Xu K, et al. Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11348-11354.	❎
Chen W, Xu K, Dou Y, et al. Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model[C]//2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024: 1-6.	❎
Chen W, Zhu B, Xu K, et al. VoiceStyle: Voice-based face generation via cross-modal prototype contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(9): 1-23.	❎
Tao R, Shi Z, Jiang Y, et al. Multi-stage Face-voice Association Learning with Keynote Speaker Diarization[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11342-11347.	code
Kim T, Kang J. Face and voice cross-modal association with learning convex feature embedding[J]. Multimedia Systems, 2025, 31(4): 296.	❎
Wang J, Zheng A, Liu L, et al. Adaptive Interaction and Correction Attention Network for Audio-Visual Matching[J]. IEEE Transactions on Information Forensics and Security, vol. 20, pp. 7558-7571, 2025, doi: 10.1109/TIFS.2025.3586484	code
Hannan A, Manzoor M A, Nawaz S, et al. PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association[C]. Interspeech, 2025, 2710--2714.	code
Liu Y, Fang Y, Lin Z. MuteSwap: Silent Face-based Voice Conversion[J]. arXiv preprint arXiv:2507.00498, 2025.	code
Fang Z, Tao S, Wang J, et al. XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association[J]. arXiv preprint arXiv:2512.06757, 2025.	code
Liu Y, Fang Y, Lin Z. Visual-informed Silent Video Identity Conversion[C]//Proceedings of the 33rd ACM International Conference on Multimedia. 2025: 2104-2112.	code
Zhang Z, Naito K, Dahmani H. Contrastive gated fusion for multilingual speaker verification[J]. Authorea Preprints, 2025.	code

Benchmarks

Voice-Face Association Learning Evaluation

https://github.com/my-yy/vfal-eva

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-visual matching (voice-face association) learning-related papers

Continuously updated !!!

Get dataset

VoxCeleb1

VoxCeleb2

List of Papers

Benchmarks

Voice-Face Association Learning Evaluation

About

Uh oh!

Releases

Packages

w1018979952/Audio-Visual-Matching

Folders and files

Latest commit

History

Repository files navigation

Audio-visual matching (voice-face association) learning-related papers

Continuously updated !!!

Get dataset

VoxCeleb1

VoxCeleb2

List of Papers

Benchmarks

Voice-Face Association Learning Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages