Skip to content

w1018979952/Audio-Visual-Matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 

Repository files navigation

Audio-visual matching (voice-face association) learning-related papers

Continuously updated !!!

Other Names:

  • Voice-face Cross-modal Biometric Matching
  • Voice-face Representation Learning
  • Voice-face Cross-modal Mapping

Get dataset

VoxCeleb1

  • wav audio data, 1,251 people in total, 39 GB after decompression.

  • Baidu Cloud link: VoxCeleb1

  • VoxCeleb Document Classification:VoxCeleb1 Document

  • Decompression command:

  • zip -s 0 split.zip --out unsplit.zip

  • unzip unslit.zip

  • Vox1 official website: VoxCeleb1

VoxCeleb2

  • MP4 video data, files include audio, total of 5,994 people, 255 GB after decompression.

  • Baidu Cloud link: VoxCeleb2

  • Decompression command:

  • zip -s 0 vox2_mp4_dev.zip --out unsplit.zip

  • unzip unslit.zip

  • Vox2 official website: VoxCeleb2

List of Papers

If there are any errors in the paper descriptions or omissions, please feel free to leave a comment to supplement them.

Paper Code
Nagrani A, Albanie S, Zisserman A. Seeing voices and hearing faces: Cross-modal biometric matching[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8427-8436. code
,model
Horiguchi S, Kanda N, Nagamatsu K. Face-voice matching using cross-modal embeddings[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1011-1019.
Nagrani A, Albanie S, Zisserman A. Learnable pins: Cross-modal embeddings for person identity[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 71-88. official, pytorch
Kim C, Shin H V, Oh T H, et al. On learning associations of faces and voices[C]//Asian Conference on Computer Vision. Cham: Springer International Publishing, 2018: 276-292.
Nawaz S, Janjua M K, Gallo I, et al. Deep latent space learning for cross-modal mapping of audio and visual signals. In 2019 Digital Image Computing: Techniques and Applications (DICTA)[J]. IEEE, 2019, 1(2): 5.
Wen Y, Ismail M A, Liu W, et al. Disjoint mapping network for cross-modal matching of voices and faces[C]//7th International Conference on Learning Representations, {ICLR} 2019, New Orleans, LA, USA, May 6-9, 2019
Wang R, Huang H, Zhang X, et al. A novel distance learning for elastic cross-modal audio-visual matching[C]//2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2019: 300-305.
Xiong C, Zhang D, Liu T, et al. Voice-face cross-modal matching and retrieval: A benchmark[J]. arXiv preprint arXiv:1911.09338, 2019.
Wang R, Liu X, Cheung Y, et al. Learning discriminative joint embeddings for efficient face and voice association[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020: 1881-1884.
Cheng K, Liu X, Cheung Y, et al. Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 448-455.
Tao R, Das R K, Li H. Audio-visual speaker recognition with a cross-modal discriminative network[C]//21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020:2242--2246
Zheng A, Hu M, Jiang B, et al. Adversarial-metric learning for audio-visual cross-modal matching[J]. IEEE Transactions on Multimedia, 2021, 24: 338-351. official, copy
Wen P, Xu Q, Jiang Y, et al. Seeking the shape of sound: An adaptive framework for learning voice-face association[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 16347-16356. code
Nawaz S, Saeed M S, Morerio P, et al. Cross-modal speaker verification and recognition: A multilingual perspective[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition Workshop. 2021: 1682-1691.
Ning H, Zheng X, Lu X, et al. Disentangled representation learning for cross-modal biometric matching[J]. IEEE Transactions on Multimedia, 2021, 24: 1763-1774.
Saeed M S, Khan M H, Nawaz S, et al. Fusion and orthogonal projection for improved face-voice association[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7057-7061. code
Chen G, Zhang D, Liu T, et al. Self-lifting: A novel framework for unsupervised voice-face association learning[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval. 2022: 527-535. code
Zhu B, Xu K, Wang C, et al. Unsupervised voice-face representation learning by cross-modal prototype contrast[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI} 2022:3787--3794 code
Yu Z, Liu X, Cheung Y M, et al. Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching[C]//2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 648-655.
Wang J, Li C, Zheng A, et al. Looking and hearing into details: Dual-enhanced Siamese adversarial network for audio-visual matching[J]. IEEE Transactions on Multimedia, 2023, 25: 7505-7516. code
Saeed M S, Nawaz S, Khan M H, et al. Single-branch network for multimodal training[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5. code
Chen G, Liu X, Xu X, et al. Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 7056-7064. code
Jing X, He L, Song Z, et al. Audio–Visual Fusion Based on Interactive Attention for Person Verification[J]. Sensors, 2023, 23(24): 9845.
Yuan F, Wang J, Zhou X, et al. Audio Visual Cross-modal Matching based on Relational Similarity Learning[C]//2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 2023: 49-52.
Zheng A, Yuan F, Zhang H, et al. Public-private attributes-based variational adversarial network for audio-visual cross-modal matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8698-8709. code
Wang J, Zheng A, Yan Y, et al. Attribute-guided cross-modal interaction and enhancement for audio-visual matching[J]. IEEE Transactions on Information Forensics and Security, 2024, 19: 4986-4998. code
Zhang H, Wang J, Shi H, et al. Attention-Guided Contrastive Masked Autoencoders for Self-supervised Cross-Modal Biometric Matching[C]//International Conference on Digital Forensics and Cyber Crime. Cham: Springer Nature Switzerland, 2024: 162-173.
Tang J, Wang X, Xiao Z, et al. Exploring Robust Face-Voice Matching in Multilingual Environments[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11335-11341. code
Sun J, Su J. Unsupervised Multi-level Search and Correspondence for Generic Voice-Face Feature Spaces[C]//International Conference on Pattern Recognition. Cham: Springer Nature Switzerland, 2024: 219-232.
Chen W, Sun Y, Xu K, et al. Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11348-11354.
Chen W, Xu K, Dou Y, et al. Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model[C]//2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024: 1-6.
Chen W, Zhu B, Xu K, et al. VoiceStyle: Voice-based face generation via cross-modal prototype contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(9): 1-23.
Tao R, Shi Z, Jiang Y, et al. Multi-stage Face-voice Association Learning with Keynote Speaker Diarization[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 11342-11347. code
Kim T, Kang J. Face and voice cross-modal association with learning convex feature embedding[J]. Multimedia Systems, 2025, 31(4): 296.
Wang J, Zheng A, Liu L, et al. Adaptive Interaction and Correction Attention Network for Audio-Visual Matching[J]. IEEE Transactions on Information Forensics and Security, vol. 20, pp. 7558-7571, 2025, doi: 10.1109/TIFS.2025.3586484 code
Hannan A, Manzoor M A, Nawaz S, et al. PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association[C]. Interspeech, 2025, 2710--2714. code
Liu Y, Fang Y, Lin Z. MuteSwap: Silent Face-based Voice Conversion[J]. arXiv preprint arXiv:2507.00498, 2025. code
Fang Z, Tao S, Wang J, et al. XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association[J]. arXiv preprint arXiv:2512.06757, 2025. code
Liu Y, Fang Y, Lin Z. Visual-informed Silent Video Identity Conversion[C]//Proceedings of the 33rd ACM International Conference on Multimedia. 2025: 2104-2112. code
Zhang Z, Naito K, Dahmani H. Contrastive gated fusion for multilingual speaker verification[J]. Authorea Preprints, 2025. code

Benchmarks

Voice-Face Association Learning Evaluation

https://github.com/my-yy/vfal-eva