On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey (Text Embedding Survey)
Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
Paper Link: https://arxiv.org/abs/2507.20783
- [2026/1/26] We created the repository on GitHub.
- [2025/11/26] We released the second version of our survey on arXiv.
- [2025/7/28] We released the first version of our survey on arXiv.
- Taxonomy of PLMs’ Roles in GPTE
- Four typical applications of text embedding
- The General Architecture of GPTE
- Representative open-source GPTE models
- Comparisons of GPTE models
- Citation
- Contact Us
We divide PLMs’ Roles in GPTE into three categories: (1) We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction; (2) We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation;(3) Finally, we highlight potential future research directions (Expected Roles) that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings.
Text embedding applications can be broadly categorized into three types based on their primary purpose: semantic similarity, semantic relevance, and semantic encoding. The first two focus on bi-text semantic computations, where the last type involves representing individual texts as high-level features for downstream tasks. Semantic similarity typically refers to symmetric tasks where both texts are treated equally, while semantic relevance addresses asymmetric tasks where one text is semantically related to another. Additionally, several hybrid cases exist within text embedding applications.
The above framework illustrates the mainstream architecture for pretraining GPTE through a supervised approach. Typically, the textual word sequence is fed into a well-established PLM backbone (the key neural network is transformers), generating hidden contextual representations of words. Following this, a pooling step aggregates these word-level hidden vectors into a single vector, yielding the embedding form of the input text. Once the embedding network is ready, the subsequent phase involves further optimization beyond the PLM’s initial capabilities, for which contrastive learning (CL) is the widely-accepted supervision objective. This learning process can be self-supervised, weakly-supervised, or high-quality supervised through bi-encoder semantic computation.
Representative open-source GPTE models (English-centered), where only models released on Huggingface with detailed and clear documentation are listed. New Abbreviations: M.S. = multistage training; MNTP = masked next token prediction training; COS = cosine objective.
Comparisons of GPTE models with various PLM backbones, focusing on those of widely-adopted open-source PLMs.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL-HLT 2019, [paper].
- RoBERTa: A Robustly Optimized BERT Pretraining Approach, arxiv, 2019, [paper].
- Unsupervised Cross-lingual Representation Learning at Scale, ACL, 2020, [paper].
- Improving language understanding by generative pre-training, OpenAI, 2018, [paper].
- Qwen3 Technical Report, arxiv, 2025, [paper].
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
- Don’t Judge a Language Model by Its Last Layer: Contrastive Learning with Layer-Wise Attention Pooling, Coling, 2022, [paper].
- Whitening sentence representations for better semantics and faster retrieval, arxiv, 2021, [paper].
- M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
- jina-embeddings-v3: Multilingual Embeddings With Task LoRA, arxiv, 2024, [paper].
- LongEmbed: Extending Embedding Models for Long Context Retrieval, EMNLP, 2024, [paper].
- Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents, arxiv, 2023, [paper].
- MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining, NIPS, 2023, [paper].
- Benchmarking and building long-context retrieval models with loco and m2-bert, ICML, 2024, [paper].
- Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, ACL, 2025, [paper].
- PromptBERT: Improving BERT Sentence Embeddings with Prompts, EMNLP, 2022, [paper].
- Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning, EMNLP, 2022, [paper].
- Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models, ICIC, 2024, [paper].
- Improving Text Embeddings with Large Language Models, ACL, 2024, [paper].
- Generative Representational Instruction Tuning, ICLR, 2025, [paper].
- GEM: Empowering LLM for both Embedding Generation and Language Understanding, arxiv, 2025, [paper].
- KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model, arxiv, 2025, [paper].
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
- Towards General Text Embeddings with Multi-stage Contrastive Learning, arxiv, 2023, [paper].
- Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, COLM, 2024, [paper].
- DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings, NAACL, 2022, [paper].
- Generative Representational Instruction Tuning, ICLR, 2025, [paper].
- Generating Datasets with Pretrained Language Models, EMNLP, 2021, [paper].
- Sentence Representation Learning with Generative Objective rather than Contrastive Objective, EMNLP, 2022, [paper].
- InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings, EMNLP, 2022, [paper].
- Matryoshka Representation Learning, ICLR, 2022, [paper].
- CoSENT: a more effective sentence vector scheme than Sentence BERT, blog, 2022, [paper].
- M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
- Dual-View Distilled BERT for Sentence Embedding, SIGIR, 2021, [paper].
- RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank, ACL, 2023, [paper].
- Ranking-Enhanced Unsupervised Sentence Representation Learning, ACL, 2023, [paper].
- Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
- Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup, RepL4NLP, 2021, [paper].
- GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning, arxiv, 2024, [paper].
- Training Deep Nets with Sublinear Memory Cost, arxiv, 2016, [paper].
- ZeRO: Memory optimizations Toward Training Trillion Parameter Models, SC. IEEE, 2020, [paper].
- KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model, arxiv, 2025, [paper].
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
- Gecko: Versatile Text Embeddings Distilled from Large Language Models, arxiv, 2024, [paper].
- Improving Text Embeddings with Large Language Models, ACL, 2024, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- SKICSE: Sentence Knowable Information Prompted by LLMs Improves Contrastive Sentence Embeddings, NAACL, 2024, [paper].
- SumCSE: Summary as a transformation for Contrastive Learning, ACL, 2024, [paper].
- Contrastive Learning of Sentence Embeddings from Scratch, EMNLP, 2023, [paper].
- Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding, ACL Findings, 2022, [paper].
- SimCSE: Simple Contrastive Learning of Sentence Embeddings, EMNLP, 2021, [paper].
- Self-Guided Contrastive Learning for BERT Sentence Representations, ACL, 2021, [paper].
- Virtual Augmentation Supported Contrastive Learning of Sentence Representations, ACL Findings, 2022, [paper].
- Unsupervised Sentence Representation via Contrastive Learning with Mixing Negatives, AAAI, 2022,[paper].
- Debiased Contrastive Learning of Unsupervised Sentence Representations, ACL, 2022, [paper].
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark, ACL, 2025, [paper].
- Towards General Text Embeddings with Multi-stage Contrastive Learning, arxiv, 2023, [paper].
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP, 2019, [paper].
- Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
- C-Pack: Packed Resources For General Chinese Embeddings, SIGIR, 2024, [paper].
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- Your Mixture-of-Experts LLM Is Secretly an Embedding Model for Free, ICLR, 2025, [paper].
- Generative Representational Instruction Tuning, ICLR, 2025, [paper].
- Training Sparse Mixture Of Experts Text Embedding Models, arxiv, 2025, [paper].
- M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
- Language-agnostic BERT Sentence Embedding, ACL, 2022, [paper].
- KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- Unsupervised Dense Information Retrieval with Contrastive Learning, TMLR, 2022, [paper].
- Large Dual Encoders Are Generalizable Retrievers, EMNLP, 2022, [paper].
- Multilingual E5 Text Embeddings: A Technical Report, arxiv, 2024, [paper].
- Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications, LREC-COLING, 2024, [paper].
- Toward Best Practices for Training Multilingual Dense Retrieval Models, TOIS, 2023, [paper].
- mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval, EMNLP, 2024, [paper].
- MFAQ: a Multilingual FAQ Dataset, ACL Workshop, 2021, [paper].
- Wikimedia Downloads, Wikimedia Foundation, Accessed: 2024, [dump].
- TWEAC: Transformer with Extendable QA Agent Classifiers, arxiv, 2021, [paper].
- MLQA: Evaluating Cross-lingual Extractive Question Answering, ACL, 2020, [paper].
- MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering, TACL, 2021, [paper].
- Crosslingual Generalization through Multitask Finetuning, ACL, 2023, [paper].
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, LREC, 2020, [paper].
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer, NAACL, 2021, [paper].
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, ICML, 2021, [paper].
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML, 2022, [paper].
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, ICML, 2023, [paper].
- Learning Transferable Visual Models From Natural Language Supervision, ICML, 2021, [paper].
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features, arxiv, 2025, [paper].
- CoCa: Contrastive Captioners are Image-Text Foundation Models, TMLR, 2022, [paper].
- Sigmoid Loss for Language Image Pre-Training, ICCV, 2023, [paper].
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs, arxiv, 2025, [paper].
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks, ICLR, 2025, [paper].
- Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval, arxiv, 2025, [paper].
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning, EMNLP Findins, 2025, [paper].
- VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents, arxiv, 2025, [paper].
- Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining, NIPS, 2025, [paper].
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs, CVPR, 2025, [paper].
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data, arxiv, 2025, [paper].
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs, CVPR, 2025, [paper].
- MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval, ACL, 2025, [paper].
- Unified Pre-training for Program Understanding and Generation, NAACL, 2021, [paper].
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP Findings, 2020, [paper].
- UniXcoder: Unified Cross-Modal Pre-training for Code Representation, ACL, 2022, [paper].
- GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR, 2021, [paper].
- TreeBERT: A tree-based pre-trained model for programming language, UAI, 2021, [paper].
- Learning and Evaluating Contextual Embedding of Source Code, ICML, 2020, [paper].
- SCELMo: Source Code Embeddings from Language Models, arxiv, 2020, [paper].
- DOBF: A Deobfuscation Pre-Training Objective for Programming Languages, NIPS, 2021, [paper].
- CoTexT: Multi-task Learning with Code-Text Transformer, NLP4Prog, 2021, [paper].
- Unsupervised translation of programming languages, NIPS, 2020, [paper].
- SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation, arxiv, 2021, [paper].
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP, 2021, [paper].
- AVATAR: A Parallel Corpus for Java-Python Program Translation, ACL Findings, 2023, [paper].
- Measuring Coding Challenge Competence With APPS, NIPS, 2021, [paper].
- CoSQA: 20,000+ Web Queries for Code Search and Question Answering, ACL/IJCNLP, 2021, [paper].
- Mapping Language to Code in Programmatic Context, EMNLP, 2018, [paper].
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, NIPS, 2021, [paper].
- Convolutional neural networks over tree structures for programming language processing, AAAI, 2016, [paper].
- CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks, NIPS, 2021, [paper].
- Leveraging Automated Unit Tests for Unsupervised Code Translation, ICLR, 2022, [paper].
- CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking, ICLR, 2025, [paper].
- Towards a Big Data Curated Benchmark of Inter-project Code Clones, ICSME, 2014, [paper].
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation, EMNLP Findings, 2023, [paper].
- Multilingual code snippets training for program translation, AAAI, 2022, [paper].
- Making Text Embedders Few-Shot Learners, ICLR, 2025, [paper].
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions, NAACL, 2025, [paper].
- FaBERT: Pre-training BERT on Persian Blogs, ACL Workshop, 2025, [paper].
- C-Pack: Packed Resources For General Chinese Embeddings, SIGIR, 2024, [paper].
- Publicly Available Clinical BERT Embeddings, ClinicalNLP, 2019, [paper].
- DisEmbed: Transforming Disease Understanding through Embeddings, arxiv, 2024, [paper].
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinform, 2020, [paper].
- jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking, arxiv, 2025, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning, arxiv, 2022, [paper].
- Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence, ACL Findings, 2023, [paper].
- ALGEN: Few-shot Inversion Attacks on Textual Embeddings via Cross-Model Alignment and Generation, ACL, 2025, [paper].
- Text Revealer: Private Text Reconstruction via Model Inversion Attacks against Transformers, arxiv, 2022, [paper].
- Transferable Embedding Inversion Attack: Uncovering Privacy Risks in Text Embeddings without Model Queries, ACL, 2024, [paper].
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS, 2016, [paper].
- Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
- ReasonIR: Training Retrievers for Reasoning Tasks, arxiv, 2025, [paper].
- jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images, arxiv, 2024, [paper].
- mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval, EMNLP, 2024, [paper].
- Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning, ICLR Workshop, 2024, [paper].
- TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data, ACL, 2020, [paper].
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP Findings, 2020, [paper].
- KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation, TACL, 2021, [paper].
- Think Then Embed: Generative Context Improves Multimodal Embedding, arxiv, 2025, [paper].
- O1 Embedder: Let Retrievers Think Before Action, arxiv, 2025, [paper].
If you find this survey useful for your research or development, please cite our paper:
@misc{zhang2025rolepretrainedlanguagemodels,
title={On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey},
author={Meishan Zhang and Xin Zhang and Xinping Zhao and Shouzheng Huang and Baotian Hu and Min Zhang},
year={2025},
eprint={2507.20783},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.20783},
}
If you have any questions or suggestions, please feel free to contact us via:
Email: zhaoxinping@stu.hit.edu.cn, mason.zms@gmail.com, and hubaotian@hit.edu.cn



