On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey (Text Embedding Survey)

Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

Paper Link: https://arxiv.org/abs/2507.20783

📆 Updates

[2026/1/26] We created the repository on GitHub.
[2025/11/26] We released the second version of our survey on arXiv.
[2025/7/28] We released the first version of our survey on arXiv.

📋 Table of Contents

Taxonomy of PLMs’ Roles in GPTE
Four typical applications of text embedding
The General Architecture of GPTE
Representative open-source GPTE models
Comparisons of GPTE models
Citation
Contact Us

📕 Taxonomy of PLMs’ Roles in GPTE

We divide PLMs’ Roles in GPTE into three categories: (1) We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction; (2) We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation；(3) Finally, we highlight potential future research directions (Expected Roles) that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings.

📗 Four typical applications of text embedding

Text embedding applications can be broadly categorized into three types based on their primary purpose: semantic similarity, semantic relevance, and semantic encoding. The first two focus on bi-text semantic computations, where the last type involves representing individual texts as high-level features for downstream tasks. Semantic similarity typically refers to symmetric tasks where both texts are treated equally, while semantic relevance addresses asymmetric tasks where one text is semantically related to another. Additionally, several hybrid cases exist within text embedding applications.

📘 The General Architecture of GPTE

The above framework illustrates the mainstream architecture for pretraining GPTE through a supervised approach. Typically, the textual word sequence is fed into a well-established PLM backbone (the key neural network is transformers), generating hidden contextual representations of words. Following this, a pooling step aggregates these word-level hidden vectors into a single vector, yielding the embedding form of the input text. Once the embedding network is ready, the subsequent phase involves further optimization beyond the PLM’s initial capabilities, for which contrastive learning (CL) is the widely-accepted supervision objective. This learning process can be self-supervised, weakly-supervised, or high-quality supervised through bi-encoder semantic computation.

📙 Representative open-source GPTE models

Representative open-source GPTE models (English-centered), where only models released on Huggingface with detailed and clear documentation are listed. New Abbreviations: M.S. = multistage training; MNTP = masked next token prediction training; COS = cosine objective.

🔎 Comparisons of GPTE models

Comparisons of GPTE models with various PLM backbones, focusing on those of widely-adopted open-source PLMs.

📑 Paper List

Basic Roles

Text Embedding Acquisition

Encoder-based PLMs

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL-HLT 2019, [paper].
RoBERTa: A Robustly Optimized BERT Pretraining Approach, arxiv, 2019, [paper].
Unsupervised Cross-lingual Representation Learning at Scale, ACL, 2020, [paper].

Decoder-based PLMs

Improving language understanding by generative pre-training, OpenAI, 2018, [paper].
Qwen3 Technical Report, arxiv, 2025, [paper].

Pooling

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
Don’t Judge a Language Model by Its Last Layer: Contrastive Learning with Layer-Wise Attention Pooling, Coling, 2022, [paper].
Whitening sentence representations for better semantics and faster retrieval, arxiv, 2021, [paper].

Expressibility Improvement

Long-Context Modeling

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
jina-embeddings-v3: Multilingual Embeddings With Task LoRA, arxiv, 2024, [paper].
LongEmbed: Extending Embedding Models for Long Context Retrieval, EMNLP, 2024, [paper].
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents, arxiv, 2023, [paper].
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining, NIPS, 2023, [paper].
Benchmarking and building long-context retrieval models with loco and m2-bert, ICML, 2024, [paper].
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, ACL, 2025, [paper].

Prompt-Informed Embedding

PromptBERT: Improving BERT Sentence Embeddings with Prompts, EMNLP, 2022, [paper].
Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning, EMNLP, 2022, [paper].
Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models, ICIC, 2024, [paper].
Improving Text Embeddings with Large Language Models, ACL, 2024, [paper].
Generative Representational Instruction Tuning, ICLR, 2025, [paper].
GEM: Empowering LLM for both Embedding Generation and Language Understanding, arxiv, 2025, [paper].

Parameter Optimization

Multi-Stage Training

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model, arxiv, 2025, [paper].
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
Towards General Text Embeddings with Multi-stage Contrastive Learning, arxiv, 2023, [paper].
Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].

Objectives Beyond Contrastive Learning

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, COLM, 2024, [paper].
DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings, NAACL, 2022, [paper].
Generative Representational Instruction Tuning, ICLR, 2025, [paper].
Generating Datasets with Pretrained Language Models, EMNLP, 2021, [paper].
Sentence Representation Learning with Generative Objective rather than Contrastive Objective, EMNLP, 2022, [paper].
InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings, EMNLP, 2022, [paper].
Matryoshka Representation Learning, ICLR, 2022, [paper].
CoSENT: a more effective sentence vector scheme than Sentence BERT, blog, 2022, [paper].
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
Dual-View Distilled BERT for Sentence Embedding, SIGIR, 2021, [paper].
RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank, ACL, 2023, [paper].
Ranking-Enhanced Unsupervised Sentence Representation Learning, ACL, 2023, [paper].
Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].

Batch Learning

Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup, RepL4NLP, 2021, [paper].
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning, arxiv, 2024, [paper].
Training Deep Nets with Sublinear Memory Cost, arxiv, 2016, [paper].
ZeRO: Memory optimizations Toward Training Trillion Parameter Models, SC. IEEE, 2020, [paper].

Data Synthesis

Training Data Synthesis

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model, arxiv, 2025, [paper].
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
Gecko: Versatile Text Embeddings Distilled from Large Language Models, arxiv, 2024, [paper].
Improving Text Embeddings with Large Language Models, ACL, 2024, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
SKICSE: Sentence Knowable Information Prompted by LLMs Improves Contrastive Sentence Embeddings, NAACL, 2024, [paper].
SumCSE: Summary as a transformation for Contrastive Learning, ACL, 2024, [paper].
Contrastive Learning of Sentence Embeddings from Scratch, EMNLP, 2023, [paper].
Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding, ACL Findings, 2022, [paper].
SimCSE: Simple Contrastive Learning of Sentence Embeddings, EMNLP, 2021, [paper].
Self-Guided Contrastive Learning for BERT Sentence Representations, ACL, 2021, [paper].
Virtual Augmentation Supported Contrastive Learning of Sentence Representations, ACL Findings, 2022, [paper].
Unsupervised Sentence Representation via Contrastive Learning with Mixing Negatives, AAAI, 2022,[paper].
Debiased Contrastive Learning of Unsupervised Sentence Representations, ACL, 2022, [paper].

Training Data Synthesis

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark, ACL, 2025, [paper].

The Choice of PLMs

Comparison of Model Architectures

Encoder-based GPTE

Towards General Text Embeddings with Multi-stage Contrastive Learning, arxiv, 2023, [paper].
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP, 2019, [paper].
Text Embeddings by Weakly-Supervised Contrastive Pre-training, arxiv, 2024, [paper].
C-Pack: Packed Resources For General Chinese Embeddings, SIGIR, 2024, [paper].

Decoder-based GPTE

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR, 2025, [paper].
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].

MoE-based GPTE

Your Mixture-of-Experts LLM Is Secretly an Embedding Model for Free, ICLR, 2025, [paper].
Generative Representational Instruction Tuning, ICLR, 2025, [paper].
Training Sparse Mixture Of Experts Text Embedding Models, arxiv, 2025, [paper].

Advanced Roles

Multilingualism

Multilingual GPTE Models

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation, ACL Findings, 2024, [paper].
Language-agnostic BERT Sentence Embedding, ACL, 2022, [paper].
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model, arxiv, 2025, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
Unsupervised Dense Information Retrieval with Contrastive Learning, TMLR, 2022, [paper].
Large Dual Encoders Are Generalizable Retrievers, EMNLP, 2022, [paper].
Multilingual E5 Text Embeddings: A Technical Report, arxiv, 2024, [paper].
Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications, LREC-COLING, 2024, [paper].
Toward Best Practices for Training Multilingual Dense Retrieval Models, TOIS, 2023, [paper].
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval, EMNLP, 2024, [paper].

Multilingual Training Data

MFAQ: a Multilingual FAQ Dataset, ACL Workshop, 2021, [paper].
Wikimedia Downloads, Wikimedia Foundation, Accessed: 2024, [dump].
TWEAC: Transformer with Extendable QA Agent Classifiers, arxiv, 2021, [paper].
MLQA: Evaluating Cross-lingual Extractive Question Answering, ACL, 2020, [paper].
MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering, TACL, 2021, [paper].
Crosslingual Generalization through Multitask Finetuning, ACL, 2023, [paper].
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, LREC, 2020, [paper].
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer, NAACL, 2021, [paper].

Multimodal

Individual Encoder

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, ICML, 2021, [paper].
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML, 2022, [paper].
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, ICML, 2023, [paper].
Learning Transferable Visual Models From Natural Language Supervision, ICML, 2021, [paper].
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features, arxiv, 2025, [paper].
CoCa: Contrastive Captioners are Image-Text Foundation Models, TMLR, 2022, [paper].
Sigmoid Loss for Language Image Pre-Training, ICCV, 2023, [paper].

Unified Encoder

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs, arxiv, 2025, [paper].
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks, ICLR, 2025, [paper].
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval, arxiv, 2025, [paper].
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning, EMNLP Findins, 2025, [paper].
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents, arxiv, 2025, [paper].
Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining, NIPS, 2025, [paper].
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs, CVPR, 2025, [paper].

Multimodal Training Data

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data, arxiv, 2025, [paper].
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs, CVPR, 2025, [paper].
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval, ACL, 2025, [paper].

Programming Languages

Code Embedding Models

Unified Pre-training for Program Understanding and Generation, NAACL, 2021, [paper].
CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP Findings, 2020, [paper].
UniXcoder: Unified Cross-Modal Pre-training for Code Representation, ACL, 2022, [paper].
GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR, 2021, [paper].
TreeBERT: A tree-based pre-trained model for programming language, UAI, 2021, [paper].
Learning and Evaluating Contextual Embedding of Source Code, ICML, 2020, [paper].
SCELMo: Source Code Embeddings from Language Models, arxiv, 2020, [paper].
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages, NIPS, 2021, [paper].
CoTexT: Multi-task Learning with Code-Text Transformer, NLP4Prog, 2021, [paper].
Unsupervised translation of programming languages, NIPS, 2020, [paper].
SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation, arxiv, 2021, [paper].
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP, 2021, [paper].

Code Training Data

AVATAR: A Parallel Corpus for Java-Python Program Translation, ACL Findings, 2023, [paper].
Measuring Coding Challenge Competence With APPS, NIPS, 2021, [paper].
CoSQA: 20,000+ Web Queries for Code Search and Question Answering, ACL/IJCNLP, 2021, [paper].
Mapping Language to Code in Programmatic Context, EMNLP, 2018, [paper].
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, NIPS, 2021, [paper].
Convolutional neural networks over tree structures for programming language processing, AAAI, 2016, [paper].
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks, NIPS, 2021, [paper].
Leveraging Automated Unit Tests for Unsupervised Code Translation, ICLR, 2022, [paper].
CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking, ICLR, 2025, [paper].
Towards a Big Data Curated Benchmark of Inter-project Code Clones, ICSME, 2014, [paper].
CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation, EMNLP Findings, 2023, [paper].
Multilingual code snippets training for program translation, AAAI, 2022, [paper].

Adaptation for Specific Scenarios

Instruction-following GPTE

Making Text Embedders Few-Shot Learners, ICLR, 2025, [paper].

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions, NAACL, 2025, [paper].

Language-specific GPTE

FaBERT: Pre-training BERT on Persian Blogs, ACL Workshop, 2025, [paper].
C-Pack: Packed Resources For General Chinese Embeddings, SIGIR, 2024, [paper].

Domain-specific Adaptation

Publicly Available Clinical BERT Embeddings, ClinicalNLP, 2019, [paper].
DisEmbed: Transforming Disease Understanding through Embeddings, arxiv, 2024, [paper].
BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinform, 2020, [paper].

Expected Role

Combination with Text Ranking

jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking, arxiv, 2025, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].

Safety Considerations

Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning, arxiv, 2022, [paper].
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence, ACL Findings, 2023, [paper].
ALGEN: Few-shot Inversion Attacks on Textual Embeddings via Cross-Model Alignment and Generation, ACL, 2025, [paper].
Text Revealer: Private Text Reconstruction via Model Inversion Attacks against Transformers, arxiv, 2022, [paper].
Transferable Embedding Inversion Attack: Uncovering Privacy Risks in Text Embeddings without Model Queries, ACL, 2024, [paper].

Bias of GPTE

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NIPS, 2016, [paper].
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arxiv, 2025, [paper].
ReasonIR: Training Retrievers for Reasoning Tasks, arxiv, 2025, [paper].
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images, arxiv, 2024, [paper].
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval, EMNLP, 2024, [paper].
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning, ICLR Workshop, 2024, [paper].

Structure Information

TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data, ACL, 2020, [paper].
CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP Findings, 2020, [paper].
KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation, TACL, 2021, [paper].

Extending GPTE with Reasoning

Think Then Embed: Generative Context Improves Multimodal Embedding, arxiv, 2025, [paper].
O1 Embedder: Let Retrievers Think Before Action, arxiv, 2025, [paper].

🔗 Citation

If you find this survey useful for your research or development, please cite our paper:

@misc{zhang2025rolepretrainedlanguagemodels,
      title={On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey}, 
      author={Meishan Zhang and Xin Zhang and Xinping Zhao and Shouzheng Huang and Baotian Hu and Min Zhang},
      year={2025},
      eprint={2507.20783},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.20783}, 
}

✉️ Contact Us

If you have any questions or suggestions, please feel free to contact us via:

Email: zhaoxinping@stu.hit.edu.cn, mason.zms@gmail.com, and hubaotian@hit.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figure		figure
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey (Text Embedding Survey)

📆 Updates

📋 Table of Contents

📑 Paper List

Basic Roles

Text Embedding Acquisition

Encoder-based PLMs

Decoder-based PLMs

Pooling

Expressibility Improvement

Long-Context Modeling

Prompt-Informed Embedding

Parameter Optimization

Multi-Stage Training

Objectives Beyond Contrastive Learning

Batch Learning

Data Synthesis

Training Data Synthesis

Training Data Synthesis

The Choice of PLMs

Comparison of Model Architectures

Encoder-based GPTE

Decoder-based GPTE

MoE-based GPTE

Advanced Roles

Multilingualism

Multilingual GPTE Models

Multilingual Training Data

Multimodal

Individual Encoder

Unified Encoder

Multimodal Training Data

Programming Languages

Code Embedding Models

Code Training Data

Adaptation for Specific Scenarios

Instruction-following GPTE

Language-specific GPTE

Domain-specific Adaptation

Expected Role

Combination with Text Ranking

Safety Considerations

Bias of GPTE

Structure Information

Extending GPTE with Reasoning

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!