From 7e887d6ac2bf8e2a7b2d1f265f4a5e614ea04e47 Mon Sep 17 00:00:00 2001 From: Prabu Date: Mon, 13 Apr 2026 23:30:10 -0500 Subject: [PATCH] Add files via upload Prabu added pretraining.md --- content/pretraining.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 content/pretraining.md diff --git a/content/pretraining.md b/content/pretraining.md new file mode 100644 index 0000000..50b2092 --- /dev/null +++ b/content/pretraining.md @@ -0,0 +1,17 @@ +# Pretraining +Foundation models are large neural networks that are trained on massive datasets using self-supervised learning using a set of pretext tasks to learn useful representations/features of the data. The transferable features produced by a pretrained foundation model, called embeddings, are finetuned using a smaller labeled dataset for specific downstream tasks. The ingredients for creating a pretrained foundation model are massive (unlabeled) datasets, large neural networks, and pretext tasks. The current practice of pretraining foundation models followed by finetuning can be seen as an evolution of the transfer learning paradigm with two main changes: the usage of self-supervised learning instead of supervised learning and the convergence of model architectures around transformer variants. The effectiveness of the pretrain-and-finetune methodology, availability of internet scale (unlabeled) datasets, and improvements in compute capabilities have fueled the surge in the popularity of foundation models. + +Many advances in pretraining foundation models were pioneered in the domains of natural language processing (NLP) and computer vision (CV), mainly due to the availability of massive unlabeled text and image datasets from the internet. Foundation models for other modalities (e.g., audio, time series, tabular data) have been developed by adapting the pretraining strategies developed for NLP and CV tasks with domain-specific modifications in terms of the employed datasets and pretext tasks. In this section we will present some key concepts for foundation model pre-training, using NLP and CV as working examples, and discuss adaptations to biomedical data and tasks in the next section. + +In a typical pretraining pipeline for transformer-based foundation models, the data is converted into a set of position encoded tokens (words or sub-words in NLP and image patches in CV) and passed through a sequence of transformer blocks to learn dynamic, contextual embeddings for the tokens. The attention mechanism in the transformer blocks helps the model capture relationships among all the tokens in parallel without recurrence. This leads to computational efficiencies but loses information about order, directionality, and distance between the tokens. Position encoding the tokens helps retain some of these informative inter-token cues. + +The learning signal for model training is derived from a pretext task, with masked token modeling and multi-view contrasting learning being two popular pretext tasks for self-supervised pretraining of foundation models. In masked token modeling, one or more tokens are masked, and the model is trained to predict the masked tokens using the unmasked tokens. For text data, the model learns to predicts a probability distribution over the vocabulary for the masked tokens and this is compared to the true words using a cross entropy loss to generate a training signal. In CV, the difference between the predicted and true pixel values in the masked patches is used to generate a self-supervision signal for model training. In multi-view contrastive learning, multiple views of a sample are generated via data augmentation, and the model is trained to embed views from the same sample close to each other while embedding views from different samples to be apart. Multi-view contrastive learning is a popular technique for pretraining foundation models in CV. + +A useful taxonomy of popular pretrained foundation models is as encoder or decoder models. While both model types learn powerful token embeddings, encoder models are popular for downstream tasks such as classification, where a linear model is trained using a labeled dataset on the embeddings produced by a pretrained foundation model. Decoder models are used for applications that require token generation such as summarization and question answering. + +Adaptation of pretraining techniques from NLP and CV to other modalities usually involves modifications to tokenization schemes, position encoding approaches, attention patterns, and pretext tasks/losses. These modifications inject modality or domain specific inductive biases into the training process and enable learning of powerful application specific embeddings. + + + + +