First commit for branch pretrain-efficiency by lolalebreton · Pull Request #25 · chandar-lab/AMPLIFY

lolalebreton · 2026-02-20T22:04:11Z

No description provided.

Copilot

Pull request overview

This pull request introduces significant changes aimed at improving pretraining efficiency for the AMPLIFY protein language model. The changes refactor core components including the tokenizer, model architecture, training loop, data loading pipeline, and dependencies.

Changes:

Refactored tokenizer from custom implementation to HuggingFace PreTrainedTokenizerFast with new features (ambiguous token handling, position IDs)
Updated model architecture to support flash attention, packed sequences, and compiled models
Modernized training pipeline with new dataloader features, checkpoint management, and optimizer/scheduler implementations
Updated dependencies to newer versions (PyTorch 2.5+, flash-attn, updated transformers/accelerate/deepspeed)
Changed output directory naming convention from "outputs" to "logs"

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 25 comments.

Show a summary per file

File	Description
tests/fixtures.py	Updated test output paths from "outputs" to "logs"
tests/example-configs/frankenstein-config/config.yaml	Updated test config paths from "outputs" to "logs"
src/amplify/trainer/trainer.py	Major refactoring of training loop with new dataloader integration, flash attention support, packed sequences, model compilation, improved checkpoint management
src/amplify/tokenizer/tokenizer.py	Deleted - functionality moved to model module
src/amplify/tokenizer/init.py	Deleted - module removed
src/amplify/scheduler/scheduler.py	Complete rewrite using SequentialLR with separate warmup, constant, and decay phases
src/amplify/optimizer/optimizer.py	Simplified to only support AdamW with fused option
src/amplify/model/tokenizer.py	New HuggingFace-based tokenizer with enhanced features
src/amplify/model/rotary.py	Simplified rotary embeddings implementation, removed docstrings
src/amplify/model/rmsnorm.py	Deleted - replaced with PyTorch built-in nn.RMSNorm
src/amplify/model/amplify.py	Major architecture changes: unified QKV projection, flash attention, simplified config, removed dropout, new checkpoint loading
src/amplify/model/init.py	Updated exports to include ProteinTokenizer
src/amplify/dataset/transformable_iterable_dataset.py	New wrapper class for iterable datasets with transform support
src/amplify/dataset/iterable_protein_dataset.py	Deleted - replaced with HuggingFace datasets
src/amplify/dataset/dataloader.py	Complete rewrite using HuggingFace datasets with new features (on-the-fly tokenization, packing, shuffling)
src/amplify/dataset/data_collator.py	Deleted - replaced with HuggingFace DataCollatorForLanguageModeling
src/amplify/dataset/init.py	Updated exports for new dataset classes
src/amplify/init.py	Updated exports to reflect module reorganization
scripts/fasta_to_csv.py	Fixed bug where ">" prefix was included in sequence names
pyproject.toml	Updated all dependencies to newer versions
framework-integrations/sagemaker/training/code/conf/config.yaml	Updated hydra output path from "outputs" to "logs"
examples/ss_prediction.ipynb	New example notebook for secondary structure prediction
conf/trainer/span.yaml	Deleted - span masking configuration removed
conf/trainer/mlm.yaml	Updated MLM config with new dataloader parameters
conf/trainer/base.yaml	Reorganized checkpoint config, added model compilation option, changed paths to "logs"
conf/tokenizer/amplify_vocab.txt	Added ambiguous amino acid tokens (X, B, O, U, Z, J)
conf/tokenizer/amplify_tokenizer.yaml	Updated vocab size, added ambiguous token IDs and removal flag
conf/scheduler/*.yaml	Removed warm restart configs, updated parameter names for new scheduler implementation
conf/optimizer/adam.yaml	Deleted - only AdamW supported now
conf/optimizer/adamw.yaml	Added fused optimizer option
conf/model/esm.yaml	Deleted - ESM config removed
conf/model/amplify.yaml	Simplified config, removed dropout and normalization options
conf/dataset/*.yaml	Reorganized dataset configs with new structure
conf/dataloader/base.yaml	New dataloader configuration file
conf/config.yaml	Updated defaults and wandb settings
conf/accelerate_compiled_deepspeed_zero2.yaml	New accelerate config with compilation support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T22:13:27Z

src/amplify/trainer/trainer.py

+            train_dataset, train_dataloader = get_dataloader(
+                **cfg.tokenizer,
+                **cfg.dataloader.train,
+                **cfg.dataset.train,
+                **cfg.trainer.train,
+                merge=True,
+                seed=metrics["num_epochs"],
+                epoch=metrics["num_epochs"],
+                return_labels=False,
+            )


The return_labels parameter is passed to get_dataloader but this parameter is no longer supported in the new dataloader implementation. The parameter should be removed from this call as it's not in the function signature.

Copilot · 2026-02-20T22:13:27Z

src/amplify/trainer/trainer.py

+    if bool(cfg.trainer.tf32):
+        torch.backends.fp32_precision = "tf32"
+        torch.backends.cuda.matmul.fp32_precision = "tf32"
+        torch.backends.cudnn.fp32_precision = "tf32"


The torch.backends.fp32_precision attribute does not exist in PyTorch. The correct attributes are torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32. The new attributes like torch.backends.cuda.matmul.fp32_precision and torch.backends.cudnn.fp32_precision are not valid PyTorch attributes and will raise AttributeError.

Copilot · 2026-02-20T22:13:28Z

src/amplify/trainer/trainer.py

+            )
+            train_dataloader = accelerator.prepare(train_dataloader)
+        else:
+            train_dataloader.set_epoch(metrics["num_epochs"])


The set_epoch method is called on train_dataloader but PyTorch DataLoaders don't have this method by default. This will raise an AttributeError unless the dataloader has been specifically wrapped to support this method. Consider checking if this method exists before calling it, or document the requirement.

Copilot · 2026-02-20T22:13:28Z

src/amplify/trainer/trainer.py

+                **cfg.trainer.train,
+                merge=True,
+                seed=metrics["num_epochs"],
+                epoch=metrics["num_epochs"],


The epoch parameter is passed to get_dataloader but this parameter is not defined in the function signature. This will cause a TypeError when called.

Copilot · 2026-02-20T22:13:28Z

src/amplify/model/amplify.py

+        xq, xk, xv = (
+            self.qkv(self.attention_norm(x))
+            .reshape(batch_size, seq_len, self.config.num_attention_heads, self.d_head * 3)
+            .chunk(3, axis=-1)


Using chunk(3, axis=-1) with axis parameter, but torch.chunk uses dim parameter, not axis. This should be chunk(3, dim=-1) instead.

Copilot · 2026-02-20T22:13:32Z

src/amplify/model/amplify.py

            attn_weights = xq.permute(0, 2, 1, 3) @ xk.permute(0, 2, 3, 1) / (xq.size(-1) ** 0.5)
-            if pad_mask is not None:
-                attn_weights = attn_weights + pad_mask
+            if attention_mask is not None:
+                attn_weights = attn_weights * attention_mask


Inconsistent attention mask handling. When output_attentions is True (line 149), the attention mask is multiplied (attn_weights * attention_mask), but for SDPA (line 160), it's converted to bool. Multiplicative masking with attention_mask expanded to shape (B, H, L, L) at line 273 suggests values should be 0 or 1, but typically additive masks use -inf for masked positions. This inconsistency may lead to incorrect attention computations when output_attentions=True.

Copilot · 2026-02-20T22:13:32Z

src/amplify/trainer/trainer.py

+            UserWarning(
+                f"Skipping model compilation. {cfg.trainer.compile_backend} is not in available backends: {torch.compiler.list_backends()})"
+            )


UserWarning is instantiated but never raised. This should be raise UserWarning(...) or use warnings.warn(...) to properly display the warning message. Currently, this creates a UserWarning object that is immediately discarded.

Copilot · 2026-02-20T22:13:32Z

src/amplify/model/tokenizer.py

+        """
+        Remove ambiguous amino acids from the input sequences.
+
+        Args:
+            encoded_inputs (Dict[str, List[List[int]]): Tokenized inputs with keys like 'input_ids' as tensors.
+
+        Returns:
+            Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids.
+        """


Docstring incorrectly states this method removes ambiguous amino acids, but it actually pads sequences and returns them as tensors. Copy-paste error from the remove_ambiguous method.

Copilot · 2026-02-20T22:13:33Z

examples/ss_prediction.ipynb

+    "class amplify_classifier(nn.Module):\n",
+    "    def __init__(self, pretrained_model_name_or_path, hidden_size, num_labels):\n",
+    "        super().__init__()\n",
+    "        self.trunk = AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)\n",


The call to AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True) will download and execute arbitrary Python code from the remote Hugging Face model repository chandar-lab/AMPLIFY_350M at runtime without pinning to an immutable revision. If that repository (or its owner account) is ever compromised, an attacker can gain code execution in any environment where this notebook is run, potentially exfiltrating data or credentials accessible to the process. To mitigate this, avoid trust_remote_code=True where possible, or at minimum pin to a specific commit/revision and vendor or review the model code before execution so that future upstream changes cannot introduce arbitrary code execution.

Copilot · 2026-02-20T22:13:33Z

examples/ss_prediction.ipynb

+    "model = model.to(device)\n",
+    "\n",
+    "# Load AMPLIFY tokenizer\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"chandar-lab/AMPLIFY_350M\", trust_remote_code=True)\n",


Similar to the model load, AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) downloads and executes tokenizer code from a remote Hugging Face repository at runtime with no pinning to a specific revision. A compromise of that repository (or its owner) would allow an attacker to run arbitrary code in the context of this notebook when users execute it. To reduce this risk, avoid trust_remote_code=True for the tokenizer when not strictly required, or pin to a known-safe commit and vendor/review the tokenizer implementation so that untrusted upstream changes cannot introduce malicious behavior.

lolalebreton marked this pull request as ready for review February 20, 2026 22:05

Copilot AI review requested due to automatic review settings February 20, 2026 22:05

Copilot started reviewing on behalf of lolalebreton February 20, 2026 22:06 View session

lolalebreton force-pushed the pretrain-efficiency branch from a97ed40 to 53f2cd9 Compare February 20, 2026 22:10

lolalebreton closed this Feb 20, 2026

lolalebreton reopened this Feb 20, 2026

Copilot AI reviewed Feb 20, 2026

View reviewed changes

First commit for branch pretrain-efficiency

0dd52f7

lolalebreton force-pushed the pretrain-efficiency branch from 53f2cd9 to 0dd52f7 Compare February 20, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First commit for branch pretrain-efficiency#25

First commit for branch pretrain-efficiency#25
lolalebreton wants to merge 1 commit intomainfrom
pretrain-efficiency

lolalebreton commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lolalebreton commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants