Skip to content

First commit for branch pretrain-efficiency#25

Open
lolalebreton wants to merge 1 commit intomainfrom
pretrain-efficiency
Open

First commit for branch pretrain-efficiency#25
lolalebreton wants to merge 1 commit intomainfrom
pretrain-efficiency

Conversation

@lolalebreton
Copy link
Collaborator

No description provided.

@lolalebreton lolalebreton marked this pull request as ready for review February 20, 2026 22:05
Copilot AI review requested due to automatic review settings February 20, 2026 22:05
@lolalebreton lolalebreton reopened this Feb 20, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces significant changes aimed at improving pretraining efficiency for the AMPLIFY protein language model. The changes refactor core components including the tokenizer, model architecture, training loop, data loading pipeline, and dependencies.

Changes:

  • Refactored tokenizer from custom implementation to HuggingFace PreTrainedTokenizerFast with new features (ambiguous token handling, position IDs)
  • Updated model architecture to support flash attention, packed sequences, and compiled models
  • Modernized training pipeline with new dataloader features, checkpoint management, and optimizer/scheduler implementations
  • Updated dependencies to newer versions (PyTorch 2.5+, flash-attn, updated transformers/accelerate/deepspeed)
  • Changed output directory naming convention from "outputs" to "logs"

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 25 comments.

Show a summary per file
File Description
tests/fixtures.py Updated test output paths from "outputs" to "logs"
tests/example-configs/frankenstein-config/config.yaml Updated test config paths from "outputs" to "logs"
src/amplify/trainer/trainer.py Major refactoring of training loop with new dataloader integration, flash attention support, packed sequences, model compilation, improved checkpoint management
src/amplify/tokenizer/tokenizer.py Deleted - functionality moved to model module
src/amplify/tokenizer/init.py Deleted - module removed
src/amplify/scheduler/scheduler.py Complete rewrite using SequentialLR with separate warmup, constant, and decay phases
src/amplify/optimizer/optimizer.py Simplified to only support AdamW with fused option
src/amplify/model/tokenizer.py New HuggingFace-based tokenizer with enhanced features
src/amplify/model/rotary.py Simplified rotary embeddings implementation, removed docstrings
src/amplify/model/rmsnorm.py Deleted - replaced with PyTorch built-in nn.RMSNorm
src/amplify/model/amplify.py Major architecture changes: unified QKV projection, flash attention, simplified config, removed dropout, new checkpoint loading
src/amplify/model/init.py Updated exports to include ProteinTokenizer
src/amplify/dataset/transformable_iterable_dataset.py New wrapper class for iterable datasets with transform support
src/amplify/dataset/iterable_protein_dataset.py Deleted - replaced with HuggingFace datasets
src/amplify/dataset/dataloader.py Complete rewrite using HuggingFace datasets with new features (on-the-fly tokenization, packing, shuffling)
src/amplify/dataset/data_collator.py Deleted - replaced with HuggingFace DataCollatorForLanguageModeling
src/amplify/dataset/init.py Updated exports for new dataset classes
src/amplify/init.py Updated exports to reflect module reorganization
scripts/fasta_to_csv.py Fixed bug where ">" prefix was included in sequence names
pyproject.toml Updated all dependencies to newer versions
framework-integrations/sagemaker/training/code/conf/config.yaml Updated hydra output path from "outputs" to "logs"
examples/ss_prediction.ipynb New example notebook for secondary structure prediction
conf/trainer/span.yaml Deleted - span masking configuration removed
conf/trainer/mlm.yaml Updated MLM config with new dataloader parameters
conf/trainer/base.yaml Reorganized checkpoint config, added model compilation option, changed paths to "logs"
conf/tokenizer/amplify_vocab.txt Added ambiguous amino acid tokens (X, B, O, U, Z, J)
conf/tokenizer/amplify_tokenizer.yaml Updated vocab size, added ambiguous token IDs and removal flag
conf/scheduler/*.yaml Removed warm restart configs, updated parameter names for new scheduler implementation
conf/optimizer/adam.yaml Deleted - only AdamW supported now
conf/optimizer/adamw.yaml Added fused optimizer option
conf/model/esm.yaml Deleted - ESM config removed
conf/model/amplify.yaml Simplified config, removed dropout and normalization options
conf/dataset/*.yaml Reorganized dataset configs with new structure
conf/dataloader/base.yaml New dataloader configuration file
conf/config.yaml Updated defaults and wandb settings
conf/accelerate_compiled_deepspeed_zero2.yaml New accelerate config with compilation support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +249 to +258
train_dataset, train_dataloader = get_dataloader(
**cfg.tokenizer,
**cfg.dataloader.train,
**cfg.dataset.train,
**cfg.trainer.train,
merge=True,
seed=metrics["num_epochs"],
epoch=metrics["num_epochs"],
return_labels=False,
)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return_labels parameter is passed to get_dataloader but this parameter is no longer supported in the new dataloader implementation. The parameter should be removed from this call as it's not in the function signature.

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +142
if bool(cfg.trainer.tf32):
torch.backends.fp32_precision = "tf32"
torch.backends.cuda.matmul.fp32_precision = "tf32"
torch.backends.cudnn.fp32_precision = "tf32"
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch.backends.fp32_precision attribute does not exist in PyTorch. The correct attributes are torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32. The new attributes like torch.backends.cuda.matmul.fp32_precision and torch.backends.cudnn.fp32_precision are not valid PyTorch attributes and will raise AttributeError.

Copilot uses AI. Check for mistakes.
)
train_dataloader = accelerator.prepare(train_dataloader)
else:
train_dataloader.set_epoch(metrics["num_epochs"])
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_epoch method is called on train_dataloader but PyTorch DataLoaders don't have this method by default. This will raise an AttributeError unless the dataloader has been specifically wrapped to support this method. Consider checking if this method exists before calling it, or document the requirement.

Copilot uses AI. Check for mistakes.
**cfg.trainer.train,
merge=True,
seed=metrics["num_epochs"],
epoch=metrics["num_epochs"],
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epoch parameter is passed to get_dataloader but this parameter is not defined in the function signature. This will cause a TypeError when called.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +122
xq, xk, xv = (
self.qkv(self.attention_norm(x))
.reshape(batch_size, seq_len, self.config.num_attention_heads, self.d_head * 3)
.chunk(3, axis=-1)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using chunk(3, axis=-1) with axis parameter, but torch.chunk uses dim parameter, not axis. This should be chunk(3, dim=-1) instead.

Copilot uses AI. Check for mistakes.
Comment on lines 147 to +149
attn_weights = xq.permute(0, 2, 1, 3) @ xk.permute(0, 2, 3, 1) / (xq.size(-1) ** 0.5)
if pad_mask is not None:
attn_weights = attn_weights + pad_mask
if attention_mask is not None:
attn_weights = attn_weights * attention_mask
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent attention mask handling. When output_attentions is True (line 149), the attention mask is multiplied (attn_weights * attention_mask), but for SDPA (line 160), it's converted to bool. Multiplicative masking with attention_mask expanded to shape (B, H, L, L) at line 273 suggests values should be 0 or 1, but typically additive masks use -inf for masked positions. This inconsistency may lead to incorrect attention computations when output_attentions=True.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +160
UserWarning(
f"Skipping model compilation. {cfg.trainer.compile_backend} is not in available backends: {torch.compiler.list_backends()})"
)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserWarning is instantiated but never raised. This should be raise UserWarning(...) or use warnings.warn(...) to properly display the warning message. Currently, this creates a UserWarning object that is immediately discarded.

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +196
"""
Remove ambiguous amino acids from the input sequences.

Args:
encoded_inputs (Dict[str, List[List[int]]): Tokenized inputs with keys like 'input_ids' as tensors.

Returns:
Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids.
"""
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring incorrectly states this method removes ambiguous amino acids, but it actually pads sequences and returns them as tensors. Copy-paste error from the remove_ambiguous method.

Copilot uses AI. Check for mistakes.
"class amplify_classifier(nn.Module):\n",
" def __init__(self, pretrained_model_name_or_path, hidden_size, num_labels):\n",
" super().__init__()\n",
" self.trunk = AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)\n",
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True) will download and execute arbitrary Python code from the remote Hugging Face model repository chandar-lab/AMPLIFY_350M at runtime without pinning to an immutable revision. If that repository (or its owner account) is ever compromised, an attacker can gain code execution in any environment where this notebook is run, potentially exfiltrating data or credentials accessible to the process. To mitigate this, avoid trust_remote_code=True where possible, or at minimum pin to a specific commit/revision and vendor or review the model code before execution so that future upstream changes cannot introduce arbitrary code execution.

Copilot uses AI. Check for mistakes.
"model = model.to(device)\n",
"\n",
"# Load AMPLIFY tokenizer\n",
"tokenizer = AutoTokenizer.from_pretrained(\"chandar-lab/AMPLIFY_350M\", trust_remote_code=True)\n",
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the model load, AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) downloads and executes tokenizer code from a remote Hugging Face repository at runtime with no pinning to a specific revision. A compromise of that repository (or its owner) would allow an attacker to run arbitrary code in the context of this notebook when users execute it. To reduce this risk, avoid trust_remote_code=True for the tokenizer when not strictly required, or pin to a known-safe commit and vendor/review the tokenizer implementation so that untrusted upstream changes cannot introduce malicious behavior.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants