Conversation
a97ed40 to
53f2cd9
Compare
There was a problem hiding this comment.
Pull request overview
This pull request introduces significant changes aimed at improving pretraining efficiency for the AMPLIFY protein language model. The changes refactor core components including the tokenizer, model architecture, training loop, data loading pipeline, and dependencies.
Changes:
- Refactored tokenizer from custom implementation to HuggingFace PreTrainedTokenizerFast with new features (ambiguous token handling, position IDs)
- Updated model architecture to support flash attention, packed sequences, and compiled models
- Modernized training pipeline with new dataloader features, checkpoint management, and optimizer/scheduler implementations
- Updated dependencies to newer versions (PyTorch 2.5+, flash-attn, updated transformers/accelerate/deepspeed)
- Changed output directory naming convention from "outputs" to "logs"
Reviewed changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 25 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fixtures.py | Updated test output paths from "outputs" to "logs" |
| tests/example-configs/frankenstein-config/config.yaml | Updated test config paths from "outputs" to "logs" |
| src/amplify/trainer/trainer.py | Major refactoring of training loop with new dataloader integration, flash attention support, packed sequences, model compilation, improved checkpoint management |
| src/amplify/tokenizer/tokenizer.py | Deleted - functionality moved to model module |
| src/amplify/tokenizer/init.py | Deleted - module removed |
| src/amplify/scheduler/scheduler.py | Complete rewrite using SequentialLR with separate warmup, constant, and decay phases |
| src/amplify/optimizer/optimizer.py | Simplified to only support AdamW with fused option |
| src/amplify/model/tokenizer.py | New HuggingFace-based tokenizer with enhanced features |
| src/amplify/model/rotary.py | Simplified rotary embeddings implementation, removed docstrings |
| src/amplify/model/rmsnorm.py | Deleted - replaced with PyTorch built-in nn.RMSNorm |
| src/amplify/model/amplify.py | Major architecture changes: unified QKV projection, flash attention, simplified config, removed dropout, new checkpoint loading |
| src/amplify/model/init.py | Updated exports to include ProteinTokenizer |
| src/amplify/dataset/transformable_iterable_dataset.py | New wrapper class for iterable datasets with transform support |
| src/amplify/dataset/iterable_protein_dataset.py | Deleted - replaced with HuggingFace datasets |
| src/amplify/dataset/dataloader.py | Complete rewrite using HuggingFace datasets with new features (on-the-fly tokenization, packing, shuffling) |
| src/amplify/dataset/data_collator.py | Deleted - replaced with HuggingFace DataCollatorForLanguageModeling |
| src/amplify/dataset/init.py | Updated exports for new dataset classes |
| src/amplify/init.py | Updated exports to reflect module reorganization |
| scripts/fasta_to_csv.py | Fixed bug where ">" prefix was included in sequence names |
| pyproject.toml | Updated all dependencies to newer versions |
| framework-integrations/sagemaker/training/code/conf/config.yaml | Updated hydra output path from "outputs" to "logs" |
| examples/ss_prediction.ipynb | New example notebook for secondary structure prediction |
| conf/trainer/span.yaml | Deleted - span masking configuration removed |
| conf/trainer/mlm.yaml | Updated MLM config with new dataloader parameters |
| conf/trainer/base.yaml | Reorganized checkpoint config, added model compilation option, changed paths to "logs" |
| conf/tokenizer/amplify_vocab.txt | Added ambiguous amino acid tokens (X, B, O, U, Z, J) |
| conf/tokenizer/amplify_tokenizer.yaml | Updated vocab size, added ambiguous token IDs and removal flag |
| conf/scheduler/*.yaml | Removed warm restart configs, updated parameter names for new scheduler implementation |
| conf/optimizer/adam.yaml | Deleted - only AdamW supported now |
| conf/optimizer/adamw.yaml | Added fused optimizer option |
| conf/model/esm.yaml | Deleted - ESM config removed |
| conf/model/amplify.yaml | Simplified config, removed dropout and normalization options |
| conf/dataset/*.yaml | Reorganized dataset configs with new structure |
| conf/dataloader/base.yaml | New dataloader configuration file |
| conf/config.yaml | Updated defaults and wandb settings |
| conf/accelerate_compiled_deepspeed_zero2.yaml | New accelerate config with compilation support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| train_dataset, train_dataloader = get_dataloader( | ||
| **cfg.tokenizer, | ||
| **cfg.dataloader.train, | ||
| **cfg.dataset.train, | ||
| **cfg.trainer.train, | ||
| merge=True, | ||
| seed=metrics["num_epochs"], | ||
| epoch=metrics["num_epochs"], | ||
| return_labels=False, | ||
| ) |
There was a problem hiding this comment.
The return_labels parameter is passed to get_dataloader but this parameter is no longer supported in the new dataloader implementation. The parameter should be removed from this call as it's not in the function signature.
| if bool(cfg.trainer.tf32): | ||
| torch.backends.fp32_precision = "tf32" | ||
| torch.backends.cuda.matmul.fp32_precision = "tf32" | ||
| torch.backends.cudnn.fp32_precision = "tf32" |
There was a problem hiding this comment.
The torch.backends.fp32_precision attribute does not exist in PyTorch. The correct attributes are torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32. The new attributes like torch.backends.cuda.matmul.fp32_precision and torch.backends.cudnn.fp32_precision are not valid PyTorch attributes and will raise AttributeError.
| ) | ||
| train_dataloader = accelerator.prepare(train_dataloader) | ||
| else: | ||
| train_dataloader.set_epoch(metrics["num_epochs"]) |
There was a problem hiding this comment.
The set_epoch method is called on train_dataloader but PyTorch DataLoaders don't have this method by default. This will raise an AttributeError unless the dataloader has been specifically wrapped to support this method. Consider checking if this method exists before calling it, or document the requirement.
| **cfg.trainer.train, | ||
| merge=True, | ||
| seed=metrics["num_epochs"], | ||
| epoch=metrics["num_epochs"], |
There was a problem hiding this comment.
The epoch parameter is passed to get_dataloader but this parameter is not defined in the function signature. This will cause a TypeError when called.
| xq, xk, xv = ( | ||
| self.qkv(self.attention_norm(x)) | ||
| .reshape(batch_size, seq_len, self.config.num_attention_heads, self.d_head * 3) | ||
| .chunk(3, axis=-1) |
There was a problem hiding this comment.
Using chunk(3, axis=-1) with axis parameter, but torch.chunk uses dim parameter, not axis. This should be chunk(3, dim=-1) instead.
| attn_weights = xq.permute(0, 2, 1, 3) @ xk.permute(0, 2, 3, 1) / (xq.size(-1) ** 0.5) | ||
| if pad_mask is not None: | ||
| attn_weights = attn_weights + pad_mask | ||
| if attention_mask is not None: | ||
| attn_weights = attn_weights * attention_mask |
There was a problem hiding this comment.
Inconsistent attention mask handling. When output_attentions is True (line 149), the attention mask is multiplied (attn_weights * attention_mask), but for SDPA (line 160), it's converted to bool. Multiplicative masking with attention_mask expanded to shape (B, H, L, L) at line 273 suggests values should be 0 or 1, but typically additive masks use -inf for masked positions. This inconsistency may lead to incorrect attention computations when output_attentions=True.
| UserWarning( | ||
| f"Skipping model compilation. {cfg.trainer.compile_backend} is not in available backends: {torch.compiler.list_backends()})" | ||
| ) |
There was a problem hiding this comment.
UserWarning is instantiated but never raised. This should be raise UserWarning(...) or use warnings.warn(...) to properly display the warning message. Currently, this creates a UserWarning object that is immediately discarded.
| """ | ||
| Remove ambiguous amino acids from the input sequences. | ||
|
|
||
| Args: | ||
| encoded_inputs (Dict[str, List[List[int]]): Tokenized inputs with keys like 'input_ids' as tensors. | ||
|
|
||
| Returns: | ||
| Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids. | ||
| """ |
There was a problem hiding this comment.
Docstring incorrectly states this method removes ambiguous amino acids, but it actually pads sequences and returns them as tensors. Copy-paste error from the remove_ambiguous method.
| "class amplify_classifier(nn.Module):\n", | ||
| " def __init__(self, pretrained_model_name_or_path, hidden_size, num_labels):\n", | ||
| " super().__init__()\n", | ||
| " self.trunk = AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)\n", |
There was a problem hiding this comment.
The call to AutoModel.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True) will download and execute arbitrary Python code from the remote Hugging Face model repository chandar-lab/AMPLIFY_350M at runtime without pinning to an immutable revision. If that repository (or its owner account) is ever compromised, an attacker can gain code execution in any environment where this notebook is run, potentially exfiltrating data or credentials accessible to the process. To mitigate this, avoid trust_remote_code=True where possible, or at minimum pin to a specific commit/revision and vendor or review the model code before execution so that future upstream changes cannot introduce arbitrary code execution.
| "model = model.to(device)\n", | ||
| "\n", | ||
| "# Load AMPLIFY tokenizer\n", | ||
| "tokenizer = AutoTokenizer.from_pretrained(\"chandar-lab/AMPLIFY_350M\", trust_remote_code=True)\n", |
There was a problem hiding this comment.
Similar to the model load, AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", trust_remote_code=True) downloads and executes tokenizer code from a remote Hugging Face repository at runtime with no pinning to a specific revision. A compromise of that repository (or its owner) would allow an attacker to run arbitrary code in the context of this notebook when users execute it. To reduce this risk, avoid trust_remote_code=True for the tokenizer when not strictly required, or pin to a known-safe commit and vendor/review the tokenizer implementation so that untrusted upstream changes cannot introduce malicious behavior.
53f2cd9 to
0dd52f7
Compare
No description provided.