Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1
Open
Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1
Conversation
While bringing up the DNABERT-Epi training pipeline against current
PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14
we hit a small set of upstream bugs that block end-to-end execution.
Each is independently fixable and none changes algorithmic behaviour.
1. src/models/pair_finetuning_dnabert.py
- load_sequence_data: dataset_dict["sgRNA"] → dataset_dict["sgrna"]
to match the key data_loader.load_dataset_information actually
returns (data_loader.py:168). Was a hard KeyError on first iter.
- Same function: sgrna_seqs / dna_seqs are initialised as plain
`list` but were then indexed as if they were dicts
(`sgrna_seqs[dataset_name].extend(...)`) — TypeError. The intent
is clearly to accumulate across all datasets, so `.extend(...)`
directly on the lists.
- generate_random_sequence_input: `range((n_samples/n_sgrna)//6)`
yields a float on Py3, which `range()` rejects. Wrap in `int()`.
- Trainer instantiation: `tokenizer=` kwarg was renamed to
`processing_class=` in transformers 5. Old kwarg now TypeErrors.
2. src/models/data_loader.py — BalancedSampler.__init__
`isinstance(dataset, Dataset)` checks torch.utils.data.Dataset, but
the actual object is a HuggingFace datasets.arrow_dataset.Dataset
(unrelated classes). It then falls to the else-branch which
expects torch.Tensor from `dataset["labels"]`, but datasets>=4
returns a `Column`, so `.tolist()` never runs and `self.labels`
is a sequence of 0-d Tensors. The defaultdict buckets each tensor
by id() so the int-key check `0 in label_to_indices` fails even
when both classes are present. Normalise once via a more permissive
detection cascade (Tensor → .tolist → list comprehension).
3. src/models/dnabert_module.py — test_scratch + test_transfer
`test_dataset["labels"].numpy()` → `np.asarray(list(...))`. Same
datasets>=4 root cause; Column has no .numpy().
4. src/models/dnabert2_module.py (new file)
run_preprocess.py imports `models.dnabert2_module` unconditionally
but the file was never committed; ImportError before any work.
Stub keeps the import resolvable. The module is never actually
called on the DNABERT (no-2) code path.
End-to-end verification on Lazzarotto 2020 GUIDE-seq fold 0
(scratch mode, 8 epochs, RTX 4070): pair-finetune converges to
train_loss 0.060, off-target classifier converges to train_loss
0.0028, checkpoint saved successfully, downstream evaluation runs.
Note: I left `config.yaml` untouched in this PR. Bringing up on a
fresh host also requires adding two empty top-level dicts
(`dataset_name: {}` and `model_info: {}`) — every entry-point writes
into them without first creating the parent. That's a separate
config schema change worth surfacing on its own.
matteius
pushed a commit
to opensensor/bionpu
that referenced
this pull request
Apr 28, 2026
Headline numbers from end-to-end paper replication on the ProArt
RTX 4070 host (2026-04-28):
Lazzarotto 2020 GUIDE-seq fold 0, scratch mode, 8 epochs:
ROC-AUC 0.9824 (paper: 0.9857 ± 0.0124 across 14 folds)
PR-AUC 0.5448 (paper: 0.5501 ± 0.0673 across 14 folds)
Both within 1σ of the paper mean despite running on current 2026-era
deps (torch 2.11, transformers 5.x, datasets 4.x) instead of the
paper's pinned 2024 versions.
Bug fixes required to run the upstream pipeline on those deps are
filed as opensensor/CRISPR_DNABERT#1. Until that PR merges, run
from the submodule working-tree patches.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While bringing up the DNABERT-Epi training pipeline against current
PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14
I hit a small set of upstream bugs that block end-to-end execution.
Each is independently fixable, none changes algorithmic behaviour.
Bugs fixed
1.
pair_finetuning_dnabert.pydataset_dict[\"sgRNA\"](capital R)data_loader.load_dataset_informationreturns the key as lowercase `"sgrna"` (data_loader.py:168)\"sgrna\"sgrna_seqs[dataset_name].extend(...)sgrna_seqsis[]not{}.extend()on the listsrange((n_samples / n_sgrna) // 6)int(...)Trainer(..., tokenizer=self.tokenizer)processing_class=self.tokenizer2.
data_loader.py—BalancedSampler.__init__isinstance(dataset, Dataset)checkstorch.utils.data.Dataset, butthe actual object passed in is
datasets.arrow_dataset.Dataset(unrelated classes). It then falls to the else-branch which expects
torch.Tensorfromdataset[\"labels\"], but datasets >= 4 returns aColumninstead —.tolist()never runs andself.labelsends upas a sequence of 0-d Tensors. The defaultdict buckets each tensor by
id()so the0 in label_to_indicescheck fails even when bothclasses are present (`ValueError: Dataset must contain both classes 0 and 1`).
Fix: more permissive normalisation cascade so
self.labelsisalways a list of plain Python ints.
3.
dnabert_module.py—test_scratch+test_transfertest_dataset[\"labels\"].numpy()→AttributeError: 'Column' object has no attribute 'numpy'. Same datasets 4.x root cause. Replacedwith
np.asarray(list(...)).4.
dnabert2_module.py(new file)run_preprocess.pyimportsmodels.dnabert2_moduleunconditionally,but the file is missing from the repo (ImportError before any work
runs). The module is never actually called on the DNABERT (no-2)
code path; this stub keeps the import resolvable.
End-to-end verification
After these patches, on Lazzarotto 2020 GUIDE-seq fold 0 (scratch
mode, RTX 4070, batch 128, 8 epochs):
train_loss 0.060train_loss 0.0028fold0_iter0.pth, 82 MB)Known not-fixed-here
Bringing up on a fresh host also requires
config.yamlto declaretwo empty top-level dicts:
```yaml
dataset_name: {}
model_info: {}
```
Every entry-point writes into them without first creating the
parent (e.g.
run_preprocess.py:46-47,run_model.py:54-56,pair_finetuning_dnabert.py:250). I left this out of the PR becauseit's a config-schema change rather than a code fix and may be worth
treating differently (e.g. defensive
setdefaultincheck_set.CheckConfig).