Skip to content

Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1

Open
matteius wants to merge 1 commit intomainfrom
fix/upstream-bring-up
Open

Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1
matteius wants to merge 1 commit intomainfrom
fix/upstream-bring-up

Conversation

@matteius
Copy link
Copy Markdown

While bringing up the DNABERT-Epi training pipeline against current
PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14
I hit a small set of upstream bugs that block end-to-end execution.
Each is independently fixable, none changes algorithmic behaviour.

Bugs fixed

1. pair_finetuning_dnabert.py

Bug Symptom Fix
dataset_dict[\"sgRNA\"] (capital R) KeyError on first dataset — data_loader.load_dataset_information returns the key as lowercase `"sgrna"` (data_loader.py:168) use \"sgrna\"
sgrna_seqs[dataset_name].extend(...) TypeError: list indices must be integers — sgrna_seqs is [] not {} direct .extend() on the lists
range((n_samples / n_sgrna) // 6) TypeError: 'float' object cannot be interpreted as an integer — true-division returns float on Py3 wrap in int(...)
Trainer(..., tokenizer=self.tokenizer) TypeError: unexpected kwarg — renamed in transformers 5 processing_class=self.tokenizer

2. data_loader.pyBalancedSampler.__init__

isinstance(dataset, Dataset) checks torch.utils.data.Dataset, but
the actual object passed in is datasets.arrow_dataset.Dataset
(unrelated classes). It then falls to the else-branch which expects
torch.Tensor from dataset[\"labels\"], but datasets >= 4 returns a
Column instead — .tolist() never runs and self.labels ends up
as a sequence of 0-d Tensors. The defaultdict buckets each tensor by
id() so the 0 in label_to_indices check fails even when both
classes are present (`ValueError: Dataset must contain both classes 0 and 1`).

Fix: more permissive normalisation cascade so self.labels is
always a list of plain Python ints.

3. dnabert_module.pytest_scratch + test_transfer

test_dataset[\"labels\"].numpy()AttributeError: 'Column' object has no attribute 'numpy'. Same datasets 4.x root cause. Replaced
with np.asarray(list(...)).

4. dnabert2_module.py (new file)

run_preprocess.py imports models.dnabert2_module unconditionally,
but the file is missing from the repo (ImportError before any work
runs). The module is never actually called on the DNABERT (no-2)
code path; this stub keeps the import resolvable.

End-to-end verification

After these patches, on Lazzarotto 2020 GUIDE-seq fold 0 (scratch
mode, RTX 4070, batch 128, 8 epochs):

  • pair-finetune converges to train_loss 0.060
  • off-target classifier converges to train_loss 0.0028
  • checkpoint saved successfully (fold0_iter0.pth, 82 MB)
  • downstream evaluation runs (was the path that hit bug 3)

Known not-fixed-here

Bringing up on a fresh host also requires config.yaml to declare
two empty top-level dicts:

```yaml
dataset_name: {}
model_info: {}
```

Every entry-point writes into them without first creating the
parent (e.g. run_preprocess.py:46-47, run_model.py:54-56,
pair_finetuning_dnabert.py:250). I left this out of the PR because
it's a config-schema change rather than a code fix and may be worth
treating differently (e.g. defensive setdefault in
check_set.CheckConfig).

While bringing up the DNABERT-Epi training pipeline against current
PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14
we hit a small set of upstream bugs that block end-to-end execution.
Each is independently fixable and none changes algorithmic behaviour.

1. src/models/pair_finetuning_dnabert.py
   - load_sequence_data: dataset_dict["sgRNA"] → dataset_dict["sgrna"]
     to match the key data_loader.load_dataset_information actually
     returns (data_loader.py:168). Was a hard KeyError on first iter.
   - Same function: sgrna_seqs / dna_seqs are initialised as plain
     `list` but were then indexed as if they were dicts
     (`sgrna_seqs[dataset_name].extend(...)`) — TypeError. The intent
     is clearly to accumulate across all datasets, so `.extend(...)`
     directly on the lists.
   - generate_random_sequence_input: `range((n_samples/n_sgrna)//6)`
     yields a float on Py3, which `range()` rejects. Wrap in `int()`.
   - Trainer instantiation: `tokenizer=` kwarg was renamed to
     `processing_class=` in transformers 5. Old kwarg now TypeErrors.

2. src/models/data_loader.py — BalancedSampler.__init__
   `isinstance(dataset, Dataset)` checks torch.utils.data.Dataset, but
   the actual object is a HuggingFace datasets.arrow_dataset.Dataset
   (unrelated classes). It then falls to the else-branch which
   expects torch.Tensor from `dataset["labels"]`, but datasets>=4
   returns a `Column`, so `.tolist()` never runs and `self.labels`
   is a sequence of 0-d Tensors. The defaultdict buckets each tensor
   by id() so the int-key check `0 in label_to_indices` fails even
   when both classes are present. Normalise once via a more permissive
   detection cascade (Tensor → .tolist → list comprehension).

3. src/models/dnabert_module.py — test_scratch + test_transfer
   `test_dataset["labels"].numpy()` → `np.asarray(list(...))`. Same
   datasets>=4 root cause; Column has no .numpy().

4. src/models/dnabert2_module.py (new file)
   run_preprocess.py imports `models.dnabert2_module` unconditionally
   but the file was never committed; ImportError before any work.
   Stub keeps the import resolvable. The module is never actually
   called on the DNABERT (no-2) code path.

End-to-end verification on Lazzarotto 2020 GUIDE-seq fold 0
(scratch mode, 8 epochs, RTX 4070): pair-finetune converges to
train_loss 0.060, off-target classifier converges to train_loss
0.0028, checkpoint saved successfully, downstream evaluation runs.

Note: I left `config.yaml` untouched in this PR. Bringing up on a
fresh host also requires adding two empty top-level dicts
(`dataset_name: {}` and `model_info: {}`) — every entry-point writes
into them without first creating the parent. That's a separate
config schema change worth surfacing on its own.
matteius pushed a commit to opensensor/bionpu that referenced this pull request Apr 28, 2026
Headline numbers from end-to-end paper replication on the ProArt
RTX 4070 host (2026-04-28):

  Lazzarotto 2020 GUIDE-seq fold 0, scratch mode, 8 epochs:
    ROC-AUC  0.9824   (paper:  0.9857 ± 0.0124 across 14 folds)
    PR-AUC   0.5448   (paper:  0.5501 ± 0.0673 across 14 folds)

Both within 1σ of the paper mean despite running on current 2026-era
deps (torch 2.11, transformers 5.x, datasets 4.x) instead of the
paper's pinned 2024 versions.

Bug fixes required to run the upstream pipeline on those deps are
filed as opensensor/CRISPR_DNABERT#1. Until that PR merges, run
from the submodule working-tree patches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant