Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14) by matteius · Pull Request #1 · opensensor/CRISPR_DNABERT

matteius · 2026-04-28T04:49:07Z

While bringing up the DNABERT-Epi training pipeline against current
PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14
I hit a small set of upstream bugs that block end-to-end execution.
Each is independently fixable, none changes algorithmic behaviour.

Bugs fixed

1. `pair_finetuning_dnabert.py`

Bug	Symptom	Fix
`dataset_dict[\"sgRNA\"]` (capital R)	KeyError on first dataset — `data_loader.load_dataset_information` returns the key as lowercase `"sgrna"` (`data_loader.py:168`)	use `\"sgrna\"`
`sgrna_seqs[dataset_name].extend(...)`	TypeError: list indices must be integers — `sgrna_seqs` is `[]` not `{}`	direct `.extend()` on the lists
`range((n_samples / n_sgrna) // 6)`	TypeError: 'float' object cannot be interpreted as an integer — true-division returns float on Py3	wrap in `int(...)`
`Trainer(..., tokenizer=self.tokenizer)`	TypeError: unexpected kwarg — renamed in transformers 5	`processing_class=self.tokenizer`

2. `data_loader.py` — `BalancedSampler.init`

isinstance(dataset, Dataset) checks torch.utils.data.Dataset, but
the actual object passed in is datasets.arrow_dataset.Dataset
(unrelated classes). It then falls to the else-branch which expects
torch.Tensor from dataset[\"labels\"], but datasets >= 4 returns a
Column instead — .tolist() never runs and self.labels ends up
as a sequence of 0-d Tensors. The defaultdict buckets each tensor by
id() so the 0 in label_to_indices check fails even when both
classes are present (`ValueError: Dataset must contain both classes 0 and 1`).

Fix: more permissive normalisation cascade so self.labels is
always a list of plain Python ints.

3. `dnabert_module.py` — `test_scratch` + `test_transfer`

test_dataset[\"labels\"].numpy() → AttributeError: 'Column' object has no attribute 'numpy'. Same datasets 4.x root cause. Replaced
with np.asarray(list(...)).

4. `dnabert2_module.py` (new file)

run_preprocess.py imports models.dnabert2_module unconditionally,
but the file is missing from the repo (ImportError before any work
runs). The module is never actually called on the DNABERT (no-2)
code path; this stub keeps the import resolvable.

End-to-end verification

After these patches, on Lazzarotto 2020 GUIDE-seq fold 0 (scratch
mode, RTX 4070, batch 128, 8 epochs):

pair-finetune converges to train_loss 0.060
off-target classifier converges to train_loss 0.0028
checkpoint saved successfully (fold0_iter0.pth, 82 MB)
downstream evaluation runs (was the path that hit bug 3)

Known not-fixed-here

Bringing up on a fresh host also requires config.yaml to declare
two empty top-level dicts:

```yaml
dataset_name: {}
model_info: {}
```

Every entry-point writes into them without first creating the
parent (e.g. run_preprocess.py:46-47, run_model.py:54-56,
pair_finetuning_dnabert.py:250). I left this out of the PR because
it's a config-schema change rather than a code fix and may be worth
treating differently (e.g. defensive setdefault in
check_set.CheckConfig).

While bringing up the DNABERT-Epi training pipeline against current PyPI deps (torch 2.11+, transformers 5.x, datasets 4.x) on Python 3.14 we hit a small set of upstream bugs that block end-to-end execution. Each is independently fixable and none changes algorithmic behaviour. 1. src/models/pair_finetuning_dnabert.py - load_sequence_data: dataset_dict["sgRNA"] → dataset_dict["sgrna"] to match the key data_loader.load_dataset_information actually returns (data_loader.py:168). Was a hard KeyError on first iter. - Same function: sgrna_seqs / dna_seqs are initialised as plain `list` but were then indexed as if they were dicts (`sgrna_seqs[dataset_name].extend(...)`) — TypeError. The intent is clearly to accumulate across all datasets, so `.extend(...)` directly on the lists. - generate_random_sequence_input: `range((n_samples/n_sgrna)//6)` yields a float on Py3, which `range()` rejects. Wrap in `int()`. - Trainer instantiation: `tokenizer=` kwarg was renamed to `processing_class=` in transformers 5. Old kwarg now TypeErrors. 2. src/models/data_loader.py — BalancedSampler.__init__ `isinstance(dataset, Dataset)` checks torch.utils.data.Dataset, but the actual object is a HuggingFace datasets.arrow_dataset.Dataset (unrelated classes). It then falls to the else-branch which expects torch.Tensor from `dataset["labels"]`, but datasets>=4 returns a `Column`, so `.tolist()` never runs and `self.labels` is a sequence of 0-d Tensors. The defaultdict buckets each tensor by id() so the int-key check `0 in label_to_indices` fails even when both classes are present. Normalise once via a more permissive detection cascade (Tensor → .tolist → list comprehension). 3. src/models/dnabert_module.py — test_scratch + test_transfer `test_dataset["labels"].numpy()` → `np.asarray(list(...))`. Same datasets>=4 root cause; Column has no .numpy(). 4. src/models/dnabert2_module.py (new file) run_preprocess.py imports `models.dnabert2_module` unconditionally but the file was never committed; ImportError before any work. Stub keeps the import resolvable. The module is never actually called on the DNABERT (no-2) code path. End-to-end verification on Lazzarotto 2020 GUIDE-seq fold 0 (scratch mode, 8 epochs, RTX 4070): pair-finetune converges to train_loss 0.060, off-target classifier converges to train_loss 0.0028, checkpoint saved successfully, downstream evaluation runs. Note: I left `config.yaml` untouched in this PR. Bringing up on a fresh host also requires adding two empty top-level dicts (`dataset_name: {}` and `model_info: {}`) — every entry-point writes into them without first creating the parent. That's a separate config schema change worth surfacing on its own.

Headline numbers from end-to-end paper replication on the ProArt RTX 4070 host (2026-04-28): Lazzarotto 2020 GUIDE-seq fold 0, scratch mode, 8 epochs: ROC-AUC 0.9824 (paper: 0.9857 ± 0.0124 across 14 folds) PR-AUC 0.5448 (paper: 0.5501 ± 0.0673 across 14 folds) Both within 1σ of the paper mean despite running on current 2026-era deps (torch 2.11, transformers 5.x, datasets 4.x) instead of the paper's pinned 2024 versions. Bug fixes required to run the upstream pipeline on those deps are filed as opensensor/CRISPR_DNABERT#1. Until that PR merges, run from the submodule working-tree patches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1

Fix bring-up bugs (transformers >=5 / datasets >=4 / py3.14)#1
matteius wants to merge 1 commit intomainfrom
fix/upstream-bring-up

matteius commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matteius commented Apr 28, 2026

Bugs fixed

1. pair_finetuning_dnabert.py

2. data_loader.py — BalancedSampler.__init__

3. dnabert_module.py — test_scratch + test_transfer

4. dnabert2_module.py (new file)

End-to-end verification

Known not-fixed-here

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `pair_finetuning_dnabert.py`

2. `data_loader.py` — `BalancedSampler.init`

3. `dnabert_module.py` — `test_scratch` + `test_transfer`

4. `dnabert2_module.py` (new file)