Fix unused imports, min_redundant_inds remainder bug, and ClassBalancedSampler index mapping#70
Conversation
…dSampler Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-data/sessions/1e323ffb-2a14-4aae-99e1-8e70c1d393d3
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## rewrite #70 +/- ##
===========================================
- Coverage 81.36% 80.97% -0.40%
===========================================
Files 23 23
Lines 1567 1603 +36
===========================================
+ Hits 1275 1298 +23
- Misses 292 305 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR addresses lint/correctness issues in CellMap-Data’s sampling utilities and cleans up unused test imports, improving both reliability and code hygiene.
Changes:
- Remove unused imports in multiple test modules to resolve F401 lint violations.
- Fix
min_redundant_indsto always return exactlykindices whenreplacement=Falseandk > n(including remainder handling). - Fix
ClassBalancedSamplerby adding an early “no active classes” guard and mapping crop-matrix rows to actual dataset sample indices during iteration.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_sampler.py |
Removes unused pytest import. |
tests/test_multidataset.py |
Removes unused imports (torch, tempfile, pathlib). |
tests/test_image.py |
Removes unused local zarr import. |
tests/test_geometry.py |
Removes unused pytest import. |
tests/test_dataset.py |
Removes unused local json import. |
src/cellmap_data/utils/misc.py |
Fixes remainder handling in min_redundant_inds for k > n without replacement. |
src/cellmap_data/sampler.py |
Adds empty-active-class guard and corrects sampler output to yield dataset sample indices instead of crop-matrix row indices. |
src/cellmap_data/sampler.py
Outdated
| # Map crop index (dataset-level row) to an actual sample index. | ||
| # If n_crops equals len(dataset), the crop index IS the sample index. | ||
| if self.n_crops == len(self.dataset): | ||
| sample_idx = crop_idx | ||
| elif hasattr(self.dataset, "datasets") and hasattr( |
There was a problem hiding this comment.
The sampler now yields a mapped sample_idx, but the surrounding terminology still calls this a “crop index” (and the class docstring describes yielding “that crop index”). This makes the API/behavior confusing for callers and future maintainers. Consider renaming crop_idx to something like row_idx (matrix row) and updating the docstring/comments to reflect that __iter__ yields dataset sample indices (global indices for Concat/MultiDataset).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… fix naming and docstrings Co-authored-by: rhoadesScholar <37990507+rhoadesScholar@users.noreply.github.com> Agent-Logs-Url: https://github.com/janelia-cellmap/cellmap-data/sessions/b1ce1f6e-fdc4-47b9-aa82-562c86c4800d
| if self.n_crops <= 1 or total <= 0: | ||
| start, end = 0, max(total, 1) |
There was a problem hiding this comment.
In the generic fallback mapping, when len(self.dataset) == 0 this branch sets end = max(total, 1) and can yield sample_idx = 0, which is an invalid index for an empty dataset (and will likely crash downstream when DataLoader tries to fetch it). Consider explicitly erroring out when total == 0 (or forcing samples_per_epoch to 0) instead of manufacturing a [0, 1) range.
| if self.n_crops <= 1 or total <= 0: | |
| start, end = 0, max(total, 1) | |
| if total <= 0: | |
| raise ValueError( | |
| "ClassBalancedSampler: cannot sample from an empty dataset " | |
| "in generic fallback mapping (len(dataset) == 0). " | |
| "Consider setting samples_per_epoch=0 or removing the " | |
| "sampler for this split." | |
| ) | |
| if self.n_crops <= 1: | |
| start, end = 0, total |
| def test_raises_when_no_active_classes(self): | ||
| """All-False crop-class matrix must raise ValueError immediately.""" | ||
| matrix = np.zeros((4, 3), dtype=bool) | ||
| with pytest.raises(ValueError, match="no active classes"): | ||
| self._make_sampler(matrix, samples_per_epoch=5) |
There was a problem hiding this comment.
PR description says import pytest was removed from tests/test_sampler.py, but this file still imports (and now uses) pytest in the newly added test_raises_when_no_active_classes. Consider updating the PR description to avoid confusion when reviewing the unused-import cleanup.
Addresses several correctness and lint issues flagged in code review on the rewrite PR.
Unused imports
Removed F401 violations across test files:
import jsonfromtest_dataset.pyimport pytestfromtest_geometry.py,test_sampler.pyimport tempfile, pathlib,import torchfromtest_multidataset.pyimport zarr as zfromtest_image.pymin_redundant_indsremainder bugWhen
replacement=Falseandk > n, the previous implementation silently dropped indices whenk % n != 0:ClassBalancedSamplerfixesEmpty active classes guard —
__init__now raisesValueErrorimmediately if the crop-class matrix has no annotated classes, rather than crashing with a cryptic error onactive_counts.min()at iteration time.Correct sample index mapping —
__iter__previously yieldedcrop_idxdirectly, which is a row in the crop-class matrix (one row per dataset/sub-dataset), not a sample index. This meant only indices[0, n_crops)were ever yielded, ignoring the vast majority of samples. Now maps crop rows to actual sample indices:n_crops == len(dataset): direct mapping (no-op, preserves existing behavior)datasets+cumulative_sizes(CellMapMultiDataset/ConcatDataset): samples randomly within the matched sub-dataset's global index range[0, len(dataset))inton_cropscontiguous segments and samples within the crop's segment✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.