Fix NaN gene names crashing segmentation (#900)#1043
Merged
Conversation
RNA-seq .cnr files from import-rna can have NaN in the gene column for Ensembl IDs without HUGO symbols. This caused TypeError in ",".join() during segmentation (transfer_fields) and segment squashing (squash_region). Fix at three levels: filter non-string values when joining gene names in transfer_fields, drop NaN before joining in squash_region, and fill NaN gene names with "-" at the import-rna source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1043 +/- ##
==========================================
- Coverage 66.10% 66.07% -0.03%
==========================================
Files 72 72
Lines 7375 7382 +7
Branches 1294 1296 +2
==========================================
+ Hits 4875 4878 +3
- Misses 2061 2064 +3
- Partials 439 440 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Move NaN-gene filtering into skgenome.combiners.join_strings so all callers (merge, flatten, squash, genemetrics) are covered. Use dict.fromkeys() for order-preserving dedup instead of pd.unique() to avoid a FutureWarning on plain-list input. Return "-" when all values are NaN, matching the IGNORE_GENE_NAMES convention. Simplify segfilters.squash_region to delegate to join_strings instead of reimplementing NaN filtering. Strengthen tests: add assertNotIn for "nan" and cover the all-NaN edge case. Follow-up to 94939f5 (issue #900 initial fix), informed by code review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an `ignore` keyword argument to `skgenome.combiners.join_strings`
so that cnvlib callers can pass domain-specific names to exclude
(e.g. IGNORE_GENE_NAMES) without skgenome needing to know about them.
Simplify `transfer_fields` in segmentation to use `join_strings` with
the new `ignore` parameter, replacing the ad-hoc list comprehension.
Fix two latent NaN gene-name bugs in plots.py:
- `gene_coords_by_name`: NaN in gene set would crash `.split(",")`.
- `gene_coords_by_range`: `str(NaN)` produced "nan" as a gene label.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `ignore: tuple[str, str, str]` to `tuple[str, ...]` in squash_genes, get_gene_intervals, and transfer_fields. This removes 2 `# type: ignore[assignment]` comments that were needed because concatenating IGNORE_GENE_NAMES with ANTITARGET_ALIASES changed the tuple length. Fix NaN gene name bug in reports.get_gene_intervals: `str(row.gene)` on NaN produced the string "nan", which would be tallied as a real gene. Now skips non-string gene values. Remove vestigial `np.nan` from the ignore tuple in reports.gene_metrics — by_gene() already filters NaN via _get_gene_map(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.cnrfiles fromimport-rnacan haveNaNin thegenecolumn for Ensembl IDs without HUGO symbols, causingTypeErrorin",".join()during segmentationtransfer_fields, drop NaN before joining insquash_region, and fill NaN gene names with"-"at theimport-rnasourceTest plan
test_transfer_fields_nan_gene— verifiestransfer_fieldsproduces valid gene strings with NaN inputtest_squash_region_nan_gene— verifiessquash_regionproduces valid gene strings with NaN inputCloses #900
🤖 Generated with Claude Code