Fix NaN gene names crashing segmentation (#900) by etal · Pull Request #1043 · etal/cnvkit

etal · 2026-03-11T17:24:33Z

Summary

RNA-seq .cnr files from import-rna can have NaN in the gene column for Ensembl IDs without HUGO symbols, causing TypeError in ",".join() during segmentation
Fix at three levels: filter non-string values in transfer_fields, drop NaN before joining in squash_region, and fill NaN gene names with "-" at the import-rna source
Add two unit tests verifying NaN gene handling in both code paths

Test plan

test_transfer_fields_nan_gene — verifies transfer_fields produces valid gene strings with NaN input
test_squash_region_nan_gene — verifies squash_region produces valid gene strings with NaN input
Full test suite passes (69 tests)

Closes #900

🤖 Generated with Claude Code

RNA-seq .cnr files from import-rna can have NaN in the gene column for Ensembl IDs without HUGO symbols. This caused TypeError in ",".join() during segmentation (transfer_fields) and segment squashing (squash_region). Fix at three levels: filter non-string values when joining gene names in transfer_fields, drop NaN before joining in squash_region, and fill NaN gene names with "-" at the import-rna source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-11T17:31:41Z

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.07%. Comparing base (71f02c3) to head (ed0466f).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
cnvlib/plots.py	25.00%	3 Missing ⚠️
cnvlib/reports.py	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1043      +/-   ##
==========================================
- Coverage   66.10%   66.07%   -0.03%     
==========================================
  Files          72       72              
  Lines        7375     7382       +7     
  Branches     1294     1296       +2     
==========================================
+ Hits         4875     4878       +3     
- Misses       2061     2064       +3     
- Partials      439      440       +1

Flag	Coverage Δ
unittests	`66.07% <73.68%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Move NaN-gene filtering into skgenome.combiners.join_strings so all callers (merge, flatten, squash, genemetrics) are covered. Use dict.fromkeys() for order-preserving dedup instead of pd.unique() to avoid a FutureWarning on plain-list input. Return "-" when all values are NaN, matching the IGNORE_GENE_NAMES convention. Simplify segfilters.squash_region to delegate to join_strings instead of reimplementing NaN filtering. Strengthen tests: add assertNotIn for "nan" and cover the all-NaN edge case. Follow-up to 94939f5 (issue #900 initial fix), informed by code review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add an `ignore` keyword argument to `skgenome.combiners.join_strings` so that cnvlib callers can pass domain-specific names to exclude (e.g. IGNORE_GENE_NAMES) without skgenome needing to know about them. Simplify `transfer_fields` in segmentation to use `join_strings` with the new `ignore` parameter, replacing the ad-hoc list comprehension. Fix two latent NaN gene-name bugs in plots.py: - `gene_coords_by_name`: NaN in gene set would crash `.split(",")`. - `gene_coords_by_range`: `str(NaN)` produced "nan" as a gene label. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change `ignore: tuple[str, str, str]` to `tuple[str, ...]` in squash_genes, get_gene_intervals, and transfer_fields. This removes 2 `# type: ignore[assignment]` comments that were needed because concatenating IGNORE_GENE_NAMES with ANTITARGET_ALIASES changed the tuple length. Fix NaN gene name bug in reports.get_gene_intervals: `str(row.gene)` on NaN produced the string "nan", which would be tallied as a real gene. Now skips non-string gene values. Remove vestigial `np.nan` from the ignore tuple in reports.gene_metrics — by_gene() already filters NaN via _get_gene_map(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal and others added 3 commits March 11, 2026 11:23

etal merged commit 6d07828 into master Mar 11, 2026
15 checks passed

etal deleted the claude/pedantic-shtern branch March 11, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN gene names crashing segmentation (#900)#1043

Fix NaN gene names crashing segmentation (#900)#1043
etal merged 4 commits intomasterfrom
claude/pedantic-shtern

etal commented Mar 11, 2026

Uh oh!

codecov bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

etal commented Mar 11, 2026

Summary

Test plan

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 11, 2026 •

edited

Loading