Skip to content

Fix NaN gene names crashing segmentation (#900)#1043

Merged
etal merged 4 commits intomasterfrom
claude/pedantic-shtern
Mar 11, 2026
Merged

Fix NaN gene names crashing segmentation (#900)#1043
etal merged 4 commits intomasterfrom
claude/pedantic-shtern

Conversation

@etal
Copy link
Owner

@etal etal commented Mar 11, 2026

Summary

  • RNA-seq .cnr files from import-rna can have NaN in the gene column for Ensembl IDs without HUGO symbols, causing TypeError in ",".join() during segmentation
  • Fix at three levels: filter non-string values in transfer_fields, drop NaN before joining in squash_region, and fill NaN gene names with "-" at the import-rna source
  • Add two unit tests verifying NaN gene handling in both code paths

Test plan

  • test_transfer_fields_nan_gene — verifies transfer_fields produces valid gene strings with NaN input
  • test_squash_region_nan_gene — verifies squash_region produces valid gene strings with NaN input
  • Full test suite passes (69 tests)

Closes #900

🤖 Generated with Claude Code

RNA-seq .cnr files from import-rna can have NaN in the gene column for
Ensembl IDs without HUGO symbols. This caused TypeError in ",".join()
during segmentation (transfer_fields) and segment squashing
(squash_region).

Fix at three levels: filter non-string values when joining gene names
in transfer_fields, drop NaN before joining in squash_region, and fill
NaN gene names with "-" at the import-rna source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.07%. Comparing base (71f02c3) to head (ed0466f).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
cnvlib/plots.py 25.00% 3 Missing ⚠️
cnvlib/reports.py 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1043      +/-   ##
==========================================
- Coverage   66.10%   66.07%   -0.03%     
==========================================
  Files          72       72              
  Lines        7375     7382       +7     
  Branches     1294     1296       +2     
==========================================
+ Hits         4875     4878       +3     
- Misses       2061     2064       +3     
- Partials      439      440       +1     
Flag Coverage Δ
unittests 66.07% <73.68%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

etal and others added 3 commits March 11, 2026 11:23
Move NaN-gene filtering into skgenome.combiners.join_strings so all
callers (merge, flatten, squash, genemetrics) are covered. Use
dict.fromkeys() for order-preserving dedup instead of pd.unique() to
avoid a FutureWarning on plain-list input. Return "-" when all values
are NaN, matching the IGNORE_GENE_NAMES convention.

Simplify segfilters.squash_region to delegate to join_strings instead
of reimplementing NaN filtering. Strengthen tests: add assertNotIn for
"nan" and cover the all-NaN edge case.

Follow-up to 94939f5 (issue #900 initial fix), informed by code review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an `ignore` keyword argument to `skgenome.combiners.join_strings`
so that cnvlib callers can pass domain-specific names to exclude
(e.g. IGNORE_GENE_NAMES) without skgenome needing to know about them.

Simplify `transfer_fields` in segmentation to use `join_strings` with
the new `ignore` parameter, replacing the ad-hoc list comprehension.

Fix two latent NaN gene-name bugs in plots.py:
- `gene_coords_by_name`: NaN in gene set would crash `.split(",")`.
- `gene_coords_by_range`: `str(NaN)` produced "nan" as a gene label.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `ignore: tuple[str, str, str]` to `tuple[str, ...]` in
squash_genes, get_gene_intervals, and transfer_fields. This removes
2 `# type: ignore[assignment]` comments that were needed because
concatenating IGNORE_GENE_NAMES with ANTITARGET_ALIASES changed the
tuple length.

Fix NaN gene name bug in reports.get_gene_intervals: `str(row.gene)`
on NaN produced the string "nan", which would be tallied as a real
gene. Now skips non-string gene values.

Remove vestigial `np.nan` from the ignore tuple in reports.gene_metrics
— by_gene() already filters NaN via _get_gene_map().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@etal etal merged commit 6d07828 into master Mar 11, 2026
15 checks passed
@etal etal deleted the claude/pedantic-shtern branch March 11, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cnvkit.py segment error

1 participant