Fix UMI barcode extraction to handle FASTQ/FASTA descriptions by Stefan-JLU · Pull Request #14 · gkudla/hyb

Stefan-JLU · 2025-10-22T14:05:08Z

FASTQ format allows descriptions in the identifier line, separated by whitespace (e.g., @READID_UMI 1:N:0:1). The solexa2fasta.awk script converts the entire identifier line to FASTA format, preserving any descriptions (>@READID_UMI 1:N:0:1).

Previously, make_comp_fasta.pl had two issues when descriptions were present:

UMI Detection Regex: The pattern /^[^\t]*_[A-Z]*\t/ expected the UMI barcode to appear immediately before the tab character. When descriptions were present, the line looked like "READID_UMI description\tseq", causing the regex to fail since the description appeared before the tab instead of the UMI pattern.
Barcode Extraction: The script split the entire first field (including any description) by '_' to extract the UMI. With descriptions containing underscores or other text, this extracted the wrong value as the barcode.

This fix addresses both issues:

Updated the UMI detection regex to /^\S*_[A-Z]+\s/, which matches the UMI pattern before any whitespace (space or tab)
Modified barcode extraction to first strip the description by taking only the portion before the first whitespace, then splitting by '_'

Tested with multiple cases:

UMI with/without descriptions
Descriptions with/without underscores
Read IDs with/without additional underscores
No UMI cases (to ensure no false positives)

All of my test cases correctly detect UMI barcodes regardless of whether FASTQ/FASTA descriptions are present.

FASTQ format allows descriptions in the identifier line, separated by whitespace (e.g., "@READID_UMI 1:N:0:1"). The solexa2fasta.awk script converts the entire identifier line to FASTA format, preserving any descriptions (">@READID_UMI 1:N:0:1"). Previously, make_comp_fasta.pl had two issues when descriptions were present: 1. UMI Detection Regex: The pattern /^[^\t]*_[A-Z]*\t/ expected the UMI barcode to appear immediately before the tab character. When descriptions were present, the line looked like "READID_UMI description\tseq", causing the regex to fail since the description appeared before the tab instead of the UMI pattern. 2. Barcode Extraction: The script split the entire first field (including any description) by '_' to extract the UMI. With descriptions containing underscores or other text, this extracted the wrong value as the barcode. This fix addresses both issues: - Updated the UMI detection regex to /^\S*_[A-Z]+\s/, which matches the UMI pattern before any whitespace (space or tab) - Modified barcode extraction to first strip the description by taking only the portion before the first whitespace, then splitting by '_' Tested with multiple cases: - UMI with/without descriptions - Descriptions with/without underscores - Read IDs with/without additional underscores - No UMI cases (to ensure no false positives) All of my test cases correctly detect UMI barcodes regardless of whether FASTQ/FASTA descriptions are present.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14

Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14
Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Stefan-JLU:fix-fastq-parsing

Stefan-JLU commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stefan-JLU commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants