Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14
Open
Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Open
Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Conversation
FASTQ format allows descriptions in the identifier line, separated by
whitespace (e.g., "@READID_UMI 1:N:0:1"). The solexa2fasta.awk script
converts the entire identifier line to FASTA format, preserving any
descriptions (">@READID_UMI 1:N:0:1").
Previously, make_comp_fasta.pl had two issues when descriptions were present:
1. UMI Detection Regex: The pattern /^[^\t]*_[A-Z]*\t/ expected the UMI
barcode to appear immediately before the tab character. When descriptions
were present, the line looked like "READID_UMI description\tseq", causing
the regex to fail since the description appeared before the tab instead
of the UMI pattern.
2. Barcode Extraction: The script split the entire first field (including
any description) by '_' to extract the UMI. With descriptions containing
underscores or other text, this extracted the wrong value as the barcode.
This fix addresses both issues:
- Updated the UMI detection regex to /^\S*_[A-Z]+\s/, which matches the
UMI pattern before any whitespace (space or tab)
- Modified barcode extraction to first strip the description by taking
only the portion before the first whitespace, then splitting by '_'
Tested with multiple cases:
- UMI with/without descriptions
- Descriptions with/without underscores
- Read IDs with/without additional underscores
- No UMI cases (to ensure no false positives)
All of my test cases correctly detect UMI barcodes regardless of whether
FASTQ/FASTA descriptions are present.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FASTQ format allows descriptions in the identifier line, separated by whitespace (e.g.,
@READID_UMI 1:N:0:1). The solexa2fasta.awk script converts the entire identifier line to FASTA format, preserving any descriptions (>@READID_UMI 1:N:0:1).Previously, make_comp_fasta.pl had two issues when descriptions were present:
UMI Detection Regex: The pattern
/^[^\t]*_[A-Z]*\t/expected the UMI barcode to appear immediately before the tab character. When descriptions were present, the line looked like "READID_UMI description\tseq", causing the regex to fail since the description appeared before the tab instead of the UMI pattern.Barcode Extraction: The script split the entire first field (including any description) by '_' to extract the UMI. With descriptions containing underscores or other text, this extracted the wrong value as the barcode.
This fix addresses both issues:
/^\S*_[A-Z]+\s/, which matches the UMI pattern before any whitespace (space or tab)Tested with multiple cases:
All of my test cases correctly detect UMI barcodes regardless of whether FASTQ/FASTA descriptions are present.