Skip to content

Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14

Open
Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Stefan-JLU:fix-fastq-parsing
Open

Fix UMI barcode extraction to handle FASTQ/FASTA descriptions#14
Stefan-JLU wants to merge 1 commit intogkudla:masterfrom
Stefan-JLU:fix-fastq-parsing

Conversation

@Stefan-JLU
Copy link
Copy Markdown

FASTQ format allows descriptions in the identifier line, separated by whitespace (e.g., @READID_UMI 1:N:0:1). The solexa2fasta.awk script converts the entire identifier line to FASTA format, preserving any descriptions (>@READID_UMI 1:N:0:1).

Previously, make_comp_fasta.pl had two issues when descriptions were present:

  1. UMI Detection Regex: The pattern /^[^\t]*_[A-Z]*\t/ expected the UMI barcode to appear immediately before the tab character. When descriptions were present, the line looked like "READID_UMI description\tseq", causing the regex to fail since the description appeared before the tab instead of the UMI pattern.

  2. Barcode Extraction: The script split the entire first field (including any description) by '_' to extract the UMI. With descriptions containing underscores or other text, this extracted the wrong value as the barcode.

This fix addresses both issues:

  • Updated the UMI detection regex to /^\S*_[A-Z]+\s/, which matches the UMI pattern before any whitespace (space or tab)
  • Modified barcode extraction to first strip the description by taking only the portion before the first whitespace, then splitting by '_'

Tested with multiple cases:

  • UMI with/without descriptions
  • Descriptions with/without underscores
  • Read IDs with/without additional underscores
  • No UMI cases (to ensure no false positives)

All of my test cases correctly detect UMI barcodes regardless of whether FASTQ/FASTA descriptions are present.

FASTQ format allows descriptions in the identifier line, separated by
whitespace (e.g., "@READID_UMI 1:N:0:1"). The solexa2fasta.awk script
converts the entire identifier line to FASTA format, preserving any
descriptions (">@READID_UMI 1:N:0:1").

Previously, make_comp_fasta.pl had two issues when descriptions were present:

1. UMI Detection Regex: The pattern /^[^\t]*_[A-Z]*\t/ expected the UMI
   barcode to appear immediately before the tab character. When descriptions
   were present, the line looked like "READID_UMI description\tseq", causing
   the regex to fail since the description appeared before the tab instead
   of the UMI pattern.

2. Barcode Extraction: The script split the entire first field (including
   any description) by '_' to extract the UMI. With descriptions containing
   underscores or other text, this extracted the wrong value as the barcode.

This fix addresses both issues:
- Updated the UMI detection regex to /^\S*_[A-Z]+\s/, which matches the
  UMI pattern before any whitespace (space or tab)
- Modified barcode extraction to first strip the description by taking
  only the portion before the first whitespace, then splitting by '_'

Tested with multiple cases:
- UMI with/without descriptions
- Descriptions with/without underscores
- Read IDs with/without additional underscores
- No UMI cases (to ensure no false positives)

All of my test cases correctly detect UMI barcodes regardless of whether
FASTQ/FASTA descriptions are present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants