Hello,
Thanks for developing this tool.
I am trying to run CORGi chromosome-wide with multiple samples but failed to do so. Please see the issue below.
Issue description
The command I ran:
python corgi.py -r ref.fa -b sample1.bam -c chr2 -p 0 $chrEnd -o out/chr2/
First error message:
[formatdb] WARNING: Cannot add sequence number 1 because it has zero length.
[formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database.
After going through the output files, I noticed that the RefChunk.fa file that formatdb uses to create the blast-db was empty. This is what causes formatdb to fail.
I hence debugged the code and found the issue.
Source of the problem
The RefChunk.fa file is created by:
- reading the reference file,
- obtaining the fasta entry that corresponds to the selected chromosome,
- joining all of the fasta lines of the chromosome into a single string,
- and finally subsetting this string with the start and end coordinates that the user provided.
Step 4, which occurs in line 70 of samFunc.py is the problematic one:
70 refChunk = readRef(refFile,refInds[refIndex])[samPos_s-1:samPos_e-1]
Python is a 0-based language, which the script correctly accounted for by adding a -1 to both coordinates when subsetting the string. However, when using the start coordinate 0, the subset becomes [-1:chrLength], which returns an empty string!
I suspect #2 is related to the same bug.
(temporal) Solution
Simply remove the -1 in line 70. Or, of course, use 1 instead of 0 in the arguments.
A future version of the script should add more information in the documentation or include an exemption to account for chromosome-wide subsets and prevent the start coordinate from being less than 0.
Hello,
Thanks for developing this tool.
I am trying to run CORGi chromosome-wide with multiple samples but failed to do so. Please see the issue below.
Issue description
The command I ran:
python corgi.py -r ref.fa -b sample1.bam -c chr2 -p 0 $chrEnd -o out/chr2/First error message:
After going through the output files, I noticed that the RefChunk.fa file that formatdb uses to create the blast-db was empty. This is what causes formatdb to fail.
I hence debugged the code and found the issue.
Source of the problem
The RefChunk.fa file is created by:
Step 4, which occurs in line 70 of samFunc.py is the problematic one:
Python is a 0-based language, which the script correctly accounted for by adding a -1 to both coordinates when subsetting the string. However, when using the start coordinate 0, the subset becomes [-1:chrLength], which returns an empty string!
I suspect #2 is related to the same bug.
(temporal) Solution
Simply remove the -1 in line 70. Or, of course, use 1 instead of 0 in the arguments.
A future version of the script should add more information in the documentation or include an exemption to account for chromosome-wide subsets and prevent the start coordinate from being less than 0.