Skip to content

Latest commit

 

History

History
67 lines (67 loc) · 18.3 KB

File metadata and controls

67 lines (67 loc) · 18.3 KB
Name Type Value Range Default Description For Goals
logLevel String all, trace, debug, info, warn, error, fatal, off info Only the log levels error, warn, info and trace are used by Genestrip. all
threads int [-1, 64] -1 The number of consumer threads n when processing data with respect to the goals match, filter and also so during the update phase of the db goal. There is always one additional thread that reads and uncompresses a corresponding fastq or fasta file (so it is n + 1 threads in total). When negative, the number of available processors - 1 is used as n. When 0, then the corresponding goals run in single-threaded mode. db, match, matchlr, filter
progressBar boolean true Whether to show a progress bar on the command line for longer taking process steps. db, match, matchlr, filter
progressBarUpdateMs int [100, 2147483647] 1000 Update period in ms for progress bar (if shown). db, match, matchlr, filter
kMerSize int [15, 32] 31 The number of base pairs k for k-mers. Changes to this values do not affect the memory usage of a database. db, filter, match, matchlr
extractKey String `` Extract key for read descriptors. The beginning of a descriptor must match this key after the '@' for the read to be written. extract
httpBaseURL String https://ftp.ncbi.nlm.nih.gov This base URL will be extended by /pub/taxonomy/ in order to download the taxonomy file taxdmp.zip and by /genomes/genbank for files from Genbank. db
ftpBaseURL String ftp.ncbi.nih.gov db
refseq.httpBaseURL String https://ftp.ncbi.nlm.nih.gov/refseq This mirror might be considered as an alternative. (No other mirror sites are known.) db
refseq.ftpBaseURL String ftp.ncbi.nih.gov db
useHttp boolean true Use http(s) to download data from NCBI. If false, then Genestrip will do anonymous FTP instead (with login and password set to anonymous). db
ignoreMissingFastas boolean false If true, then a download of files from NCBI will not stop in case a file is missing on the server. db
maxDownloadTries int [1, 1024] 5 The number of download attempts for a file before giving up. db
seqType nominal GENOMIC, RNA, M_RNA, ALL_RNA, ALL GENOMIC Which type of sequence files to include from the RefSeq. RNA files from the RefSeq end with rna.fna.gz, whereas genomes end with genomic.fna.gz. db
rankCompletionDepth nominal cellular root, acellular root, superkingdom, domain, realm, kingdom, phylum, subphylum, superclass, class, subclass, superorder, order, suborder, superfamily, family, subfamily, tribe, genus, subgenus, species group, species, varietas, subspecies, serogroup, biotype, strain, serotype, genotype, forma, forma specialis, isolate, clade, no rank, subkingdom, section, FILE, ID, leaf `` The rank up to which tax ids from taxids.txt will be completed by descendants of the taxonomy tree (the set rank included). If not set, the completion will traverse down to the lowest possible levels of the taxonomy. Typical values could be species or strain, but all values used for assigning ranks in the taxonomy are possible. db
checkSumCacheFile boolean true If true, then md5 check sums may be skipped by creating and accessing a file named <file>.md5ok that marks wether the md5 check sum of <file> was found to be ok after a previous download of <file>. db
maxGenomesPerTaxid int [1, 2147483647] 2147483647 The maximum number of genomes per tax id to be included in the database. Note, that this is an important parameter to control database size, because in some cases, there are thousands of genomic entries per tax id. db
maxKMersPerTaxid long [0, 9223372036854775807] 9223372036854775807 The limit for the number of k-mers per tax id at which adding more k-mers for this tax id to the database stops. Note, that this is an important parameter to control database size, because in some cases, there are millions of k-mers per tax id. all
maxPerTaxidRank nominal cellular root, acellular root, superkingdom, domain, realm, kingdom, phylum, subphylum, superclass, class, subclass, superorder, order, suborder, superfamily, family, subfamily, tribe, genus, subgenus, species group, species, varietas, subspecies, serogroup, biotype, strain, serotype, genotype, forma, forma specialis, isolate, clade, no rank, subkingdom, section, FILE, ID, leaf `` The rank for which to consider the parameters maxGenomesPerTaxid and maxKMersPerTaxid. If null, then maximum number of genomes is considered with respect to the direct tax id under which a genome is stored. all
alwaysAssumeGzip boolean true If true, a fastq of fasta file which is downloaded via a URL is always assumed to be g-zipped. Otherwise, it will be considered g-zipped only if the file part of the URL ends with .gz or .gzip. fastamap, fastqmap
refseq.filldb boolean true Whether the RefSeq should be used as the basis for filling the database. filldb
refseq.completeGenomesOnly boolean false If true, then only genomic accessions with the prefixes AC, NC_, NZ_ will be considered when filling the database. Otherwise, all genomic accessions will be considered. See RefSeq accession numbers and molecule types for details. filldb
refSeq.limitForGenbankAccess int [0, 2147483647] 0 Determines whether Genestrip should try to lookup genomic fasta files from Genbank, if the number of corresponding reference genomes from the RefSeq is below the given limit for a requested tax id including its descendants. E.g. refSeq.limitForGenbankAccess=1 would imply that Genbank is consulted if not a single reference genome is found in the RefSeq for a requested tax id. The default refSeq.limitForGenbankAccess=0 essentially inactivates this feature.In addition, Genbank access is also influenced by the keys genbank.fastaQualities, genbank.maxPerTaxid and genbank.referenceOnly (see below).Note that refSeq.limitForGenbankAccess is disregarded if refseq.filldb=false. db
refSeq.limitForGenbankRank nominal cellular root, acellular root, superkingdom, domain, realm, kingdom, phylum, subphylum, superclass, class, subclass, superorder, order, suborder, superfamily, family, subfamily, tribe, genus, subgenus, species group, species, varietas, subspecies, serogroup, biotype, strain, serotype, genotype, forma, forma specialis, isolate, clade, no rank, subkingdom, section, FILE, ID, leaf species The rank for which to check the limit refSeq.limitForGenbankAccess. If null, then the limit applies to all requested tax ids and its descendants. db
refseq.status list of nominals NA, UNKNOWN, REVIEWED, VALIDATED, PROVISIONAL, PREDICTED, INFERRED, MODEL NA,UNKNOWN,REVIEWED,VALIDATED,PROVISIONAL,PREDICTED,INFERRED,MODEL The refseq status values restrict the considered genomic accessions with respect to the given values. By default all values are allowed / included. db
reqseq.extract.gzip boolean false Whether to create gzipped extracted fasta files in goal extractrefseqfasta. all
genbank.maxPerTaxid int [-1, 2147483647] 1 Determines the maximum number of fasta files used from Genbank per requested tax id. If this value is <= 0 then all fasta files will be used. Otherwise, if the corresponding number of matching files exceeds genbank.maxPerTaxid, then best ones according to genbank.fastaQualities will be retained while adhering to this maximum. db
genbank.fastaQualities list of nominals ADDITIONAL, COMPLETE_LATEST, COMPLETE, CHROMOSOME_LATEST, CHROMOSOME, SCAFFOLD_LATEST, SCAFFOLD, CONTIG_LATEST, CONTIG, LATEST, NONE COMPLETE_LATEST,CHROMOSOME_LATEST Determines the allowed quality levels of fasta files from Genbank. The values must be comma-separated. If a corresponding value is included in the list, then a fasta file for a requested tax id on that quality level will be included, otherwise not (while also respecting the conditions exerted via the keys refSeq.limitForGenbankAccess and genbank.maxPerTaxid). The quality levels are based on Genbank's Assembly Summary File (columns version_status and assembly_level). If the list is empty then no fasta files from Genbank will qualify. db
genbank.referenceOnly boolean false Whether only reference genomes are accepted or not. (Reference Genomes must be fetched from GenBank.) db
maxDust int [-1, 2147483647] -1 When generating a database via the goal db, any low-complexity k-mer with too many repetitive sequences of base pairs may be omitted for storing. To do so, Genestrip employs a simple genetic dust-filter for k-mers: It assigns a dust value d to each k-mer, and if d > maxDust, then the k-mer will not be stored. Let k(i) be length of a k-mer's i-th substring si of maximum length such that si(j) = si(j-1) holds for all bases in s. Given a k-mer with n such non-overlapping substrings and their lengths k(1), ..., k(n), then d = fib(k(1)) + ... + fib(k(n)), where fib(k(i)) is the Fibonacci number of k(i). (The Fibonachi numbers are fib(1) = 0, fib(2) = 1, fib(n) = fib(n-1) + fib(n-2).) E.g., for the 8-mer TTTCGCGA, we have n = 3 with k(1) = 3 for TTT, k(2) = 4 for CGCG and k(3) = 1 for A which gives d = fib(3) + fib(4) + fib(1) = 1 + 2 + 0 = 3. For practical concerns maxDust = 500 may be suitable. In this case, if 31-mers were uniformly, randomly generated, then less than 0.00002 % of them would be dropped. If maxDust = -1, then dust-filtering is inactive. db
dbResizingFactor double [0.0, 1.7976931348623157E308] 1.0 TODO db
xorBloomHash boolean true all
useHLLForDBSizing boolean false all
minUpdate boolean false Perform database update regarding least common ancestors only based on genomes of tax ids as selected for the database generation (and not via all of a super-kingdom's RefSeq genomes). updatedb
refseq.updateWithCompleteGenomesOnly boolean false If true, then only genomic accessions with the prefixes AC, NC_, NZ_ will be considered when updating the database. Otherwise, all genomic accessions will be considered for the update phase. See RefSeq accession numbers and molecule types for details. updatedb
removeTempDB boolean true Wether to delete the temporary database after the final database has been saved or not. db
stepSize int [1, 2147483647] 1 Stores k-mers in steps of stepSize. E.g. if stepSize=2 then only every second k-mer from a genome is considered for entry into the database. db
idNodes boolean false Whether to add artificial nodes in the tax tree to represent ids after '>' from fasta info lines for k-mers. BEWARE: This may cause a database build to fail as only up to 32767 tax ids are allowed. db
fileNodes boolean false Whether to add artificial nodes in the tax tree to represent fasta files for k-mers. db
lowerCaseBases boolean true Whether to accept lowercase bases for k-mers. db
svgFont String SansSerif The font name for the texts in the generated tree. svgtaxtree
svgFontSize int [1, 100] 18 The font size for the texts in the generated tree. svgtaxtree
svgLineHeightFactor double [0.5, 10.0] 1.0 How much a line of text in the tree is de- or increased with regard to the normally required line height. svgtaxtree
svgIndentFactor double [0.0, 10.0] 0.75 Factor for standard indentation of child nodes in tree. svgtaxtree
svgTextGapFactor double [0.0, 1.0] 0.25 Gap between horizontal line for child node and node text as a ratio of the font size. svgtaxtree
svgKmerNodeIndentFactor double [0.0, 1.7976931348623157E308] 0.0 Factor for additional indentation to reflect k-mers of node. The base value is normalized to [0,1], where 1 corresponds to the maximum k-mers per taxid as stored in the database. svgtaxtree
svgLogBasedIndent boolean false Weather to perform log-based indentation for svgKmerNodeIndentFactor instead of linear indentation. svgtaxtree
svgReqNodesBold boolean true Weather to use bold text for tax ids requested via the project file taxids.txt. svgtaxtree
svgShowRank boolean false Weather to add the rank in the node text. svgtaxtree
logProgressUpdateCycle long [0, 9223372036854775807] 1000000 Affects the log level trace: Defines after how many reads per fastq file, information on the matching progress is logged. If less than 1, then no progress information is logged. match, matchlr, filter
classifyReads boolean true Whether to do read classification in the style of Kraken and KrakenUniq. Matching is faster without read classification and the columns kmers, unique kmers and max contig length in resulting CSV files are usually more conclusive anyways - in particular with respect to long reads. When read classification is off, the columns reads and kmers from reads will be 0 in resulting CSV files. match
countUniqueKMers boolean true If true, the number of unique k-mers will be counted and reported. This requires less than 5% of additional main memory. match, matchlr
writeFilteredFastq boolean false If true, then the goal match writes filtered fastq files in the same way that the goal filter does. match, matchlr
writeKrakenStyleOut boolean false If true, Genestrip will write output files with suffix .out in the Kraken output format under <base dir>/projects/<project_name>/krakenout covering all reads with at least one matching k-mer. match, matchlr
writeAll boolean true If false, Genestrip will write only classified reads to kraken style output files. match
useBloomFilterForMatch boolean true If true a bloom filter will be loaded and used during fastq file analysis (i.e. matching). Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain no k-mers from the database. Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory. match, matchlr
maxReadTaxErrorCount double [-1.0, 1.7976931348623157E308] -1.0 The absolute or relative maximum number of k-mers that do not need to be in the database for a read to be classified (read error count). If the number is above maxReadTaxErrorCount, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If maxReadTaxErrorCount is >= 1, then it is interpreted as an absolute number of k-mers. If >= 0 and < 1, it is interpreted as the ratio between the k-mers not in the database and all k-mers of the read. If maxReadTaxErrorCount < 0, then the read error count is disregarded, which means that even a single matching k-mer will lead to the read's classification. match, matchlr
maxReadClassErrorCount double [-1.0, 1.7976931348623157E308] -1.0 The absolute or relative maximum number of k-mers that do not need to be consistent with a read's destined class for the read to be classified (read class error count). If the number is above maxReadClassErrorCount, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If maxReadClassErrorCount is >= 1, then it is interpreted as an absolute number of k-mers. If >= 0 and < 1, it is interpreted as the ratio between the inconsistent k-mers and all k-mers of the read. If maxReadClassErrorCount < 0, then the read error count is disregarded, which means that even a single matching k-mer will lead to the read's classification. match, matchlr
minKMersForClass int [1, 2147483647] 1 Can be set to adjust the minimal total of k-mers under taxon t required for a read to be classified to t. I.e., given a read r and taxon t1 on the genus rank with two k-mers from r and taxon t2 subordinate to t1 on the species rank with one k-mer from r. Furthermore, r shall have no other k-mers matching any taxons. Then, if minKMersForClass = 2, r would not be classified to t1 but to t2 instead since the single k-mer under t1 is below the threshold but the total of three k-mers under t2 suffice. all
maxKMerResCounts int [0, 65536] 0 If > 0, the corresponding number of frequencies of the most frequent k-mers per tax id will be reported. match, matchlr
writeDumpedFastq boolean false If true, then filter will also generate a fastq file dumped_... with all reads not written to the corresponding filtered fastq file. filter
minPosCountFilter int [0, 1024] 1 The mininum number of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. If minPosCountFilter=0, then posRatioFilter becomes effective. filter
posRatioFilter double [0.0, 1.0] 0.2 Only effective if minPosCountFilter=0: The mininum ratio of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. filter
withProbs boolean false Whether to process bp probabilities and potentially write them to filtered fastq files. (Takes slightly more memory if true.) filter, match, matchlr
taxids list of Strings `` List of tax ids separated by ,. A tax id may have the suffix +, which means that taxonomic descendants from the project's database will be included. This list can alternatively be set via the command line parameter -tx. db2fastqtaxids, db2fastq