genestrip/ConfigParams.md at master · pfeiferd/genestrip

Name	Type	Value Range	Default	Description	For Goals
`logLevel`	String	`all`, `trace`, `debug`, `info`, `warn`, `error`, `fatal`, `off`	`info`	Only the log levels `error`, `warn`, `info` and `trace` are used by Genestrip.	all
`threads`	int	[-1, 64]	`-1`	The number of consumer threads n when processing data with respect to the goals `match`, `filter` and also so during the update phase of the `db` goal. There is always one additional thread that reads and uncompresses a corresponding fastq or fasta file (so it is n + 1 threads in total). When negative, the number of available processors - 1 is used as n. When 0, then the corresponding goals run in single-threaded mode.	`db`, `match`, `matchlr`, `filter`
`progressBar`	boolean		`true`	Whether to show a progress bar on the command line for longer taking process steps.	`db`, `match`, `matchlr`, `filter`
`progressBarUpdateMs`	int	[100, 2147483647]	`1000`	Update period in ms for progress bar (if shown).	`db`, `match`, `matchlr`, `filter`
`kMerSize`	int	[15, 32]	`31`	The number of base pairs k for k-mers. Changes to this values do not affect the memory usage of a database.	`db`, `filter`, `match`, `matchlr`
`extractKey`	String		``	Extract key for read descriptors. The beginning of a descriptor must match this key after the '@' for the read to be written.	`extract`
`httpBaseURL`	String		`https://ftp.ncbi.nlm.nih.gov`	This base URL will be extended by `/pub/taxonomy/` in order to download the taxonomy file `taxdmp.zip` and by `/genomes/genbank` for files from Genbank.	`db`
`ftpBaseURL`	String		`ftp.ncbi.nih.gov`		`db`
`refseq.httpBaseURL`	String		`https://ftp.ncbi.nlm.nih.gov/refseq`	This mirror might be considered as an alternative. (No other mirror sites are known.)	`db`
`refseq.ftpBaseURL`	String		`ftp.ncbi.nih.gov`		`db`
`useHttp`	boolean		`true`	Use http(s) to download data from NCBI. If `false`, then Genestrip will do anonymous FTP instead (with login and password set to `anonymous`).	`db`
`ignoreMissingFastas`	boolean		`false`	If `true`, then a download of files from NCBI will not stop in case a file is missing on the server.	`db`
`maxDownloadTries`	int	[1, 1024]	`5`	The number of download attempts for a file before giving up.	`db`
`seqType`	nominal	`GENOMIC`, `RNA`, `M_RNA`, `ALL_RNA`, `ALL`	`GENOMIC`	Which type of sequence files to include from the RefSeq. RNA files from the RefSeq end with `rna.fna.gz`, whereas genomes end with `genomic.fna.gz`.	`db`
`rankCompletionDepth`	nominal	`cellular root`, `acellular root`, `superkingdom`, `domain`, `realm`, `kingdom`, `phylum`, `subphylum`, `superclass`, `class`, `subclass`, `superorder`, `order`, `suborder`, `superfamily`, `family`, `subfamily`, `tribe`, `genus`, `subgenus`, `species group`, `species`, `varietas`, `subspecies`, `serogroup`, `biotype`, `strain`, `serotype`, `genotype`, `forma`, `forma specialis`, `isolate`, `clade`, `no rank`, `subkingdom`, `section`, `FILE`, `ID`, `leaf`	``	The rank up to which tax ids from `taxids.txt` will be completed by descendants of the taxonomy tree (the set rank included). If not set, the completion will traverse down to the lowest possible levels of the taxonomy. Typical values could be `species` or `strain`, but all values used for assigning ranks in the taxonomy are possible.	`db`
`checkSumCacheFile`	boolean		`true`	If true, then md5 check sums may be skipped by creating and accessing a file named `<file>.md5ok` that marks wether the md5 check sum of `<file>` was found to be ok after a previous download of `<file>`.	`db`
`maxGenomesPerTaxid`	int	[1, 2147483647]	`2147483647`	The maximum number of genomes per tax id to be included in the database. Note, that this is an important parameter to control database size, because in some cases, there are thousands of genomic entries per tax id.	`db`
`maxKMersPerTaxid`	long	[0, 9223372036854775807]	`9223372036854775807`	The limit for the number of k-mers per tax id at which adding more k-mers for this tax id to the database stops. Note, that this is an important parameter to control database size, because in some cases, there are millions of k-mers per tax id.	all
`maxPerTaxidRank`	nominal	`cellular root`, `acellular root`, `superkingdom`, `domain`, `realm`, `kingdom`, `phylum`, `subphylum`, `superclass`, `class`, `subclass`, `superorder`, `order`, `suborder`, `superfamily`, `family`, `subfamily`, `tribe`, `genus`, `subgenus`, `species group`, `species`, `varietas`, `subspecies`, `serogroup`, `biotype`, `strain`, `serotype`, `genotype`, `forma`, `forma specialis`, `isolate`, `clade`, `no rank`, `subkingdom`, `section`, `FILE`, `ID`, `leaf`	``	The rank for which to consider the parameters `maxGenomesPerTaxid` and `maxKMersPerTaxid`. If `null`, then maximum number of genomes is considered with respect to the direct tax id under which a genome is stored.	all
`alwaysAssumeGzip`	boolean		`true`	If `true`, a fastq of fasta file which is downloaded via a URL is always assumed to be g-zipped. Otherwise, it will be considered g-zipped only if the file part of the URL ends with `.gz` or `.gzip`.	`fastamap`, `fastqmap`
`refseq.filldb`	boolean		`true`	Whether the RefSeq should be used as the basis for filling the database.	`filldb`
`refseq.completeGenomesOnly`	boolean		`false`	If `true`, then only genomic accessions with the prefixes `AC`, `NC_`, `NZ_` will be considered when filling the database. Otherwise, all genomic accessions will be considered. See RefSeq accession numbers and molecule types for details.	`filldb`
`refSeq.limitForGenbankAccess`	int	[0, 2147483647]	`0`	Determines whether Genestrip should try to lookup genomic fasta files from Genbank, if the number of corresponding reference genomes from the RefSeq is below the given limit for a requested tax id including its descendants. E.g. `refSeq.limitForGenbankAccess=1` would imply that Genbank is consulted if not a single reference genome is found in the RefSeq for a requested tax id. The default `refSeq.limitForGenbankAccess=0` essentially inactivates this feature.In addition, Genbank access is also influenced by the keys `genbank.fastaQualities`, `genbank.maxPerTaxid` and `genbank.referenceOnly` (see below).Note that `refSeq.limitForGenbankAccess` is disregarded if `refseq.filldb=false`.	`db`
`refSeq.limitForGenbankRank`	nominal	`cellular root`, `acellular root`, `superkingdom`, `domain`, `realm`, `kingdom`, `phylum`, `subphylum`, `superclass`, `class`, `subclass`, `superorder`, `order`, `suborder`, `superfamily`, `family`, `subfamily`, `tribe`, `genus`, `subgenus`, `species group`, `species`, `varietas`, `subspecies`, `serogroup`, `biotype`, `strain`, `serotype`, `genotype`, `forma`, `forma specialis`, `isolate`, `clade`, `no rank`, `subkingdom`, `section`, `FILE`, `ID`, `leaf`	`species`	The rank for which to check the limit `refSeq.limitForGenbankAccess`. If `null`, then the limit applies to all requested tax ids and its descendants.	`db`
`refseq.status`	list of nominals	`NA`, `UNKNOWN`, `REVIEWED`, `VALIDATED`, `PROVISIONAL`, `PREDICTED`, `INFERRED`, `MODEL`	`NA,UNKNOWN,REVIEWED,VALIDATED,PROVISIONAL,PREDICTED,INFERRED,MODEL`	The refseq status values restrict the considered genomic accessions with respect to the given values. By default all values are allowed / included.	`db`
`reqseq.extract.gzip`	boolean		`false`	Whether to create gzipped extracted fasta files in goal `extractrefseqfasta`.	all
`genbank.maxPerTaxid`	int	[-1, 2147483647]	`1`	Determines the maximum number of fasta files used from Genbank per requested tax id. If this value is <= 0 then all fasta files will be used. Otherwise, if the corresponding number of matching files exceeds `genbank.maxPerTaxid`, then best ones according to `genbank.fastaQualities` will be retained while adhering to this maximum.	`db`
`genbank.fastaQualities`	list of nominals	`ADDITIONAL`, `COMPLETE_LATEST`, `COMPLETE`, `CHROMOSOME_LATEST`, `CHROMOSOME`, `SCAFFOLD_LATEST`, `SCAFFOLD`, `CONTIG_LATEST`, `CONTIG`, `LATEST`, `NONE`	`COMPLETE_LATEST,CHROMOSOME_LATEST`	Determines the allowed quality levels of fasta files from Genbank. The values must be comma-separated. If a corresponding value is included in the list, then a fasta file for a requested tax id on that quality level will be included, otherwise not (while also respecting the conditions exerted via the keys `refSeq.limitForGenbankAccess` and `genbank.maxPerTaxid`). The quality levels are based on Genbank's Assembly Summary File (columns `version_status` and `assembly_level`). If the list is empty then no fasta files from Genbank will qualify.	`db`
`genbank.referenceOnly`	boolean		`false`	Whether only reference genomes are accepted or not. (Reference Genomes must be fetched from GenBank.)	`db`
`maxDust`	int	[-1, 2147483647]	`-1`	When generating a database via the goal `db`, any low-complexity k-mer with too many repetitive sequences of base pairs may be omitted for storing. To do so, Genestrip employs a simple genetic dust-filter for k-mers: It assigns a dust value d to each k-mer, and if d > `maxDust`, then the k-mer will not be stored. Let k(i) be length of a k-mer's i-th substring s_i of maximum length such that s_i(j) = s_i(j-1) holds for all bases in s. Given a k-mer with n such non-overlapping substrings and their lengths k(1), ..., k(n), then d = fib(k(1)) + ... + fib(k(n)), where fib(k(i)) is the Fibonacci number of k(i). (The Fibonachi numbers are fib(1) = 0, fib(2) = 1, fib(n) = fib(n-1) + fib(n-2).) E.g., for the 8-mer `TTTCGCGA`, we have n = 3 with k(1) = 3 for `TTT`, k(2) = 4 for `CGCG` and k(3) = 1 for `A` which gives d = fib(3) + fib(4) + fib(1) = 1 + 2 + 0 = 3. For practical concerns `maxDust = 500` may be suitable. In this case, if 31-mers were uniformly, randomly generated, then less than 0.00002 % of them would be dropped. If `maxDust = -1`, then dust-filtering is inactive.	`db`
`dbResizingFactor`	double	[0.0, 1.7976931348623157E308]	`1.0`	TODO	`db`
`xorBloomHash`	boolean		`true`		all
`useHLLForDBSizing`	boolean		`false`		all
`minUpdate`	boolean		`false`	Perform database update regarding least common ancestors only based on genomes of tax ids as selected for the database generation (and not via all of a super-kingdom's RefSeq genomes).	`updatedb`
`refseq.updateWithCompleteGenomesOnly`	boolean		`false`	If `true`, then only genomic accessions with the prefixes `AC`, `NC_`, `NZ_` will be considered when updating the database. Otherwise, all genomic accessions will be considered for the update phase. See RefSeq accession numbers and molecule types for details.	`updatedb`
`removeTempDB`	boolean		`true`	Wether to delete the temporary database after the final database has been saved or not.	`db`
`stepSize`	int	[1, 2147483647]	`1`	Stores k-mers in steps of `stepSize`. E.g. if `stepSize=2` then only every second k-mer from a genome is considered for entry into the database.	`db`
`idNodes`	boolean		`false`	Whether to add artificial nodes in the tax tree to represent ids after '>' from fasta info lines for k-mers. BEWARE: This may cause a database build to fail as only up to 32767 tax ids are allowed.	`db`
`fileNodes`	boolean		`false`	Whether to add artificial nodes in the tax tree to represent fasta files for k-mers.	`db`
`lowerCaseBases`	boolean		`true`	Whether to accept lowercase bases for k-mers.	`db`
`svgFont`	String		`SansSerif`	The font name for the texts in the generated tree.	`svgtaxtree`
`svgFontSize`	int	[1, 100]	`18`	The font size for the texts in the generated tree.	`svgtaxtree`
`svgLineHeightFactor`	double	[0.5, 10.0]	`1.0`	How much a line of text in the tree is de- or increased with regard to the normally required line height.	`svgtaxtree`
`svgIndentFactor`	double	[0.0, 10.0]	`0.75`	Factor for standard indentation of child nodes in tree.	`svgtaxtree`
`svgTextGapFactor`	double	[0.0, 1.0]	`0.25`	Gap between horizontal line for child node and node text as a ratio of the font size.	`svgtaxtree`
`svgKmerNodeIndentFactor`	double	[0.0, 1.7976931348623157E308]	`0.0`	Factor for additional indentation to reflect k-mers of node. The base value is normalized to [0,1], where 1 corresponds to the maximum k-mers per taxid as stored in the database.	`svgtaxtree`
`svgLogBasedIndent`	boolean		`false`	Weather to perform log-based indentation for `svgKmerNodeIndentFactor` instead of linear indentation.	`svgtaxtree`
`svgReqNodesBold`	boolean		`true`	Weather to use bold text for tax ids requested via the project file `taxids.txt`.	`svgtaxtree`
`svgShowRank`	boolean		`false`	Weather to add the rank in the node text.	`svgtaxtree`
`logProgressUpdateCycle`	long	[0, 9223372036854775807]	`1000000`	Affects the log level `trace`: Defines after how many reads per fastq file, information on the matching progress is logged. If less than 1, then no progress information is logged.	`match`, `matchlr`, `filter`
`classifyReads`	boolean		`true`	Whether to do read classification in the style of Kraken and KrakenUniq. Matching is faster without read classification and the columns `kmers`, `unique kmers` and `max contig length` in resulting CSV files are usually more conclusive anyways - in particular with respect to long reads. When read classification is off, the columns `reads` and `kmers from reads` will be 0 in resulting CSV files.	`match`
`countUniqueKMers`	boolean		`true`	If `true`, the number of unique k-mers will be counted and reported. This requires less than 5% of additional main memory.	`match`, `matchlr`
`writeFilteredFastq`	boolean		`false`	If `true`, then the goal `match` writes filtered fastq files in the same way that the goal `filter` does.	`match`, `matchlr`
`writeKrakenStyleOut`	boolean		`false`	If `true`, Genestrip will write output files with suffix `.out` in the Kraken output format under `<base dir>/projects/<project_name>/krakenout` covering all reads with at least one matching k-mer.	`match`, `matchlr`
`writeAll`	boolean		`true`	If `false`, Genestrip will write only classified reads to kraken style output files.	`match`
`useBloomFilterForMatch`	boolean		`true`	If `true` a bloom filter will be loaded and used during fastq file analysis (i.e. matching). Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain no k-mers from the database. Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory.	`match`, `matchlr`
`maxReadTaxErrorCount`	double	[-1.0, 1.7976931348623157E308]	`-1.0`	The absolute or relative maximum number of k-mers that do not need to be in the database for a read to be classified (read error count). If the number is above `maxReadTaxErrorCount`, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If `maxReadTaxErrorCount` is >= 1, then it is interpreted as an absolute number of k-mers. If >= 0 and < 1, it is interpreted as the ratio between the k-mers not in the database and all k-mers of the read. If `maxReadTaxErrorCount` < 0, then the read error count is disregarded, which means that even a single matching k-mer will lead to the read's classification.	`match`, `matchlr`
`maxReadClassErrorCount`	double	[-1.0, 1.7976931348623157E308]	`-1.0`	The absolute or relative maximum number of k-mers that do not need to be consistent with a read's destined class for the read to be classified (read class error count). If the number is above `maxReadClassErrorCount`, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If `maxReadClassErrorCount` is >= 1, then it is interpreted as an absolute number of k-mers. If >= 0 and < 1, it is interpreted as the ratio between the inconsistent k-mers and all k-mers of the read. If `maxReadClassErrorCount` < 0, then the read error count is disregarded, which means that even a single matching k-mer will lead to the read's classification.	`match`, `matchlr`
`minKMersForClass`	int	[1, 2147483647]	`1`	Can be set to adjust the minimal total of k-mers under taxon t required for a read to be classified to t. I.e., given a read r and taxon t1 on the genus rank with two k-mers from r and taxon t2 subordinate to t1 on the species rank with one k-mer from r. Furthermore, r shall have no other k-mers matching any taxons. Then, if `minKMersForClass = 2`, r would not be classified to t1 but to t2 instead since the single k-mer under t1 is below the threshold but the total of three k-mers under t2 suffice.	all
`maxKMerResCounts`	int	[0, 65536]	`0`	If > 0, the corresponding number of frequencies of the most frequent k-mers per tax id will be reported.	`match`, `matchlr`
`writeDumpedFastq`	boolean		`false`	If `true`, then `filter` will also generate a fastq file `dumped_...` with all reads not written to the corresponding filtered fastq file.	`filter`
`minPosCountFilter`	int	[0, 1024]	`1`	The mininum number of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. If `minPosCountFilter=0`, then `posRatioFilter` becomes effective.	`filter`
`posRatioFilter`	double	[0.0, 1.0]	`0.2`	Only effective if `minPosCountFilter=0`: The mininum ratio of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file.	`filter`
`withProbs`	boolean		`false`	Whether to process bp probabilities and potentially write them to filtered fastq files. (Takes slightly more memory if `true`.)	`filter`, `match`, `matchlr`
`taxids`	list of Strings		``	List of tax ids separated by `,`. A tax id may have the suffix `+`, which means that taxonomic descendants from the project's database will be included. This list can alternatively be set via the command line parameter `-tx`.	`db2fastqtaxids`, `db2fastq`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

ConfigParams.md

Latest commit

History

ConfigParams.md

File metadata and controls