Skip to content

Look into using taxon result limited blastp instead of top-n limited #322

@chasemc

Description

@chasemc

Taxon assignment is currently performed by running DIAMOND blastp on contig proteins, finding the LCA and majority vote.

Currently, the blast result simply returns the top 200 database hits which is not ideal if there are many high-scoring matches

That is currently decided here in python:

def blast(
fasta: str,
database: str,
outfpath: str,
blast_type: str = "blastp",
evalue: float = float("1e-5"),
maxtargetseqs: int = 200,
cpus: int = mp.cpu_count(),
tmpdir: str = None,
force: bool = False,
verbose: bool = False,
) -> str:
"""
Performs diamond blastp search using query sequence against diamond formatted database
Parameters
----------
fasta : str
Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP
and nucleotide sequences in case of BLASTX
database : str
Path to diamond formatted database
outfpath : str
Path to output file
blast_type : str, optional
blastp to align protein query sequences against a protein reference database,
blastx to align translated DNA query sequences against a protein reference database, by default 'blastp'
evalue : float, optional
cutoff e-value to count hit as significant, by default float('1e-5')
maxtargetseqs : int, optional
max number of target sequences to retrieve per query by diamond, by default 200
cpus : int, optional
Number of processors to be used, by default uses all the processors of the system
tmpdir : str, optional
Path to temporary directory. By default, same as the output directory
force : bool, optional
overwrite existing diamond results, by default False
verbose : bool, optional
log progress to terminal, by default False
Returns
-------
str
Path to BLAST results
Raises
------
FileNotFoundError
`fasta` file does not exist
ValueError
provided `blast_type` is not 'blastp' or 'blastx'
subprocess.CalledProcessError
Failed to run blast
"""
if not os.path.exists(fasta):
raise FileNotFoundError(fasta)
if os.path.exists(outfpath) and not force:
if os.path.getsize(outfpath):
if verbose:
logger.warning(f"FileExistsError: {outfpath}. To overwrite use --force")
return outfpath
blast_type = blast_type.lower()
if blast_type not in ["blastp", "blastx"]:
raise ValueError(f"blast_type must be blastp or blastx. Given: {blast_type}")
if verbose:
logger.debug(f"diamond {blast_type} {fasta} against {database}")
cmd = [
"diamond",
blast_type,
"--query",
fasta,
"--db",
database,
"--evalue",
evalue,
"--max-target-seqs",
maxtargetseqs,
"--threads",
cpus,
"--outfmt",
"6",
"--out",
outfpath,
]
if tmpdir:
cmd.extend(["--tmpdir", tmpdir])
# this is done as when cmd is a list each element should be a string
cmd = [str(c) for c in cmd]
if verbose:
logger.debug(f'RunningDiamond: {" ".join(cmd)}')
subprocess.run(
cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True
)
return outfpath

And, once merged, here in Nextflow:
ext.args = '--evalue 1e-5 --max-hsps 1 --max-target-seqs 200 -b 6 --outfmt 6'

To investigate:

It looks like DIAMOND can build the database with taxon info and limit to 1 hit per species, which is more ideal.

e.g. build... (note this has been done and the taxon_mapped_nr_diamond.dmnd file exists on the server)

 diamond makedb \
    --in $DIR_PREFIX/Databases/autometa_databases/nr \
    --db $DIR_PREFIX/Databases/autometa_databases/taxon_mapped_nr_diamond \
    --taxonmap $DIR_PREFIX/autometa_databases/prot.accession2taxid.FULL.gz \
    --taxonnodes $DIR_PREFIX/autometa_databases/nodes.dmp \
    --threads 40

then run diamonnd blastp --taxon-k 1 ... ?

--taxon-k maximum number of targets to report per species

Additionally, this could potentially remove need for a taxid lookup step in Autometa by changing DIAMOND blastp's output to --outfmt 6 staxids


Sorta related to #11

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestnextflowNextflow related issues/codepythonPython related issues/codestretchwonderful for the community, yet may not be top priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions