2. User guide

Installation

ℹ️ This section will only cover the bases of usage. Depending on if you want (or have) conda, you can follow one or the other procedure. Though, it is recommended to use it, as all prerequisites will be met this way. Otherwise, you need Python 3.10 or above with last version of pip (python -m pip install --upgrade pip) before starting this guide.

With conda

Step 1 : clone repository with git clone git@github.com:Tharos-ux/wisp.git
Step 2 : navigate to main wisp package with cd wisp/
Step 3 : execute the environment config tool with bash env.sh

Without conda

Step 1 : clone repository with git clone git@github.com:Tharos-ux/wisp.git
Step 2 : navigate to main wisp package with cd wisp/
Step 3 : install dependencies with python -m pip install -r requirements.txt

Basics

With conda

ℹ️ Let's assume your_parameter_file is a valid .json file. You may find examples of those inside the parameters_files folder. Run below commands with your conda environnement closed!

To build your database, command is bash wisp.sh "wisp.py parameters_files/your_parameter_file -b"
To do predictions, command is bash wisp.sh "wisp.py parameters_files/your_parameter_file"

You can use managers such as slurm to create background jobs on a cluster (ask your administrator if slurm is configured) ; commands will look like sbatch wisp.sh "wisp.py parameters_files/your_parameter_file -b"

Without conda

Commands are the same ; though, you need to use solely python wisp.py parameters_files/your_parameter_file -b (remove the call to the bash script, keep the Python part) and you're good to go!

Scripts rundown

Only database creation and prediction won't cover all use-cases ; many other scripts are available. All commands you will see in this list can be used with each of the methods described above : replace the python expression with the new one and you're good to go!

utilities.py format_tool path/to/reference/genomes will rename, accordingly to classification, fresh retrieved genomes which are inside path/to/reference/genomes
utilities.py aggregate path/to/wisp/results path/to/output/txt/file will gather information from all reads of a meta-genomic sample output which is inside path/to/wisp/results to a single file which will be save in path/to/output/txt/file
utilities.py compare_metagenomic path/to/output path/to/reference/percentages/json path/to/aggregate/file_1 path/to/aggregate/file_2 ... will plot distribution of reads for metagenomic samples, saving plot in path/to/output
utilities.py database_features path/to/database output/path/for/figures will plot features used by XGBoost models from a database folder path/to/database as surfaces (outputs will be in output/path/for/figures)
utilities.py kmers_signatures path/to/reference/genomes output/path/for/figures will plot kmer signatures variations of a set of genomes situed in path/to/reference/genomes and output them to output/path/for/figures
utilities.py compare_outputs ["path/to/output/1","path/to/output/2", ...] will aggregate in a .csv file results from different folders in the list ["path/to/output/1","path/to/output/2", ...]
utilities.py clean_rename path/to/reference/genomes will suppress conflicting characters from genome names situed in path/to/reference/genomes
utilities.py summary_to_dl path/to/summary_assembly.txt path/to/reference/genomes will download a set of genomes described by a file path/to/summary_assembly.txt to specified folder path/to/reference/genomes
utilities.py destroy_sequence path/to/reference/genomes output/path/for/genomes 0.06 will add 6% of random sequencing errors to reference genomes inside path/to/reference/genomes and write files to output/path/for/genomes
utilities.py mock_dataset path/to/reference/genomes output/path/for/dataset will parse genomes in path/to/reference/genomes and write a mock metagenomic sample file to output/path/for/dataset
utilities.py clean_minion path/to/minion/genomes will extract sequences from path/to/minion/genomes matching length condition and will erase header
utilities.py extract_genomes 3 path/to/reference/genomes output/path/for/genomes will extract all genomes from path/to/reference/genomes having at least 3 relatives at family level from reference genomes directory into output path output/path/for/genomes
wisp_lib/parameters_init.py your_parameter_file will create a parameters file named your_parameter_file with default parameters (you need to edit those according to your own directory set!)
wisp.py build parameters_files/your_parameter_file will construct the training and test sets with parameters defined in parameters file
wisp.py model parameters_files/your_parameter_file will create models from the train sets with parameters defined in parameters file
wisp.py predict parameters_files/your_parameter_file will bin samples against the models with parameters defined in parameters file

Parameters file

The structure of the file is simple : it is a python dictionary, which will be save to a .json file. It contains all path names and algorithm parameters.

'window_size': 10000 # Size of window we use ; all sequences below this threshold are discarded
'sampling_objective': 500 # Number of reads we randomly sample, in each unknown read

# Next up are the parameters for your database [kmer_size, subsampling_depth, pattern]
# One line per classification level. Merged is to jump at a given level without previous classification steps
# kmer_size is the length of the sliding window
# subsampling_depth is the number of reads we sample, with maximum coverage as objective, in each of our reference genomes
# pattern allows for spaced seed, as 1 means 'keep nucleotide at this position' and 0 means 'discard it'
# the number of 1's in pattern must be equal to kmer_size
'domain_ref': [5, 50, [1, 1, 1, 1, 1]]
'phylum_ref': [5, 100, [1, 1, 1, 1, 1]]
'group_ref': [4, 100, [1, 1, 1, 1]]
'order_ref': [4, 100, [1, 1, 1, 1]]
'family_ref': [4, 100, [1, 1, 1, 1]]
'merged_ref': [4, 50, [1, 1, 1, 1]]

# Next, are the params for your unknown sample [kmer_size, pattern]
# You need to match those parameters with the ones from the database you're using
'domain_sample': [5, [1, 1, 1, 1, 1]]
'phylum_sample': [5, [1, 1, 1, 1, 1]]
'group_sample': [4, [1, 1, 1, 1]]
'order_sample': [4, [1, 1, 1, 1]]
'family_sample': [4, [1, 1, 1, 1]]
'merged_sample': [4, [1, 1, 1, 1]]

# location of reference genomes
'input_train': f"path/to/directory/train/"

# location of unk samples
'input_unk': f"path/to/directory/unk/"

# where to save the database
'database_output': f"path/to/directory/data/"

# where to save the output files
'reports_output': f"path/to/directory/output/jobname/"

# parameters for exploration and algorithm (please refer to report to have clues on what they are referring to)
# limit to consider a tree branch as pertinent
'threshold': 0.10
# number of XGBoost boostings
'nb_boosts': 10
# max depth of XGBoost trees
'tree_depth': 10

# level of verbosity for output
'test_mode': 'no_test'  # 'no_test', 'min_set', 'verbose'

# parameter for read selection, significance for softprob
'reads_th': 0.1
'selection_mode': 'delta_mean'  # 'min_max','delta_mean','delta_sum'

# force rebuilding full model, when you re-use a database but you changed model parameters
'force_model_rebuild': False

# tells the software if should consider both orientations or assume it is 3' -> 5' and computes canonical kmers (disabled)
'single_way': True

# level we're stopping prediction at
'targeted_level': 'family'  # domain, phylum, group, order, family

# levels we consider during classification
'levels_list': ['domain', 'phylum', 'group', 'order', 'family']

# (disabled parameter)
'abundance_threshold': 0.25

# Identity to fetch WISP genomes from refseq
'email': 'XXXXXXXX.XXXXXXX@XXXXX.XX'

# Path to download genomes
'annotate_path': 'path/to/genomes/for/something/'

# Path to assembly_summary file (NCBI ftp format)
'accession_numbers': 'path/to/assembly_summary.txt'

# Name for database
'db_name': "dbname"

# Prefix for identifying job in output folder
'prefix_job': "jobprefix"

# Name of log file (stored inside logs folder)
'log_file': "name_for_log_file"

WISP : Bacterial families identification from long reads, machine learning with XGBoost

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. User guide

Installation

With conda

Without conda

Basics

With conda

Without conda

Scripts rundown

Parameters file

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally