-
Notifications
You must be signed in to change notification settings - Fork 0
2. User guide
ℹ️ This section will only cover the bases of usage. Depending on if you want (or have) conda, you can follow one or the other procedure. Though, it is recommended to use it, as all prerequisites will be met this way. Otherwise, you need Python 3.10 or above with last version of pip (python -m pip install --upgrade pip) before starting this guide.
- Step 1 : clone repository with
git clone git@github.com:Tharos-ux/wisp.git - Step 2 : navigate to main
wisppackage withcd wisp/ - Step 3 : execute the environment config tool with
bash env.sh
- Step 1 : clone repository with
git clone git@github.com:Tharos-ux/wisp.git - Step 2 : navigate to main
wisppackage withcd wisp/ - Step 3 : install dependencies with
python -m pip install -r requirements.txt
ℹ️ Let's assume your_parameter_file is a valid .json file. You may find examples of those inside the parameters_files folder. Run below commands with your conda environnement closed!
- To build your database, command is
bash wisp.sh "wisp.py parameters_files/your_parameter_file -b" - To do predictions, command is
bash wisp.sh "wisp.py parameters_files/your_parameter_file"
You can use managers such as slurm to create background jobs on a cluster (ask your administrator if slurm is configured) ; commands will look like sbatch wisp.sh "wisp.py parameters_files/your_parameter_file -b"
Commands are the same ; though, you need to use solely python wisp.py parameters_files/your_parameter_file -b (remove the call to the bash script, keep the Python part) and you're good to go!
Only database creation and prediction won't cover all use-cases ; many other scripts are available. All commands you will see in this list can be used with each of the methods described above : replace the python expression with the new one and you're good to go!
-
utilities.py format_tool path/to/reference/genomeswill rename, accordingly to classification, fresh retrieved genomes which are inside path/to/reference/genomes -
utilities.py aggregate path/to/wisp/results path/to/output/txt/filewill gather information from all reads of a meta-genomic sample output which is inside path/to/wisp/results to a single file which will be save in path/to/output/txt/file -
utilities.py compare_metagenomic path/to/output path/to/reference/percentages/json path/to/aggregate/file_1 path/to/aggregate/file_2 ...will plot distribution of reads for metagenomic samples, saving plot in path/to/output -
utilities.py database_features path/to/database output/path/for/figureswill plot features used by XGBoost models from a database folder path/to/database as surfaces (outputs will be in output/path/for/figures) -
utilities.py kmers_signatures path/to/reference/genomes output/path/for/figureswill plot kmer signatures variations of a set of genomes situed in path/to/reference/genomes and output them to output/path/for/figures -
utilities.py compare_outputs ["path/to/output/1","path/to/output/2", ...]will aggregate in a .csv file results from different folders in the list ["path/to/output/1","path/to/output/2", ...] -
utilities.py clean_rename path/to/reference/genomeswill suppress conflicting characters from genome names situed in path/to/reference/genomes -
utilities.py summary_to_dl path/to/summary_assembly.txt path/to/reference/genomeswill download a set of genomes described by a file path/to/summary_assembly.txt to specified folder path/to/reference/genomes -
utilities.py destroy_sequence path/to/reference/genomes output/path/for/genomes 0.06will add 6% of random sequencing errors to reference genomes inside path/to/reference/genomes and write files to output/path/for/genomes -
utilities.py mock_dataset path/to/reference/genomes output/path/for/datasetwill parse genomes in path/to/reference/genomes and write a mock metagenomic sample file to output/path/for/dataset -
utilities.py clean_minion path/to/minion/genomeswill extract sequences from path/to/minion/genomes matching length condition and will erase header -
utilities.py extract_genomes 3 path/to/reference/genomes output/path/for/genomeswill extract all genomes from path/to/reference/genomes having at least 3 relatives at family level from reference genomes directory into output path output/path/for/genomes -
wisp_lib/parameters_init.py your_parameter_filewill create a parameters file named your_parameter_file with default parameters (you need to edit those according to your own directory set!) -
wisp.py build parameters_files/your_parameter_filewill construct the training and test sets with parameters defined in parameters file -
wisp.py model parameters_files/your_parameter_filewill create models from the train sets with parameters defined in parameters file -
wisp.py predict parameters_files/your_parameter_filewill bin samples against the models with parameters defined in parameters file
The structure of the file is simple : it is a python dictionary, which will be save to a .json file. It contains all path names and algorithm parameters.
'window_size': 10000 # Size of window we use ; all sequences below this threshold are discarded
'sampling_objective': 500 # Number of reads we randomly sample, in each unknown read
# Next up are the parameters for your database [kmer_size, subsampling_depth, pattern]
# One line per classification level. Merged is to jump at a given level without previous classification steps
# kmer_size is the length of the sliding window
# subsampling_depth is the number of reads we sample, with maximum coverage as objective, in each of our reference genomes
# pattern allows for spaced seed, as 1 means 'keep nucleotide at this position' and 0 means 'discard it'
# the number of 1's in pattern must be equal to kmer_size
'domain_ref': [5, 50, [1, 1, 1, 1, 1]]
'phylum_ref': [5, 100, [1, 1, 1, 1, 1]]
'group_ref': [4, 100, [1, 1, 1, 1]]
'order_ref': [4, 100, [1, 1, 1, 1]]
'family_ref': [4, 100, [1, 1, 1, 1]]
'merged_ref': [4, 50, [1, 1, 1, 1]]
# Next, are the params for your unknown sample [kmer_size, pattern]
# You need to match those parameters with the ones from the database you're using
'domain_sample': [5, [1, 1, 1, 1, 1]]
'phylum_sample': [5, [1, 1, 1, 1, 1]]
'group_sample': [4, [1, 1, 1, 1]]
'order_sample': [4, [1, 1, 1, 1]]
'family_sample': [4, [1, 1, 1, 1]]
'merged_sample': [4, [1, 1, 1, 1]]
# location of reference genomes
'input_train': f"path/to/directory/train/"
# location of unk samples
'input_unk': f"path/to/directory/unk/"
# where to save the database
'database_output': f"path/to/directory/data/"
# where to save the output files
'reports_output': f"path/to/directory/output/jobname/"
# parameters for exploration and algorithm (please refer to report to have clues on what they are referring to)
# limit to consider a tree branch as pertinent
'threshold': 0.10
# number of XGBoost boostings
'nb_boosts': 10
# max depth of XGBoost trees
'tree_depth': 10
# level of verbosity for output
'test_mode': 'no_test' # 'no_test', 'min_set', 'verbose'
# parameter for read selection, significance for softprob
'reads_th': 0.1
'selection_mode': 'delta_mean' # 'min_max','delta_mean','delta_sum'
# force rebuilding full model, when you re-use a database but you changed model parameters
'force_model_rebuild': False
# tells the software if should consider both orientations or assume it is 3' -> 5' and computes canonical kmers (disabled)
'single_way': True
# level we're stopping prediction at
'targeted_level': 'family' # domain, phylum, group, order, family
# levels we consider during classification
'levels_list': ['domain', 'phylum', 'group', 'order', 'family']
# (disabled parameter)
'abundance_threshold': 0.25
# Identity to fetch WISP genomes from refseq
'email': 'XXXXXXXX.XXXXXXX@XXXXX.XX'
# Path to download genomes
'annotate_path': 'path/to/genomes/for/something/'
# Path to assembly_summary file (NCBI ftp format)
'accession_numbers': 'path/to/assembly_summary.txt'
# Name for database
'db_name': "dbname"
# Prefix for identifying job in output folder
'prefix_job': "jobprefix"
# Name of log file (stored inside logs folder)
'log_file': "name_for_log_file"
WISP : Bacterial families identification from long reads, machine learning with XGBoost