-
Notifications
You must be signed in to change notification settings - Fork 0
4. Developper documentation
Siegfried edited this page Jul 26, 2022
·
6 revisions
The five classes inside this folder contains support functions that are not specific to WISP.
-
my_checkeris about checking types and comparing them to function signature. Usage of the two decorators is described inside the class. -
my_coroscontains the co-routine method ; this class aims to gather multi-threading tools. -
my_fastais about loading and processing .fna/.fastq/.fasta files. It uses thebiopythonpackage, however this class is badly optimized as of now. -
my_logscontains the logging facilities for this project. -
my_mathscontains mathematical tools. It is currently unused, but stays as a placeholder for future expansions.
To help figuring out package origin, all functions inside this folder are prefixed by my_.
All representation facilities are condensed inside this package.
-
mass_analysiscontains functions to compare multiple outputs. -
plottersis about plotting models and databases (model interpretation). -
tex_reportcontains tools to create a LaTeX report. -
tree_renderingcreates trees with graphviz engine ; disabled as of now, as graphviz does not run on the cluster I had access to. -
visualisation_toolis about plotting results of predictions and validation of models (output interpretation)
Those support functions, however, are specific to WISP.
-
data_manipulationcontains all functions to create, edit, and check existence of files used by WISP. -
kmers_coderscontains all function that encode, decode kmers, split subreads. -
parameters_initis about creating a parameter file.
-
build_softprobcontains methods to build models. -
predictorsdoes prediction stuff and maths on it. -
sample_classcontains database creation utilities. -
utilitiesis a call script for most of the support functions. -
wisp_buildis a loop to build databases and models. -
wisp_predictis a loop to predict samples. -
wispis a thread manager (interface we call) for prediction and build.
-
_versionandversioneerhandles version managment to track program current version and displaying it in reports and such. -
setupandMANIFESTare about creating the package (future PyPi distribution intended). -
Dockerfileis an attempt to create a Docker env. -
processingis about testing leave-one-out (see article). -
wisp_pipelineis about executing full pipeline over a set of files. -
envis about creating conda environment and directories.
In this section, I want to emphasize some thoughts my director and I had concerning the future of the software :
- Accepting raw
.fast5as inputs to lower the weight of sequencing errors. As of inputs, re-working the MinION input function will be mandatory in a near future ; as of now, all information about basecalling are discarded. - Shifting from hierarchical classification to graph-like one by considering classification rather as a 3-dimensional proximity graph with nodes in a same z-axis plane belonging at a same taxonomic level. This way, we could elaborate on composition proximity at each level to draw distances and use custom weights to help with interpretation.
- Incorporating ORI-like ASP to help reads binning. In ORI, it seeks to minimize the number of strains that qualifies the maximum of reads. Here, implementation would be a similar approach, but rather focused on the family level.
- Reworking the parameter file system, as it is quite junky. As well, the call functions are quite tedious to use for non-programmers, and would benefit from shorter commands.
- Adding a weight to read size when aggregating results to neglect small genome shared parts over huge well-assigned parts.
WISP : Bacterial families identification from long reads, machine learning with XGBoost