Skip to content

4. Developper documentation

Siegfried edited this page Jul 26, 2022 · 6 revisions

Classes organization

Support tools : package wisp_tools

The five classes inside this folder contains support functions that are not specific to WISP.

  • my_checker is about checking types and comparing them to function signature. Usage of the two decorators is described inside the class.
  • my_coros contains the co-routine method ; this class aims to gather multi-threading tools.
  • my_fasta is about loading and processing .fna/.fastq/.fasta files. It uses the biopython package, however this class is badly optimized as of now.
  • my_logs contains the logging facilities for this project.
  • my_maths contains mathematical tools. It is currently unused, but stays as a placeholder for future expansions.

To help figuring out package origin, all functions inside this folder are prefixed by my_.

Plotting facilities : package wisp_view

All representation facilities are condensed inside this package.

  • mass_analysis contains functions to compare multiple outputs.
  • plotters is about plotting models and databases (model interpretation).
  • tex_report contains tools to create a LaTeX report.
  • tree_rendering creates trees with graphviz engine ; disabled as of now, as graphviz does not run on the cluster I had access to.
  • visualisation_tool is about plotting results of predictions and validation of models (output interpretation)

Encoding kmers and manipulating data : package wisp_lib

Those support functions, however, are specific to WISP.

  • data_manipulation contains all functions to create, edit, and check existence of files used by WISP.
  • kmers_coders contains all function that encode, decode kmers, split subreads.
  • parameters_init is about creating a parameter file.

Other stuff : main wisp package

WISP core utilities

  • build_softprob contains methods to build models.
  • predictors does prediction stuff and maths on it.
  • sample_class contains database creation utilities.
  • utilities is a call script for most of the support functions.
  • wisp_build is a loop to build databases and models.
  • wisp_predict is a loop to predict samples.
  • wisp is a thread manager (interface we call) for prediction and build.

Other files

  • _version and versioneer handles version managment to track program current version and displaying it in reports and such.
  • setup and MANIFEST are about creating the package (future PyPi distribution intended).
  • Dockerfile is an attempt to create a Docker env.
  • processing is about testing leave-one-out (see article).
  • wisp_pipeline is about executing full pipeline over a set of files.
  • env is about creating conda environment and directories.

Development roadmap

In this section, I want to emphasize some thoughts my director and I had concerning the future of the software :

  • Accepting raw .fast5 as inputs to lower the weight of sequencing errors. As of inputs, re-working the MinION input function will be mandatory in a near future ; as of now, all information about basecalling are discarded.
  • Shifting from hierarchical classification to graph-like one by considering classification rather as a 3-dimensional proximity graph with nodes in a same z-axis plane belonging at a same taxonomic level. This way, we could elaborate on composition proximity at each level to draw distances and use custom weights to help with interpretation.
  • Incorporating ORI-like ASP to help reads binning. In ORI, it seeks to minimize the number of strains that qualifies the maximum of reads. Here, implementation would be a similar approach, but rather focused on the family level.
  • Reworking the parameter file system, as it is quite junky. As well, the call functions are quite tedious to use for non-programmers, and would benefit from shorter commands.
  • Adding a weight to read size when aggregating results to neglect small genome shared parts over huge well-assigned parts.

Clone this wiki locally