GitHub - GenoRobotics-EPFL/Vladimir-project

Summary

This repo contains code to analyse the sequence of reads from the minION sequencer in real time. A pipeline will create a consensus for one gene, outputing the quality of the consensus and the result of the detection. Two pipelines were implemented: the naive pipeline and the bestx pipeline

The report associated with the code can be found in root directory of the repo (called "report.pdf").

Usage

To launch the pipeline, go to the root directory of the repo and execute the pipeline:

foo@bar:~$ python3 <pathToPipelineFile>

where <pathToPipelineFile> is either src/pipelineBestSequence.py (for the best-x pipeline) or src/pipelineNaïve.py (for the naive pipeline) For instance:

foo@bar:~$ python3 src/pipelineBestSequence

There are a lot of default parameters at the top of the file, that you must change depending on your use case.

The pipeline can be used in two contexes. The first is in real time sequencing. In that case, you must first launch minKNOW and the sequencer, so that you see where the new fastq files will be created. Then copy that path and enter it as parameter in the pipeline file at the top. This way, the pipeline will wait for new files to arrive in that directory and will take ALL of the new files when it starts a new iteration. The second context is for testing, where the output of the sequencer is simulated. In that case a script can be used (explained below). In that case we only take 1 new fastq file of reads, corresponding to the iteration number.

When doing a simulation, the pipelines will take the reads that are in the folder "fastqpass". The pipelines will take each file as one "iteration" of reads coming from the sequencer. In order to populate that folder, the following script can be used:

foo@bar:~$ python3 src/simulateRealTimeOutput.py <pathToFastqFile> <minutesBeforeNextFile>

where <minutesBeforeNextFile> must be a positive integer that represents the number of minutes the script will sleep before creating the next fastq file. This is done so that we can simulate the sequencer outputting reads at every time interval. If 0 is supplied then the script will create the files directly. for instance:

foo@bar:~$ python3 src/simulateRealTimeOutput.py allData/Allium_Ursinum_ITS.fastq 0

The script can be used in the background to simulate the sequencer creating a fastq file every minute.

The normal use case to test the pipeline is to first call the siulateRealTimeOutput.py script to populate the fastqpass folder, and then launch a pipeline.

Data

The repo contains 3 fastq files, that are stored in the allData folder. Then, three new ones were created using the "dataDowngrader.py" script. Those can be used for debuggint purposes.

Output

The pipelines will output all their results in their respective output folder outputPipelineBest and outputPipelineNaive. The main output file result.txt contains the detection result of each iteration (and the consensus). The folder will also contain graph of the depth covreage for instance.

stdout will show the progression of the pipeline.

Installation

The first program that the pipelines uses is Medaka. Follow their installation instructions on their repo. The method that worked for me was using the conda channel. To test that the installation works you should be able to execute

foo@bar:~$ medaka_consensus -h

Another program required is SPOA. I included a binary in the src/ directory, but maybe it won't work on your machine so you can follow their instructions to compile it from source, and replace the binary of this repo (must be placed in the src/ directory) You should be able to execture

foo@bar:~$ ./src/spoa --version

Another program is Mosdepth. Same thing as for spoa You should be able to execture

foo@bar:~$ ./src/mosdepth --version

The last program is blastn. Follow the instructions in the Identification repo of Genorobotics. You must be able to execute:

foo@bar:~$ blastn -version

In addition, you need 4 databases, one for each gene mentioned above. Their name must be exactmu the same as the geneName used as parameter for the pipeline. For instance you should be able to execture:

foo@bar:~$ blastdbcmd -db rbcL -info

In terms of the python packages, you can use a conda environment with the environment.yml file provided.

Author

Vladimir Hanin (vladimir.hanin@outlook.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Usage

Data

Output

Installation

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
allData		allData
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
report.pdf		report.pdf

Folders and files

Latest commit

History

Repository files navigation

Summary

Usage

Data

Output

Installation

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages