This repo contains code to analyse the sequence of reads from the minION sequencer in real time. A pipeline will create a consensus for one gene, outputing the quality of the consensus and the result of the detection. Two pipelines were implemented: the naive pipeline and the bestx pipeline
The report associated with the code can be found in root directory of the repo (called "report.pdf").
To launch the pipeline, go to the root directory of the repo and execute the pipeline:
foo@bar:~$ python3 <pathToPipelineFile>where <pathToPipelineFile> is either src/pipelineBestSequence.py (for the best-x pipeline) or src/pipelineNaïve.py (for the naive pipeline)
For instance:
foo@bar:~$ python3 src/pipelineBestSequenceThere are a lot of default parameters at the top of the file, that you must change depending on your use case.
The pipeline can be used in two contexes. The first is in real time sequencing. In that case, you must first launch minKNOW and the sequencer, so that you see where the new fastq files will be created. Then copy that path and enter it as parameter in the pipeline file at the top. This way, the pipeline will wait for new files to arrive in that directory and will take ALL of the new files when it starts a new iteration. The second context is for testing, where the output of the sequencer is simulated. In that case a script can be used (explained below). In that case we only take 1 new fastq file of reads, corresponding to the iteration number.
When doing a simulation, the pipelines will take the reads that are in the folder "fastqpass". The pipelines will take each file as one "iteration" of reads coming from the sequencer. In order to populate that folder, the following script can be used:
foo@bar:~$ python3 src/simulateRealTimeOutput.py <pathToFastqFile> <minutesBeforeNextFile>where <minutesBeforeNextFile> must be a positive integer that represents the number of minutes the script will sleep before creating the next fastq file. This is done so that we can simulate the sequencer outputting reads at every time interval. If 0 is supplied then the script will create the files directly.
for instance:
foo@bar:~$ python3 src/simulateRealTimeOutput.py allData/Allium_Ursinum_ITS.fastq 0The script can be used in the background to simulate the sequencer creating a fastq file every minute.
The normal use case to test the pipeline is to first call the siulateRealTimeOutput.py script to populate the fastqpass folder, and then launch a pipeline.
The repo contains 3 fastq files, that are stored in the allData folder. Then, three new ones were created using the "dataDowngrader.py" script. Those can be used for debuggint purposes.
The pipelines will output all their results in their respective output folder outputPipelineBest and outputPipelineNaive.
The main output file result.txt contains the detection result of each iteration (and the consensus). The folder will also contain graph of the depth covreage for instance.
stdout will show the progression of the pipeline.
The first program that the pipelines uses is Medaka. Follow their installation instructions on their repo. The method that worked for me was using the conda channel. To test that the installation works you should be able to execute
foo@bar:~$ medaka_consensus -hAnother program required is SPOA. I included a binary in the src/ directory, but maybe it won't work on your machine so you can follow their instructions to compile it from source, and replace the binary of this repo (must be placed in the src/ directory) You should be able to execture
foo@bar:~$ ./src/spoa --versionAnother program is Mosdepth. Same thing as for spoa You should be able to execture
foo@bar:~$ ./src/mosdepth --versionThe last program is blastn. Follow the instructions in the Identification repo of Genorobotics. You must be able to execute:
foo@bar:~$ blastn -versionIn addition, you need 4 databases, one for each gene mentioned above. Their name must be exactmu the same as the geneName used as parameter for the pipeline. For instance you should be able to execture:
foo@bar:~$ blastdbcmd -db rbcL -infoIn terms of the python packages, you can use a conda environment with the environment.yml file provided.
Vladimir Hanin (vladimir.hanin@outlook.com)