Skip to content

Latest commit

 

History

History
executable file
·
66 lines (48 loc) · 3.03 KB

File metadata and controls

executable file
·
66 lines (48 loc) · 3.03 KB

GetCodingSequences: a coding/non-coding sequence extractor to be used with Genbank files.

Oliver Bonham-Carter, Allegheny College


logo Figure 1. A GCS stands for Get Coding Sequences. Genetic Music: Use your ears to study DNA!!

Table of Contents

Description

Often, when you have a tool from Bioinformatics, sequences are the input. This program, GCS creates fasta files of the coding sequences (producing protein) of a GenBank file. In addition, the program also outputs the non-coding sequences (those that produce no-known protein) from the Genbank file. These sequences can then be used for research or to test new tools.

genbank record Figure 2. In a GenBank file, there are references for the coding regions.

Mechanism

GCS works by locating the coding sequences from a GenBank file by finding their location references in the record, as shown in Figure 2. Then GCS locates the actual sequences using these starting and ending markers, and places this sequence data into fasta files. The noncoding regions are located by removing the coding regions from main sequence. The remaining sequence, from which all coding information has been removed, is the non-coding region. Sequences are then extracted from this body of non-coding genetic material.

    numOfSeqs_int = 20
    maxSize_int = 400

Note: shown above, the size of the extracted sequences is 400 base-pairs but this value may be customized in main.py, along with the number of sequences to produce.

Running the code

You must first install Poetry to manage the code's dependencies, and to run the program.

* Setup with Poetry : 
    + poetry install
* Find online help:
    + poetry run gcs --bighelp
* Produce reduced-sized sequences from a genbank file:
    + poetry run gcs --data-file data/df.gb
* Produce full-sized sequences from a genbank file:
    + poetry run gcs --data-file data/df.gb --fullseqs

OUTPUT: All output files are saved in the directory `0_out/

  • Coding files (C_startLocation-endLocation.fasta) are named according to their locations as detailed in GenBank files.
  • The noncoding sequencs (nC_0.fasta) are arbitrariliy selected from the DNA text string after all the coding material has been removed from the super sequence.

Future Work

This is a program to used primarily to obtain DNA sequence data. One of the main reasons to create genetic data as sequence files is to facilitate and provide data for another excellent project: Genmus, which converts DNA fasta sequences into piano music.

This is also a work in progress. If you see anyway to improve it, please let me know, or actually make that improvement in the code via a pull request. I would be very grateful for any productive input that you may have.