Automated Genome Feature Discovery is a s(CASP) program that searches for promoters, which identify regions of transcription, in a DNA sequence.
DNA sequences can be modeled as a string of characters consisting of 'A', T', 'C', and 'G'. Broadly speaking, a transcription region consists of a translation section, which is prefixed by a start codon (usually ATG) and suffixed by a stop codon (varies). Upstream of the translation section, there is typically one or more AT-rich sequences which aid in transcription.
Depicted below is a model transcription region where the character 'X' represents a wildcard and 'Y' represents the payload. In this example, there are two AT-rich sequences: one (the Pribnow box) appearing 10 characters before the translation section and another appearing 35 characters before the translation section. This structure is typical for bacteria.
The AT-rich sequences rarely appear exactly as shown in the model. In fact, the model shows the most likely character for each position, but in reality each position is based on an observed probability distribution.
git clone https://github.com/Emiller88/Comet-Galaxy.git
make
- Clone the git repository to your local machine.
git clone https://github.com/Emiller88/Comet-Galaxy.git
- Run the
data_to_list.py
script to generate input for the s(CASP) program. (Note: You can supply your own genome by replacing raw_input2.txt with the file name of any local text file)
python data_to_list.py raw_input2.txt
- Run the program!
scasp promoter.pl
An example file can be downloaded from https://github.com/nf-core/modules/raw/master/tests/data/fasta/E_coli/NC_010473.fa and run the following instead.
python data_to_list.py NC_010473.fa
However, this is extremely slow because it is now optimized.
The s(CASP) program is an implementation of rules for identifying promoters in DNA sequences. Our team first compiled these rules in common English as seen in this document. These English rules were then converted to s(CASP) code.