Automated Genome Feature Discovery

Overview

Automated Genome Feature Discovery is a s(CASP) program that searches for promoters, which identify regions of transcription, in a DNA sequence.

DNA sequences can be modeled as a string of characters consisting of 'A', T', 'C', and 'G'. Broadly speaking, a transcription region consists of a translation section, which is prefixed by a start codon (usually ATG) and suffixed by a stop codon (varies). Upstream of the translation section, there is typically one or more AT-rich sequences which aid in transcription.

Depicted below is a model transcription region where the character 'X' represents a wildcard and 'Y' represents the payload. In this example, there are two AT-rich sequences: one (the Pribnow box) appearing 10 characters before the translation section and another appearing 35 characters before the translation section. This structure is typical for bacteria.

The AT-rich sequences rarely appear exactly as shown in the model. In fact, the model shows the most likely character for each position, but in reality each position is based on an observed probability distribution.

Quick Start

git clone https://github.com/Emiller88/Comet-Galaxy.git
make

Usage

Clone the git repository to your local machine.

git clone https://github.com/Emiller88/Comet-Galaxy.git

Run the data_to_list.py script to generate input for the s(CASP) program. (Note: You can supply your own genome by replacing raw_input2.txt with the file name of any local text file)

python data_to_list.py raw_input2.txt

Run the program!

scasp promoter.pl

FASTA

An example file can be downloaded from https://github.com/nf-core/modules/raw/master/tests/data/fasta/E_coli/NC_010473.fa and run the following instead.

python data_to_list.py NC_010473.fa

However, this is extremely slow because it is now optimized.

Output

Design

The s(CASP) program is an implementation of rules for identifying promoters in DNA sequences. Our team first compiled these rules in common English as seen in this document. These English rules were then converted to s(CASP) code.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
image		image
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
data_to_list.py		data_to_list.py
flake.lock		flake.lock
flake.nix		flake.nix
promoter.pl		promoter.pl
raw_input.txt		raw_input.txt
raw_input2.txt		raw_input2.txt
raw_input3.txt		raw_input3.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Genome Feature Discovery

Overview

Quick Start

Usage

FASTA

Output

Design

About

Releases

Packages

Languages

chang4tech/Comet-Galaxy

Folders and files

Latest commit

History

Repository files navigation

Automated Genome Feature Discovery

Overview

Quick Start

Usage

FASTA

Output

Design

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages