SubtiML

20.490 Final Project

Expression data:

Table S2 from "Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis" (Nicolas, Mäder, Dervyn et al., 2012). Stored as expression_data.xlsx. Genome sequence:
Bacillus subtilis subsp. subtilis str. 168 genome sequence downloaded from NCBI, reference NC_000964.3. Stored as bsubt168.fna. Experimental growth conditions:
The different experimental growth conditions tested are summarized in experiments_order.txt. More information about what exactly the conditions correspond to can be found in the paper by Nicolas et al.

To generate the cleaned dataset, click through the jupyter notebook data processing.ipynb The notebook performs the following processing steps:

Retrieve the sequences for the regions 20, 50, 100, 200, 1000bp upstream of each gene
Compute the mean expression level across different replicates for the same conditions
For each gene, store all of the expression data in a single array
For each gene, also compute a normalized (z-scored) array of expression data
Store all this data in cleaned_data.csv
Store the order of the conditions stored in the expression array in experiments_order.txt
Visualize correlatedness of growth conditions

To follow the training protocol, click through the jupyter notebook train CNN.ipynb Two separate training tasks were performed

A CNN binary classification model was trained on each experimental growth condition to predict whether a given upstream DNA sequence would lead to up- or downregulation of the gene in the given growth condition
A CNN regression model was trained to predict the full expression profile across all experimental conditions for a given upstream DNA sequence

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Train CNN.ipynb		Train CNN.ipynb
bsubt168.fna		bsubt168.fna
cleaned_data.csv		cleaned_data.csv
data processing.ipynb		data processing.ipynb
experiments_order.txt		experiments_order.txt
expression_data.xlsx		expression_data.xlsx