Here we implemented a processing and analysis pipeline to extract information from sequencing data to provide insight for Multiple sclerosis (MS) treatments. Presentation with analysis and results here.
This preprocessing pipeline works on all transcriptomic sequencing data, and the analysis and visualization script can be used for any table of counts (genomics data and beyond).
MS is a neurodegenerative diseases that affects more than 2.3 million people worldwide. It is characterized by a loss of nerve myeline which results in slower synaptic transmission and possible cell death leading to walking difficulties, vision problems, fatigue, numbness and a range of cognitive changes. The cause is unknown but it is associated with autoimmune processes.
Here we look at changes in the gene profile of human immune cells in response to the two of the most common treatments for the disease: interferon-beta and vitamin D. We compared large scale squencing data across conditions using the following analysis pipeline:
Run downloads_refGenome.s to download files. This step is optional. The rest of the processing and analysis pipeline works on all SRA sequencing data.
We used FASTQC to get a comprehensive description sequencing reads' quality. This allows for visual and quantitative inspection of the data quality to inform data processing and filtering. Run fastq.s.
Based of the FASTQC output, we filter the reads using Trim-galore, which provides flexible preprocessing of data reads. Run Trimming.s to use chosen hyperparameters for these datasets.
Note: one should asses data quality after any preprocessing step.
Here we use hiSAT2 to identify the genes and DNA sequences from the human DNA that our reads come from for all conditions. Run IndexSorting.s script.
To inspect differences in gene regulation across treatments, we generated table of counts with patients (observations) as columns and genes as rows (features). We used HTSeq, implemented in htseq.s.
We extracted significantily modulated genes for across conditions, correcting for multiple data comparisons, did unsupervised exploration of data structure using multidimensional scaling (MDS) and draw predictions between treatment conditions using generalized linear models (GLMs). We use edgeR, GGally, heatmap3, biomaRt, statmod and built in R modules to this end. Run AnalysisVisualization.R to visualize and analyze the sequencing data.