Skip to content

Processing and analysis pipeline to extract information from sequencing data.

License

Notifications You must be signed in to change notification settings

pedroherrerovidal/GeneTranscriptomicDataAnalysis

Repository files navigation

Sequencing data processing and analysis

Here we implemented a processing and analysis pipeline to extract information from sequencing data to provide insight for Multiple sclerosis (MS) treatments. Presentation with analysis and results here.

This preprocessing pipeline works on all transcriptomic sequencing data, and the analysis and visualization script can be used for any table of counts (genomics data and beyond).

MS is a neurodegenerative diseases that affects more than 2.3 million people worldwide. It is characterized by a loss of nerve myeline which results in slower synaptic transmission and possible cell death leading to walking difficulties, vision problems, fatigue, numbness and a range of cognitive changes. The cause is unknown but it is associated with autoimmune processes.

Here we look at changes in the gene profile of human immune cells in response to the two of the most common treatments for the disease: interferon-beta and vitamin D. We compared large scale squencing data across conditions using the following analysis pipeline:

Download raw reads from open source repositories

Run downloads_refGenome.s to download files. This step is optional. The rest of the processing and analysis pipeline works on all SRA sequencing data.

Asses quality of the raw data

We used FASTQC to get a comprehensive description sequencing reads' quality. This allows for visual and quantitative inspection of the data quality to inform data processing and filtering. Run fastq.s.

Data preprocessing and filtering

Based of the FASTQC output, we filter the reads using Trim-galore, which provides flexible preprocessing of data reads. Run Trimming.s to use chosen hyperparameters for these datasets.

Note: one should asses data quality after any preprocessing step.

Align reads to the reference genome

Here we use hiSAT2 to identify the genes and DNA sequences from the human DNA that our reads come from for all conditions. Run IndexSorting.s script.

Generate and structure data

To inspect differences in gene regulation across treatments, we generated table of counts with patients (observations) as columns and genes as rows (features). We used HTSeq, implemented in htseq.s.

Gene expression analysis

We extracted significantily modulated genes for across conditions, correcting for multiple data comparisons, did unsupervised exploration of data structure using multidimensional scaling (MDS) and draw predictions between treatment conditions using generalized linear models (GLMs). We use edgeR, GGally, heatmap3, biomaRt, statmod and built in R modules to this end. Run AnalysisVisualization.R to visualize and analyze the sequencing data.

About

Processing and analysis pipeline to extract information from sequencing data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published