forked from geparada/RNA_seq_course2020
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathChapterI_Unix_introduction.Rmd
executable file
·182 lines (101 loc) · 11.3 KB
/
ChapterI_Unix_introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
title: " Chapter I - Introduction to Unix "
output: html_notebook
---
# 0. Let's start!
Today we will be working remotely though [ssh](https://en.wikipedia.org/wiki/Secure_Shell). This allows us to access to higher computing power that is allocated on servers that we are renting for you. In order to connect to the server is important you take note of the IP address you will be given and then you can open a terminal and run the following command:
`ssh course@IP_address`
Where instead of `IP_address` you should insert the address corresponding to your own working server. Once you connect to the remote server, we will use [screen](https://en.wikipedia.org/wiki/GNU_Screen) to create a session. This will allow us to smoothly resume our working session after any disconnection from the server. To create a new session use:
`screen -S day1`
Where `day1` corresponds to the name of the session, but it can be replaced by any name you prefer. After any interruption on the ssh connection, you can check the created sessions using:
`screen -ls`
If you want to resume a session, just use:
`screen -r day1`
If your session is still attached, but you want to re-attach to your current terminal, you will need to first de-attach the session and the re-attach it:
`screen -d day1`
`screen -r day1`
# 1. Basic Unix
UNIX is an operating system that was developed in the 1960s and it has been the base of widely used operating systems such as Linux and Mac OS. As the data coming out from biological outputs growing exponentially through time, Knowing the UNIX basics have become essential to do bioinformatics. Here there is a list of the most used commands:
|Commad| Use| Examples|
|:--|:------|:-----|
|`cd` | Tove between directories| `cd FASTQ` (to enter a folder)|
|`cd ..` | To change to the parent of the current directory| |
|`cd -` | To go to previous directory| |
|`pwd` | Where am I?| |
|`ls` | List files | `ls` (on this directory); `ls FASTQ` (list files inside FASTQ) |
|`mkdir` | make new directory| `mkdir my_folder` |
|`mv` | move a file or directory| `mv file my_folder/` (move file to a folder) |
|`cp` | copy file| `cp my_folder/* new_folder/` (copy all files of al folder) |
|`rm` |Remove a file or folder| `rm file`; `rm -rf folder` |
|`less` |Displays the contents of a file| `less file.txt` (press `q` to quit) |
|`head`| Displays the first ten lines of a file | |
|`tail`| Displays the last ten lines of a file | |
|`cat` |Print file content|`cat file1 file2 > file.total` (merge files)|
|`zcat` |Print file content of compresed files|`zcat SRR7889597.fastq.gz` |
|`gzcat` |mac OSx version of zcat |`gzcat SRR7889597.fastq.gz` |
|`wc` | word count | `wc -l file ` (count lines inside of a file)|
|`grep` | Pattern matching | `grep patern file` (get lines that match the patern)|
| `cut` | extract columns from files | `cut -f 2 file` (extract second column)
|`sort` | sort file by column |`sort -nr -k 2 file` (sort file by second column) |
|`wget` | download file |`wget -r file_url -O local_file` |
|`gzip` | compress / uncompress files |`wget -d file.gz` (uncompress a .gz file) |
During the day to day work as a bioinformatician, the terminal will be your best friend. In this section, we are going to review one of the most used terminal commands by doing some practical exercises.
**Exercise 1.0.1**
A) Use `ls` to list the files that are inside the course directory (`RNA_seq_course2020`).
B) From the commands listed on the table above, use the right command to enter the Genome folder and use `ls` again to list the files inside the folder.
C) Find command to print the first lines of `dm6.fa`
D) Use `less` to display the content of `dm6.fa`. What is the name of this format? What does each sequence represent? (hint: use `q` to exit)
E) Count the number of lines inside `dm6.fa` to estimate the number of sequences present inside
**Exercise 1.0.2**
A) Count the number of sequences present at `dm6.fa` (hint: count lines that start with `>`, but as this is a special UNIX character, refer to it as `\>`. Which command can be used for pattern matching and count lines?).
B) Can you calculate the exact number of nucleotides that each D. melanogaster assembly is made of? (hint: explore additional options of `grep` and `wc`)
C) Compare your results with the official stats reported for [dm6](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001215.4/) and [dm3](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001215.2/).
## 1.2 Efficient use of the terminal
To accelerate our interaction with the terminal, we can use the TAB key:
![](Images/TAB.png)
For example, if you are at course folder and you write `ls Gene_` and then press TAB key once, it will complete the command to `ls Gene_annotation`. But if you just write `ls G` and press TAB two times, it will show you the two options (Gene_annotation and Genome).
Another powerful thing is to pipe (`|`) the output of one command into and another command. For example you can do:
`ls FASTQ/*.fastq.gz | wc -l`
To count all the FASTQ files that are stored inside of `FASTQ/` folder.
Finally, if you want to know more, every command has its own options, you can see the entire description of a command using `man`. For instance, use man to know more about `head` typing on the command line:
`man head`
Different data types are stored using different file formats. Fasta files a one of the most widely used formats to store sequences. In the following excersice we are going to use some terminal skills to get an older version of D. melanogaster genome and compare it with the newest version.
**Exercise 1.2.1**
Read the manual of `head` command. Then use it to extract the first 100 lines of `Gene_annotation/dm6.Ensembl.genes.gtf` and count all the lines that have `CDS` in the third column. (hint: pipe head, grep and wc -l)
# 2. Input data format
In order to quantify mRNA by RNA-seq analysis, different files are required which as a different format. RNA-Seq reads are normally formatted as [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format), SRA or CRAM. To process this data, the most common initial step is to map the reads to the genome, which is often formatted as [FASTA](https://en.wikipedia.org/wiki/FASTA_format). Finally, to quantify gene expression, we need to know which part of the genome encodes for genes. This information can be found at gene annotation files, which can be formatted as [GTF](http://mblab.wustl.edu/GTF22.html#fields) or [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) files. Explore the different hyperlinks provided for detailed information about the different file formats.
## 2.1 FASTA format
FASTA files contain nucleotide or amino acid sequence information. The data is composed of identifiers, which starts with `>` and nucleotide o amino acid sequences. Hundred of thousands of genomes coming from different species are available in public repositories as FASTA files. It is important to notice that a given species can have different genomic assemblies. For instance, a popular assembly human genome assembly is `hg19` that was published in February 2009, and as a huge amount of data is available for that assembly some researchers keep doing their analyses with that version. But other recent analyses have started to use `hg38` instead, which was published in December 2013 and represent a more curated version of the human genome.
For today's practice, we are going to work with a smaller genome. We already have downloaded for you the `dm6` assembly of D. melanogaster's genome, an invertebrate organism widely used as a model organism for genetic and behavioral studies. The genome file is at course material folder at `Genome/dm6.fa`.
**Exercise 2.1.1**
Genome assemblyies have different versions. During this exercise we will compare the newest assembly of Dm melagogaster's genome (`dm6`) with an older version (`dm3`)
A) Create a new directory named `Older` under the Genome directory (hint: look for the appropriate command on the table above)
B) Enter the newly created directory and use `wget` to download a file from the following URL:
http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.fa.gz
Check the manual pages of wget by inserting `man wget` and find an option to redirect the downloading to a file named `dm3.fa.gz`
C) The obtained file `dm3.fa.gz` is a compressed file, therefore it cannot be directly oppened using `less`. To print the first lines of this file we can use a combination of two commands; `zcat` and `head`. Use `zcat dm3.fa.gz | head` to print the fist 10 lines. Notice the pipe `|` which here is used to couple these two commands. Now pipe `zcat`with `wc -l` to count the total number of lines inside `dm3.fa.gz`.
D) To uncompress and compress `.gz` files we can use gzip. Explore the man pages of it (`man gzip`) and find the right option to uncompress the file. Then use `wc -l` to check if the number of lines is maintained.
## 2.2 FASTQ format
RNA-seq reads often come as FASTQ, which is an extension of FASTA file format. Every read is represented by four lines:
* First line: start with `@` and correspond to a read identifier.
* Second line: read nucleotide sequence.
* Third line: Start with `+` and it can be followed by additional information, but often it is just `+`.
* Fourth line: Phred quality score represent the reliability of a base call (as higher the better). It is encoded as ASCII characters. It can have different encodings, such as Sanger, Solexa, Illumina 1.3+ or Illumina 1.8+. Please check [FASTQ format description](https://en.wikipedia.org/wiki/FASTQ_format) for more details.
**Exercise 2.2.1**
Which of the fastq files found at `FASTQ/` has the highest and the lowest amount of reads? How many reads do they have? What is the read length?
## 2.3 Gene annotation
Efforts have been made to identify the genomic regions that encode for genes. This information can be stored as BED or GTF files (among other formats). BED files store information about genomic coordinates. Each row of a BED file represents a genomic interval, in the case of gene annotation file, it represents the start and end of a transcript. Exons coordinates are encoded as `blocks`, where the `blockStarts` column is a list of all exon starts relative to the transcript starts and `blockSizes` column contain all the exon sizes. Please read [BED description format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information.
GTF files instead are gene-centric annotations, in which a gene can be described by multiple rows. Each row contains the coordinates and additional information of a particular gene feature. Please check the full [GTF description](http://mblab.wustl.edu/GTF22.html)
**Exercise 2.3.1**
Get D. melanogaster gene annotation files for `dm6` assembly in from [UCSC Table Browser](http://genome-euro.ucsc.edu/cgi-bin/hgTables). Choosing the following values:
* clade: Insect.
* genome: D. melanogaster.
* assembly: Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6).
* group: Gene and Gene Predictions.
* track: Ensembl Genes.
* table: ensGene.
And get the annotation files as BED and GTF (choosing the right output formats). Save the BED file as `dm6.Ensembl.genes.bed12` and GTF as `dm6.Ensembl.genes.gtf` at `Gene_annotation` folder.
**Exercise 2.3.1**
Process `dm6.Ensembl.genes.bed12` to find the five transcripts with the highest number of exons. (hint use `sort`, look for `-r`, `-n` and `-k` flags at its manual).
**Exercise 2.3.2**
Process `dm6.Ensembl.genes.gtf` to find the total number of genes