datatable_SOLVED.Rmd

---
title: "datatable"
author: "Maja Kuzman"
date: "7/1/2020"
output: html_document
---


```{r}
ddf <- data.table::fread("http://hex.bioinfo.hr/~mfabijanic/tidyData.txt", header=T)
ddf
``` 
I will show you some examples on this data, and you will do the exercise on data available in R (iris).  

---

# Loading the library  

If you don't have the package, install it with:  
(only once)  
```{r, eval=FALSE}
install.packages(data.table)
```

Once it is installed, you need to load it to R:  
(once per session)  

```{r}
library(data.table)
```

If you are all working together and sharing threads, be collegial:
```{r}
setDTthreads(threads = 1, restore_after_fork = FALSE)
```

If you already have a data frame, convert it to data.table by:  

```{r}
ddf_dt <- as.data.table(ddf)
```

---
class: invert, center, middle
# Packages in R: data.table   


####All you need to know:  

#### DT[i,j,by]  
     
##### i: select those rows  
##### j: do this to them  
##### by: do it per groups  

---
#data.table: SELECTING ROWS  


Selecting rows:
How many people are taller then 170 and shorter than 180?  

In data.table:  
You select rows similar as you would select elements in vector:  

```{r}
ddf_dt[Height<180 & Height>170]
nrow(ddf_dt[Height<180 & Height>170])
```

---

#data.table: SELECTING COLUMNS   

Show the Age and Continent of the first 5 people.

In data.table use . instead of c,  
. is short for list()  
```{r}
ddf_dt[1:5,.(Age, Continent)]
```


If you want to use column names:
```{r}
cnames <- c("Age","Continent")
ddf_dt[1:5,..cnames]
```

---

#data.table: SELECTING COLUMNS

note on columns selections: this will also work:
```{r}
ddf_dt[1:3,Age:Continent]
```

```{r}
ddf_dt[1:3,-(Age:Continent)]
```

```{r}
ddf_dt
ddf_dt[1:3,!(Age:Gender)]
```

---

#data.table: exercise iris 1  

```{r, eval=FALSE}
ddf_dt <- as.data.table(ddt)
ddf_dt[1:5,.(Age,Continent)]
cnames <- c("Age","Continent")
ddf_dt[1:5,..cnames]
```

Mini exercise:  
```{r}
iris
```

- convert iris to data.table - call it iris_dt  
- Select all rows in iris_dt with Sepal.Length<6.7  
- Select as before, but show only columns Sepal.Length and Species  

```{r}
iris_dt <- as.data.table(iris)
iris_dt[Sepal.Length < 6.7]
iris_dt[Sepal.Length < 6.7, .(Sepal.Length,Species)]

```

---
#data.table: OPERATION ON COLUMNS

Calculations of mean:  
```{r}
ddf_dt[,mean(Height)]
```

By groups:
```{r}
ddf_dt[,mean(Height), by=Gender]
```

---

# data.table: OPERATION ON COLUMNS

Calculations of mean, sd, max , min for Height, and number of people per Gender, get a data frame order it by mean:  
```{r}
meanic <- by(ddf$Height,INDICES = ddf$Gender, mean)
sdic <- by(ddf$Height,INDICES = ddf$Gender, sd)
maxx <- by(ddf$Height,INDICES = ddf$Gender, max)
minx <- by(ddf$Height,INDICES = ddf$Gender, min)
nr_grp <- by(ddf$Height,INDICES = ddf$Gender, length)
res_df <- data.frame(as.numeric(meanic),
                     as.numeric(sdic),
                     as.numeric(maxx),
                     as.numeric(minx),
                     as.numeric(nr_grp))
res_df[order(res_df$as.numeric.meanic.),]    
```

---

# data.table: OPERATION ON COLUMNS

Calculations of mean, sd, max , min for Height, and number of people per Gender, get a data frame order it by mean:  

```{r}
ddf_dt[ , 
        .(mean=mean(Height),
          sd  =sd(Height),
          min_x= min(Height),
          max_x= max(Height),
          N= .N),
        by=Gender][order(mean)]
```

---
# data.table: .N, by 

The .N gives number of observations:  
```{r}
ddf_dt[,.N]
ddf_dt[,.N, Gender]
```




You can group by multiple variables:  
```{r}
ddf_dt[,.N, by=.(Gender,Height>180)]
```
### Exercise: 

How many man and women are there from different continents?
```{r}
ddf_dt[,.N,.(Gender,Continent)]
```

---
# data.table: exercise iris 2  

```{r, eval=FALSE}
ddf_dt[,
       .(N=.N), 
       by=.(Gender,Height>180)]
```

Exercise on iris_dt:  
- Select all rows where Sepal.Length < 6.7 and Species=="virginica"   
- For those - use chaining - [][] to calculate mean Petal.Width for all flowers  
- How many of those flowers have Sepal.Width>3 and how many less then 3?  

```{r}
iris_dt[Sepal.Length<6.7 & (Species=="virginica")]
iris_dt[Sepal.Length<6.7 & (Species=="virginica")][,mean(Petal.Width)]
iris_dt[Sepal.Length<6.7 & (Species=="virginica"),mean(Petal.Width)]
iris_dt[Sepal.Length<6.7 & (Species=="virginica"),.N, .(Sepal.Width<3)]
```

---

# data.table:  add a new column 

Use ':=' to add a new column
```{r}
ddf_dt[,N:=.N]
ddf_dt[1:3]
```

If you want to add multiple columns, use ':='as a function:  
```{r}

ddf_dt[ ,
       ':='(N_grp=.N, mean=mean(Height)),
       by=.(Gender, Continent)]
ddf_dt
```

---

# data.table: Exercise add a new column 

```{r, eval=FALSE}
ddf_dt[ ,
       ':='(N_grp=.N, mean=mean(Height)),
       by=.(Gender,Continent)]
```

- Add columns to iris_dt that represent mean and sd of Sepal.Width grouped by species.  
- use function uniqueN to check how many unique mean Sepal Widths there are. 

```{r}
iris_dt[,':='(meanSepWid=mean(Sepal.Width), 
              sdSepWid=sd(Sepal.Width)),
        Species]
iris_dt[,uniqueN(meanSepWid)]
```


---
# data.table:  .I, .GRP  

The .I holds row numbers:
```{r}
ddf_dt[,.(row_id=.I,Gender,Continent)]
```

--

The .GRP holds unique group number:  

```{r}
ddf_dt[,.GRP,.(Gender, Continent)]
```

---
# data.table Exercise :=, .N, .I, .GRP  


```{r, eval=FALSE}
ddf_dt[ ,
       ':='(N_grp=.N, mean=mean(Height)),
       by=.(Gender, Continent)]
ddf_dt[,.(row_id=.I,Gender,Continent)]
ddf_dt[,.GRP,.(Gender,Continent)]
```

- Add columns to iris_dt that represent number of observations of all rows for which Petal.Length is smaller than 6.5 in iris_dt grouped by species.  
- One great benefit of data.table is the ability to sub-assign by reference:  Try it: select all rows that have species=="virginica" and rename those Species entries using := to new_virginica  

```{r}
iris_dt[Petal.Length<6.5,N:=.N,Species]
iris_dt[Species=="virginica",Species:="new_virginica"]
iris_dt
```


---
#data.table MORE ADVANCED USAGE: keys  

     You can "set a key" of data.table by using the setkey() function. This will result in data table that is ordered by the key/s and will allow for (much) faster manipulation (for example merge function!)!   
```{r}
setkey(ddf_dt, Gender, Continent, Height)
ddf_dt
```

---

# data.table MORE ADVANCED USAGE: .SD, .SDcols  

Select all columns with .SD. Select only a subset of all columns by .SDcols:  
```{r}
ddf_dt[,.SD, .SDcols=c(2,4,5)]
```

this is especially useful when you want to do the same operation on multiple columns: for example, calculate mean of x and y: 
```{r}
ddf_dt[,lapply(.SD,mean), by=Gender,.SDcols=c(2,4,5)]
```


---
#data.table MORE ADVANCED USAGE: .SD, .SDcols  

..Or for example select first and last row for each group:  
```{r}
ddf_dt[, .SD[c(1, .N)], by=Gender]
```
It is easier if you read it as: SelectedData  

---
#data.table exercise MORE ADVANCED USAGE

```{r, eval=FALSE}
ddf_dt[,lapply(.SD,mean), by=Gender,.SDcols=c(2,4,5)]
ddf_dt[, .SD[c(1, .N)], by=Gender]
```


Do the following in a single command:  
- order the results by Petal.Width and select first three (smallest) observations .  
- Calculate mean of first four columns for iris_dt for those observations      
```{r}
iris_dt[order(Petal.Width),.SD[1:3]]
iris_dt[order(Petal.Width),lapply(.SD[1:3],mean),.SDcols=1:4]

```


---

#data.table MORE ADVANCED USAGE {}

Suppressing Intermediate Output with {}:

If you want to do multiple things, but dont need to save all steps in separate columns, no problem! Check this out:  

Lets calculate BMI!
BMI = kg/m^2
heightinmeters = Height/100
heightinmeterssquared = heightinmeters^2
Weight is already in kg :)
BMI= heightinmeters/Weight
---
#data.table MORE ADVANCED USAGE {} solved  

```{r}
ddf_dt[,BMI:= {
          heightinmeters = Height/100
          heightinmeterssquared = heightinmeters^2
          Weight/heightinmeterssquared
     }]
```
### Exercise:  

How many obese imaginary people are there? (BMI>25)

```{r}
ddf_dt[BMI>25,.N]
ddf_dt[,.N,BMI>25]

```

#data.table exercise MORE ADVANCED USAGE: merging 


Lets define dummy data tables:  
```{r}
dt1 <- data.table(x = c("a", "b", "c", "d"), y = c(11.9, 21.4, 5.7, 18))
dt2 <- data.table(x= c("a","b","k"),y = c(10, 15, 20), z = c("one", "two", "three"))
dt1
dt2
```

---
#data.table exercise MORE ADVANCED USAGE: merging 

Merge those two data tables by variable x


```{r}
merge(dt1,dt2, by="x")
```

```{r, eval=FALSE}
merge(dt1,dt2, by="x", all.x = T)
merge(dt1,dt2, by="x", all.y = T)
```

---
#data.table exercise MORE ADVANCED USAGE: merging  

```{r, eval=FALSE}
merge(dt1,dt2, by="x", all.y = T)
```

```{r}
dt1[dt2, on = .(x)]
```

---

# data.table exercise MORE ADVANCED USAGE: roll

```{r}
dt1
dt2
```
---
#keep rolling! - FORWARD JOIN:   
Merge two data frames on CLOSEST SMALLER NUMERICAL VALUE in dt2 - keep all observations from dt1!:  

```{r}
dt2[dt1, on = .(y), roll = T]
```
---
#keep rolling! - BACKWARD JOIN:   
Merge two data frames on CLOSEST LARGER NUMERICAL VALUE in dt2 NOT LARGER THAN 4 - keep all observations from dt1!:  
```{r}
dt2[dt1, on = .(y), roll = -4]
```
roll can also take "nearest".

---

# data.table exercise MORE ADVANCED USAGE: foverlaps

```{r}
dt3 <- data.table(min_y = c(0, 10, 15, 20), max_y = c(10, 15, 20, 30))
setkey(dt3, min_y, max_y)
dt1[, `:=` (dt1_y_end = c(13, 25, 10, 22), dt1_y=y)]
setkey(dt1, dt1_y, dt1_y_end)
foverlaps(dt1, dt3, type = "any")
```

Each window does not have to be equal
```{r}
dt1
dt3
```