-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path07-data-manipulation.Rmd
395 lines (280 loc) · 17 KB
/
07-data-manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# Data Manipulation{#datman}
```{block2, type = 'rmdtip'}
In most instances, the structure of the available data will not meet the specific requirements needed to perform the analyses you are interested in. Data analysts typically spend most of their time cleaning, filtering, restructuring data as well as harmonizing and joining data from different sources.
In this lesson, you will learn about the fundamental data wrangling operations using the `dplyr` library from the Tidyverse suite. You'll also be introduced to [tibbles](https://tibble.tidyverse.org/){target="_blank"}, a modern take on data frames in R. Tibbles are lightweight and work seamlessly within the Tidyverse ecosystem. For more on tibbles, see [Tibbles in R for Data Science](https://r4ds.had.co.nz/tibbles.html){target="_blank"}.
```
## Preparation
If not already installed on your machine, install both the `tidyverse` and `nycflights13` libraries. (see "libraries" in lesson [2](#core)).
The code below loads a sample dataset (a [tibble](https://tibble.tidyverse.org/){target="_blank"}) from the library `nycflights13` into the variable `flights_from_nyc`. We will use this sample data in this lesson.
```{r, echo=TRUE, message=FALSE, warning = FALSE}
library(nycflights13)
flights_from_nyc <- nycflights13::flights
```
The operator `::` is used to indicate that the function `flights` (that returns our sample dataset) is situated within the library `nycflights13`. This helps avoid ambiguities when functions from different loaded libraries have the same names.
In order to run the following data wrangling examples on your machine, add both lines above as well as the code snippets provided in the upcoming examples to a new R script file.
```{block2, type = 'rmdtip'}
Once you have loaded the flights table, open the **Environment Tab** in RStudio and double-click variable `flights_from_nyc` to inspect the variable contents.
Alternatively, you may inspect `flights_from_nyc` by writing it to the console.
```
## Data manipulation
The `dplyr` library provides a number of functions to investigate the basic characteristics of inputs. For instance, the function `count()` can be used to count the number of rows of a data frame or tibble. The code below uses `flights_from_nyc` as input to this function. The returned row count `n` is represented in a table.
```{r, echo=TRUE, message=FALSE, warning = FALSE}
library(tidyverse)
library(knitr)
flights_from_nyc %>%
dplyr::count() %>%
knitr::kable()
```
In the example above, we use the pipe operator `%>%`, a key feature of `magrittr` and `tidyverse` syntax. The operator links a sequence of analysis steps, passing `flights_from_nyc` through `count()` and then `kable()`. This makes the code simpler and more readable. For a deeper understanding, check out [this video](https://www.youtube.com/watch?v=sohARFx6aTo){target="_blank"}.
The same result could be achieved with a traditional approach using an intermediate variable or by nesting functions. However, these methods can make the code less readable compared to using the pipe operator:
**Intermediate Variable:**
```{r, eval=FALSE, echo=TRUE, message=FALSE, warning = FALSE}
cnt <- dplyr::count(flights_from_nyc)
knitr::kable(cnt)
```
**Nested Function:**
```{r, eval=FALSE, echo=TRUE, message=FALSE, warning = FALSE}
knitr::kable(dplyr::count(flights_from_nyc))
```
The result is the same. However, nested structures are harder to read. Accordingly, the pipe operator is a powerful tool to simplify your code. See [this video](https://www.youtube.com/watch?v=sohARFx6aTo){target="_blank"} to learn more about it.
The function `count()` can also be used to count the number of rows of a table that has the same value for a given column, usually representing a category.
In the example below, the column name `origin` is provided as an argument to the function `count()`, so rows representing flights from the same origin are counted together – `EWR` is the Newark Liberty International Airport, `JFK` is the John F. Kennedy International Airport, and `LGA` is LaGuardia Airport.
```{r, echo=TRUE, message=FALSE, warning = FALSE}
flights_from_nyc %>%
dplyr::count(origin) %>%
knitr::kable()
```
Notice how the code becomes more readable with each operation on a new line after `%>%`. Remember, the pipe operator passes the left-hand object as the first argument to the right-hand function. To change the argument placement, use `.` within the function call.
For example, you could explicitly pass variable `flights_from_nyc` as a first argument by `dplyr::count(., origin)`. Passing the variable as a second argument `dplyr::count(origin, .)` will return an error as the data frame is expected as a first input to function `count`.
### Summarise
To carry out more complex aggregations, the function `summarise()` can be used in combination with the function `group_by()` to summarise the values of the rows of a data frame or tibble. Rows having the same value for a selected column (in the example below, the same `origin`) are grouped together, then values are aggregated based on the defined function (using one or more columns in the calculation).
In the example below, the function `mean()` is applied to the column `distance` to calculate a new column `mean_distance_traveled_from` (the mean distance travelled by flights starting from each airport).
```{r, echo=TRUE}
flights_from_nyc %>%
dplyr::group_by(origin) %>%
dplyr::summarise(
mean_distance_traveled_from = mean(distance)
) %>%
knitr::kable()
```
The results show that the average distance covered by flights starting from `JFK` is significantly higher that flights departing from `LGA` or `EWR`.
### Select and filter
The function `select()` can be used to select a subset of **columns**. For instance in the code below, the function `select()` is used to select the columns `origin`, `dest`, and `dep_delay`. The function `slice_head` is used to include only the first `n` rows in the output.
```{r, echo=TRUE}
flights_from_nyc %>%
dplyr::select(origin, dest, dep_delay) %>%
dplyr::slice_head(n = 5) %>%
knitr::kable()
```
The function `filter()` can instead be used to filter **rows** based on a specified condition. In the example below, the output of the `filter` step only includes the rows where the value of `month` is `11` (i.e., the eleventh month, November).
```{r, echo=TRUE}
flights_from_nyc %>%
dplyr::select(origin, dest, year, month, day, dep_delay) %>%
dplyr::filter(month == 11) %>%
dplyr::slice_head(n = 5) %>%
knitr::kable()
```
Notice how `filter` is used in combination with `select`. All functions in the `dplyr` library can be combined, in any other order that makes logical sense. However, if the `select` step didn’t include `month`, that same column couldn’t have been used in the `filter` step.
### Mutate
The function `mutate()` can be used to add a new column to an output table. The `mutate` step in the code below adds a new column `air_time_hours` to the table obtained through the pipe, that is the flight air time in hours, dividing the flight air time in minutes by `60`.
```{r, echo=TRUE}
flights_from_nyc %>%
dplyr::select(flight, origin, dest, air_time) %>%
dplyr::mutate(
air_time_hours = air_time / 60
) %>%
dplyr::slice_head(n = 5) %>%
knitr::kable()
```
```{block2, type = 'rmdexercise'}
Run the mutate example above in a new script and replace `dplyr::mutate` with `dplyr::transmute`. What happens to your results?
<details closed>
<summary><ins>**See solution!**</ins></summary>
<p><i><font color="grey">
The `transmute` function creates a new data frame containing only column `air_time_hours`. The function drops other table columns.
</font></i>
</p>
</details>
```
### Arrange
The function `arrange()` sorts a `tibble or data frame in ascending order based on the values in the specified column. If a negative sign is specified before the column name, the descending order is used. The code below would produce a table showing all the rows when ordered by descending order of air time.
```{r, echo=TRUE}
flights_from_nyc %>%
dplyr::select(origin, dest, air_time) %>%
dplyr::arrange(-air_time) %>%
dplyr::slice_head(n = 5) %>%
knitr::kable()
```
In the examples above, we have used `slice_head` to present only the first `n` rows in a table, based on the existing order.
### Exercise: data manipulation
```{block2, type = 'rmdexercise'}
The Food and Agriculture Organization (FAO) is a specialized agency of the United Nations that leads international efforts to defeat hunger. On their [Website](http://www.fao.org/home/en/){target="_blank"} they provide comprehensive datasets on global crop and livestock production. Your task in this exercise is to create a table that shows national African sorghum production in 2019.
```
1. Create an RScript and install or load the libraries `tidyverse` and `knitr`, if not done yet.
2. Bulk download [African Crop and Livestock Production](http://www.fao.org/faostat/en/#data/QCL){target="_blank"} data as CSV:
```{r L06-faodata, echo=FALSE, fig.cap="FAO Data Download", out.width='100%', fig.asp=.75, fig.align='center'}
knitr::include_graphics('images/FAODataAfrica.png')
```
3. Read data from comma-separated CSV (`"Production_Crops_Livestock_E_Africa_NOFLAG.csv"`) into your Script.
```{echo=FALSE}
fao_data <- read.csv(directory as string, header = TRUE, sep = ",")
```
4. Use the pipe operator to perform the following operations:
a) Select columns `Area`, `Item`, `Element`, `Unit` and `Y2019`
b) Filter rows that contain sorghum production (`Item == "Sorghum" & Element == "Production"`)
c) Sort the table based on yield in descending order (`arrange`)
d) remove rows including `No Data` by means of function `drop_na()`
e) render the table using the function `kable()` of `library knitr`
See [my solution](Solution_Exercise_DataWrangling1.R){target="_blank"}!
**Note on Encoding:** When working with datasets from various sources, you may encounter encoding errors. If such an issue arises, a simple approach is to specify the encoding in the `read.csv()` function. For instance, you can use `read.csv(..., fileEncoding = "latin1")`. However, the best case would be, to check the original encoding of the data beforehand and match it in your `read.csv()` command. Encoding can be tricky, and while it's outside the scope of this exercise, being aware of it is crucial in data analysis.
## Join
A join operation combines two tables into one by matching rows that have the same values in a specified column. This operation is usually executed on columns containing identifiers, which are matched through different tables containing different data about the same real-world entities.
For instance, the table created below (data frame `city_telephone_prexix`) contains the telephone prefixes of three cities.
```{r, echo=TRUE, message=FALSE, warning=FALSE}
city_telephone_prexix <- data.frame(
city = c("Leicester", "Birmingham", "London"),
telephon_prefix = c("0116", "0121", "0171")
) %>%
tibble::as_tibble()
city_telephone_prexix %>%
knitr::kable()
```
That information can be combined with the data present in a second table `city_info_wide` (see below) through the join operation on the columns containing the city names.
```{r, echo=TRUE, message=FALSE, warning=FALSE}
city_info_wide <- data.frame(
city = c("Leicester", "Nottingham"),
population = c(329839, 321500),
area = c(73.3, 74.6),
density = c(4500, 4412)
) %>%
tibble::as_tibble()
city_info_wide %>%
knitr::kable()
```
```{block2, type = 'rmdtip'}
Note that data frames in the code above are converted to [tibbles](https://tibble.tidyverse.org/){target="_blank"}. This step is necessary because the `kable()` function requires `tibbles` as input.
```
Joins between two tables can be executed by different join operations. The `dplyr` library offers join operations, which correspond to `SQL` joins illustrated in the image below.
```{r L06-joins, echo=FALSE, fig.cap="Join types", out.width='70%', fig.asp=.75, fig.align='center'}
knitr::include_graphics('images/joins.png')
```
Please take your time to understand the examples below that show different realizations of joins between table `city_telephone_prexix` and table `city_info_wide`.
The first four examples execute the exact same *full join* operation using three different syntaxes: with or without using the pipe operator and specifying the `by` argument or not. Note that all those approaches to writing the join are valid and produce the same result. The choice about which approach to use will depend on the code you are writing. In particular, you might find it useful to use the syntax that uses the pipe operator when the join operation is itself only one stem in a series of data manipulation steps. Using the `by` argument is usually advisable unless you are certain that you aim to join two tables with all and exactly the column that have the same names in the two table.
Note how the result of the join operations is *not* saved to a variable. The function `knitr::kable` is added after each join operation through a pipe `%>%` to display the resulting table in a nice format.
\newpage
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Option 1: without using the pipe operator
# full join verb
dplyr::full_join(
# left table
city_info_wide,
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
knitr::kable()
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Option 2: without using the pipe operator
# and without using the argument "by"
# as columns have the same name
# in the two tables.
# Same result as Option 1
# full join verb
dplyr::full_join(
# left table
city_info_wide,
# right table
city_telephone_prexix
) %>%
knitr::kable()
```
\newpage
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Option 3: using the pipe operator
# and without using the argument "by"
# as columns have the same name
# in the two tables.
# Same result as Option 1 and 2
# left table
city_info_wide %>%
# full join verb
dplyr::full_join(
# right table
city_telephone_prexix
) %>%
knitr::kable()
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Option 4: using the pipe operator
# and using the argument "by".
# Same result as Option 1, 2 and 3
# left table
city_info_wide %>%
# full join verb
dplyr::full_join(
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
knitr::kable()
```
\newpage
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Left join
# Using syntax similar to Option 1 above
# left join
dplyr::left_join(
# left table
city_info_wide,
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
knitr::kable()
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Right join
# Using syntax similar to Option 2 above
# right join verb
dplyr::right_join(
# left table
city_info_wide,
# right table
city_telephone_prexix
) %>%
knitr::kable()
```
\newpage
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Inner join
# Using syntax similar to Option 3 above
# left table
city_info_wide %>%
# inner join
dplyr::inner_join(
# right table
city_telephone_prexix
) %>%
knitr::kable()
```
Compare the results of respective full, left, right and inner join operations. Turn to the discussion forum in case you need further explanations.
### Exercise: join
```{block2, type = 'rmdexercise'}
In the previous exercise, we created a table that shows national African sorghum production in 2019. In this exercise we will join crop production statistics with a table that contains national boundaries and visualize sorghum production quantities in a simple map.
```
1. Create an RScript and install and load the libraries `tidyverse`, `knitr`, `ggplot2` and `maps`, if not done yet.
2. Copy the [code from the previous exercise](Solution_Exercise_DataWrangling1.R){target="_blank"} into your new RScript. Note that the result of the pipe operations is *not* saved to a variable. Save it to a variable.
3. Use the `ggplot2` function [`map_data`](https://ggplot2.tidyverse.org/reference/map_data.html){target="_blank"} to convert the built in sample dataset `world` (comes with library [`maps`](https://cran.r-project.org/web/packages/maps/index.html){target="_blank"}) to a data frame:
```{echo=FALSE}
world_ctry <- map_data("world")
```
4. Inspect the structure of this data frame. Every row represents a node (defined by long/lat) of a polygon feature (national boundaries).
5. Join tables (left table: geographic features, right table: sorghum production statistics) based on country names. Make sure to choose a join method (`full_join`, `inner_join`, `left_join` or `right_join`) that allows for retaining all the geographic features.
[Our exercise solution](Solution_Exercise_DataWrangling2.R){target="_blank"} creates a simple output map. Don't worry in [Lesson 10](#data-viz) we will cover visualization methods in more detail.
```{block2, type = 'rmdtip'}
Take a look at the [dplyr Cheatsheet](https://www.rstudio.com/resources/cheatsheets/){target="_blank"} which shows the most important dplyr operations at a glance.
```