-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathDATASETS.Rmd
141 lines (100 loc) · 3.83 KB
/
DATASETS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: 'Datasets and dataframes'
---
```{r, include=FALSE}
library(tidyverse)
library(tufte)
```
# (PART) Data {- #data}
# The `dataframe` {#datasets-dataframes}
A `dataframe` is a container for our data.
It's much like a spreadsheet, but with some constraints applied. 'Constraints'
might sound bad, but they're actually helpful: they make dataframes more
structured and predictable to work with. The main constraints are that:
- Each column is a [vector](#vectors-and-lists), and so can
[only store one type of data](#vectors-and-lists).
- Every column has to be the same length (although missing values are
allowed).
- Each column must have a name.
A `tibble` is an updated version of a dataframe with a whimsical name, which is
part of the `tidyverse`. It's almost exactly the same a dataframe, but with some
rough edges smoothed off — it's safe and preferred to use `tibble` in place of
`data.frame`.
You can make a simple tibble or dataframe like this:
```{r}
data.frame(myvariable = 1:10)
```
Using a tible is much the same, but allows some extra tricks like creating one
variable from another:
```{r}
tibble(
height_m = rnorm(10, 1.5, .2),
weight_kg = rnorm(10, 65, 10),
bmi = weight_kg / height_m ^ 2,
overweight = bmi > 25
)
```
#### Using 'built in' data {- #built-in-data}
The quickest way to see a dataframe in action is to use one that is built in to
R
([this page](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html)
lists all the built-in datasets). For example:
```{r}
head(airquality)
```
Or
```{r}
head(mtcars)
```
In both these examples the datasets are already loaded and available to be used
with the `head()` function.
:::{.exercise}
To find a list of all the built in datasets you can type `help(datasets)` into
the console, or see
<https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html>.
Familiarise yourself with some of the other included datasets, e.g.
`datasets::attitude`. Watch out that not all the included datasets are
_dataframes_: Some are just vectors of observations (e.g. the `airmiles` data)
and some are 'time-series', (e.g. the `co2` data)
:::
#### Looking at dataframes {- #looking-at-data}
As we've already seen, using `print(df)` within an RMarkdown document creates a
nice interactive table you can use to look at your data.
However you won't want to print your whole data file when you Knit your
RMarkdown document. The `head` function can be useful if you just want to show a
few rows:
```{r}
head(mtcars)
```
Or we can use `glimpse()` function from the `dplyr::` package (see the
[section on loading and using packages](#packages)) for a different view of the
first few rows of the `mtcars` data. This flips the dataframe so the variables
are listed in the first column of the output:
```{r}
glimpse(mtcars)
```
You can use the `pander()` function (from the `pander::` package) to format
tables nicely, for when you Knit a document to HTML, Word or PDF. For example:
```{r}
library(pander)
pander(head(airquality), caption="Tables always need a caption.")
```
See the section on [sharing and publishing](#sharing-and-publication) for more
ways to format and present tables.
Other useful functions for looking at and exploring datasets include:
- `summary(df)`
- `psych::describe(df)`
- `skimr::skim(df)`
:::{.exercise}
Experiment with a few of the functions for viewing/summarising dataframes.
:::
There are also some helpful plotting functions which accept a whole dataframe as
their input:
```{r, fig.cap="Box plot of all variables in a dataset."}
boxplot(airquality)
```
```{r, fig.cap="Correlation heatmap of all variables in a dataset. Colours indicate size of the correlation between pairs of variables."}
psych::cor.plot(airquality)
```
These plots might not be worth including in a final write-up, but are very
useful when exploring your data.