-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_data_in_r.qmd
428 lines (355 loc) · 16.2 KB
/
03_data_in_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# Data in R
In the previous section we learned how to store, extract, and manipulate
numerical vectors. But working with real data requires using multiple types of
data with different shapes and sizes. In this section we will learn how **R**\
handles different data types and structures, and how to use them to study and
summarize data.
## Data types
Data types are classifications of data that help **R** conform to our intuition.
For example, multiplying $2$ by $7$ feels right, but multiplying "pineapple" by
"stalactite" does not. There are six data types in **R**: double, integer,
logical, character, complex, and raw, each with its own rules for storing and
handling. Learning these rules will allow us to analyze data later with less
effort and fewer mistakes.
Data type **complex** is for imaginary numbers, and type **raw** is for raw
bytes of data. It is unlikely that you will ever need these data types, so I
will not explain them in these notes.
**Doubles** are regular numbers with a decimal value (which may be zero). In general, **R** will save any number that we type as a double.
```{r double value}
my_double <- 5
typeof(my_double)
```
**Integers** are numbers that have no decimal component. To create an integer
you must type a number followed by an `L`:
```{r integer value}
my_integer <- 5L
typeof(my_integer)
```
As data scientists we rarely use integers because we can save them as doubles.
But they can help us do complicated operations because **R** stores them more
precisely and with less memory than doubles.
**Logicals** are truth values `TRUE` and `FALSE`, which are useful when we
compare numbers or objects:
```{r logical value}
my_comparison <- -3 < 1
typeof(my_comparison)
```
::: {.callout-tip}
#### Write TRUE and FALSE explicitly
At the beginning of every session, **R** saves `T` and `F` as shortcuts for
`TRUE` and `FALSE`. But `T` and `F` are not reserved words; they are regular
objects that we can modify and even delete inadvertently. An accidental misuse
of `T` or `F` will cost you more time and effort than whatever you may save by
typing a single letter instead of a full word. So, I strongly suggest you always
write the complete words.
:::
There is also a special type of logical value called `NA`, which denotes a
missing value.
```{r NA as a logical value}
missing_value <- NA
typeof(missing_value)
```
Treating `NA` as a logical value allows **R** to handle missing data in
intuitive ways. Consider, for example, that if a value is missing, we can not
know whether it is bigger than a number, or whether it is equal to a word. **R**
reflects this intuition by forcing any comparison involving `NA` to result in
another `NA`.
```{r compare NA}
NA > 5
NA == "Sancho"
```
**Characters** are text like "hello", "Elvis", or "Somewhere in La Mancha"; or
symbols that we want to handle as text, like "size 45" or "mail/u". We can
create a character object by typing a character or *string* of characters
surrounded by quotes:
```{r character value}
my_character <- "Somewhere in La Mancha"
typeof(my_character)
```
It is easy to confuse character strings with objects because both look like
chunks of text in the code. For example, `the_thing` and `"the_thing"` look
alike. However, `the_thing` is the name of an object that can contain any type
of data, while `"the_thing"` is a character string, i.e., a piece of data
that we can assign to any name. If we forget the quotation marks when writing a
name, **R** will look for an object that likely doesn't exist, so we will likely
get an error.
```{r}
#| error: true
noquotes
```
Also, notice that **R** will be treat *anything* surrounded by quotes as a
character string---regardless of what is between the quotes.
```{r numbers as characters}
nums_as_chars <- "-5 + 71"
typeof(nums_as_chars)
```
Still, we can identify strings because **R** always shows them surrounded by
quotation marks and in a different colors from other data types.
```{r character vs double}
typeof("9")
typeof(9)
```
A special type of character string is a *factor*, which stores categorical
information like color or level of agreement. Factors can only have certain
values called "levels", which may be ordered (e.g., "agree", "neutral",
"disagree") or disordered (e.g., "red" or "green"). To create a factor, we can
start with a vector of strings and use function `as.factor()` to convert it to a
factor.
```{r create factor}
my_factor <- as.factor("red")
my_factor
```
They may seem pointless now, but factors will matter more after we understand
the different ways that **R** can group and organize data.
## Data structures
Data structures are ways of organizing data that make it easier for us to
manipulate and understand it. I will explain five different data structures,
each with its own advantages and limitations: atomic vectors, matrices, arrays,
lists, and data frames.
### Atomic vectors
Atomic vectors are the most basic type of data structure. They are
one-dimensional groups of data where all values must be of the same type. There
is only one exception: any vector can include `NA` as a value regardless of the
type of the other values. This internal consistency of vectors makes it easy for
us to store values that are supposed to describe the same property.
To create an atomic vector, we can group values using the combine function `c()`:
```{r atomic vector}
quijote_characters <- c("Don Quijote", "Sancho Panza", NA)
```
Atomic vectors can have almost^[The maximum length (number of elements) of a
vector is 2^31 - 1 ~ 2*10^9. See `help(.Machine)` for more information.] as many
elements as you want (including zero).
```{r}
# length() counts the number of elements in the vector
length(quijote_characters)
length(c())
```
::: {.callout-note}
#### Coercion
Adding different data types to the same atomic vector does not produce an error.
Instead, **R** automatically follows specific rules to *coerce* everything
inside the vector to be of the same type. If a character string is present in an
atomic vector, **R** will convert all other values to character strings. If a
vector only contains logicals and numbers, **R** will convert the logicals to
numbers; every `TRUE` becomes a `1`, and every `FALSE` becomes a `0`. The only
values that are not coerced are `NA`s.
Following these rules helps preserve information. It is easy, for example, to
recognize the original types of strings `"TRUE"` and `"3.14"`, or to transform a
vector of `1`s and `0`s back to `TRUE`s and `FALSE`s.
:::
### Matrices
Matrices are two-dimensional "sheets" of data. To create a matrix, first give
`matrix()` an atomic vector to reorganize into a matrix. Then, define the number
of rows using the `nrow` argument, or the number of columns using the `ncol`
argument. `matrix()` will reshape your vector into a matrix with the specified
number of rows (or columns).
```{r create matrix}
scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3)
scores_mat
# Equivalently
scores_mat <- matrix(data = scores_vec, ncol = 3)
scores_mat
```
Like atomic vectors, matrices can have any data type, but only one (or `NA`).
```{r}
character_mat <- matrix(
data = c("Mario", "Peach", "Luigi", "Yoshi"),
ncol = 2
)
character_mat
```
By default `matrix()` will fill up the matrix column by column. But we can fill
the matrix row by row if we include the argument `byrow = TRUE`:
```{r create matrix filling by row}
scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3, byrow = TRUE)
scores_mat
```
When showing a matrix, **R** shows expressions with square brackets (e.g.,
`[,1]`). The numbers inside the square brackets are positional indices that
denote the "coordinates" of the matrix. Two-dimensional objects like matrices
have two indices. The first index always refers to the row, and the second
always refers to the column. So, as with vectors, we can use square bracket
notation `[ ]` to extract values from matrices.
```{r extract value from matrix}
scores_mat[c(1, 3), 2] # Rows 1 and 3 in column 2
```
We can even extract entire rows or columns.
```{r extract rows and columns from matrix}
scores_mat[2,] # Extract entire second row
scores_mat[, 1] # Extract entire first column
```
::: {.callout-note}
## Matrices are fancy vectors
Deep down, **R** thinks of a matrix as a vector folded to look like a square.
That means, among other things, that you can reference an element of a vector
with a single positional index. Check what happens if you run `scores_mat[5]`.
:::
We can define names for the rows and the columns of a matrix using the
`rownames()` and `colnames()` functions.
```{r naming rows and columns of matrix}
rownames(scores_mat) <- c("Andrew", "Booker", "Comstock")
colnames(scores_mat) <- c("Columbia", "Rapture", "Atlantic")
scores_mat
```
Now we can use these names to extract values:
```{r extracting values using names}
scores_mat["Andrew", c("Rapture", "Columbia")]
```
There are several useful functions to do matrix operations. For example:
```{r matrix operations}
#| eval: false
t(scores_mat) # Transpose the matrix
diag(scores_mat) # Extract values in diagonal
scores_mat + scores_mat # Matrix addition
scores_mat * scores_mat # Element-wise multiplication
scores_mat %*% scores_mat # Matrix multiplication
```
### Arrays
An array is a multidimensional object that stacks groups of data. Using 1
dimension in an array forms a column of data with multiple values; using 2
dimensions is like a sheet of paper with several columns of data; 3 dimensions
are like a book with several sheets; 4 dimensions are like a box with several
books, and so on. Note that layers of an array have consistent sizes. All books
have the same number of sheets, and all sheets have the same number of rows and
columns.
We can use `array()` to create an n-dimensional array. The first argument in
`array()` must be a vector with the values that we want to store in the array.
The second argument must be a vector where the length denotes the number of
dimensions, and the values denote the size of each dimension.
```{r array with three dimensions}
# This is an array with 3 matrices, each with 2 rows and 2 columns
array(c(25:28, 35:38, 45:48), dim = c(2, 2, 3))
```
The `dim` argument works from the inside out. The first value is the number of
elements in each column, the second value is the number of columns, the third
element is the number of matrices, and so on.
Note that the total number of elements in the array is equal to multiplying the
sizes of all dimensions. If the vector we use to build the array has a different
number of elements, **R** will discard or recycle values from the vector. Check
it yourself.
::: {.callout-tip}
## Applied inception
Try to make an array with 4 dimensions. Following the metaphor from above, try
to make an array with 3 books, each of which has 4 sheets with 2 columns and 2
rows each. See a quick solution below.
```{r inception solution}
#| code-fold: true
#| eval: false
array(c(1:48), dim = c(2, 2, 4, 3))
```
:::
Vectors, matrices, and arrays need all of its values to be of the same type.
This requirement seems severe, but it allows the computer to store large sets of
numbers in a simple and efficient way; and it accelerates computations because
it tells **R** that it can manipulate all values in the object in the same way.
However, we often need to store different types of data in a single
place---maybe because all the data belong to the same underlying subject of
study. For example, we can describe a dog based on its height, weight, and age
(numerical values), and on its color and breed (character strings or factors).
**R** can keep all this information in a single place.
### Lists
Lists can group **R** objects like vectors, arrays, and other lists in a set of
one dimension. To create a list, use the function `list()` and separate each of
its elements with a comma.
```{r diverse list}
all_in_one_list <- list(c(3.1, 10), "El Zorro", list(character_mat))
all_in_one_list
```
The double-bracketed indices tell you which *element* of the list is displayed.
The single-bracketed indices tell you which *subelement* of an element is
displayed. For example, `3.1` is the first subelement of the first element in
the list, and `"El Zorro"` is the first subelement of the second element. This
double notation helps us recognize which level of the stacking we are in,
regardless of what we stack inside the list.
There are two ways to access an element from a list, depending on what we want
to do with the output. We can use single-bracket notation `[ ]` to get a new
list with elements from the original list:
```{r extracting list elements as a new list}
new_list <- all_in_one_list[1]
new_list
typeof(new_list)
```
Or we can use double-bracket notation `[[ ]]` to get only the subelements of an
element from the original list (we can not extract multiple elements this way).
```{r extracting a single element without more lists}
new_item <- all_in_one_list[[1]]
new_item
typeof(new_item)
```
We can name the elements of a list when we first create it:
```{r name list at creation}
countries_info <- list(
countries = c("Japan", "Egypt", "Mexico", "Finland"),
temperatures_fahrenheit = c(50, 90, 65, -10),
speak_spanish = c(FALSE, FALSE, TRUE, FALSE)
)
```
Or we can name them after creating the list using `names()`:
```{r name list after creation}
names(all_in_one_list) <- c("weight", "hero", "game")
all_in_one_list
```
With a named list, we can also use dollar sign notation `$` to extract elements.
This notation produces the same result as using the double bracket notation `[[ ]]`
```{r extract with $ notation}
countries_info$speak_spanish
countries_info[["speak_spanish"]]
```
### Data frames
Data frames are the most common storage structure for data analysis. We can
think of them as a group of atomic vectors (columns) of the same length.
Usually, each row of a data frame represents an individual observation and each
column represents a different measurement or variable of that observation.
Different vectors in a data frame can have different data types, but they must
all have the same length. If we use vectors of different lengths, **R** will
recycle values of the shorter vectors to ensure that the data frame has a square
shape.
We can create a data frame using the `data.frame()` function. Give
`data.frame()` any number of vectors of equal length, each separated with a
comma. Each vector should be set equal to a name that describes the vector.
`data.frame()` will turn each vector into a column of the new data frame:
```{r create data frame from scratch}
aliens_df <- data.frame(
name = c("Bender", "Fry", "Nibbler", "Zoidberg"),
species = c("Robot", "Human", "Nibblonian", "Decapodian"),
fingers_per_hand = c(3, 4, 3, NA)
)
aliens_df
```
We can use the function `dim()` to get the size of each dimension of the data
frame, and the function `str()` to get a compact summary:
```{r minimal summary of data frame}
dim(aliens_df)
str(aliens_df)
```
As you can see, `str()` gives us the data frame dimensions, reminds us that
`aliens_df` is of type data.frame, lists the names and data types of all of the variables (columns) contained in the data frame, and prints the first five values.
To us, a data frame resembles a matrix, but to **R** it is a list with an
attribute "class" set to "data.frame"
```{r type and class of data frame}
typeof(aliens_df)
class(aliens_df)
```
This means that we can extract extract values from data frames the same way we
extract from lists. We can use single brackets `[]` to get a new data frame:
```{r extracting values from data frame}
aliens_df[2]
# Equivalently
aliens_df["species"]
```
Or we can use double brackets `[[ ]]` or dollar sign notation `$` to get atomic vectors:
```{r}
aliens_df$name
```
Also, as with matrices, we can use a single bracket with two indices (note that
this produces an atomic vector):
```{r}
aliens_df[c(1,2), "fingers_per_hand"]
```
Creating data frames from scratch is cumbersome and prone to errors. In the next
section, we will see how to import data from different sources into **R**, as
well as basic ways to prepare it for analysis.
## References
Most of this section is based on ["Hands-On Programming with R"](https://rstudio-education.github.io/hopr/), by Garret Grolemund; and on ["An Introduction to R"](https://intro2r.com/), by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.