-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01_intro.Rmd
417 lines (256 loc) · 19.1 KB
/
01_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "Intro to R"
author: "Kyla McConnell"
output: html_document
---
# Welcome to R!
Why should you learn R?
It's...
- totally free!
- open-source, so is always being extended for all different types of applications.
- a scripting language, so you can share your analysis AND all the steps it took for you to get there, making your analysis reproducible and transparent.
- an established tool with tons of tutorials and help pages (try StackOverflow, Google, or YouTube)
- And it has an amazing and inclusive community (check out #rstats on Twitter and R-Ladies chapters around the world)
## Step One: Install R and R-Studio
R is a programming language that can be written directly into your computer's console/terminal. However, almost all users of R pair it with the program R-Studio, an integrated development environment (IDE) that offers a ton of features that will make your life much easier!
Download R here (if in Germany): https://ftp.gwdg.de/pub/misc/cran/
Or select a closer location ("mirror") here to optimize performance: https://ftp.gwdg.de/pub/misc/cran/
Then download R-Studio (Desktop) here: https://rstudio.com/products/rstudio/download/
More installation info here if you have trouble: https://rstudio-education.github.io/hopr/starting.html
Mac users -- a program called XQuartz is needed for viewing some plots. This might not come up as a problem for you until later, but to be safe, you can also go ahead and install it now: https://www.xquartz.org/
Note: You must install both R AND R-Studio in order to use R-Studio for the rest of this tutorial.
## Step Two: Check out R-Studio
R-Studio can make your R journey much easier! With R-Studio, you go from a programming language of text-only to a whole program, not too unlike Microsoft Word in comparison to text files.
In general, you have four panels:
Top-left: your scripting panel, in which you can write and save code. This may only open when you open a new file (File -> New File -> R Script / R-Markdown, we'll talk about the difference below)
Bottom-left: the console panel, for input/output that won't be saved.
Top-right: the environment panel, which generally can show you dataframes you are working with as well as other information you've saved to variables (and sometimes has other features too, depending on what you're working with)
Bottom-right: install and update packages, preview plots, read help files, and some other features
![R-Studio](img/rstudio_panels.png)
Image from https://ourcodingclub.github.io/tutorials/intro-to-r/
## Step Three: Open your first script
There are two types of file that you can use with R-Studio: a simple script, and a more feature rich R-Markdown document like this one.
Both of these file types allow you to save code to continue working on t, re-run it later, or have as a record of your analysis steps.
### R script (.R)
R scripts are the most simple files for saving your R code. They consist only of code.
To open a new script, select File -> New File -> R Script
In this type of file, you write lines of code in the order they are meant to be run, and can run each line by clicking that line and pressing Control + R (on Windows) or Cmd + Enter (on Mac)
You can try out this type of script by typing some mathematical operations in the script and running those lines. The output will show below, in your console panel.
All text in an R script is read as code. To leave comments for yourself or to indicate that a certain line shouldn't be run as code, use a hashtag (#) at the beginning of that line.
### R-Markdown file (.Rmd)
R-Markdown files are more complex files, that contain both code and text. They can also include images and some basic formatting.
To open a new R-Markdown file, select File -> New File -> R Markdown
This will open a pop-up where you can give your file a title (you can also change this later).
When you open a new R-Markdown file, you will be shown a basic template. You can see a header at the top, called the YAML information for that file.
For example:
---
title: "Intro to R"
author: "Kyla McConnell"
date: "12/21/2020"
output: html_document
---
Changing this information will change what is displayed if you save the file to a final format, which is denoted in the "output" line. It's not super important for basic applications of R-Markdown but can be customized more for more advanced uses.
### How to use R-Markdown
If you type into general whitespace in a normal R-file, this will be interpreted as code. In R-Markdown, however, text in whitespace is interpreted as plain text. You can use this to keep notes for yourself when learning something, or leave a record of your decisions in an analysis.
To add code to R-Markdown, you use code chunks that you can either type out (three backticks and {r} at the beginning and three backticks at the end) or add by clicking the "Insert" button in the top bar and selecting R.
```{r}
```
To run code in these blocks, click the green arrow at the top right corner of the box. The arrow with the green line under it will run all code blocks above the current one, but will not run the current block itself.
The output of the code block will show directly below it (not in the console, like in a standard R script.) Often, this output will be proceeded by a number in brackets -- this just helps you keep track of how many numbers are output if they are on more than one line. Most of the time, you can safely ignore this.
#### Try it out
1. Open a new R-Markdown file
2. Delete or adapt the template that is provided in the file
3. Add a new (R) code block and try some basic arithmetic. What happens if you have multiple lines of code in your code block? What happens if you chain together multiple operations? Try using different types of arithmetic symbols (+, - , * , / , ^)
Bonus: Can you figure out what the %% operator does (ex: 11 %% 2, 11 %% 3)
### Variables
One of the most fundamental parts of working in R is saving output to variables. This allows you to give a name or label to the output of some string of code. In R, you assign variables with the following syntax:
name_of_variable <- code
For example:
```{r}
fav_number <- 12
fav_number * 3
```
You can see that the code remembered that fav_number is actually 12.
In R, you can have as many variables as you'd like.
```{r}
worst_number <- 7
difference <- fav_number - worst_number
```
If you just assign variables, you'll see no output. However, if you look in the Environment tab of your top right panel in R-Studio, you will see the values you have saved to variables.
You can also call a variable by name to see what is saved in it.
```{r}
difference
```
Variables can also be used to create new variables, ex:
```{r}
double_number <- fav_number * 2
double_number
```
#### Try it out
1. You are planning a pizza party and want to figure out how many pieces of pizza each of your guests can eat before the pizzas run out. Create a variable called "guests" and assign it to the number of guests you have at your party, 15.
2. You order 5 pizzas. Assign the variable "pizzas" to the amount of pizzas you have. 3. Each pizza has 12 slices. Assign the variable "slices" to the amount of slices you have over all pizzas (hint: multiple the slices per pizza by the total number of pizzas -- you can use the pizzas variable here instead of retyping the number)
4. Divide your "slices" variable by your "guests" variable to figure out how many slices each person can have.
5. 5 more people arrive to the party uninvited. Update your "guests" variable to the total number you have now. How many slices can each person have now?
### Data types
So far, we've been using numbers to show how variables work. In R, these are called *numeric*. You can check data types with the command `class()`
There is also the type *integer*, which allows only whole numbers. You can force numbers into this form if it's needed with `as.integer()` -- this will truncate the decimal (cut it off), not round.
To round, use `round()`
```{r}
class(8.8)
as.integer(8.8)
round(8.8)
```
Another common data type is *character* -- this is text. Character variables should always be enclosed in either double or single quotation marks "/' Some operations won't work on strings, if they are expecting numeric. Check out the error here -- it may sound confusing but try to identify what it is telling you.
```{r}
class("hello")
my_word <- "hello"
my_word + 2
```
Another common variable type is *logical*. This can be either TRUE or FALSE, nothing else. You can assess logical statements with:
test if something is identical with ==
test if something is not identical with !=
test if something is greater than with > or less than with <
or try greater than or equals to >= or less than or equal to with <=
```{r}
7 < 8
"dog" == "dogs"
8.8 >= 5.6
"seven" == 7
class("seven" == 7)
```
Finally, there is also the data type *factor*, which identifies string variables that should be treated as categories / distinct levels of a grouping -> will become more relevant when we work with datasets
#### Try it out
1. Assign your name to the variable "my_name". What data type is the variable?
2. Try the command `nchar()` on your name variable. What does it do? Try also `toupper()`
3. What does nchar(my_name) >= 4 return? What does this mean?
Bonus: What does TRUE == 1 return? What about FALSE == 1? Try them both with 0 too. What does this tell you about how R deals with logical variables?
### Vectors
A special type of variable is the *vector*. Vectors contain multiple items of the same type and are enclosed in the syntax `c()`.
The command `length()` returns number of items in the vector.
For example:
```{r}
grocery_list <- c("apples", "bananas", "bread", "orange juice")
length(grocery_list)
```
You can also do "vector-wise" operations:
```{r}
my_numbers <- c(12, 14, 26, 57, 82)
my_numbers + 2
my_numbers + my_numbers
```
If you do a mathematical operation to a vector, it is broadcast to each item, i.e. it does each item separately.
Similarly, if you add/subtract/multiply/divide a vector by another vector, it will match the items up one at a time in a loop.
You can also take a certain item from a vector using square brackets. For example, take the first item:
```{r}
my_numbers[1]
```
#### Try it out
1. Make a variable called long_array and assign it the values 1, 1, 2, 2, 5, 10 (don't forget the `c()` syntax!)
2. Make a variable called short_array and assign it the values 1, 6, 10
3. Subtract short_array from long_array
4. Subtract long_array from short_array
5. Look at the output. What does this tell you about how R deals with vectors of different lengths?
### Installing packages
R has a lot of built-in functions, but most of them are basic operations. To do more advanced operations, you will probably have to use packages. *Packages* are extensions to R that any advanced R-programmer can make and submit to the R Consortium for inclusion on their platform, CRAN. When packages are on CRAN, any R-user can easily download and use them.
To install packages you can either use a command: `install.packages("packagename")` or you can use the panel at the lower right, under the Packages tab. Click Install, type the name of the package, and you will see some output in the console as the package installs.
Try it now with one of the most used packages:
```{r}
install.packages("tidyverse")
```
You may have to pick a CRAN "mirror". This is just meant to streamline your download by sending it from a close source -- pick whichever city is closest to you (it will also work even if you pick one further away.)
### Using packages
When you want to use a feature from a package, you will first have to call it with the `library()` command. This will only work if you've already installed the package, otherwise you'll get an error that the package wasn't found.
Usually, you should use the first few lines of your script or Markdown (in a code block) to load all packages you'll need for the script.
You'll know you've forgotten a package if R throws an error about not finding a certain command that you know exists, and other similar situations.
For example, let's load the tidyverse package for use later in this tutorial.
```{r}
library(tidyverse)
```
### Reading in dataframes / tibbles
If you are using R for data analysis, you generally have existing data you'd like to analyze. First, you have to know what type of file you're dealing with.
If you work with your data in Excel or a similar program, you can go to File -> Save As and select either comma-separated values (csv) or tab-separated values (tsv). This will save the file in what is really a simple text format with either commas or tabs between the rows. These file formats are easiest to work with in R.
You will also need to know where the file is stored. If you are working in an R script, you will need to set a "working directory" -- that is, the location that R should start looking for files -- with `setwd()`. If you work in R-Markdown, R will automatically start looking in the folder that your R-Markdown file is saved in. If you data is in that folder, you can call it by name including the extension, ex: "experiment_data.csv" or "jump_data.tsv". If it is within a folder that is inside of the folder your R-Markdown is in, you can include this: "data/jump_data.tsv"
Once you know the file type you're working with and where it's saved, you can read it in using the base-R, i.e. built-in, methods of `read.csv()` and `read.tsv()`. However, I would recommend using the modified calls `read_csv()` and `read_tsv()` from the tidyverse package because there are some improvements.
Make sure tidyverse is installed and loaded with a `library()` call.
Then, use the appropriate file type call on the path of the file, depending on where it is saved and whether you're using a script or a Markdown. *Save the output to a variable.*
Be aware: Different system languages can affect csv files. Some languages will separate instead with semicolons (;), especially in languages that use a comma as a decimal separator (ex: 8,35). In this case, you'll need `read_csv2()`
```{r}
library(tidyverse)
jump_data <- read_csv("jump.csv")
```
### Exploring your data
Now you have a data file read in, but how do you see what's in it?
One easy way is using the `head()` function, which will show the first six rows by default.
```{r}
head(jump_data)
```
You can also pass the `head()` call an extra, optional argument to show more rows. Just add n= the number of rows you want to see.
```{r}
head(jump_data, n=3)
```
You can also see your dataframe in the upper right panel under the Environment tab. Just click on the name of your dataframe to see it in a separate tab. In this tab, you can also sort columns and Filter rows -- just for viewing purposes though. This tab will get a bit slow if you start having huge dataframes but is often a good first look.
Another useful call is `summary()`. Call it on a dataframe to get each column and useful info based on the data type. For example, numeric columns will show the min, median, max and the quartiles (25% increments).
```{r}
summary(jump_data)
```
#### Try it out
1. Load in the dataframe "spr_example.csv" as a tibble, i.e. using the tidyverse call `read_csv()` Make sure to save it to a variable.
2. Take a look at your dataframe in the Environment tab.
3. Try the command `colnames()` on the variable you saved the dataframe to. What does it do?
4. We already tried the `head()` command, now try the `tail()` command. What does it do?
### Working with columns
In R, whenever we want to work with a specific column, we can call that column with dataframe_name$column_name. That is, the dollar symbol is used to pull out a specific column.
Let's try pulling out the columns individually to look at them. You'll notice that R-Studio will helpfully suggest the column names as soon as it sees the $
```{r}
jump_data$participant
```
```{r}
jump_data$height_cm
```
### Descriptive stats
Let's say we want to know a bit more about the columns. We can look at basic descriptive statistics -- the mean, the median and the range.
We need to work with a numeric column -- that is, one that is only numbers.
```{r}
mean(jump_data$height_cm)
```
```{r}
range(jump_data$height_cm)
```
```{r}
median(jump_data$height_cm)
```
You can also call summary on just one column:
```{r}
summary(jump_data$participant)
```
#### Try it out
1. Use the variable you saved the spr_example data to. What is the range of response times (RT)?
2. Save the mean RT to a variable called "average_RT". This number is in milliseconds. What would the equivalent be in seconds (hint: divide by 1000)
3. How many unique words are in this experiment? How many conditions are there?
### Categorical columns
By default, tibbles (the type of dataframe that is produced by a `read_csv()` call and the similar tidyverse calls) will read in any text as a character type. This is because, under the hood, R sometimes does weird things with factors (also, not every time you have text in a dataframe do you want to consider it as a grouping).
However, if you do have a grouping variable (participant, condition, type of something), it will generally work better with a lot of R applications to make sure R knows it is a factor.
Below, you can see with the `class()` call that the column is of type character. If you call `summary()` on the column, it tells you it contains 88 strings -- R doesn't realize that there are four participants or that the same label repeats.
```{r}
spr_data <- read_csv("data/spr_example.csv")
class(spr_data$participant)
summary(spr_data$participant)
```
To change a character column into a factor use `as.factor()` on the column name and save it over the original column.
Below, I convert the participant column to a factor -- check out the summary call. Now, R see that there are 22 instances of each of the four labeled groups.
```{r}
spr_data$participant <- as.factor(spr_data$participant)
class(spr_data$participant)
summary(spr_data$participant)
```
Factors are made up of groupings called 'levels'. To see the unique groupings of a given factor, use `levels()`
```{r}
levels(spr_data$participant)
```
We'll revisit the idea of factors later! R has a lot of functionality that will differ based on whether the column is a factor or character type.
For a quick reference, you can also check in the data tab on the upper right -- each column will show its type (Factor w/ x levels vs. chr vs. num).
### Conclusion
In this tutorial, you've learned how to install R, work with R-Studio, create R-Markdown files and get started writing your own R code. You've seen how to work with variables to give labels to code output and use it later. We've introduce the basic data types: numeric, integer, character, factor & vector. You've installed your first package and learned to call it with `library()` and read in data files from your computer to look at and work with it.
Next time, we'll get into more details of tidyverse and how to work with data frames: rename columns, reorder the entries in a column, select only entries that meet a condition, and more. This will allow you to work with your data frame easily and effectively.
### Reading
As a follow-up to this tutorial, please read chapters 3.3 - 3.9 in Navarro, Learning Statistics with R (p. 46 - 66).