-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy pathwalkthrough.qmd
275 lines (213 loc) · 11.3 KB
/
walkthrough.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
execute:
freeze: auto
---
# A walkthrough to get started {#walkthrough}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE)
options(crayon.enabled = FALSE)
```
```{r, include = FALSE, eval = TRUE}
library(fs)
library(targets)
library(tarchetypes)
library(tidyverse)
library(withr)
```
```{r, include = FALSE, eval = TRUE}
dir_create("R")
write_csv(airquality, "data.csv")
withr::with_dir("R", {
tar_script({
get_data <- function(file) {
read_csv(file, col_types = cols()) %>%
filter(!is.na(Ozone))
}
fit_model <- function(data) {
lm(Ozone ~ Temp, data) %>%
coefficients()
}
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = Temp, y = Ozone)) +
geom_abline(intercept = model[1], slope = model[2])
}
})
file_move("_targets.R", "functions.R")
})
tar_script({
library(targets)
library(tarchetypes)
tar_source()
options(crayon.enabled = FALSE)
options(tidyverse.quiet = TRUE)
tar_option_set(packages = c("readr", "dplyr", "ggplot2"))
list(
tar_target(file, "data.csv", format = "file"),
tar_target(data, get_data(file)),
tar_target(model, fit_model(data)),
tar_target(plot, plot_model(model, data))
)
})
```
This chapter walks through a short example of a [`targets`](https://github.com/ropensci/targets)-powered data analysis project. The source code is available at <https://github.com/wlandau/targets-four-minutes>, and you can visit <https://rstudio.cloud/project/3946303> to try out the code in a web browser (no download or installation required). The [documentation website](https://docs.ropensci.org/targets/index.html#examples) links to other examples. The contents of the chapter are also explained in a four-minute video tutorial:
<center>
<iframe src="https://player.vimeo.com/video/700982360?h=38c890bd4f" width="640" height="360" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen></iframe>
</center>
## About this example
The goal of this short analysis is to assess the relationship among ozone and temperature in base R’s `airquality` dataset. We track a data file, prepare a dataset, fit a model, and plot the model against the data.
## File structure
The file structure of the project looks like this.
```{r, eval = FALSE}
├── _targets.R
├── data.csv
├── R/
│ ├── functions.R
```
`data.csv` contains the data we want to analyze.
```
Ozone,Solar.R,Wind,Temp,Month,Day
36,118,8.0,72,5,2
12,149,12.6,74,5,3
...
```
`R/functions.R` contains our custom user-defined functions. (See the [functions chapter](#functions) for a discussion of function-oriented workflows.)
```{r, eval = FALSE}
# R/functions.R
get_data <- function(file) {
read_csv(file, col_types = cols()) %>%
filter(!is.na(Ozone))
}
fit_model <- function(data) {
lm(Ozone ~ Temp, data) %>%
coefficients()
}
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = Temp, y = Ozone)) +
geom_abline(intercept = model[1], slope = model[2])
}
```
## Target script file
Whereas files `data.csv` and `functions.R` are typical user-defined components of a [project-oriented workflow](https://rstats.wtf/projects), the target script file `_targets.R` file is special. Every [`targets`](https://github.com/ropensci/targets) workflow needs a [target script file](https://docs.ropensci.org/targets/reference/tar_script.html) to configure and define the pipeline.^[By default, the target script is a file called `_targets.R` in the project's [root directory](https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner%27s-guide/). However, you can set the target script file path to something other than `_targets.R`. You can either set the path persistently for your project using [`tar_config_set()`](https://docs.ropensci.org/targets/reference/tar_config_set.html), or you can set it temporarily for an individual function call using the `script` argument of [`tar_make()`](https://docs.ropensci.org/targets/reference/tar_make.html) and related functions.] The [`use_targets()`](https://docs.ropensci.org/targets/reference/use_targets.html) function in `targets` version >= 0.12.0 creates an initial target script with comments to help you fill it in. Ours looks like this:
```{r, eval = FALSE}
# _targets.R file
library(targets)
library(tarchetypes)
tar_source()
tar_option_set(packages = c("readr", "dplyr", "ggplot2"))
list(
tar_target(file, "data.csv", format = "file"),
tar_target(data, get_data(file)),
tar_target(model, fit_model(data)),
tar_target(plot, plot_model(model, data))
)
```
All target script files have these requirements.
1. Load the packages needed to define the pipeline, e.g. [`targets`](https://github.com/ropensci/targets) itself.^[target scripts created with `tar_script()` automatically insert a `library(targets)` line at the top by default.]
1. Use `tar_option_set()` to declare the packages that the targets themselves need, as well as other settings such as the default [storage format](https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats).
1. Load your custom functions and small input objects into the R session: in our case, with `source("R/functions.R")`.
1. Write the pipeline at the bottom of `_targets.R`. A pipeline is a list of target objects, which you can create with `tar_target()`. Each target is a step of the analysis. It looks and feels like a variable in R, but during `tar_make()`, it will reproducibly store a value in `_targets/objects/`.
:::{.callout-tip}
## Start small
Even if you plan to create a large-scale heavy-duty pipeline with hundreds of time-consuming targets, it is best to start small. First create a version of the pipeline with a small number of quick-to-run targets, follow the sections below to inspect and test it, and then scale up to the full-sized pipeline after you are sure everything is working.
:::
## Inspect the pipeline
Before you run the pipeline for real, it is best to check for obvious errors. [`tar_manifest()`](https://docs.ropensci.org/targets/reference/tar_manifest.html) lists verbose information about each target.
```{r, eval = TRUE}
tar_manifest(fields = all_of("command"))
```
[`tar_visnetwork()`](https://docs.ropensci.org/targets/reference/tar_visnetwork.html) displays the dependency graph of the pipeline, showing a natural left-to-right flow of work. It is good practice to make sure the graph has the correct nodes connected with the correct edges. Read more about dependencies and the graph in the [dependencies section](https://books.ropensci.org/targets/targets.html#dependencies) of a [later chapter](#targets).
```{r, eval = TRUE}
tar_visnetwork()
```
## Run the pipeline
`tar_make()` runs the pipeline. It creates a reproducible new external R process which then reads the target script and runs the correct targets in the correct order.^[In `targets` version 0.3.1.9000 and above, you can set the path of the local data store to something other than `_targets/`. A project-level `_targets.yaml` file keeps track of the path. Functions [`tar_config_set()`](https://docs.ropensci.org/targets/reference/tar_config_set.html) and [`tar_config_get()`](https://docs.ropensci.org/targets/reference/tar_config_get.html) can help.]
```{r, eval = TRUE}
tar_make()
```
The output of the pipeline is saved to the `_targets/` data store, and you can read the output with `tar_read()` (see also `tar_load()`).
```{r}
tar_read(plot)
```
The next time you run `tar_make()`, `targets` skips everything that is already up to date, which saves a lot of time in large projects with long runtimes.
```{r, eval = TRUE}
tar_make()
```
You can use `tar_visnetwork()` and `tar_outdated()` to check ahead of time which targets are up to date.
```{r, eval = TRUE}
tar_visnetwork()
```
```{r, eval = TRUE}
tar_outdated()
```
## Changes
The `targets` package notices when you make changes to code and data, and those changes affect which targets rerun and which targets are skipped.^[Internally, special rules called "cues" decide whether a target reruns. The [`tar_cue()`](https://docs.ropensci.org/targets/reference/tar_cue.html) function lets you suppress some of these cues, and the [`tarchetypes`](https://docs.ropensci.org/tarchetypes/) package supports nuanced [cue factories](https://docs.ropensci.org/tarchetypes/reference/index.html#section-cues) and [target factories](https://docs.ropensci.org/tarchetypes/reference/index.html#section-targets-with-custom-invalidation-rules) to further customize target invalidation behavior. The [`tar_cue()`](https://docs.ropensci.org/targets/reference/tar_cue.html) function documentation [explains cues in detail](https://docs.ropensci.org/targets/reference/tar_cue.html#target-invalidation-rules), as well as [specifics on how `targets` detects changes to upstream dependencies](https://docs.ropensci.org/targets/reference/tar_cue.html#dependency-based-invalidation-and-user-defined-functions).]
### Change code
If you change one of your functions, the targets that depend on it will no longer be up to date, and `tar_make()` will rebuild them. For example, let's increase the font size of the plot.
```{r, include = FALSE, eval = TRUE}
withr::with_dir("R", {
tar_script({
get_data <- function(file) {
read_csv(file, col_types = cols()) %>%
filter(!is.na(Ozone))
}
fit_model <- function(data) {
lm(Ozone ~ Temp, data) %>%
coefficients()
}
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = Temp, y = Ozone)) +
geom_abline(intercept = model[1], slope = model[2]) +
theme_gray(24)
}
})
file_move("_targets.R", "functions.R")
})
```
```{r, eval = FALSE}
# Edit functions.R...
plot_model <- function(model, data) {
ggplot(data) +
geom_point(aes(x = Temp, y = Ozone)) +
geom_abline(intercept = model[1], slope = model[2]) +
theme_gray(24) # Increased the font size.
}
```
`targets` detects the change. `plot` is "outdated" (i.e. invalidated) and the others are still up to date.
```{r, eval = TRUE}
tar_visnetwork()
```
```{r, eval = TRUE}
tar_outdated()
```
Thus, `tar_make()` reruns `plot` and nothing else.^[We would see similar behavior if we changed the R expressions in any `tar_target()` calls in the target script file.]
```{r, eval = TRUE}
tar_make()
```
Sure enough, we have a new plot.
```{r}
tar_read(plot)
```
### Change data
If we change the data file `data.csv`, `targets` notices the change. This is because `file` is a file target (i.e. with `format = "file"` in `tar_target()`), and the return value from last `tar_make()` identified `"data.csv"` as the file to be tracked for changes. Let's try it out. Below, let's use only the first 100 rows of the `airquality` dataset.
```{r, eval = TRUE}
write_csv(head(airquality, n = 100), "data.csv")
```
Sure enough, `raw_data_file` and everything downstream is out of date, so all our targets are outdated.
```{r, eval = TRUE}
tar_visnetwork()
```
```{r, eval = TRUE}
tar_outdated()
```
```{r, eval = TRUE}
tar_make()
```
## Read metadata
:::{.callout-tip}
## Performance
See the [performance chapter](#performance) for options, settings, and other choices to make the pipeline more efficient. This chapter also has guidance for [monitoring the progress of a running pipeline](https://books.ropensci.org/targets/performance.html#monitoring-the-pipeline).
:::