-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path001_What_is_a_database.Rmd
107 lines (80 loc) · 4.83 KB
/
001_What_is_a_database.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "001 - What is a database?"
output:
html_notebook:
highlight: tango
number_sections: yes
theme: readable
toc: yes
toc_float: yes
code_folding: none
includes:
before_body: assets/header.html
after_body: assets/footer.html
---
# What is a DB?
A database is a rectangular storage system, columns and rows of data, just like a spreadsheet. Unlike a spreadsheet, the way you read and write data in a DB is through code or the command line, not through a point-and-click interface like Google Sheets or Excel.
I'll mostly be using "database" to mean a relational database management system, like PostgreSQL or SQLite, but there are other paradigms.
The code you use to interact with the DB is called "SQL" (Structured Query Language), but you don't *have* to know SQL. In the code examples I'll show, there will be a "vanilla SQL" version and a "tidy R" version.
## Where is a DB?
A database consists of a server and one or more clients. "Server" and "clients" here just mean computer applications; the server can be a webserver living out in the cloud, or it can be your own laptop. The client can be a command line terminal, an R session, a SAS session, a webpage running JavaScript, etc.
I'll demonstrate later on how to use a local DB, where the server and the client are both on your computer. Then I'll discuss the PSA On-Farm database, which is hosted on Azure and you can access from anywhere with an internet connection.
## What's inside?
Inside a database there are one or more tables, which are analogous to individual tabs in a regular spreadsheet. What makes these tables different is that:
* Each column only contains one type of data (numbers, text, dates)
* The type of each column must be decided before you put any data in it
* There is nothing outside the "rectangle" (no plots, no notes in the margins, no comment boxes floating over top)
There are additional benefits to using a structured relational database:
* Each table has a "key" that connects an observation between tables
* Keys can map one row to another row, or many rows to one row, or vice versa
* Planning a structure (schema) helps you decide how to organize your data
Are there downsides to using a database?
* You have to learn how to use them, by reading this tutorial
## Example
Let's look at a toy database with two tables:
```{r echo = FALSE}
treatments <- data.frame(
plot_id = c(101, 102, 103, 104, 201, 202, 203, 204),
crop = c("corn", "corn", "soy", "soy", "corn", "soy", "corn", "soy"),
tillage = c("no-till", "strip-till", "strip-till", "no-till", "strip-till", "no-till", "no-till", "strip-till"),
stringsAsFactors = FALSE
)
yields <- data.frame(
plot_id = c(101, 102, 103, 104, 201, 202, 203, 204),
Mg_ha = c(10, 12, 4.5, 4.5, 11, 3, 9.5, 5),
stringsAsFactors = FALSE
)
```
```{r paged.print = F}
treatments
yields
```
If you were making a spreadsheet of yields, you would probably want all of this in one table, so you could associate each observation with the treatments. But I'm sure you've noticed that copy-pasting columns of treatments inevitably ends up with some errors that you have to go back and clean up. Then when you want to have another observation type, like cover crop biomass or ground cover or weed count, you have to do it all over again.
With a database, instead you record the minimally identifiable information, like `plot_id` and the observation. Then once all your data is recorded and stored and cleaned, you can **join** tables together for your analyses. Note that here we're using `plot_id` as our **key**. That's how we know which row from one table is connected to a row from another table.
This sort of structure is especially helpful where information from one table would need to be repeated many times. Imagine an experiment with 3 or 4 columns of treatment variables, and then a table of sensor readings every 15 minutes for a year (~35000 rows per plot!). You'd have a lot of duplicated information, and any mistakes (misspelling, miscoding treatments, rows out of order) would break the whole thing.
## What do I do with it though?
In general, databases are for **storage** and **organization**, and you use other tools like R or SAS for **analysis**. Once you have the two tables, you join them either in the DB (via SQL) or locally (using `merge` or `dplyr::full_join`):
```{r paged.print = F}
yield_data <- merge(treatments, yields)
yield_data
```
And then analysis is straightforward, using your normal tools.
```{r paged.print = F}
model <- lm(Mg_ha ~ crop * tillage, data = yield_data)
anova(model)
aggregate(
Mg_ha ~ crop*tillage,
data = yield_data,
FUN = mean
)
```
Or if you prefer tidyverse-flavored R:
```{r paged.print = F}
library(dplyr, warn.conflicts = FALSE)
yield_data %>%
group_by(crop, tillage) %>%
summarise_at(
vars(Mg_ha),
list(~mean(.), ~sd(.))
)
```