-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path1.1-prerequisites.Rmd
748 lines (497 loc) · 66.3 KB
/
1.1-prerequisites.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
# Basics of statistics {#basics}
This chapter introduces some important terms useful for doing data analyses. We introduce the Bayesian approach of data analyses. We also introduce the essentials of the classical frequentist tests (e.g. t-tests), which can be seen as an alternative to the Bayesian approach. Even though we will not use null hypotheses tests later [@Amrhein.2019], we introduce them here because we need to understand the scientific literature. For each classical test treated, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist approach.
## Variables and observations
Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally is a random sample of the entire population of units we are interested in. The measurements (or observations) of the random sample are stored in a data table. A data table is a collection of variables (columns). Data tables normally are handled as objects of class `data.frame` in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table \@ref(tab:scalemeasurement)).
<br>
Table: (\#tab:scalemeasurement) Scales of measurements
Scale | Examples | Properties | Coding in R |
:-------|:------------------|:------------------|:--------------------|
Nominal | Sex, genotype, habitat | Identity (values have a unique meaning) | `factor()` |
Ordinal | Elevational zones | Identity and order (values have an ordered relationship) | `ordered()` |
Numeric | Discrete: counts; continuous: body weight, wing length | Identity, order, and interval | `intgeger()` `numeric()` |
<br>
Nominal and ordinal variables may also be called "categorical" variables.
The aim of many studies is to describe how a variable of interest ($y$; e.g. the time to build a nest) is related to one or more predictor variables ($x$; e.g. the sex of the bird, its age class, and the number of individuals in the colony - representing a nominal, ordinal and numeric predictor). Depending on the author, the y-variable is called "outcome variable", "response" or "dependent variable". The x-variables are called "predictors", "explanatory variables" or "independent variables". We avoid the terms "dependent" and "independent" variables because often we do not know whether the variable $y$ is in fact depending on the $x$ variables, and often the x-variables are not independent of each other. In this book, we try to use "outcome" and "predictor" variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to an assumed real relationship.
"Predictors" are often called a "covariable" if they are numeric (e.g. the colony size), and "factor" if they are nominal or ordinal (e.g. sex and age class). The characteristic of a factor is that it has defined values, called levels (in our example, the factor "sex" has the levels "female" and "male", the factor "age class" has the levels "juvenile", "immature" and "adult").
## Displaying and summarizing data
### Histogram
While nominal and ordinal variables can be summarized by giving the absolute number or the proportion of observations for each level (e.g number of females and number of males), numeric variables normally are summarized by a location and a scatter statistic, such as the mean and the standard deviation, or the median and some quantiles (see below). Hence, the location tells us around what value our observations lay and it is sometimes called the "measure of central tendency". The distribution of a numeric variable is graphically displayed in a histogram (Fig. \@ref(fig:histogram)).
```{r histogram, echo=FALSE, fig.width=4, fig.height=3,fig.cap='Histogram of the length of the forearm of statistics course participants.'}
load("RData/datacourse.RData")
par(mar=c(4,4,1,1))
hist(dat$ell, las=1, xlab="Lenght of forearm [cm]", ylab="Number of students", main=NA,
col="tomato")
box()
```
:::: {.greenbox data-latex=""}
::: {.center data-latex=""}
Draw a histogram
:::
To draw a histogram, the variable is displayed on the x-axis and the observed values are assigned to classes. We use the function <span class="codeInBox">hist</span>. Remember to call the helpfile, if you forgot how a function works and what arguments it has; for that, type <span class="codeInBox">?hist</span> in the R console. There, we see that the edges of the classes can be set with the argument <span class="codeInBox">breaks=</span>. The values given in the <span class="codeInBox">breaks=</span> argument must at least span the values of the variable. If the argument <span class="codeInBox">breaks=</span> is not specified, R searches for break-values that make the histogram look smooth. The number of observations falling in each class is given on the y-axis. The y-axis can be re-scaled so that the area of the histogram equals 1 by setting the argument <span class="codeInBox">density=TRUE</span>. In that case, the values on the y-axis correspond to the density values of a probability distribution (chapter \@ref(distributions)). You can also save the result of the hist-function into an object, e.g. <span class="codeInBox">t.hist <- hist(dat$ell)</span>, possibly with the argument <span class="codeInBox">plot=F (F for FALSE)</span>. Using <span class="codeInBox">t.hist</span>, you may then fully customize your histogram (e.g. overlay two histograms with slightly shifted columns).
::::
### Location and scatter
Typical location statistics are mean, median or mode.
There are different types of means, e.g.:
- Arithmetic mean: $\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{1}^{n}x_i$
(R function `mean`), where $n$ is the sample size. The parameter $\mu$ is the (unknown) true mean of the entire population of which the $n$ measurements $x_i$ are a random sample of. $\bar{x}$ is called the sample mean and it is used as an estimate for $\mu$. The $^$ (the "hat") above any parameter indicates that the parameter value is obtained from a sample and, therefore, it may be different from the true value; it is an estimate of the true value.
- Geometric mean: $\hat{\mu}_{geom} = \bar{x}_{geom} = \sqrt{\prod_{1}^{n}x_i}$
(no R function in the base package, but you may use: `exp(mean(log(x)))`)
The median is the 50% quantile: 50% of the measurements are below (and, hence 50% above) the median. If $x_1,..., x_n$ are the ordered measurements of a variable, then the median is:
- $\begin{aligned}
& Median =
\begin{cases}
x_{(n+1)/2} & \quad \text{if } n \text{ is odd}\\
\frac{1}{2}(x_{n/2} + x_{n/2+1}) & \quad \text{if } n \text{ is even}
\end{cases}
\end{aligned}$
(R function `median`)
The mode is the value that is occurring with highest frequency or that has the highest density.
Scatter is also called spread, scale or variance. It describes how far away from the location parameter single observations can be found, or how the measurements are scattered around their mean. The variance is defined as the average squared difference between the observations and the mean:
- Variance $\hat{\sigma^2} = s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2$
(R function `var`). The term $(n-1)$ is called the degrees of freedom. It is used in the denominator of the variance formula instead of $n$ to prevent underestimating the (true) variance - remember that the $x_i$ are a sample of the population we are interested in, and $\hat{\sigma^2}$ is used as an estimate of the true variance $\sigma^2$ of this population. Because $\bar{x}$ is in average closer to $x_i$ than the unknown true mean $\mu$ would be, the variance would be underestimated if $n$ is used in the denominator.
<!-- <font size="1"> The maximum likelihood estimate of the variance corresponds to the variance formula using $n$ instead of $n-1$ in the denominator, see, e.g., @Royle.2008b.</font> -->
The variance is used to compare the degree of scatter among different groups. However, its values are difficult to interpret because of the squared unit. Therefore, the square root of the variance, the standard deviation is normally reported:
- Standard deviation $\hat{\sigma} = s = \sqrt{s^2}$
(R Function `sd`). The standard deviation is approximately the average deviation of an observation from the sample mean. In the case of a [normal distribution](#normdist), about two thirds (68%) of the data are within one standard deviation around the mean, and 95% within two standard deviations (more precisely within 1.96 SD).
Sometimes, the inverse of the variance is used, called precision:
- Precision $\hat{p} = \frac{1}{\sigma^2}$
The variance and standard deviation each describe the scatter with a single value. Thus, we have to assume that the observations are scattered symmetrically around their mean in order to get a picture of the distribution of the measurements. When the measurements are spread asymmetrically (skewed distribution), then it may be more useful to describe the scatter with more than one value. Such statistics can be quantiles from the lower and upper tail of the data.
Quantiles inform us about both location and spread of a distribution. The $p$th-quantile is the value with the property that the proportion $p$ of all values are less than or equal to the value of the quantile. The median is the 50% quantile. The 25% quantile and the 75% quantiles are also called the lower and upper quartiles, respectively. The range between the 25% and the 75% quartiles is called the interquartile range. This range includes 50% of the (central part of the) distribution and is also a measure of scatter. The R function `quantile` extracts sample quantiles. The median, the quartiles, and the interquartile range can be graphically displayed using box-and-whisker plots (boxplots in short, R function `boxplot`). The horizontal fat bar is the median (Fig. \@ref(fig:boxplot)). The box marks the interquartile range. The whiskers reach out to the last observation within 1.5 times the interquartile range from the quartile. Circles mark observations beyond the whiskers.
```{r boxplot, echo=F, fig.width=4, fig.height=3, fig.cap="Boxplot of the length of forearm of statistics course participants who own car or not."}
par(mar=c(4,4,1,1))
boxplot(ell~car, data=dat, las=1, ylab="Lenght of forearm [cm]",
col="tomato", xaxt="n", xlab="", boxwex=0.2)
axis(1, at=c(1,2), labels=c("Not owing a car", "Car owner"))
n <- table(dat$car)
axis(1, at=c(1,2), labels=paste("n=", n, sep=""), mgp=c(3,2, 0))
```
The boxplot is an appealing tool for comparing location, variance and distribution of measurements among groups.
A value that is far away from the distribution - such that it seems not to belong to the distribution - is called an outlier. It is not always clear whether an observation at the edge of a distribution should be considered an outlier or not, and how to deal with it. A first step is to check whether it could be due to a data entry error. Then, analyses may be done without and with the outlier(s) to see whether that has an effect on our conclusions. If you omit outliers, you must report and justify this. Note that median and interquartile range are less affected by outliers than mean and standard deviation.
To summarize: For symmetric distributions, mean and standard deviation (SD) area meaningful statistics to describe the distribution (Fig. \@ref(fig:locationScatter), left). Mean and median are identical. The more skewed a distribution is, the more different are mean and median. Mean $\pm$1 SD becomes flawed as it suggests symmetry that does not exist, and its end may even go beyond the distribution (e.g. below the minimum in Fig. \@ref(fig:locationScatter), right). For skewed distributions, median and interquartile range are more informative.
```{r locationScatter, echo=F, fig.width=6, fig.height=3, fig.cap="A symmetric and a skewed distribution with the corresponding mean (dot, with $\\pm1$ SD) and median (triangle, with interquartile range)."}
par(mfrow=c(1,2),mar=c(1,1,3,1))
set.seed(8)
d.sym <- rnorm(500)
d.asym <- rgamma(500,0.5)
hist(d.sym, las=1, ylab="", col="tomato", xaxt="n", xlab="", main="", yaxt="n")
t.y <- par("usr")[4] + diff(par("usr")[c(3,4)])*c(0.03,0.08) # find y-position for mean and median
segments(mean(d.sym)-sd(d.sym),t.y[1],mean(d.sym)+sd(d.sym),t.y[1], xpd=NA, lwd=2) # xpd=NA allows plotting in the margins
points(mean(d.sym),t.y[1],pch=16,xpd=NA)
segments(quantile(d.sym,0.25),t.y[2],quantile(d.sym,0.75),t.y[2], xpd=NA, lwd=2)
points(median(d.sym),t.y[2],pch=17,xpd=NA)
hist(d.asym, las=1, ylab="", col="gold", xaxt="n", xlab="", main="", yaxt="n")
t.y <- par("usr")[4] + diff(par("usr")[c(3,4)])*c(0.03,0.08) # find y-position for mean and median
segments(mean(d.asym)-sd(d.asym),t.y[1],mean(d.asym)+sd(d.asym),t.y[1], xpd=NA, lwd=2) # xpd=NA allows plotting in the margins
points(mean(d.asym),t.y[1],pch=16,xpd=NA)
segments(quantile(d.asym,0.25),t.y[2],quantile(d.asym,0.75),t.y[2], xpd=NA, lwd=2)
points(median(d.asym),t.y[2],pch=17,xpd=NA)
```
### Correlations
A correlation measures the strength with which characteristics of two variables are associated with each other (co-occur). If both variables are numeric, we can visualize the correlation using a scatterplot.
```{r scatterplot, echo=F, fig.width=4, fig.height=3, fig.cap="Scatterplot of the length of forewarm and the comfort temperature of statistics course participants."}
par(mar=c(4,4,1,1))
plot(temp~ell, data=dat, las=1, xlab="Lenght of forearm [cm]",
ylab="Comfort temperature [°C]", pch=16)
```
The covariance between variable $x$ and $y$ is defined as:
- Covariance $q = \frac{1}{n-1}\sum_{i=1}^{n}((x_i-\bar{x})*(y_i-\bar{y}))$
(R function `cov`), with $\bar{x}$ and $\bar{y}$ being the means of $x$ and $y$. As for the variance, the units of the covariance are sqared and, therefore, covariance values are difficult to interpret. A standardized covariance is the Pearson correlation coefficient:
- Pearson correlation coefficient: $r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$
(R function `cor`)
Mean, variance, standard deviation, covariance and correlation are sensitive to outliers. Single observations with extreme values normally have a disproportionately high influence on these statistics. When outliers are present in the data, we may prefer a more robust correlation measure such as the Spearman correlation or Kendall’s tau. Both are based on the ranks of the measurements instead of the measurements themselves. The rank is simply the rank of each observation when these are sorted. The function `rank` ranks values, with different methods to deal with ties (see the helpfile using `?rank`).
- Spearman correlation coefficient: correlation between rank(x) and rank(y), i.e. $r_s=\frac{\sum_{i=1}^{n}(R_{x_i}-\bar{R_x})(R_{y_i}-\bar{R_y})}{\sqrt{\sum_{i=1}^{n}(R_{x_i}-\bar{R_x})^2\sum_{i=1}^{n}(R_{y_i}-\bar{R_y})^2}}$
(R function `cor(x,y, method="spearman")`), with $R_x$ being the rank of observation $x$ among all $x$ observations, and $\bar{R_x}$ the mean of the ranks of $x$.
- Kendall's tau: $\tau = 1-\frac{4I}{(n(n-1))}$, where $I$ = number of $(i,k)$ for which $(x_i < x_k)$ & $(y_i > y_k)$ or vice versa, and $(i,k)$ has all pairs of observations with $k>i$.
(R function `cor(x,y, method="kendall")`). $I$ counts the number of observation pairs that are "disconcordant", i.e. where the ranking of the measurements of the two data points is not the same for the both variables. The more such discordant pairs, the less the correlation.
Figure \@ref(fig:correlations) illustrates that the correlation $q$ is larger with more data points (B vs. A) and steeper relationships (C vs. D). Pearson correlation $r$, Spearman correlation $r_s$ and Kendall's $\tau$ are always 1 when dots are in line (A-D), or -1 when the line descends (D), in which case $q$ is negative, too. The outlier in E has a strong effect on $r$, but much less on $r_s$ and $\tau$. F shows a more real world example. $q$ can have any value from $-\infty$ to $\infty$, while $r$, $r_s$ and $\tau$ are bound between -1 and 1. If the dots are completely horizontal (not shown), $q$ is 0 and the other three measures of correlation are $NA$, i.e. not defined.
```{r correlations, echo=F, fig.width=7, fig.height=4, fig.cap="Covariance, Pearson correlation, Spearman correlation, and Kendall's tau for different patterns of x-y-data."}
cplot <- function(x,y) {
plot(t.x, t.y, xlab="", ylab="", xaxt="n", yaxt="n", pch=16, xlim=c(0,10),ylim=c(0,10))
legend("topleft", bty="n", legend=c(paste("q =",round(cov(t.x,t.y),3)),
paste("r =",round(cor(t.x,t.y),3)),
paste("r_s =",round(cor(t.x,t.y,method="spearman"),3)),
paste("tau =",round(cor(t.x,t.y,method="kendall"),3))),
cex=1.2)
}
par(mfrow=c(2,3),mar=c(1,1,1,1))
t.x <- 2:6; t.y <- 2:6
cplot(t.x); legend("topright",legend="A",bty="n", text.font=2, cex=1.2)
t.x <- 1:8; t.y <- 1:8
cplot(t.x); legend("topright",legend="B",bty="n", text.font=2, cex=1.2)
t.x <- 2:6; t.y <- t.x/2
cplot(t.x); legend("topright",legend="C",bty="n", text.font=2, cex=1.2)
t.x <- 2:6; t.y <- 5-t.x/2
cplot(t.x); legend("topright",legend="D",bty="n", text.font=2, cex=1.2)
t.x <- 2:8; t.y <- c(c(1.4,1.2,1.3,1.0,1.33,1.22),10)
cplot(t.x); legend("topright",legend="E",bty="n", text.font=2, cex=1.2)
set.seed(8)
t.x <- runif(20,2,10); t.y <- rnorm(length(t.x),0.5*t.x,0.5)
cplot(t.x); legend("topright",legend="F",bty="n", text.font=2, cex=1.2)
# t.x <- 2:6; t.y <- rep(1,length(t.x))
# cplot(t.x) # not plotted, but q = 0 and the other 3 are NA
```
For nominal variables, covariance and the above presented correlations cannot be calculated and would not be meaningful. But nominal variables may also be correlated or not - generally, we talk of "balanced" data if they are uncorrelated and "unbalanced" data otherwise. Measures of two nominal variables can be aggregated in a crosstabulation, e.g. the number of birds observed per combination of sex (male or female) and age (juvenile, immature or adult). In this table, there are 6 cells, and if there is the same number of males and females for each age class (and, thereby, also the same number per age class for both sexes), the data is balanced, i.e. sex and age are uncorrelated. The more the 6 numbers deviate from this, the more unbalanced the data is regarding sex and age. The $\chi^2$-value is a measure of this balancedness, with a value of 0 for perfect balanced data, and increasingly larger values for less balanced data (also dependent on the size of the crosstabulation).
### Principal components analyses PCA
The principal components analysis (PCA) is a multivariate correlation analysis. A multidimensional data set with $k$ variables can be seen as a cloud of points (observations) in a $k$-dimensional space. Imagine, we could move around in the space and look at the cloud from different locations. From some locations, the data looks highly correlated, whereas from others, we cannot see the correlation. That is what PCA is doing. It is rotating the coordinate system (defined by the original variables) of the data cloud so that the correlations are no longer visible. The axes of the new coordinates system are linear combinations of the original variables. They are called principal components. There are as many principal components as there are original variables, i.e. $k$, $p_1, ..., p_k$. The principal components meet further requirements:
- the first component explains most variance
- the second component explains most of the remaining variance and is perpendicular (= uncorrelated) to the first one
- third component explains most of the remaining variance and is perpendicular to the first two
- and so on
For example, in a two-dimensional data set $(x_1, x_2)$ the principal components become
$pc_{1i} = b_{11}x_{1i} + b_{12}x_{2i}$
$pc_{2i} = b_{21}x_{1i} + b_{22}x_{2i}$
with $b_{jk}$ being loadings of original variable $k$ on principal component $j$. Fig. \@ref(fig:principal) shows the two principal components PC1 and PC2 for a two-dimensional data set. They can be calculated using linear algebra: Principal components are eigenvectors of the covariance or correlation matrix. Note how the data spreads maximally along PC1.
```{r principal, echo=FALSE, fig.width=4, fig.height=3, fig.cap='Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown).'}
set.seed(2353)
x1 <- rnorm(100, 5, 1)
x2 <- rnorm(100, 0.8*x1, 1)
par(mar=c(4,4,0.3,0.3))
plot(x1-mean(x1),x2-mean(x2), asp=1, pch=16, col="blue", xlab="x_1", ylab="x_2", cex=0.8)
# PCA based on covariance matrix
pc1 <- eigen(cov(cbind(x1,x2)))$vectors[,1]
pc2 <- eigen(cov(cbind(x1,x2)))$vectors[,2]
abline(0, pc1[2]/pc1[1], col="green", lwd=2)
abline(0, pc2[2]/pc2[1], col="green", lwd=2)
# PCA based on correlation matrix
pc1 <- eigen(cor(cbind(x1,x2)))$vectors[,1]
pc2 <- eigen(cor(cbind(x1,x2)))$vectors[,2]
abline(0, pc1[2]/pc1[1], col="brown", lwd=2)
abline(0, pc2[2]/pc2[1], col="brown", lwd=2)
# label PC1 and PC2:
text(2.2,2.65,"PC1")
text(-2.9,2.2,"PC2")
```
The choice between correlation or covariance matrix is important. The covariance matrix is an unstandardized correlation matrix and, therefore, the units, e.g. whether centimeters or meters are used, influence the result of the PCA. When the PCA is based on the covariance matrix, the result will change when we change the unit of a variable, e.g. from centimeters to meters. A PCA based on the covariance matrix only makes sense when the variances are comparable among the variables, i.e. when all variables are measured in the same unit, and when we want to weight each variable according to its variance. Usually, this is not the case and, therefore, the PCA must be based on the correlation matrix. Note that this is not the default setting in the most commonly used function `princomp` (use argument `cor=TRUE`) and `prcomp` (use argument `scale.=TRUE`)!
```{r}
pca <- princomp(cbind(x1,x2)) # PCA based on covariance matrix
pca <- princomp(cbind(x1,x2), cor=TRUE) # PCA based on correlation matrix
loadings(pca)
```
The loadings measure the correlation of each variable with the principal components. They inform about what aspects of the data each component is measuring. The signs of the loadings are arbitrary, thus we can multiplied them by -1 without changing the PCA. Sometimes this can be handy for describing the meaning of the principal component in a paper. For example, @Zbinden.2018 combined the number of hunting licenses, the duration of the hunting period, and the number of black grouse cocks that were allowed to be hunted per hunter in a principal component in order to measure hunting pressure. All three variables had a negative loading on the first component, so that high values of the component meant low hunting pressure. Before the subsequent analyses, for which a measure of hunting pressure was of interest, the authors changed the signs of the loadings so that the first principal component could be used as a measure of hunting pressure.
This example also illustrates what PCA is often used for in ecology: You can reduce the number of predictors (for a model) by combining meaningful sets of predictors into fewer PCs. In the example, three predictors were combined into one predictor. This is often useful when you have many predictors, especially when these contain ecologically meaningful sets of predictors (in the example: hunting pressure variables; could also be weather variables, habitat variables etc) and, importantly, when they correlate. If there is not correlation, you don't gain anything by using PCs instead of the original variables. But if they correlate, the first PC (or the first few PCs) can represent a relevant proportion of the total variance.
Hence, this proportion of variance explained by each PC is, beside the loadings, an important information. If the first few components explain the main part of the variance, it means that maybe not all $k$ variables are necessary to describe the data, or, in other words, the original $k$ variables contain a lot of redundant information. Then, you can use fewer than $k$ PCs without loosing (much) information.
```{r}
# extract the variance captured by each component
summary(pca)
```
Ridge regression is similar to doing a PCA within a linear model while components with low variance are shrinked to a higher degree than components with a high variance [may be there will be a chapter on ridge regression some day].
As stated, PCA can be used to generate a lower-dimensional representation of multi-dimensional data, aiming to represent as much of the original multi-dimensional variance as possible e.g. in two dimensions (PC1 vs. PC2). There is a number of other techniques, especially from community ecology, with a similar aim, e.g. correspondence analyses (CA), redundancy analyses (RDA), or non-metric multidimensional scaling (NMDS). All these methods, including PCA, are "ordination techniques". PCA simply turns the data cloud around without any distortion (only scaling), thereby showing as much of the euclidean distance as possible in the first PCs. Other techniques use other distance measures than euclidean. There may be additional variables that constrain the ordination (e.g. constrained = canonical CA). The R package vegan provides many of these ordination techniques and tutorials are available online. You may also start with wikipedia, e.g. on [multivariate statistics](https://en.wikipedia.org/wiki/Multivariate_statistics).
## Inferential statistics
### Uncertainty
> There is never a "yes-or-no" answer,
> there will always be uncertainty
[Amrhein (2017)](https://peerj.com/preprints/26857)
The decision whether an effect is important or not cannot be based on data alone. For making a decision, we should, beside the data, carefully consider the consequences of each decision, the aims we would like to achieve, and the risk, i.e. how bad it is to make the wrong decision. Structured decision making or decision analyses provide methods to combine consequences of decisions, objectives of different stakeholders, and risk attitudes of decision makers to make optimal decisions [@Hemming.2022, @Runge.2020]. In most data analyses, particularly in basic research and when working on case studies, we normally do not consider consequences of decisions. However, the results will be more useful when presented in a way that other scientists can use them for a meta-analysis (including structured decision making), or stakeholders and politicians can use them for making better decisions. Results useful for this must include information on the size of a parameter of interest, such as the effect of a drug or an average survival, together with an uncertainty measure.
Therefore, inferential statistics aims to describe the process that presumably has generated the data, and it quantifies the uncertainty of the described process that is due to the fact that the data is just a random sample from the larger population we would like to know the process of. In other words: Using regression modelling, we find patterns in our data, and if we make good models and careful ecological reasoning, we can make an educated guess about the underlying process.
Quantification of uncertainty is only possible if:
1. the mechanisms that generated the data are known
2. the observations are a random sample from the population of interest
Most studies aim at understanding the mechanisms that generated the data, thus they are generally not known beforehand. To overcome that problem, we construct models, e.g. statistical models, that are (strong) abstractions of the data generating process. And we report the model assumptions. All uncertainty measures are conditional on the model we used to analyze the data, i.e., they are only reliable if the model describes the data generating process realistically. Because statistical models essentially never describe the data generating process perfectly, the true uncertainty almost always is (much) higher than the one we report.
We can only make inference about the population under study if we have a random sample from this population. In order to obtain a random sample, a good study design is a prerequisite. To illustrate how inference about a big population is drawn from a small sample, we here use simulated data. The advantage of using simulated data is that the mechanism that generated the data is known, and we can play around with a big population.
Imagine there are 300 000 PhD students in the world and we would like to know how many statistics courses they have taken before they started their PhD on average (Fig. \@ref(fig:histtruesample)). We use random number generators (`rpois` and `rgamma`) to simulate a number of courses for each of the 300 000 virtual students. We use these 300 000 as the big population that in real life we almost never can sample in total. Normally, we only have values for a small sample of students. To simulate that situation we draw 12 numbers at random from the 300 000 (R function `sample`). Then, we estimate the average number of pre-PhD courses from the sample of 12 students, and we compare that *sample mean* with the *true mean* of the 300 000 students.
```{r}
# simulate the virtual true population
set.seed(235325) # set seed for random number generator
# simulate fake data of the whole population using an overdispersed Poisson
# distribution, i.e. a Poisson distribution whose mean has a gamma distribution
statscourses <- rpois(300000, rgamma(300000, 2, 3))
# draw a random sample from the population
n <- 12 # sample size
y <- sample(statscourses, 12, replace=FALSE)
```
```{r histtruesample, echo=F, fig.width=4.5, fig.height=3.5, fig.cap='Histogram of the number of statistics courses that 300000 virtual PhD students have taken before their PhD started. The rug (small ticks) on the x-axis shows the random sample of 12 drawn from the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean).', echo=FALSE}
par(mar=c(4,5,1,1))
# draw a histogram of the 300000 students
hist(statscourses, breaks=seq(-0.5, 10.5, by=1), main=NA,
xlab="Number of statistics courses", ylab="Number of students", ylim=c(-5000,170000))
box(bty="l")
# add the sample
rug(jitter(y), lwd=2)
# add the mean of the sample
abline(v=mean(y), col="blue", lwd=2)
# add the true mean of the population
abline(v=mean(statscourses), lwd=2)
text(4, 150000, "Sample mean", col="blue", adj=c(0,1))
text(4, 120000, "True mean", adj=c(0,1))
```
We observe the sample mean - but from that, what do we know about the population mean? There are two different approaches to answer this question. 1) We could ask us, how much the sample mean would scatter, if we repeated the study many times? This approach is the basis of frequentist statistics. 2) We could ask us for any possible value, what is the probability that it is the true population mean? To do so, we use probability theory and that is the basis of Bayesian statistics.
Both approaches use (essentially similar) models. Only the mathematical techniques to calculate uncertainty measures differ between the two approaches. When no information other than the data is used to construct the model, the results are practically identical (at least for large enough sample sizes).
A frequentist 95% confidence interval CI (blue horizontal segment in Fig. \@ref(fig:CImean)) is constructed such that, if you were to (hypothetically) repeat the sampling many times, 95% of these many intervals constructed would contain the true value of the parameter (here the mean number of courses). From the Bayesian approach, we get a so-called posterior distribution (pink in Fig. \@ref(fig:CImean)), from which we can construct a 95% interval (e.g. using the 2.5% and 97.5% quantiles). This interval has traditionally been called 95% credible interval CrI. It can be interpreted that we are 95% sure that the true mean is inside that interval.
Both, confidence interval and posterior distribution, represent to the statistical uncertainty of the sample mean, i.e., they measure how far away the sample mean could be from the true mean. In this virtual example, we know the true mean is `r round(mean(statscourses),2)`, thus in the lower part of the 95% CI or in the lower quantiles of the posterior distribution (hence, in the lower part of the 95% CrI). The grey histogram in Fig. \@ref(fig:CImean) shows how the means of many different virtual samples of 12 students scatter around the true mean. The 95% interval of these virtual means corresponds to the 95% CI, and the variance of these virtual means correspond to the variance of the posterior distribution. This virtual example shows that posterior distribution and 95% CI correctly measure the statistical uncertainty (variance, width of the interval), however we never know exactly how far the sample mean is from the true mean. In real life, we do not know the true mean.
```{r CImean, fig.width=6, fig.height=3, fig.cap='Histogram (in gray) of the means of repeated samples from the true population. The scatter of these means visualizes the true uncertainty of the mean in this example. The blue vertical line indicates the mean of one sample. The blue segment shows its 95% confidence interval (obtained by fequensist methods). The violet line shows the posterior distribution of the mean (obtained by Bayesian methods).',echo=FALSE, message=FALSE}
library(arm)
set.seed(123)
nsim <- 10000
# frequentist theoretical uncertainty of mean
# we draw many new samples from the true population
my <- numeric(nsim)
for(i in 1:nsim) my[i] <- mean(sample(statscourses, 12, replace=FALSE))
# Bayesian uncertainty of the mean
mod <- lm(y~1) # define the model (Normal(mu, sigma2))
bsim <- sim(mod, n.sim=nsim) # apply Bayes theorem
# frequentist uncertainty of the mean
mean_freqCI <- predict(mod, interval="confidence")[1,]
par(mar=c(4,4,0.1,6), mgp=c(2,0.5,0), tck=-0.01)
hist(my, main=NA, xlab="Mean(y)", freq=FALSE, ylim=c(0, 2.5), col="grey",
cex.axis=0.8, las=1)
bm <- density(bsim@coef[,1])
lines(bm$x, bm$y, lwd=2, col="violet")
abline(v= coef(mod), lwd=2, col="blue")
segments(mean_freqCI["lwr"], 0, mean_freqCI["upr"], lwd=3, col="blue")
text(0, 2.5, "True repetitions of the study", adj=c(0,1), cex=0.9, xpd=NA)
text(1.2, 2.2, "Sample mean with 95% CI", col="blue", adj=c(0,1), cex=0.9, xpd=NA)
text(1.2, 1.5, "Posterior distribution of the mean", col="violet", adj=c(0,1), cex=0.9, xpd=NA)
```
Uncertainty intervals are indispensable for inference from statistical analyses. An estimate without an uncertainty interval, called a point estimate, is quite useless. E.g. a point estimate of 10 has a very different meaning if its 95% uncertainty interval spans from -120 to 130 compared to when it spans from 9.6 to 10.4. Therefore, in most cases we provide the point estimate together with its uncertainty interval. But note that this does not yet guarantee correct inference: uncertainty intervals, as point estimates, are only reliable if the model is a realistic abstraction of the data generating process (i.e. if the model assumptions are realistic).
Both terms, confidence and credible interval, suggest that the interval indicates *confidence* or *credibility*, but the intervals actually show *uncertainty*. Therefore, it has been suggested to rename the interval into *uncertainty* interval, or *compatibility* interval as it indicates the range of values of the parameter (e.g. the mean) that is compatible with the data [@Gelman.2019].
### Standard error
The standard error SE is, beside the uncertainty interval, an alternative possibility to measure uncertainty. It measures an average deviation of the sample mean from the (unknown) true population mean. The frequentist method for obtaining the SE is based on the central limit theorem. According to the central limit theorem, the sum of independent, not necessarily normally distributed random numbers are normally distributed when sample size is large enough (Chapter \@ref(distributions)). Because the mean is a sum (divided by a constant, the sample size) it can be assumed that the distribution of the means of many samples is normal. The standard deviation SD of the many means is called the standard error SE. It can be mathematically shown that the standard error SE equals the standard deviation SD of the sample divided by the square root of the sample size:
- Frequentist SE = SD/sqrt(n) = $\frac{\hat{\sigma}}{\sqrt{n}}$
- Bayesian SE = the SD of the posterior distribution
It is very important to keep the difference between SE and SD in mind! SD measures the scatter of the data, whereas SE measures the statistical uncertainty of the mean (or of another estimated parameter, Fig. \@ref(fig:sesd)). SD is a descriptive statistic, describing a characteristic of the data. The SE, on the other hand, is an inferential statistic; it is a measure of how far the sample mean may be away from the true mean (in Bayesian reasoning, the true mean is expected to be within $\pm$ 1 SE of the sample mean with a probability of about 67%, and within $\pm$ 2 SE with 95%). As sample size increases, the SE becomes smaller, whereas SD does not change (on average). The SE becomes smaller because the uncertainty about our sample mean is reduced with increasing sample size. The SD of the sample, however, is an estimate of the SD in the big population, so whether we have a small or large sample, its SD remains the same on average.
```{r sesd, fig.width=5, fig.height=3, fig.cap='Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown).',echo=FALSE, message=FALSE}
par(mar=c(4,4,0.1,6), mgp=c(2,0.5,0), tck=-0.01, cex.lab=0.8, cex.axis=0.7)
hist(statscourses, breaks=seq(-0.5, 10.5, by=1), main=NA, freq=FALSE, ylim=c(-0.02, 0.6),
las=1, xlab="Stats courses") # draw histogram of the population
rug(jitter(y), col="black", lwd=2) # add the sample
abline(v= coef(mod), lwd=2, col="blue")
segments(coef(mod)-sd(y), 0, coef(mod)+sd(y),
lwd=5, col="green", lend="butt")
segments(coef(mod)-summary(mod)$coef[2], 0.02, coef(mod)+summary(mod)$coef[2],
lwd=5, col="blue", lend="butt")
abline(v=mean(statscourses), col="brown")
text(2, 0.2, "Estimated SD around the mean", adj=c(0,1), col="green", cex=0.9, xpd=NA)
text(2, 0.3, "Estimated mean with SE", adj=c(0,1), col="blue", cex=0.9, xpd=NA)
box()
```
## Bayes theorem and the common aim of frequentist and Bayesian methods
### Bayes theorem for discrete events
```{r, echo=FALSE}
tab <- table(dat$car, dat$birthday)
```
The Bayes theorem describes the probability of event A conditional on event B (the probability of A after B has already occurred) from the probability of B conditional on A and the two probabilities of the events A and B:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
Imagine, event A is "The person likes wine as a birthday present." and event B "The person has no car.". The conditional probability of A given B is the probability that a person not owing a car likes wine. Answers from students whether they have a car and what they like as a birthday present are summarized in Table \@ref(tab:winecar).
Table: (\#tab:winecar) Cross table of the student's birthday preference and car ownership.
car/birthday | flowers | wine | **sum** |
:------|:---------------|:---------------|:------------------|
no car | `r tab[1,1]` | `r tab[1,2]` | **`r sum(tab[1,])`** |
car | `r tab[2,1]` | `r tab[2,2]` | **`r sum(tab[2,])`** |
**sum** | **`r sum(tab[,1])`**| **`r sum(tab[,2])`**| **`r sum(tab)`** |
We can apply the Bayes theorem to obtain the probability that the student likes wine given it has no car, $P(A|B)$. Let's assume that only the ones who prefer wine go together for having a glass of wine at the bar after the statistics course. While they drink wine, the tell each other about their cars and they obtain the probability that a student who likes wine has no car, $P(B|A) = `r round(tab[1,2]/sum(tab[,2]),2)`$. During the statistics class the teacher asked the students about their car ownership and birthday preference. Therefore, they know that $P(A) =$ likes wine $= `r round(sum(tab[,2])/sum(tab),2)`$ and $P(B) =$ no car $= `r round(sum(tab[1,])/ sum(tab),2)`$. With these information, they can obtain the probability that a student likes wine given it has no car, even if not all students without cars went to the bar: $P(A|B) = \frac{`r round(tab[1,2]/sum(tab[,2]),2)`*`r round(sum(tab[,2])/sum(tab),2)`}{`r round(sum(tab[1,])/ sum(tab),2)`} = `r round((tab[1,2]/sum(tab[,2]))*(sum(tab[,2])/sum(tab))/ (sum(tab[1,])/ sum(tab)), 2)`$.
### Bayes theorem for continuous parameters
When we use the Bayes theorem for analyzing data, then the aim is to make probability statements for parameters. Because most parameters are measured at a continuous scale we use probability density functions to describe what we know about them. The Bayes theorem can be formulated for probability density functions denoted with $p(\theta)$, e.g. for a parameter $\theta$ (for example probability density functions see Chapter \@ref(distributions)).
What we are interested in is the probability of the parameter $\theta$ given the data, i.e., $p(\theta|y)$. This probability density function is called the posterior distribution of the parameter $\theta$. Here is the Bayes theorem formulated for obtaining the posterior distribution of a parameter from the data $y$, the prior distribution of the parameter $p(\theta)$ and assuming a model for the data generating process. The data model is defined by the likelihood that specifies how the data $y$ is distributed given the parameters $p(y|\theta)$:
$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int p(y|\theta)p(\theta) d\theta}$
The probability of the data $p(y)$ is also called the scaling constant, because it is a constant. It is the integral of the likelihood over all possible values of the parameter(s) of the model.
### Estimating a mean assuming that the variance is known
For illustration, we first describe a simple (unrealistic) example for which it is almost possible to follow the mathematical steps for solving the Bayes theorem even for non-mathematicians. Even if we cannot follow all steps, this example will illustrate the principle how the Bayesian theorem works for continuous parameters. The example is unrealistic because we assume that the variance $\sigma^2$ in the data $y$ is known.
We construct a data model by assuming that $y$ is normally distributed:
$p(y|\theta) = normal(\theta, \sigma)$, with $\sigma$ known. The function $normal$ defines the probability density function of the normal distribution (Chapter \@ref(distributions)).
The parameter, for which we would like to get the posterior distribution is $\theta$, the mean. We specify a prior distribution for $\theta$. Because the normal distribution is a conjugate prior for a normal data model with known variance, we use the normal distribution. Conjugate priors have nice mathematical properties (see Chapter \@ref(priors)) and are therefore preferred when the posterior distribution is obtained algebraically.
That is the prior:
$p(\theta) = normal(\mu_0, \tau_0)$
With the above data, data model and prior, the posterior distribution of the mean $\theta$ is defined by:
$p(\theta|y) = normal(\mu_n, \tau_n)$, where
$\mu_n= \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}$ and
$\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}$
$\bar{y}$ is the arithmetic mean of the data. Because only this value is needed in order to obtain the posterior distribution, this value is called the sufficient statistics.
From the mathematical formulas above and also from Fig. \@ref(fig:triplot) we see that the mean of the posterior distribution is a weighted average between the prior mean and $\bar{y}$ with weights equal to the precisions ($\frac{1}{\tau_0^2}$ and $\frac{n}{\sigma^2}$).
```{r triplot, fig.cap='Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution.',echo=FALSE, results='hide', dpi=700}
library(blmeco)
par(mar=c(4,4,0.8,0.1), mgp=c(2,0.5,0), tck=-0.01)
triplot.normal.knownvariance(theta.data=3, variance.known=2, n=3,
prior.theta=0, prior.variance=10)
```
### Estimating the mean and the variance
We now move to a more realistic example, which is estimating the mean and the variance of a sample of weights of Snowfinches *Montifringilla nivalis* (Fig. \@ref(fig:ssp)). To analyze those data, a model with two parameters (the mean and the variance or standard deviation) is needed. The data model (or likelihood) is specified as $p(y|\theta, \sigma) = normal(\theta, \sigma)$.
```{r ssp, fig.align='center', echo=FALSE, fig.cap='Snowfinches stay above the treeline for winter. They come to feeders.'}
knitr::include_graphics('images/snowfinch2.JPG', dpi = 150)
```
```{r,echo=TRUE}
# weight (g)
y <- c(47.5, 43, 43, 44, 48.5, 37.5, 41.5, 45.5)
n <- length(y)
```
Because there are two parameters, we need to specify a two-dimensional prior distribution. We looked up in @Gelman.2014 that the conjugate prior distribution in our case is an Normal-Inverse-Chisquare distribution:
$p(\theta, \sigma) = N-Inv-\chi^2(\mu_0, \sigma_0^2/\kappa_0; v_0, \sigma_0^2)$
From the same reference we looked up how the posterior distribution looks like in our case:
$p(\theta,\sigma|y) = \frac{p(y|\theta, \sigma)p(\theta, \sigma)}{p(y)} = N-Inv-\chi^2(\mu_n, \sigma_n^2/\kappa_n; v_n, \sigma_n^2)$, with
$\mu_n= \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y}$,
$\kappa_n = \kappa_0+n$,
$v_n = v_0 +n$,
$v_n\sigma_n^2=v_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0n}{\kappa_0+n}(\bar{y}-\mu_0)^2$
For this example, we need the arithmetic mean $\bar{y}$ and standard deviation $s^2$ from the sample for obtaining the posterior distribution. Therefore, these two statistics are the sufficient statistics.
The above formula look intimidating, but we never really do that calculations. We let `R` doing that for us in most cases by simulating many numbers from the posterior distribution, e.g., using the function `sim` from the package arm [@Gelman.2007]. In the end, we can visualize the distribution of these many numbers to have a look at the posterior distribution.
In Fig. \@ref(fig:jointpdist) the two-dimensional $(\theta, \sigma)$ posterior distribution is visualized by using simulated values. The two dimensional distribution is called the joint posterior distribution. The mountain of dots in Fig. \@ref(fig:jointpdist) visualize the Normal-Inverse-Chisquare that we calculated above. When all values of one parameter is displayed in a histogram ignoring the values of the other parameter, it is called the marginal posterior distribution. Algebraically, the marginal distribution is obtained by integrating one of the two parameters out over the joint posterior distribution. This step is definitively way easier when simulated values from the posterior distribution are available!
```{r jointpdist, fig.cap='Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately.', echo=FALSE, fig.height=7}
mod <- lm(y~1)
bsim <- sim(mod, n.sim=20000)
xrange <- range(bsim@coef)
yrange <- range(bsim@sigma)
layout(matrix(c(2,0,1,3), ncol=2, byrow=TRUE), heights=c(1,2),
width=c(2,1))
par(mar=c(5.1, 4.1, 0.2, 0.2), cex.lab=0.8, cex.axis=0.8)
plot(bsim@coef, bsim@sigma, xlab=expression(theta), ylab=expression(sigma), las=1, cex=0.5, xlim=xrange, ylim=yrange, col=rgb(0,0,0,0.05), pch=16)
par(mar=c(0, 4.1, 1, 0.2))
histtheta <- hist(bsim@coef, plot=FALSE)
plot(histtheta$breaks, seq(0, max(histtheta$density)+0.01, length=
length(histtheta$breaks)), type="n", xaxt="n", xlim = xrange,
las=1, ylab = "Density", yaxs="i")
rect(histtheta$breaks[1:(length(histtheta$breaks)-1)], 0,
histtheta$breaks[2:(length(histtheta$breaks))], histtheta$density,
col=grey(0.5))
mtext(paste(expression(p(theta|y)), "= t-distribution"), side=3, adj=0, cex=0.6)
par(mar=c(5.1, 0, 0.2, 0.2))
histsigma<- hist(bsim@sigma, plot=FALSE)
plot(seq(0, max(histsigma$density)+0.01, length=length(histsigma$breaks)),
histsigma$breaks, type="n", yaxt="n", ylim=yrange,
xlab="Density", xaxs="i",
xaxt="n")
axis(1, at=seq(0, 0.4, by=0.1), labels=c(NA, 0.1, 0.3, 0.3, 0.4))
rect(0, histsigma$breaks[2:(length(histsigma$breaks))], histsigma$density,
histsigma$breaks[1:(length(histsigma$breaks)-1)],col=grey(0.5))
text(0.04, 18, "p(sigma|y) = \nInv-Chisq-dist.", adj=c(0,1), cex=0.6)
```
The marginal posterior distributions of every parameter is what we normally report in a paper to report statistical uncertainty.
In our example, the marginal distribution of the mean is a t-distribution (Chapter \@ref(distributions)). Frequentist statistical methods also use a t-distribution to describe the uncertainty of an estimated mean for the case when the variance is not known. Thus, frequentist methods came to the same solution using a completely different approach and different techniques. Doesn't that increase dramatically our trust in statistical methods?
## Classical frequentist tests and alternatives
### Nullhypothesis testing
Null hypothesis testing is constructing a model that describes how the data would look like in case of what we expect to be would not be. Then, the data is compared to how the model thinks the data should look like. If the data does not look like the model thinks they should, we reject that model and accept that our expectation may be true.
To decide whether the data looks like the null-model thinks the data should look like the p-value is used. The p-value is the probability of observing the data or more extreme data given the null hypothesis is true.
Small p-values indicate that it is rather unlikely to observe the data or more extreme data given the null hypothesis $H_0$ is true.
Null hypothesis testing is problematic. We discuss some of the problems after having introduces the most commonly used classical tests.
### Comparison of a sample with a fixed value (one-sample t-test)
In some studies, we would like to compare the data to a theoretical value. The theoretical value is a fixed value, e.g. calculated based on physical, biochemical, ecological or any other theory. The statistical task is then to compare the mean of the data including its uncertainty with the theoretical value. The result of such a comparison may be an estimate of the mean of the data with its uncertainty or an estimate of the difference of the mean of the data to the theoretical value with the uncertainty of this difference.
<!-- fk: HIER WEITER: include a figure that shows the data in a box and the fixed value and the mean with CI, and one only showing data with stars-->
For example, a null hypothesis could be $H_0:$"The mean of Snowfinch weights is exactly 40g."
A normal distribution with a mean of $\mu_0=40$ and a variance equal to the estimated variance in the data $s^2$ is then assumed to describe how we would expect the data to look like given the null hypothesis was true. From that model it is possible to calculate the distribution of hypothetical means of many different hypothetical samples of sample size $n$. The result is a t-distribution (Fig. \@ref(fig:nht)). In classical tests, the distribution is standardized so that its variance was one. Then the sample mean, or in classical tests a standardized difference between the mean and the hypothetical mean of the null hypothesis (here 40g), called test statistics $t = \frac{\bar{y}-\mu_0}{\frac{s}{\sqrt{n}}}$, is compared to that (standardized) t-distribution. If the test statistics falls well within the expected distribution the null hypothesis is accepted. Then, the data is well compatible with the null hypothesis. However, if the test statistics falls in the tails or outside the distribution, then the null hypothesis is rejected and we could write that the mean weight of Snowfinches is statistically significantly different from 40g. Unfortunately, we cannot infer about the probability of the null hypothesis and how relevant the result is.
```{r nht, fig.cap='Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value.', fig.width=8, fig.height=5, fig.align='center',echo=FALSE}
# use density function for t-distribution as given in Gelman et al. (2014) BDA-book
dt_BDA <- function(x,v,m,s) gamma((v+1)/2)/(gamma(v/2)*sqrt(v*pi)*s)*(1+1/v*((x-m)/s)^2)^(-(v+1)/2)
par(mar=c(4.5, 5, 2, 2))
hist(y, col="blue", xlim=c(30, 52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3))
abline(v=mean(y), lwd=3, col="lightblue")
abline(v=40, lwd=2)
text(41, 0.31, "H0", xpd=NA)
text(30, 0.25, "t(7, 40, SE(mean(y))", adj=c(0,0), cex=0.8)
x <- seq(20, 60, length=100)
dy <- dt_BDA(x, v=7, m=40, s=sd(y)/sqrt(n))
lines(x, dy)
index <- x>=mean(y)
polygon(c(x[index], rev(x[index])), c(rep(0, sum(index)), rev(dy[index])), border=NA, col="orange")
index <- x<= 40- (mean(y)-40)
polygon(c(x[index], rev(x[index])), c(rep(0, sum(index)), rev(dy[index])), border=NA, col="orange")
```
We can use the r-function `t.test` to calculate the p-value of a one sample t-test.
```{r,echo=TRUE}
t.test(y, mu=40)
```
The output of the r-function `t.test` also includes the mean and the 95% confidence interval (or compatibility or uncertainty interval) of the mean. The CI could, alternatively, be obtained as the 2.5% and 97.5% quantiles of a t-distribution with a mean equal to the sample mean, degrees of freedom equal to the sample size minus one and a standard deviation equal to the standard error of the mean.
```{r}
# lower limit of 95% CI
mean(y) + qt(0.025, df=length(y)-1)*sd(y)/sqrt(n)
# upper limit of 95% CI
mean(y) + qt(0.975, df=length(y)-1)*sd(y)/sqrt(n)
```
When applying the Bayes theorem for obtaining the posterior distribution of the mean we end up with the same t-distribution as described above, in case we use flat prior distributions for the mean and the standard deviation. Thus, the two different approaches end up with the same result!
```{r, fig.cap='Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g.',echo=TRUE, fig.height=6}
par(mar=c(4.5, 5, 2, 2))
hist(y, col="blue", xlim=c(30,52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3))
abline(v=mean(y), lwd=2, col="lightblue")
abline(v=40, lwd=2)
lines(density(bsim@coef))
text(45, 0.3, "posterior distribution\nof the mean of y", cex=0.8, adj=c(0,1), xpd=NA)
```
The posterior probability of the hypothesis that the true mean Snowfinch weight is larger than 40g, $P(H:\mu>40) =$, is equal to the proportion of simulated random values from the posterior distribution, saved in the vector `bsim@coef`, that are larger than 40.
```{r}
# Two ways of calculating the proportion of values
# larger than a specific value within a vector of values
round(sum(bsim@coef[,1]>40)/nrow(bsim@coef),2)
round(mean(bsim@coef[,1]>40),2)
# Note: logical values TRUE and FALSE become
# the numeric values 1 and 0 within the functions sum() and mean()
```
We, thus, can be pretty sure that the mean Snowfinch weight (in the big world population) is larger than 40g. Such a conclusion is not very informative, because it does not tell us how much larger we can expect the mean Snowfinch weight to be. Therefore, we prefer reporting a credible interval (or compatibility interval or uncertainty interval) that tells us what values for the mean Snowfinch weight are compatible with the data (given the data model we used realistically reflects the data generating process). Based on such an interval, we can conclude that we are pretty sure that the mean Snowfinch weight is between 40 and 48g.
```{r}
# 80% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.1, 0.9))
# 95% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.025, 0.975))
# 99% credible interval, compatibility interval, uncertainty interval
quantile(bsim@coef[,1], probs=c(0.005, 0.995))
```
### Comparison of the locations between two groups (two-sample t-test)
Many research questions aim at measuring differences between groups. For example, we could be curious to know how different in size car owner are from people not owning a car.
A boxplot is a nice possibility to visualize the ell length measurements of two (or more) groups (Fig. \@ref(fig:boxplt)). From the boxplot, we do not see how many observations are in the two samples. We can add that information to the plot. The boxplot visualizes the samples but it does not show what we know about the big (unmeasured) population mean. To show that, we need to add a compatibility interval (or uncertainty interval, credible interval, confidence interval, in brown in Fig. \@ref(fig:boxplt)).
```{r boxplt, fig.cap='Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval.',echo=FALSE}
par(mar=c(4.5,4.5,2,0.2))
boxplot(ell~car, data=dat, las=1, ylab="Length of ell (cm)", col="grey", boxwex=0.2,
ylim=c(35, 50))
means <- tapply(dat$ell, dat$car, mean)
semeans <- tapply(dat$ell, dat$car, function(x) sd(x)/sqrt(length(x)))
segments(c(1.2,2.2), means-2*semeans, c(1.2,2.2), means+2*semeans, lwd=2, col="brown")
points(c(1.2,2.2), means, pch=21, bg="brown")
text(x=c(1,2), y=c(36, 36), labels=paste("n=", table(dat$car)))
```
When we added the two means with a compatibility interval, we see what we know about the two means, but we do still not see what we know about the difference between the two means. The uncertainties of the means do not show the uncertainty of the difference between the means. To do so, we need to extract the difference between the two means from a model that describes (abstractly) how the data has been generated. Such a model is a linear model that we will introduce in Chapter \@ref(lm). The second parameter measures the differences in the means of the two groups. And from the simulated posterior distribution we can extract a 95% compatibility interval.
Thus, we can conclude that the average ell length of car owners is with high probability between 0.5 cm smaller and 2.5 cm larger than the averag ell of people not having a car.
```{r, echo=TRUE}
mod <- lm(ell~car, data=dat)
mod
bsim <- sim(mod, n.sim=nsim)
quantile(bsim@coef[,"carY"], prob=c(0.025, 0.5, 0.975))
```
The corresponding two-sample t-test gives a p-value for the null hypothesis: "The difference between the two means equals zero.", a confidence interval for the difference and the two means. While the function `lm`gives the difference Y minus N, the function `t.test`gives the difference N minus Y. Luckily the two means are also given in the output, so we know which group mean is the larger one.
```{r, echo=TRUE}
t.test(ell~car, data=dat, var.equal=TRUE)
```
In both possibilities, we used to compare the to means, the Bayesian posterior distribution of the difference and the t-test or the confidence interval of the difference, we used a data model. We thus assumed that the observations are normally distributed. In some cases, such an assumption is not a reasonable assumption. Then the result is not reliable. In such cases, we can either search for a more realistic model or use non-parametric (also called distribution free) methods. Nowadays, we have almost infinite possibilities to construct data models (e.g. generalized linear models and beyond). Therefore, we normally start looking for a model that fits the data better. However, in former days, all these possiblities did not exist (or were not easily available for non-mathematicians). Therefore, we here introduce two of such non-parametric methods, the Wilcoxon-test (or Mann-Whitney-U-test) and the randomisation test.
Some of the distribution free statistical methods are based on the rank instead of the value of the observations. The principle of the Wilcoxon-test is to rank the observations and sum the ranks per group. It is not completely true that the non-parametric methods do not have a model. The model of the Wilcoxon-test "knows" how the difference in the sum of the ranks between two groups is distributed given the mean of the two groups do not differ (null hypothesis). Therefore, it is possible to get a p-value, e.g. by the function `wilcox.test`.
```{r, echo=TRUE}
wilcox.test(ell~car, data=dat)
```
The note in the output tells us that ranking is ambiguous, when some values are equal. Equal values are called ties when they should be ranked.
The result of the Wilcoxon-test tells us how probable it is to observe the difference in the rank sum between the two sample or a more extreme difference given the means of the two groups are equal. That is at least something.
A similar result is obtained by using a randomisation test. This test is not based on ranks but on the original values. The aim of the randomisation is to simulate a distribution of the difference in the arithmetic mean between the two groups assuming this difference would be zero. To do so, the observed values are randomly distributed among the two groups. Because of the random distribution among the two groups, we expect that, if we repeat that virtual experiment many times, the average difference between the group means would be zero (both virtual samples are drawn from the same big population).
We can use a loop in R for repeating the random re-assignement to the two groups and, each time, extracting the difference between the group means. As a result, we have a vector of many (`nsim`) values that all are possible differences between group means given the two samples were drawn from the same population. The proportion of these values that have an equal or larger absolute value give the probability that the observed or a larger difference between the group means is observed given the null hypothesis would be true, thus that is a p-value.
```{r, echo=TRUE}
diffH0 <- numeric(nsim)
for(i in 1:nsim){
randomcars <- sample(dat$car)
rmod <- lm(ell~randomcars, data=dat)
diffH0[i] <- coef(rmod)["randomcarsY"]
}
mean(abs(diffH0)>abs(coef(mod)["carY"])) # p-value
```
Visualizing the possible differences between the group means given the null hypothesis was true shows that the observed difference is well within what is expected if the two groups would not differ in their means (Fig. \@ref(fig:ranhist)).
```{r ranhist, fig.cap='Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red)', echo=FALSE}
par(mar=c(5.5, 4.5, 0.5, 0), cex=0.8)
hist(diffH0, main=NA); abline(v=coef(mod)[2], lwd=2, col="red", main=NA)
```
The randomization test results in a p-value and, we could also report the observed difference between the group means. However, it does not tell us, what values of the difference all would be compatible with the data. We do not get an uncertainty measurement for the difference.
In order to get a compatibility interval without assuming a distribution for the data (thus non-parametric) we could bootstrap the samples.
Bootstrapping is sampling observations from the data with replacement. For example, if we have a sample of 8 observations, we draw 8 times a random observation from the 8 observation. Each time, we assume that all 8 observations are available. Thus a bootstrapped sample could include some observations several times, whereas others are missing. In this way, we simulate the variance in the data that is due to the fact that our data do not contain the whole big population.
Also bootstrapping can be programmed in R using a loop.
```{r, echo=TRUE}
diffboot <- numeric(1000)
for(i in 1:nsim){
ngroups <- 1
while(ngroups==1){
bootrows <- sample(1:nrow(dat), replace=TRUE)
ngroups <- length(unique(dat$car[bootrows]))
}
rmod <- lm(ell~car, data=dat[bootrows,])
diffboot[i] <- coef(rmod)[2]
}
quantile(diffboot, prob=c(0.025, 0.975))
```
The resulting values for the difference between the two group means can be interpreted as the distribution of those differences, if we had repeated the study many times (Fig. \@ref(fig:histboot)).
A 95% interval of the distribution corresponds to a 95% compatibility interval (or confidence interval or uncertainty interval).
```{r histboot, fig.cap='Histogram of the boostrapped differences between the group means (grey) and the observed difference.', echo=TRUE}
hist(diffboot); abline(v=coef(mod)[2], lwd=2, col="red")
```
For both methods, randomisation test and bootstrapping, we have to assume that all observations are independent. Randomization and bootstrapping becomes complicated or even unfeasible when data are structured.
## Summary
Bayesian data analysis is applying the Bayes theorem for summarizing knowledge based on data, priors and the model assumptions.
Frequentist statistics is quantifying uncertainty by hypothetical repetitions.