-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path2.03-lm.Rmd
973 lines (721 loc) · 72.6 KB
/
2.03-lm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
# Normal Linear Models{#lm}
```{r fig.align='center', echo=FALSE, fig.link=''}
knitr::include_graphics('images/snowfinch3.jpg', dpi = 150)
```
------
## Linear regression
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(arm)
```
### Background
Linear regression is the basis of a large part of applied statistical analysis. Analysis of variance (ANOVA) and analysis of covariance (ANCOVA) can be considered special cases of linear regression, and generalized linear models are extensions of linear regression.
Typical questions that can be answered using linear regression are: How does $y$ change with changes in $x$? How is y predicted from $x$? An ordinary linear regression (i.e., one numeric $x$ and one numeric $y$ variable) can be represented by a scatterplot of $y$ against $x$. We search for the line that fits best and describe how the observations scatter around this regression line (see Fig. \@ref(fig:figlm) for an example). The model formula of a simple linear regression with one continuous predictor variable $x_i$ (the subscript $i$ denotes the $i=1,\dots,n$ data points) is:
\begin{align}
\mu_i &=\beta_0 + \beta_1 x_i \\
y_i &\sim normal(\mu_i, \sigma^2)
(\#eq:lm)
\end{align}
While the first part of Equation \@ref(eq:lm) describes the regression line, the second part describes how the data points, also called observations, are distributed around the regression line (Figure \@ref(fig:illlm)). In other words: the observation $y_i$ stems from a normal distribution with mean $\mu_i$ and variance $\sigma^2$. The mean of the normal distribution, $\mu_i$ , equals the sum of the intercept ($b_0$ ) and the product of the slope ($b_1$) and the continuous predictor value, $x_i$.
Equation \@ref(eq:lm) is called the data model, because it describes mathematically the process that has (or, better, that we think has) produced the data. This nomenclature also helps to distinguish data models from models for parameters such as prior or posterior distributions.
The differences between observation $y_i$ and the predicted values $\mu_i$ are the residuals (i.e., $\epsilon_i=y_i-\mu_i$). Equivalently to Equation \@ref(eq:lm), the regression could thus be written as:
\begin{align}
y_i &= \beta_0 + \beta_1 x_i + \epsilon_i\\
\epsilon_i &\sim normal(0, \sigma^2)
(\#eq:lmalternativ)
\end{align}
We prefer the notation in Equation \@ref(eq:lm) because, in this formula, the stochastic part (second row) is nicely separated from the deterministic part (first row) of the model, whereas, in the second notation \@ref(eq:lmalternativ) the first row contains both stochastic and deterministic parts.
For illustration, we here simulate a data set and below fit a linear regression to these simulated data. The advantage of simulating data is that the following analyses can be reproduced without having to read data into R. Further, for simulating data, we need to translate the algebraic model formula into R language which helps us understanding the model structure.
```{r lmsimulation}
set.seed(34) # set a seed for the random number generator
# define the data structure
n <- 50 # sample size
x <- runif(n, 10, 30) # sample values of the predictor variable
# define values for each model parameter
sigma <- 5 # standard deviation of the residuals
b0 <- 2 # intercept
b1 <- 0.7 # slope
# simulate y-values from the model
mu <- b0 + b1 * x # define the regression line (deterministic part)
y <- rnorm(n, mu, sd = sigma) # simulate y-values
# save data in a data.frame
dat <- tibble(x = x, y = y)
```
```{r illlm, fig.asp=0.45, fig.cap="Illustration of a linear regression. The blue line represents the deterministic part of the model, i.e., here regression line. The stochastic part is represented by a probability distribution, here the normal distribution. The normal distribution changes its mean but not the variance along the x-axis, and it describes how the data are distributed. The blue line and the orange distribution together are a statistical model, i.e., an abstract representation of the data which is given in black.", echo=FALSE}
mod <- lm(y~x, data=dat)
par(mar=c(3,4,0.5,0.5))
plot(dat$x, dat$y, pch=16, las=1, xlab="Predictor (x)",
ylab="Outcome (y)")
abline(mod, lwd=2, col="blue")
ynew <- seq(0,30, length=100)
for(xx in c(14,24)){
dynew <- dnorm(ynew, coef(mod)[1]+xx*coef(mod)[2], summary(mod)$sigma)
lines(xx+dynew/max(dynew)*4, ynew, lwd=2, col="orange")
abline(v=xx)
}
```
Using matrix notation equation \@ref(eq:lm) can also be written in one row:
$$\boldsymbol{y} \sim
Norm(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2\boldsymbol{I})$$
where $\boldsymbol{ I}$ is the $n \times n$ identity matrix (it transforms the variance parameter to a $n \times n$ matrix with its diagonal elements equal $\sigma^2$ ; $n$ is the sample size). The multiplication by $\boldsymbol{ I}$ is necessary because we use vector notation, $\boldsymbol{y}$ instead of $y_{i}$ . Here, $\boldsymbol{y}$ is the vector of all observations, whereas $y_{i}$ is a single observation, $i$. When using vector notation, we can write the linear predictor of the model, $\beta_0 + \beta_1 x_i$ , as a multiplication of the vector of the model coefficients
$$\boldsymbol{\beta} =
\begin{pmatrix}
\beta_0 \\
\beta_1
\end{pmatrix}$$
times the model matrix
$$\boldsymbol{X} =
\begin{pmatrix}
1 & x_1 \\
\dots & \dots \\
1 & x_n
\end{pmatrix}$$
where $x_1 , \dots, x_n$ are the observed values for the predictor variable, $x$. The first column of $\boldsymbol{X}$ contains only ones because the values in this column are multiplied with the intercept, $\beta_0$ . To the intercept, the product of the second element of $\boldsymbol{\beta}$, $\beta_1$ , with each element in the second column of $\boldsymbol{X}$ is added to obtain the predicted value for each observation, $\boldsymbol{\mu}$:
\begin{align}
\boldsymbol{X \beta}=
\begin{pmatrix}
1 & x_1 \\
\dots & \dots \\
1 & x_n
\end{pmatrix}
\times
\begin{pmatrix}
\beta_0 \\
\beta_1
\end{pmatrix} =
\begin{pmatrix}
\beta_0 + \beta_1x_1 \\
\dots \\
\beta_0 + \beta_1x_n
\end{pmatrix}=
\begin{pmatrix}
\hat{y}_1 \\
\dots \\
\hat{y}_n
\end{pmatrix} =
\boldsymbol{\mu}
(\#eq:lmmatrix)
\end{align}
### Fitting a Linear Regression in R
In Equation \@ref(eq:lm), the fitted values $\mu_i$ are directly defined by the model coefficients, $\beta_{0}$ and $\beta_{1}$ . Therefore, when we can estimate $\beta_{0}$, $\beta_{1}$ , and $\sigma^2$, the model is fully defined. The last parameter $\sigma^2$ describes how the observations scatter around the regression line and relies on the assumption that the residuals are normally distributed. The estimates for the model parameters of a linear regression are obtained by searching for the best fitting regression line. To do so, we search for the regression line that minimizes the sum of the squared residuals. This model fitting method is called the least-squares method, abbreviated as LS. It has a very simple solution using matrix algebra [see e.g., @Aitkin.2009].
The least-squares estimates for the model parameters of a linear regression are obtained in R using the function `lm`.
```{r}
mod <- lm(y ~ x, data = dat)
coef(mod)
summary(mod)$sigma
```
The object “mod” produced by `lm` contains the estimates for the intercept, $\beta_0$ , and the slope, $\beta_1$. The residual standard deviation $\sigma^2$ is extracted using the function `summary`. We can show the result of the linear regression as a line in a scatter plot with the covariate (`x`) on the x-axis and the observations (`y`) on the y-axis (Fig. \@ref(fig:figlm)).
```{r figlm, echo=FALSE, fig.cap="Linear regression. Black dots = observations, blue solid line = regression line, orange dotted lines = residuals. The fitted values lie where the orange dotted lines touch the blue regression line.", fig.asp=.45}
dat %>%
mutate(ypred = fitted(lm(y~x))) %>%
ggplot(aes(x = x, y = y, xend = x, yend = ypred)) +
geom_segment(col = "orange", lty = 2) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
# alternative: geom_abline(intercept = coef(mod)[1] +
labs(x = "Predictor (x)", y = "Outcome (y)")
```
Conclusions drawn from a model depend on the model assumptions. When model assumptions are violated, estimates usually are biased and inappropriate conclusions can be drawn. We devote Chapter \@ref(residualanalysis) to the assessment of model assumptions, given its importance.
### Drawing Conclusions
To answer the question about how strongly $y$ is related to $x$ taking into account statistical uncertainty we look at the joint posterior distribution of $\boldsymbol{\beta}$ (vector that contains $\beta_{0}$ and $\beta_{1}$ ) and $\sigma^2$ , the residual variance. The function `sim` calculates the joint posterior distribution and renders a simulated values from this distribution.
<font size="1">
<div style="border: 2px solid grey;">
What does `sim` do?
It simulates parameter values from the joint posterior distribution of a model assuming flat prior distributions. For a normal linear regression, it first draws a random value, $\sigma^*$ from the marginal posterior distribution of $\sigma$, and then draws random values from the conditional posterior distribution for $\boldsymbol{\beta}$ given $\sigma^*$ [@Gelman.2014].
The conditional posterior distribution of the parameter vector $\boldsymbol{\beta}$, $p(\boldsymbol{\beta}|\sigma^*,\boldsymbol{y,X})$ can be analytically derived. With flat prior distributions, it is a uni- or multivariate normal distribution $p(\boldsymbol{\beta}|\sigma^*,\boldsymbol{y,X})=normal(\boldsymbol{\hat{\beta}},V_\beta,(\sigma^*)^2)$ with:
\begin{align}
\boldsymbol{\hat{\beta}=(\boldsymbol{X^TX})^{-1}X^Ty}
(\#eq:sim)
\end{align}
and $V_\beta = (\boldsymbol{X^T X})^{-1}$.
The marginal posterior distribution of $\sigma^2$ is independent of specific values of $\boldsymbol{\beta}$. It is, for flat prior distributions, an inverse chi-square distribution $p(\sigma^2|\boldsymbol{y,X})=Inv-\chi^2(n-k,\sigma^2)$, where $\sigma^2 = \frac{1}{n-k}(\boldsymbol{y}-\boldsymbol{X,\hat{\beta}})^T(\boldsymbol{y}-\boldsymbol{X,\hat{\beta}})$, and $k$ is the number of parameters. The marginal posterior distribution of $\boldsymbol{\beta}$ can be obtained by integrating the conditional posterior distribution $p(\boldsymbol{\beta}|\sigma^2,\boldsymbol{y,X})=normal(\boldsymbol{\hat{\beta}},V_\beta\sigma^2)$ over the distribution of $\sigma^2$ . This results in a uni- or multivariate $t$-distribution.
Because `sim` simulates values $\beta_0^*$ and $\beta_1^*$ always conditional on $\sigma^*$, a triplet of values ($\beta_0^*$, $\beta_1^*$, $\sigma^*$) is one draw of the joint posterior distribution. When we visualize the distribution of the simulated values for one parameter only, ignoring the values for the other, we display the marginal posterior distribution of that parameter. Thus, the distribution of all simulated values for the parameter $\beta_0$ is a $t$-distribution even if a normal distribution has been used for simulating the values. The $t$-distribution is a consequence of using a different $\sigma^2$-value for every draw of $\beta_0$.
</div>
</font>
Using the function `sim` from the package, we can draw values from the joint posterior distribution of the model parameters and describe the marginal posterior distribution of each model parameter using these simulated values.
```{r}
library(arm)
nsim <- 1000
bsim <- sim(mod, n.sim = nsim)
```
The function `sim` simulates (in our example) 1000 values from the joint posterior distribution of the three model parameters $\beta_0$ , $\beta_1$, and $\sigma$. These simulated values are shown in Figure \@ref(fig:simfirstexample).
```{r simfirstexample, echo=FALSE, fig.dim = c(8,6.5), fig.cap="Joint (scatterplots) and marginal (histograms) posterior distribution of the model parameters. The six scatterplots show, using different axes, the three-dimensional cloud of 1000 simulations from the joint posterior distribution of the three parameters."}
histofun <- function(x) {
basicdat <- hist(x, plot = FALSE)
segments(
basicdat$mids,
min(x),
basicdat$mids,
min(x) + (max(x) - min(x)) * 2 / 3 * basicdat$density / max(basicdat$density),
lwd = 10,
lend = "butt",
col = grey(0.5)
)
}
panelfun <- function(x, y)
points(x, y, cex = 0.6)
pairs(
cbind(coef(bsim), bsim@sigma),
diag.panel = histofun,
panel = panelfun,
labels = c(expression(beta[0]), expression(beta[1]), expression(sigma))
)
```
The posterior distribution describes, given the data and the model, which values relative to each other are more likely to correspond to the parameter value we aim at measuring. It expresses the uncertainty of the parameter estimate. It shows what we know about the model parameter after having looked at the data and given the model is realistic.
The 2.5% and 97.5% quantiles of the marginal posterior distributions can be used as 95% uncertainty intervals of the model parameters. The function `coef` extracts the simulated values for the beta coefficients, returning a matrix with *nsim* rows and the number of columns corresponding to the number of parameters. In our example, the first column contains the simulated values from the posterior distribution of the intercept and the second column contains values from the posterior distribution of the slope. The "2" in the second argument of the apply-function (see Chapter \@ref(rmisc)) indicates that the `quantile` function is applied columnwise.
```{r}
apply(X = coef(bsim), MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)) %>%
round(2)
```
We also can calculate an uncertainty interval of the estimated residual standard deviation, $\hat{\sigma}$.
```{r}
quantile(bsim@sigma, probs = c(0.025, 0.975)) %>%
round(1)
```
We can further get a posterior probability for specific hypotheses, such as “The slope parameter is larger than 1” or “The slope parameter is larger than 0.5”. These probabilities are the proportion of simulated values from the posterior distribution that are larger than 1 and 0.5, respectively.
```{r}
sum(coef(bsim)[,2] > 1) / nsim # alternatively: mean(coef(bsim)[,2]>1)
sum(coef(bsim)[,2] > 0.5) / nsim
```
From this, there is very little evidence in the data that the slope is larger than 1, but we are quite confident that the slope is larger than 0.5 (assuming that our model is realistic).
We often want to show the effect of $x$ on $y$ graphically, with information about the uncertainty of the parameter estimates included in the graph. To draw such effect plots, we use the simulated values from the posterior distribution of the model parameters. From the deterministic part of the model, we know the regression line $\mu = \beta_0 + \beta_1 x_i$. The simulation from the joint posterior distribution of $\beta_0$ and $\beta_1$ gives 1000 pairs of intercepts and slopes that describe 1000 different regression lines. We can draw these regression lines in an x-y plot (scatter plot) to show the uncertainty in the regression line estimation (Fig. \@ref(fig:figlmer1), left). Note, that in this case it is not advisable to use `ggplot` because we draw many lines in one plot, which makes `ggplot` rather slow.
```{r figlmer1, fig.cap = "Regression with 1000 lines based on draws form the joint posterior distribution for the intercept and slope parameters to visualize the uncertainty of the estimated regression line.", fig.asp=0.45}
par(mar = c(4, 4, 0, 0))
plot(x, y, pch = 16, las = 1,
xlab = "Outcome (y)")
for(i in 1:nsim) {
abline(coef(bsim)[i,1], coef(bsim)[i,2], col = rgb(0, 0, 0, 0.05))
}
```
A more convenient way to show uncertainty is to draw the 95% uncertainty interval, CrI, of the regression line. To this end, we first define new x-values for which we would like to have the fitted values (about 100 points across the range of x will produce smooth-looking lines when connected by line segments). We save these new x-values within the new tibble `newdat`. Then, we create a new model matrix that contains these new x-values (`newmodmat`) using the function `model.matrix`. We then calculate the 1000 fitted values for each element of the new x (one value for each of the 1000 simulated regressions, Fig. \@ref(fig:figlmer1)), using matrix multiplication (%*%). We save these values in the matrix “fitmat”. Finally, we extract the 2.5% and 97.5% quantiles for each x-value from fitmat, and draw the lines for the lower and upper limits of the credible interval (Fig. \@ref(fig:figlmer2)).
```{r figlmer2, fig.cap = "Regression with 95% credible interval of the posterior distribution of the fitted values.", fig.asp=0.45}
# Calculate 95% credible interval
newdat <- tibble(x = seq(10, 30, by = 0.1))
newmodmat <- model.matrix( ~ x, data = newdat)
fitmat <- matrix(ncol = nsim, nrow = nrow(newdat))
for(i in 1:nsim) {fitmat[,i] <- newmodmat %*% coef(bsim)[i,]}
newdat$CrI_lo <- apply(fitmat, 1, quantile, probs = 0.025)
newdat$CrI_up <- apply(fitmat, 1, quantile, probs = 0.975)
# Make plot
regplot <-
ggplot(dat, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
geom_line(data = newdat, aes(x = x, y = CrI_lo), lty = 3) +
geom_line(data = newdat, aes(x = x, y = CrI_up), lty = 3) +
labs(x = "Predictor (x)", y = "Outcome (y)")
regplot
```
The interpretation of the 95% uncertainty interval is straightforward: We are 95% sure that the true regression line is within the credible interval (given the data and the model). As with all statistical results, this interpretation is only valid in the model world (if the world would look like the model). The larger the sample size, the narrower the interval, because each additional data point increases information about the true regression line.
The uncertainty interval measures statistical uncertainty of the regression line, but it does not describe how new observations would scatter around the regression line. If we want to describe where future observations will be, we have to report the posterior predictive distribution. We can get a sample of random draws from the posterior predictive distribution $\hat{y}|\boldsymbol{\beta},\sigma^2,\boldsymbol{X}\sim normal( \boldsymbol{X \beta, \sigma^2})$ using the simulated joint posterior distributions of the model parameters, thus taking the uncertainty of the parameter estimates into account. We draw a new $\hat{y}$-value from $normal( \boldsymbol{X \beta, \sigma^2})$ for each simulated set of model parameters. Then, we can visualize the 2.5% and 97.5% quantiles of this distribution for each new x-value.
```{r figlmer3, cache = TRUE, fig.cap = "Regression line with 95% uncertainty interval (dotted lines) and the 95% interval of the simulated predictive distribution (broken lines). Note that we increased the number of simulations to 50,000 to produce smooth lines.", fig.asp=0.45}
# increase number of simulation to produce smooth lines of the posterior
# predictive distribution
set.seed(34)
nsim <- 50000
bsim <- sim(mod, n.sim=nsim)
fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
for(i in 1:nsim) fitmat[,i] <- newmodmat%*%coef(bsim)[i,]
# prepare matrix for simulated new data
newy <- matrix(ncol=nsim, nrow=nrow(newdat))
# for each simulated fitted value, simulate one new y-value
for(i in 1:nsim) {
newy[,i] <- rnorm(nrow(newdat), mean = fitmat[,i], sd = bsim@sigma[i])
}
# Calculate 2.5% and 97.5% quantiles
newdat$pred_lo <- apply(newy, 1, quantile, probs = 0.025)
newdat$pred_up <- apply(newy, 1, quantile, probs = 0.975)
# Add the posterior predictive distribution to plot
regplot +
geom_line(data = newdat, aes(x = x, y = pred_lo), lty = 2) +
geom_line(data = newdat, aes(x = x, y = pred_up), lty = 2)
```
Of future observations, 95% are expected to be within the interval defined by the broken lines in Fig. \@ref(fig:figlmer3). Increasing sample size will not give a narrower predictive distribution because the predictive distribution primarily depends on the residual variance $\sigma^2$ which is a property of the data that is independent of sample size.
The way we produced Fig. \@ref(fig:figlmer3) is somewhat tedious compared to how easy we could have obtained the same figure using frequentist methods: `predict(mod, newdata = newdat, interval = "prediction")` would have produced the y-values for the lower and upper lines in Fig. \@ref(fig:figlmer3) in one R-code line. However, once we have a simulated sample of the posterior predictive distribution, we have much more information than is contained in the frequentist prediction interval. For example, we could give an estimate for the proportion of observations greater than 20, given $x = 25$.
```{r}
sum(newy[newdat$x == 25, ] > 20) / nsim
```
Thus, we expect 44% of future observations with $x = 25$ to be higher than 20. We can extract similar information for any relevant threshold value.
Another reason to learn the more complicated R code we presented here, compared to the frequentist methods, is that, for more complicated models such as mixed models, the frequentist methods to obtain confidence intervals of fitted values are much more complicated than the Bayesian method just presented. The latter can be used with only slight adaptations for mixed models and also for generalized linear mixed models.
### Interpretation of the R summary output
The solution for $\boldsymbol{\beta}$ is the Equation \@ref(eq:lmmatrix). Most statistical software, including R, return an estimated frequentist standard error for each $\beta_k$. We extract these standard errors together with the estimates for the model parameters using the `summary` function.
```{r}
summary(mod)
```
The summary output first gives a rough summary of the residual distribution. However, we will do more rigorous residual analyses in Chapter \@ref(residualanalysis). The estimates of the model coefficients follow. The column "Estimate" contains the estimates for the intercept $\beta_0$ and the slope $\beta_1$ . The column "Std. Error" contains the estimated (frequentist) standard errors of the estimates. The last two columns contain the t-value and the p-value of the classical t-test for the null hypothesis that the coefficient equals zero. The last part of the summary output gives the parameter $\sigma$ of the model, named "residual standard error" and the residual degrees of freedom.
We think the name "residual standard error" for "sigma" is confusing, because $\sigma$ is not a measurement of uncertainty of a parameter estimate like the standard errors of the model coefficients are. $\sigma$ is a model parameter that describes how the observations scatter around the fitted values, that is, it is a standard deviation. It is independent of sample size, whereas the standard errors of the estimates for the model parameters will decrease with increasing sample size. Such a standard error of the estimate of $\sigma$, however, is not given in the summary output. Note that, by using Bayesian methods, we could easily obtain the standard error of the estimated $\sigma$ by calculating the standard deviation of the posterior distribution of $\sigma$.
The $R^2$ and the adjusted $R^2$ measure the proportion of variance in the outcome variable $y$ that is explained by the predictors in the model. $R^2$ is calculated from the sum of squared residuals, $SSR = \sum_{i=1}^{n}(y_i - \hat{y})$, and the "total sum of squares", $SST = \sum_{i=1}^{n}(y_i - \bar{y})$, where $\bar{y})$ is the mean of $y$. $SST$ is a measure of total variance in $y$ and $SSR$
is a measure of variance that cannot be explained by the model, thus $R^2 = 1- \frac{SSR}{SST}$ is a measure of variance that can be explained by the model. If $SSR$ is close to $SST$, $R^2$ is close to zero and the model cannot explain a lot of variance. The smaller $SSR$, the closer $R^2$ is to 1. This version of $R2$ approximates 1 if the number of model parameters approximates sample size even if none of the predictor variables correlates with the outcome. It is exactly 1 when the number of model parameters equals sample size, because $n$ measurements can be exactly described by $n$ parameters. The adjusted $R^2$, $R^2 = \frac{var(y)-\hat\sigma^2}{var(y)}$ takes sample size $n$ and the number of model parameters $k$ into account (see explanation to variance in chapter \@ref(basics)). Therefore, the adjusted $R^2$ is recommended as a measurement of the proportion of explained variance.
## Linear model with one categorical predictor (one-way ANOVA)
The aim of analysis of variance (ANOVA) is to compare means of an outcome variable $y$ between different groups. To do so in the frequentist’s framework, variances between and within the groups are compared using F-tests (hence the name “analysis of variance”). When doing an ANOVA in a Bayesian way, inference is based on the posterior distributions of the group means and the differences between the group means.
One-way ANOVA means that we only have one predictor variable, specifically a categorical predictor variable (in R defined as a "factor"). We illustrate the one-way ANOVA based on an example of simulated data (Fig. \@ref(fig:figanova)). We have measured weights of 30 virtual individuals for each of 3 groups. Possible research questions could be: How big are the differences between the group means? Are individuals from group 2 heavier than the ones from group 1? Which group mean is higher than 7.5 g?
```{r figanova, fig.asp=0.45, fig.cap = "Weights (g) of the 30 individuals in each group. The dark horizontal line is the median, the box contains 50% of the observations (i.e., the interquartile range), the whiskers mark the range of all observations that are less than 1.5 times the interquartile range away from the edge of the box."}
# settings for the simulation
set.seed(626436)
b0 <- 12 # mean of group 1 (reference group)
sigma <- 2 # residual standard deviation
b1 <- 3 # difference between group 1 and group 2
b2 <- -5 # difference between group 1 and group 3
n <- 90 # sample size
# generate data
group <- factor(rep(c("group 1","group 2", "group 3"), each=30))
simresid <- rnorm(n, mean=0, sd=sigma) # simulate residuals
y <- b0 +
as.numeric(group=="group 2") * b1 +
as.numeric(group=="group 3") * b2 +
simresid
dat <- tibble(y, group)
# make figure
dat %>%
ggplot(aes(x = group, y = y)) +
geom_boxplot(fill = "orange") +
labs(y = "Weight (g)", x = "") +
ylim(0, NA)
```
An ANOVA is a linear regression with a categorical predictor variable instead of a continuous one. The categorical predictor variable with $k$ levels is (as a default in R) transformed to $k-1$ indicator variables. An indicator variable is a binary variable containing 0 and 1 where 1 indicates a specific level (a category of the predictor variable). Often, one indicator variable is constructed for every level except for the reference level. In our example, the categorical variable is "group" with the three levels "group 1", "group 2", and "group 3" ($k = 3$). Group 1 is taken as the reference level (default in R is the first in the alphabeth), and for each of the other two groups an indicator variable is constructed, $I(group_i = 2)$ and $I(group_i = 3)$. The function $I()$ gives out 1, if the expression is true and 0 otherwise. We can write the model as a formula:
\begin{align}
\mu_i &=\beta_0 + \beta_1 I(group_i=2) + \beta_1 I(group_i=3) \\
y_i &\sim normal(\mu_i, \sigma^2)
(\#eq:anova)
\end{align}
where $y_i$ is the $i$-th observation (weight measurement for individual $i$ in our example), and $\beta_{0,1,2}$ are the model coefficients. The residual variance is $\sigma^2$. The model coefficients $\beta_{0,1,2}$ constitute the deterministic part of the model. From the model formula it follows that the group means, $m_g$, are:
\begin{align}
m_1 &=\beta_0 \\
m_2 &=\beta_0 + \beta_1 \\
m_3 &=\beta_0 + \beta_2 \\
(\#eq:anovamw)
\end{align}
There are other possibilities to describe three group means with three parameters, for example:
\begin{align}
m_1 &=\beta_1 \\
m_2 &=\beta_2 \\
m_3 &=\beta_3 \\
(\#eq:anovamwalt)
\end{align}
In this case, the model formula would be:
\begin{align}
\mu_i &= \beta_1 I(group_i=1) + \beta_2 I(group_i=2) + \beta_3 I(group_i=3) \\
y_i &\sim Norm(\mu_i, \sigma^2)
(\#eq:anovaalt)
\end{align}
The way the group means are calculated within a model is called the parameterization of the model. Different statistical software use different parameterizations. The parameterization used by R by default is the one shown in Equation \@ref(eq:anova). R automatically takes the first level as the reference (the first level is the first one alphabetically unless the user defines a different order for the levels). The mean of the first group (i.e., of the first factor level) is the intercept, $b_0$ , of the model. The mean of another factor level is obtained by adding, to the intercept, the estimate of the corresponding parameter (which is the difference from the reference group mean).
The parameterization of the model is defined by the model matrix. In the case of a one-way ANOVA, there are as many columns in the model matrix as there are factor levels (i.e., groups); thus there are k factor levels and k model coefficients. Recall from Equation \@ref(eq:lmmatrix) that for each observation, the entry in the $j$-th column of the model matrix is multiplied by the $j$-th element of the model coefficients and the $k$ products are summed to obtain the fitted values. For a data set with $n = 5$ observations of which the first two are from group 1, the third from group 2, and the last two from group 3, the model matrix used for the parameterization described in Equation \@ref(eq:anovamw) and defined in R by the formula `~ group` is
\begin{align}
\boldsymbol{X}=
\begin{pmatrix}
1 & 0 & 0 \\
1 & 0 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1 \\
1 & 0 & 1 \\
\end{pmatrix}
\end{align}
If parameterization of Equation \@ref(eq:anovamwalt) (corresponding R formula: `~ group - 1`) were used,
\begin{align}
\boldsymbol{X}=
\begin{pmatrix}
1 & 0 & 0 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
0 & 0 & 1 \\
\end{pmatrix}
\end{align}
To obtain the parameter estimates for model parameterized according to Equation \@ref(eq:anovamw) we fit the model in R:
```{r}
# fit the model
mod <- lm(y~group, data=dat)
# parameter estimates
mod
summary(mod)$sigma
```
The "Intercept" is $\beta_0$. The other coefficients are named with the factor name ("group") and the factor level (either "group 2" or "group 3"). These are $\beta_1$ and $\beta_2$ , respectively. Before drawing conclusions from an R output we need to examine whether the model assumptions are met, that is, we need to do a residual analysis as described in Chapter \@ref(residualanalysis).
Different questions can be answered using the above ANOVA: What are the group means? What is the difference in the means between group 1 and group 2? What is the difference between the means of the heaviest and lightest group? In a Bayesian framework we can directly assess how strongly the data support the hypothesis that the mean of the group 2 is larger than the mean of group 1. We first simulate from the posterior distribution of the model parameters.
```{r}
library(arm)
nsim <- 1000
bsim <- sim(mod, n.sim=nsim)
```
Then we obtain the posterior distributions for the group means according to the parameterization of the model formula (Equation \@ref(eq:anovamw)).
```{r}
m.g1 <- coef(bsim)[,1]
m.g2 <- coef(bsim)[,1] + coef(bsim)[,2]
m.g3 <- coef(bsim)[,1] + coef(bsim)[,3]
```
The histograms of the simulated values from the posterior distributions of the three means are given in Fig. \@ref(fig:figanovares). The three means are well separated and, based on our data, we are confident that the group means differ. From these simulated posterior distributions we obtain the means and use the 2.5% and 97.5% quantiles as limits of the 95% uncertainty intervals (Fig. \@ref(fig:figanovares), right).
```{r figanovares, fig.asp=0.45, fig.cap = "Distribution of the simulated values from the posterior distributions of the group means (left); group means with 95% uncertainty intervals obtained from the simulated distributions (right)."}
# save simulated values from posterior distribution in tibble
post <-
tibble(`group 1` = m.g1, `group 2` = m.g2, `group 3` = m.g3) %>%
gather("groups", "Group means")
# histograms per group
leftplot <-
ggplot(post, aes(x = `Group means`, fill = groups)) +
geom_histogram(aes(y=..density..), binwidth = 0.5, col = "black") +
labs(y = "Density") +
theme(legend.position = "top", legend.title = element_blank())
# plot mean and 95%-CrI
rightplot <-
post %>%
group_by(groups) %>%
dplyr::summarise(
mean = mean(`Group means`),
CrI_lo = quantile(`Group means`, probs = 0.025),
CrI_up = quantile(`Group means`, probs = 0.975)) %>%
ggplot(aes(x = groups, y = mean)) +
geom_point() +
geom_errorbar(aes(ymin = CrI_lo, ymax = CrI_up), width = 0.1) +
ylim(0, NA) +
labs(y = "Weight (g)", x ="")
multiplot(leftplot, rightplot, cols = 2)
```
To obtain the posterior distribution of the difference between the means of group 1 and group 2, we simply calculate this difference for each draw from the joint posterior distribution of the group means.
```{r}
d.g1.2 <- m.g1 - m.g2
mean(d.g1.2)
quantile(d.g1.2, probs = c(0.025, 0.975))
```
The estimated difference is `r mean(d.g1.2)`. In the small model world, we are 95% sure that the difference between the means of group 1 and 2 is between `r quantile(d.g1.2, probs = 0.025)` and `r quantile(d.g1.2, probs = 0.975)`.
How strongly do the data support the hypothesis that the mean of group 2 is larger than the mean of group 1? To answer this question we calculate the proportion of the draws from the joint posterior distribution for which the mean of group 2 is larger than the mean of group 1.
```{r}
sum(m.g2 > m.g1) / nsim
```
This means that in all of the `r nsim %>% as.integer` simulations from the joint posterior distribution, the mean of group 2 was larger than the mean of group 1. Therefore, there is a very high probability (i.e., it is close to 1; because probabilities are never exactly 1, we write >0.999) that the mean of group 2 is larger than the mean of group 1.
## Other variants of normal linear models: Two-way anova, analysis of covariance and multiple regression
Up to now, we introduced normal linear models with one predictor only. We can add more predictors to the model and these can be numerical or categorical ones. Traditionally, models with 2 or 3 categorical predictors are called two-way or three-way ANOVA, respectively. Models with a mixture of categorical and numerical predictors are called ANCOVA. And, models containing only numerical predictors are called multiple regressions. Nowadays, we only use the term "normal linear model" as an umbrella term for all these types of models.
While it is easy to add additional predictors in the R formula of the model, it becomes more difficult to interpret the coefficients of such multi-dimensional models. Two important topics arise with multi-dimensional models, *interactions* and *partial effects*. We dedicate partial effects the full next chapter and introduce interactions in this chapter using two examples. The first, is a model including two categorical predictors and the second is a model with one categorical and one numeric predictor.
### Linear model with two categorical predictors (two-way ANOVA)
In the first example, we ask how large are the differences in wing length between age and sex classes of the Coal tit *Periparus ater*. Wing lengths were measured on 19 coal tit museum skins with known sex and age class.
```{r}
data(periparusater)
dat <- tibble(periparusater) # give the data a short handy name
dat$age <- recode_factor(dat$age, "4"="adult", "3"="juvenile") # replace EURING code
dat$sex <- recode_factor(dat$sex, "2"="female", "1"="male") # replace EURING code
```
To describe differences in wing length between the age classes or between the sexes a normal linear model with two categorical predictors is fitted to the data. The two predictors are specified on the right side of the model formula separated by the "+" sign, which means that the model is an additive combination of the two effects (as opposed to an interaction, see following).
```{r}
mod <- lm(wing ~ sex + age, data=dat)
```
After having seen that the residual distribution does not appear to violate the model assumptions (as assessed with diagnostic residual plots, see Chapter \@ref(residualanalysis)), we can draw inferences. We first have a look at the model parameter estimates:
```{r}
mod
summary(mod)$sigma
```
R has taken the first level of the factors age and sex (as defined in the data.frame dat) as the reference levels. The intercept is the expected wing length for individuals having the reference level in age and sex, thus adult female. The other two parameters provide estimates of what is to be added to the intercept to get the expected wing length for the other levels. The parameter `sexmale` is the average difference between females and males. We can conclude that in males have in average a `r round(coef(mod)[2],1)` mm longer wing than females. Similarly, the parameter `agejuvenile` measures the differences between the age classes and we can conclude that, in average, juveniles have a `r abs(round(coef(mod)[3],1))` shorter wing than adults. When we insert the parameter estimates into the model formula, we get the receipt to calculate expected values for each age and sex combination:
$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}I(sex=male) + \hat{\beta_2}I(age=juvenile)$ which yields
$\hat{y_i}$ = `r round(coef(mod)[1],1)` $+$ `r round(coef(mod)[2],1)` $I(sex=male) +$ `r round(coef(mod)[3],1)` $I(age=juvenile)$.
Alternatively, we could use matrix notation. We construct a new data set that contains one virtual individual for each age and sex class.
```{r wingnd}
newdat <- tibble(expand.grid(sex=factor(levels(dat$sex)),
age=factor(levels(dat$age))))
# expand.grid creates a data frame with all combination of values given
newdat
newdat$fit <- predict(mod, newdata=newdat) # fast way of getting fitted values
# or
Xmat <- model.matrix(~sex+age, data=newdat) # creates a model matrix
newdat$fit <- Xmat %*% coef(mod)
```
For this new data set the model matrix contains four rows (one for each combination of age class and sex) and three columns. The first column contains only ones because the values of this column are multiplied by the intercept ($\beta_0$) in the matrix multiplication. The second column contains an indicator variable for males (so only the rows corresponding to males contain a one) and the third column has ones for juveniles.
\begin{align}
\hat{y} =
\boldsymbol{X \hat{\beta}} =
\begin{pmatrix}
1 & 0 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1 \\
1 & 1 & 1 \\
\end{pmatrix}
\times
\begin{pmatrix}
`r round(coef(mod)[1],1)` \\
`r round(coef(mod)[2],1)` \\
`r round(coef(mod)[3],1)`
\end{pmatrix} =
\begin{pmatrix}
`r round(newdat$fit[1],1)` \\
`r round(newdat$fit[2],1)` \\
`r round(newdat$fit[3],1)` \\
`r round(newdat$fit[4],1)`
\end{pmatrix} =
\boldsymbol{\mu}
(\#eq:lmmatrix)
\end{align}
The result of the matrix multiplication is a vector containing the expected wing length for adult and juvenile females and adult and juvenile males.
When creating the model matrix with `model.matrix` care has to be taken that the columns in the model matrix match the parameters in the vector of model coefficients. To achieve that, it is required that the model formula is identical to the model formula of the model (same order of terms!), and that the factors in newdat are identical in their levels and their order as in the data the model was fitted to.
To describe the uncertainty of the fitted values, we use 2000 sets of parameter values of the joint posterior distribution to obtain 2000 values for each of the four fitted values. These are stored in the object "fitmat". In the end, we extract for every fitted value, i.e., for every row in fitmat, the 2.5% and 97.5% quantiles as the lower and upper limits of the 95% uncertainty interval.
```{r }
nsim <- 2000
bsim <- sim(mod, n.sim=nsim)
fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,]
newdat$lwr <- apply(fitmat, 1, quantile, probs=0.025)
newdat$upr <- apply(fitmat, 1, quantile, probs=0.975)
```
```{r fgwingpa, fig.asp=0.45, fig.cap="Wing length measurements on 19 museumm skins of coal tits per age class and sex. Fitted values are from the additive model (black triangles) and from the model including an interaction (black dots). Vertical bars = 95% uncertainty intervals."}
dat$sexage <- factor(paste(dat$sex, dat$age))
newdat$sexage <- factor(paste(newdat$sex, newdat$age))
dat$pch <- 21
dat$pch[dat$sex=="male"] <- 22
dat$col="blue"
dat$col[dat$age=="adult"] <- "orange"
par(mar=c(4,4,0.5,0.5))
plot(wing~jitter(as.numeric(sexage), amount=0.05), data=dat, las=1,
ylab="Wing length (mm)", xlab="Sex and age", xaxt="n", pch=dat$pch,
bg=dat$col, cex.lab=1.2, cex=1, cex.axis=1, xlim=c(0.5, 4.5))
axis(1, at=c(1:4), labels=levels(dat$sexage), cex.axis=1)
segments(as.numeric(newdat$sexage), newdat$lwr, as.numeric(newdat$sexage), newdat$upr, lwd=2,
lend="butt")
points(as.numeric(newdat$sexage), newdat$fit, pch=17)
```
We can see that the fitted values are not equal to the arithmetic means of the groups; this is especially clear for juvenile males. The fitted values are constrained because only three parameters were used to estimate four means. In other words, this model assumes that the age difference is equal in both sexes and, vice versa, that the difference between the sexes does not change with age. If the effect of sex changes with age, we would include an *interaction* between sex and age in the model. Including an interaction adds a fourth parameter enabling us to estimate the group means exactly. In R, an interaction is indicated with the `:` sign.
```{r}
mod2 <- lm(wing ~ sex + age + sex:age, data=dat)
# alternative formulations of the same model:
# mod2 <- lm(wing ~ sex * age, data=dat)
# mod2 <- lm(wing ~ (sex + age)^2, data=dat)
```
The formula for this model is $\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}I(sex=male) + \hat{\beta_2}I(age=juvenile) + \hat{\beta_3}I(age=juvenile)I(sex=male)$. From this formula we get the following expected values for the sexes and age classes:
for adult females: $\hat{y} = \beta_0$
for adult males: $\hat{y} = \beta_0 + \beta_1$
for juveniles females: $\hat{y} = \beta_0 + \beta_2$
for juveniles males: $\hat{y} = \beta_0 + \beta_1 + \beta_2 + \beta_3$
The interaction parameter measures how much different between age classes is the difference between the sexes.
To obtain the fitted values the R-code above can be recycled with two adaptations. First, the model name needs to be changed to "mod2". Second, importantly, the model matrix needs to be adapted to the new model formula.
```{r}
newdat$fit2 <- predict(mod2, newdata=newdat)
bsim <- sim(mod2, n.sim=nsim)
Xmat <- model.matrix(~ sex + age + sex:age, data=newdat)
fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,]
newdat$lwr2 <- apply(fitmat, 1, quantile, probs=0.025)
newdat$upr2 <- apply(fitmat, 1, quantile, probs=0.975)
print(newdat[,c(1:5,7:9)], digits=3)
```
These fitted values are now exactly equal to the arithmetic means of each groups.
```{r}
tapply(dat$wing, list(dat$age, dat$sex), mean) # arithmetic mean per group
```
We can also see that the uncertainty of the fitted values is larger for the model with an interaction than for the additive model. This is because, in the model including the interaction, an additional parameter has to be estimated based on the same amount of data. Therefore, the information available per parameter is smaller than in the additive model. In the additive model, some information is pooled between the groups by making the assumption that the difference between the sexes does not depend on age.
The degree to which a difference in wing length is ‘important’ depends on the context of the study. Here, for example, we could consider effects of wing length on flight energetics and maneuverability or methodological aspects like measurement error. Mean between-observer difference in wing length measurement is around 0.3 mm [@Jenni.1989]. Therefore, we may consider that the interaction is important because its point estimate is larger than 0.3 mm.
```{r}
mod2
summary(mod2)$sigma
```
Further, we think a difference of 1 mm in wing length may be relevant compared to the among-individual variation of which the standard deviation is around 2 mm. Therefore, we report the parameter estimates of the model including the interaction together with their uncertainty intervals.
```{r sumtabpa, results="as.is", echo=FALSE}
tab <- data.frame(Estimate=coef(mod2),
lwr= apply(bsim@coef, 2, quantile, prob=0.025),
upr= apply(bsim@coef, 2, quantile, prob=0.975))
kable(tab, caption="Parameter estimates of the model for wing length of Coal tits with 95% uncertainty interval.", rownames="Parameter", dig=2)
```
From these parameters we obtain the estimated differences in wing length between the sexes for adults of `r round(coef(mod2)[2], 1)`mm and the posterior probability of the hypotheses that males have an average wing length that is at least 1mm larger compared to females is `mean(bsim@coef[,2]>1)` which is `r round(mean(bsim@coef[,2]>1),2)`. Thus, there is some evidence that adult Coal tit males have substantially larger wings than adult females in these data. However, we do not draw further conclusions on other differences from these data because statistical uncertainty is large due to the low sample size.
### A linear model with a categorical and a numeric predictor (ANCOVA)
An analysis of covariance, ANCOVA, is a normal linear model that contains at least one factor and one continuous variable as predictor variables. The continuous variable is also called a covariate, hence the name analysis of covariance. An ANCOVA can be used, for example, when we are interested in how the biomass of grass depends on the distance from the surface of the soil to the ground water in two different species (*Alopecurus pratensis*, *Dactylis glomerata*). The two species were grown by @Ellenberg.1953 in tanks that showed a gradient in distance from the soil surface to the ground water. The distance from the soil surface to the ground water is used as a covariate (‘water’). We further assume that the species react differently to the water conditions. Therefore, we include an interaction between species and water. The model formula is then
$\hat{y_i} = \beta_0 + \beta_1I(species=Dg) + \beta_2water_i + \beta_3I(species=Dg)water_i$
$y_i \sim normal(\hat{y_i}, \sigma^2)$
To fit the model, it is important to first check whether the factor is indeed defined as a factor and the continuous variable contains numbers (i.e., numeric or integer values) in the data frame.
```{r}
data(ellenberg)
index <- is.element(ellenberg$Species, c("Ap", "Dg")) & complete.cases(ellenberg$Yi.g)
dat <- ellenberg[index,c("Water", "Species", "Yi.g")] # select two species
dat <- droplevels(dat)
str(dat)
```
Species is a factor with two levels and Water is an integer variable, so we are fine and we can fit the model
```{r}
mod <- lm(log(Yi.g) ~ Species + Water + Species:Water, data=dat)
# plot(mod) # 4 standard residual plots
```
We log-transform the biomass to make the residuals closer to normally distributed. So, the normal distribution assumption is met well. However, a slight banana shaped relationship exists between the residuals and the fitted values indicating a slight non-linear relationship between biomass and water. Further, residuals showed substantial autocorrelation because the grass biomass was measured in different tanks. Measurements from the same tank were more similar than measurements from different tanks after correcting for the distance to water. Thus, the analysis we have done here suffers from pseudoreplication. We will re-analyze the example data in a more appropriate way in Chapter \@ref(lmer).
Let's have a look at the model matrix (first and last six rows only).
```{r}
head(model.matrix(mod)) # print the first 6 rows of the matrix
tail(model.matrix(mod)) # print the last 6 rows of the matrix
```
The first column of the model matrix contains only 1s. These are multiplied by the intercept in the matrix multiplication that yields the fitted values. The second column contains the indicator variable for species *Dactylis glomerata* (Dg). Species *Alopecurus pratensis* (Ap) is the reference level. The third column contains the values for the covariate. The last column contains the product of the indicator for species Dg and water. This column specifies the interaction between species and water.
The parameters are the intercept, the difference between the species, a slope for water and the interaction parameter.
```{r}
mod
summary(mod)$sigma
```
These four parameters define two regression lines, one for each species (Figure \@ref(fig:fgbiom) Left). For Ap, it is $\hat{y_i} = \beta_0 + \beta_2water_i$, and for Dg it is $\hat{y_i} = (\beta_0 + \beta_1) + (\beta_2 + \beta_3)water_i$. Thus, $\beta_1$ is the difference in the intercept between the species and $\beta_3$ is the difference in the slope.
```{r, fgbiom, fig.asp=0.45, echo=FALSE, fig.cap="Aboveground biomass (g, log-transformed) in relation to distance to ground water and species (two grass species). Fitted values from a model including an interaction species x water (left) and a model without interaction (right) are added. The dotted line indicates water=0."}
par(mfrow=c(1,2), mgp=c(2, 0.5, 0), mar=c(5.5,2,2,0.5), oma=c(0,2,0,0))
plot(dat$Water, log(dat$Yi.g), col=c("orange", "blue")[as.numeric(dat$Species)],
las=1, ylab=NA,
xlab="Average distance to ground water (cm)", pch=c(16, 17)[as.numeric(dat$Species)])
mtext("log(Yi.g) ~ Species * Water", side=3, adj=0)
mtext(expression(paste("Above-ground biomass ", (g/m^2), " [log]")),
side=2, outer=TRUE)
# insert regression lines per species
abline(coef(mod)[1], coef(mod)[3], lwd=2, col="orange")
abline(coef(mod)[1]+coef(mod)[2], coef(mod)[3]+coef(mod)[4], lwd=2, col="blue")
abline(v=0, lty=3)
legend(0, -1.2, pch=c(16, 17), lty=1, lwd=2, col=c("orange", "blue"),
legend=c("Species Ap", "Species Dg"), horiz=TRUE, xpd=NA, bty="n")
mod2 <- lm(log(Yi.g)~Species + Water, data=dat)
# visualise the model
plot(dat$Water, log(dat$Yi.g), col=c("orange", "blue")[as.numeric(dat$Species)],
las=1, ylab=NA, yaxt="n",
xlab="Average distance to ground water (cm)", pch=c(16, 17)[as.numeric(dat$Species)])
mtext("log(Yi.g) ~ Species + Water", side=3, adj=0)
abline(coef(mod2)[1], coef(mod2)[3], lwd=2, col="orange")
abline(coef(mod2)[1]+coef(mod2)[2], coef(mod2)[3], lwd=2, col="blue")
abline(v=0, lty=3)
```
As a consequence of including an interaction in the model, the interpretation of the main effects become difficult. From the above model output, we read that the intercept of the species Dg is lower than the intercept of the species Ap. However, from a graphical inspection of the data, we would expect that the average biomass of species Dg is higher than the one of species Ap. The estimated main effect of species is counter-intuitive because it is measured where water is zero (i.e, it is the difference in the intercepts and not between the mean biomasses of the species). Therefore, the main effect of species in the above model does not have a biologically meaningful interpretation. We have two possibilities to get a meaningful species effect. First, we could delete the interaction from the model (Figure \@ref(fig:fgbiom) Right). Then the difference in the intercept reflects an average difference between the species. However, the fit for such an additive model is much worth compared to the model with interaction, and an average difference between the species may not make much sense because this difference so much depends on water. Therefore, we prefer to use a model including the interaction and may opt for th second possibility. Second, we could move the location where water equals 0 to the center of the data by transforming, specifically centering, the variable water: $water.c = water - mean(water)$. When the predictor variable (water) is centered, then the intercept corresponds to the difference in fitted values measured in the center of the data.
For drawing biological conclusions from these data, we refer to Chapter \@ref(lmer), where we use a more appropriate model.
## Partial coefficients and some comments on collinearity {#collinearity}
Many biologists think that it is forbidden to include correlated predictor variables in a model. They use variance inflating factors (VIF) to omit some of the variables. However, omitting important variables from the model just because a correlation coefficient exceeds a threshold value can have undesirable effects. Here, we explain why and we present the usefulness and limits of partial coefficients (also called partial correlation or partial effects). We start with an example illustrating the usefulness of partial coefficients and then, give some guidelines on how to deal with collinearity.
As an example, we look at hatching dates of Snowfinches and how these dates relate to the date when snow melt started (first date in the season when a minimum of 5% ground is snow free). A thorough analyses of the data is presented by @Schano.2021. An important question is how well can Snowfinches adjust their hatching dates to the snow conditions. For Snowfinches, it is important to raise their nestlings during snow melt. Their nestlings grow faster when they are reared during the snow melt compared to after snow has completely melted, because their parents find nutrient rich insect larvae in the edges of melting snow patches.
```{r sfdat}
load("RData/snowfinch_hatching_date.rda")
# Pearson's correlation coefficient
cor(datsf$elevation, datsf$meltstart, use = "pairwise.complete")
mod <- lm(meltstart~elevation, data=datsf)
100*coef(mod)[2] # change in meltstart with 100m change in elevation
```
Hatching dates of Snowfinch broods were inferred from citizen science data from the Alps, where snow melt starts later at higher elevations compared to lower elevations. Thus, the start of snow melt is correlated with elevation (Pearson's correlation coefficient `r round(cor(datsf$elevation, datsf$meltstart, use = "pairwise.complete") ,2)`). In average, snow starts melting `r round(100*coef(mod)[2])` days later with every 100m increase in elevation.
```{r sfex1}
mod1 <- lm(hatchday.mean~meltstart, data=datsf)
mod1
```
From a a normal linear regression of hatching date on the snow melt date, we obtain an estimate of `r round(coef(mod1)[2],2)` days delay in hatching date with one day later snow melt. This effect sizes describes the relationship in the data that were collected along an elevational gradient. Along the elevational gradient there are many factors that change such as average temperature, air pressure or sun radiation. All these factors may have an influence on the birds decision to start breeding. Consequentily, from the raw correlation between hatching dates and start of snow melt we cannot conclude how Snowfinches react to changes in the start of snow melt because the correlation seen in the data may be caused by other factors changing with elevation (such a correlation is called "pseudocorrelation"). However, we are interested in the correlation between hatching date and date of snow melt independent of other factors changing with elevation. In other words, we would like to measure how much in average hatching date delays when snow melt starts one day later while all other factors are kept constant. This is called the partial effect of snow melt date. Therefore, we include elevation as a covariate in the model.
```{r sfex}
library(arm)
mod <- lm(hatchday.mean~elevation + meltstart, data=datsf)
mod
```
From this model, we obtain an estimate of `r round(coef(mod)["meltstart"],2)` days delay in hatching date with one day later snow melt at a given elevation. That gives a difference in hatching date between early and late years (around one month difference in snow melt date) at a given elevation of `r round(30*coef(mod)["meltstart"],2)` days (Figure \@ref(fig:sfexfig)). We further get an estimate of `r round(100*coef(mod)["elevation"],2)` days later hatching date for each 100m shift in elevation. Thus, a `r round(100*coef(mod)["elevation"]/coef(mod)["meltstart"],2)` days later snow melt corresponds to a similar delay in average hatching date when elevation increases by 100m.
When we estimate the coefficient within a constant elevation (coloured regression lines in Figure \@ref(fig:sfexfig)), it is lower than the raw correlation and closer to a causal relationship, because it is corrected for elevation. However, in observational studies, we never can be sure whether the partial coefficients can be interpreted as a causal relationship unless we include all factors that influence hatching date. Nevertheless, partial effects give much more insight into a system compared to univariate analyses because we can separated effects of simultaneously acting variables (that we have measured). The result indicates that Snowfinches may not react very sensibly to varying timing of snow melt, whereas at higher elevations they clearly breed later compared to lower elevations.
```{r sfexfig, echo=FALSE, fig.cap="Illustration of the partial coefficient of snow melt date in a model of hatching date. Panel A shows the entire raw data together with the regression lines drawn for three different elevations. The regression lines span the range of snow melt dates occurring at the respective elevation (shown in panel C). Panel B is the same as panel A, but zoomed in to the better see the regression lines and with an additional regression line (in black) from the model that does not take elevation into account."}
nsim <- 5000
bsim <- sim(mod, n.sim=nsim)
bsim1 <- sim(mod1, n.sim=nsim)
newdat <- data.frame(meltstart=c(60:130, 70:150, 100:160),
elevation=rep(c(2000, 2400, 2800),
times=c(length(60:130), length(70:150),length(100:160))))
Xmat <- model.matrix(~elevation + meltstart, data=newdat)
Xmat1 <- model.matrix(~meltstart, data=newdat)
fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
fitmat1 <- matrix(ncol=nsim, nrow=nrow(newdat))
for(i in 1:nsim){
fitmat[,i] <- Xmat%*%bsim@coef[i,]
fitmat1[,i] <- Xmat1%*%bsim1@coef[i,]
}
newdat$fit <- apply(fitmat, 1, mean)
newdat$lwr <- apply(fitmat, 1, quantile, probs=0.025)
newdat$upr <- apply(fitmat, 1, quantile, probs=0.975)
newdat$fit1 <- apply(fitmat1, 1, mean)
newdat$lwr1 <- apply(fitmat1, 1, quantile, probs=0.025)
newdat$upr1 <- apply(fitmat1, 1, quantile, probs=0.975)
par(mfrow=c(2,2), mar=c(3.4,3,1,1), mgp=c(2,0.5,0))
colkey <- c("orange", "brown", "blue")
plot(datsf$meltstart, datsf$hatchday.mean, pch=16, col=rgb(0.6,0,0.9,0.5),
xlab="Start day of snowmelt [d since 1/1]",
ylab="Hatching date [d since 1/1]", las=1, cex=0.5)
text(55, 225, "A", adj=c(0,0))
index <- newdat$elevation==2000
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[1])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[1])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[1])
index <- newdat$elevation==2400
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[2])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[2])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[2])
index <- newdat$elevation==2800
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[3])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[3])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[3])
plot(datsf$elevation, datsf$meltstart, pch=16, col=rgb(0,0,0,0.5),
xlab="Elevation [m asl]", ylab="Start of snowmelt [d since 1/1]", las=1)
text(1500, 150, "C", adj=c(0,0))
segments(2000, 60, 2000, 130, lwd=2, col=colkey[1])
segments(2400, 70, 2400, 150, lwd=2, col=colkey[2])
segments(2800, 100, 2800, 160, lwd=2, col=colkey[3])
plot(datsf$meltstart, datsf$hatchday.mean, pch=16, col=rgb(0.6,0,0.9,0.5),
xlab="Start day of snowmelt [d since 1/1]", ylab="Hatching date [d since 1/1]",
las=1, cex=0.5, ylim=c(165, 185))
text(55, 183.5, "B", adj=c(0,0))
index <- newdat$elevation==2000
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[1])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[1])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[1])
index <- newdat$elevation==2400
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[2])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[2])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[2])
index <- newdat$elevation==2800
lines(newdat$meltstart[index], newdat$fit[index], lwd=2, col=colkey[3])
lines(newdat$meltstart[index], newdat$lwr[index], lwd=1, lty=3, col=colkey[3])
lines(newdat$meltstart[index], newdat$upr[index], lwd=1, lty=3, col=colkey[3])
index <- order(newdat$meltstart)
lines(newdat$meltstart[index], newdat$fit1[index], lwd=2)
lines(newdat$meltstart[index], newdat$lwr1[index], lwd=1, lty=3)
lines(newdat$meltstart[index], newdat$upr1[index], lwd=1, lty=3)
plot(0:1, 0:1, type="n", axes=FALSE, xlab=NA, ylab=NA)
legend(-0.1,1, lwd=2, col=c(colkey, 1),
legend=c("2000 m", "2400 m", "2800 m", "not corrected for elevation"), xpd=NA)
```
We have seen that it can be very useful to include more than one predictor variable in a model even if they are correlated with each other. In fact, there is nothing wrong with that. However, correlated predictors (collinearity) make things more complicated.
For example, partial regression lines should not be drawn across the whole range of values of a variable, to avoid extrapolating out of data. At 2800 m asl snow melt never starts in the beginning of March. Therefore, the blue regression line would not make sense for snow melt dates in March.
Further, sometimes correlations among predictors indicate that these predictors measure the same underlying aspect and we are actually interested in the effect of this underlying aspect on our response. For example, we could include also the date of the end of snow melt. Both variables, the start and the end of the snow melt measure the timing of snow melt. Including both as predictor in the model would result in partial coefficients that measure how much hatching date changes when the snow melt starts one day later, while the end date is constant. That interpretation is a mixture of the effect of timing and duration rather than of snow melt timing alone. Similarly, the coefficient of the end of snow melt measures a mixture of duration and timing. Thus, if we include two variables that are correlated because they measure the same aspect (just a little bit differently), we get coefficients that are hard to interpret and may not measure what we actually are interested in. In such a cases, we get easier to interpret model coefficients, if we include just one variable of each aspect that we are interested in, e.g. we could include one timing variable (e.g. start of snow melt) and the duration of snow melt that may or may not be correlated with the start of snow melt.
To summarize, the decision of what to do with correlated predictors primarily relies on the question we are interested in, i.e., what exactly should the partial coefficients be an estimate for.
A further drawback of collinearity is that model fitting can become difficult. When strong correlations are present, model fitting algorithms may fail. If they do not fail, the statistical uncertainty of the estimates often becomes large. This is because the partial coefficient of one variable needs to be estimated for constant values of the other predictors in the model which means that a reduced range of values is available as illustrated in Figure \@ref(fig:sfexfig) C. However, if uncertainty intervals (confidence, credible or compatibility intervals) are reported alongside the estimates, then using correlated predictors in the same model is absolutely fine, if the fitting algorithm was successful.
The correlations per se can be interesting. Further readings on how to visualize and analyse data with complex correlation structures:
- principal component analysis [@Manly.1994]
- path analyses, e.g. @Shipley.2009
- structural equation models [@Hoyle.2012]
```{r fig.align='center', echo=FALSE, fig.link=''}
knitr::include_graphics('images/ruchen.jpg', dpi = 150)
```
## Ordered Factors and Contrasts {#orderedfactors}
In this chapter, we have seen that the model matrix is an $n \times k$ matrix (with $n$ = sample size and $k$ = number of model coefficients) that is multiplied by the vector of the $k$ model coefficients to obtain the fitted values of a normal linear model. The first column of the model matrix normally contains only ones. This column is multiplied by the intercept. The other columns contain the observed values of the predictor variables if these are numeric variables, or indicator variables (= dummy variables) for factor levels if the predictors are categorical variables (= factors). For categorical variables the model matrix can be constructed in a number of ways. How it is constructed determines how the model coefficients can be interpreted. For example, coefficients could represent differences between means of specific factor levels to the mean of the reference level. That is what we have introduced above. However, they could also represent a linear, quadratic or cubic effect of an ordered factor. Here, we show how this works.
An ordered factor is a categorical variable with levels that have a natural order, for example, ‘low’, ‘medium’ and ‘high’. How do we tell R that a factor is ordered? The swallow data contain a factor ‘nesting_aid’ that contains the type aid provided in a barn for the nesting swallows. The natural order of the levels is none < support (e.g., a wooden stick in the wall that helps support a nest built by the swallow) < artificial_nest < both (support and artificial nest). However, when we read in the data R orders these levels alphabetically rather than according to the logical order.
```{r}
data(swallows)
levels(swallows$nesting_aid)
```
And with the function contrasts we see how R will construct the model matrix.
```{r}
contrasts(swallows$nesting_aid)
```
R will construct three dummy variables and call them ‘both’, ‘none’, and ‘support’. The variable ‘both’ will have an entry of one when the observation is ‘both’ and zero otherwise. Similarly, the other two dummy variables are indicator variables of the other two levels and ‘artif_nest’ is the reference level. The model coefficients can then be interpreted as the difference between ‘artif_nest’ and each of the other levels. The instruction how to transform a factor into columns of a model matrix is called the contrasts.
Now, let's bring the levels into their natural order and define the factor as an ordered factor.
```{r}
swallows$nesting_aid <- factor(swallows$nesting_aid, levels=c("none", "support", "artif_nest", "both"), ordered=TRUE)
levels(swallows$nesting_aid)
```
The levels are now in the natural order. R will, from now on, use this order for analyses, tables, and plots, and because we defined the factor to be an ordered factor, R will use polynomial contrasts:
```{r}
contrasts(swallows$nesting_aid)
```
When using polynomial contrasts, R will construct three (= number of levels minus one) variables that are called ‘.L’, ‘.Q’, and ‘.C’ for linear, quadratic and cubic effects. The contrast matrix defines which numeric value will be inserted in each of the three corresponding columns in the model matrix for each observation, for example, an observation with ‘support’ in the factor ‘nesting_aid’ will get the values -0.224, -0.5 and 0.671 in the columns L, Q and C of the model matrix. These contrasts define yet another way to get 4 different group means:
$m1 = \beta_0 – 0.671* \beta_1 + 0.5*\beta_2 - 0.224* \beta_3$
$m2 = \beta_0 – 0.224* \beta_1 - 0.5*\beta_2 + 0.671* \beta_3$
$m3 = \beta_0 + 0.224* \beta_1 - 0.5*\beta_2 - 0.671* \beta_3$
$m4 = \beta_0 + 0.671* \beta_1 + 0.5*\beta_2 + 0.224* \beta_3$
The group means are the same, independent of whether a factor is defined as ordered or not. The ordering also has no effect on the variance that is explained by the factor ‘nesting_aid’ or the overall model fit. Only the model coefficients and their interpretation depend on whether a factor is defined as ordered or not. When we define a factor as ordered, the coefficients can be interpreted as linear, quadratic, cubic, or higher order polynomial effects. The number of the polynomials will always be the number of factor levels minus one (unless the intercept is omitted from the model in which case it is the number of factor levels). Linear, quadratic, and further polynomial effects normally are more interesting for ordered factors than single differences from a reference level because linear and polynomial trends tell us something about consistent changes in the outcome along the ordered factor levels. Therefore, an ordered factor with k levels is treated like a covariate consisting of the centered level numbers (-1.5, -0.5, 0.5, 1.5 in our case with four levels) and k-1 orthogonal polynomials of this covariate are included in the model. Thus, if we have an ordered factor A with three levels, `y~A` is equivalent to `y~x+I(x^2)`, with x=-1 for the lowest, x=0 for the middle and x=1 for the highest level.
Note that it is also possible to define own contrasts if we are interested in specific differences or trends. However, it is not trivial to find meaningful and orthogonal (= uncorrelated) contrasts.
## Quadratic and Higher Polynomial Terms {#polynomials}
The straight regression line for the biomass of grass species Ap *Alopecurus pratensis* dependent on the distance to the ground water does not fit well (Figure \@ref(fig:fgbiom)). The residuals at low and high values of water tend to be positive and intermediate water levels are associated with negative residuals. This points out a possible violation of the model assumptions.
The problem is that the relationship between distance to water and biomass of species Ap is not linear. In real life, we often find non-linear relationships, but if the shape of the relationship is quadratic (plus, potentially, a few more polynomials) we can still use ‘linear modeling’ (the term 'linear' refers to the linear function used to describe the relationship between the outcome and the predictor variables: $f(x) = \beta_0 + \beta_1x + \beta_2x^2$ is a linear function compared to, e.g., $f(x) = \beta^x$, which is not a linear function). We simply add the quadratic term of the predictor variable, that is, water in our example, as a further predictor in the linear predictor:
$\hat{y_i} = \beta_0+\beta_1water_i+\beta_2water_i^2$.
A quadratic term can be fitted in R using the function `I()` which tells R that we want the squared values of distance to water. If we do not use `I()` the `^2` indicates a two-way interaction. The model specification is then `lm(log(Yi.g) ~ Water + I(Water^2), data=...)`.
The cubic term would be added by `+I(Water^3)`.
As with interactions, a polynomial term changes the interpretation of lower level polynomials. Therefore, we normally include all polynomials up to a specific degree. Furthermore, polynomials are normally correlated (if no special transformation is used, see below) which could cause problems when fitting the model such as non-convergence. To avoid collinearity among polynomials, so called orthogonal polynomials can be used. These are polynomials that are uncorrelated. To that end, we can use the function `poly` which creates as many orthogonal polynomials of the variable as we want:
`poly(dat$Water, 2)` creates two columns, the first one can be used to model the linear effect of water, the second one to model the quadratic term of water:
```{r}
t.poly <- poly(dat$Water, 2)
dat$Water.l <- t.poly[,1] # linear term for water
dat$Water.q <- t.poly[,2] # quadratic term for water
mod <- lm(log(Yi.g) ~ Water.l + Water.q, data=dat)
```
When orthogonal polynomials are used, the estimated linear and quadratic effects can be interpreted as purely linear and purely quadratic influences of the predictor on the outcome. The function poly applies a specific transformation to the original variables. To reproduce the transformation (e.g. for getting the corresponding orthogonal polynomials for new data used to draw an effect plot), the function predict can be used with the poly-object created based on the original data.
```{r}
newdat <- data.frame(Water = seq(0,130))
# transformation analogous to the one used to fit the model:
newdat$Water.l <- predict(t.poly, newdat$Water)[,1]
newdat$Water.q <- predict(t.poly, newdat$Water)[,2]
```
These transformed variables can then be used to calculate fitted values that correspond to the water values specified in the new data.