-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path32-StatisticalApproaches.Rmd
858 lines (709 loc) · 63.5 KB
/
32-StatisticalApproaches.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
# Design-based, model-based, and model-assisted approach for sampling and inference {#Approaches}
Section \@ref(DBvsMB) already mentioned the design-based and the model-based approach for sampling and statistical inference. In this chapter, the fundamental differences between these two approaches are explained in more detail. Several misconceptions about the design-based approach\index{Design-based approach} for sampling and statistical inference, based on classical sampling theory\index{Classical sampling theory}, seem to be quite persistent. These misconceptions are the result of confusion about basic statistical concepts such as independence, expectation, and bias and variance of estimators or predictors. These concepts have a different meaning in the design-based and the model-based approach\index{Model-based approach}. Besides, a population mean is still often confused with a model-mean, and a population variance with a model-variance, leading to invalid formulas for the sampling variance of an estimator of the population mean. The fundamental differences between these two approaches are illustrated with simulations, so that hopefully a better understanding of this subject is obtained. Besides, the difference between model-dependent inference (as used in the model-based approach) and model-assisted inference is explained. This chapter has been published as part of a journal paper, see @Brus2021.
## Two sources of randomness
In my classes about spatial sampling, I ask the participants the following question. Suppose we have measurements of a soil property, for instance soil organic carbon content, at two locations separated by 20 cm. Do you think these two measurements are correlated? I ask them to vote for one of three answers:
1. yes, they are (>80\% confident);
2. no, they are not (>80\% confident); or
3. I do not know.
Most students vote for answer 1, the other students vote for answer 3, nearly no one votes for answer 2. Then I explain that you cannot say which answer is correct, simply because for correlation we need two series of data, not just two numbers. The question then is how to generate two series of data. We need some random process\index{Random process} for this. This random process differs between the design-based and the model-based approach.
In the design-based approach, the random process is the random selection of sampling units, whereas in the model-based approach randomness is introduced via the statistical model of the spatial variation (Table \@ref(tab:approach)). So, the design-based approach requires probability sampling, i.e., random sampling, using a random number generator\index{Random number generator}, in such way that all population units have a positive probability of being included in the sample and that these inclusion probabilities are known for at least the selected population units [@sar92]. A probability sampling\index{Probability sampling} design can be used to generate an infinite number of samples in theory, although in practical applications only one is selected.
The spatial variation model\index{Spatial variation model} used in the model-based approach contains two terms, one for the mean (deterministic part) and one for the error with a specified probability distribution. For instance, Equation \@ref(eq:OKmodel) in Chapter \@ref(Introkriging) describes the model used in ordinary kriging. This model can be used to simulate an infinite number of spatial populations. All these populations together are referred to as a superpopulation (@sar92, @loh99). Depending on the model of spatial variation, the simulated populations may show spatial structure\index{Spatial structure} because the mean is a function of covariates, as in kriging with an external drift, and/or because the errors are spatially autocorrelated. A superpopulation\index{Superpopulation} is a construct, the populations do not exist in the real world. The populations are similar, but not identical. For instance, the mean differs among the populations. The expectation of the population mean, i.e., the average over all possible simulated populations, equals the superpopulation mean\index{Superpopulation mean}, commonly referred to as the model-mean\index{Model-mean}, parameter $\mu$ in Equation \@ref(eq:OKmodel). The variance also differs among the populations. Contrary to the mean, the average of the population variance over all populations generally is not equal to the model-variance, parameter $\sigma^2$ in Equation \@ref(eq:OKmodel), but smaller. I will come back to this later. The differences between the simulated spatial populations illustrate our uncertainty about the spatial variation of the study variable in the population that is sampled or will be sampled.
In the design-based approach, only one population is considered, the one sampled, but the statistical inference is based on all samples that can be generated by a probability sampling. The top row of Figure \@ref(fig:plotsimulationsDBMB) shows five simple random samples of size ten. The population is the same in all plots. Proponents of the design-based approach do not like to consider other populations than the one sampled. Their challenge is to characterise this one population from a probability sample.
On the contrary, in the model-based approach only one sample is considered, but the statistical inference is based on all populations that can be generated with the spatial variation model. Proponents of the model-based approach do not like to consider other samples than the one selected. Their challenge is to get the most out of the sample that is selected. The bottom row of Figure \@ref(fig:plotsimulationsDBMB) shows a spatial coverage sample, superimposed on five different populations simulated with an ordinary kriging model, using a spherical semivariogram with a nugget of 0.1, partial sill of 0.6, and a range of 75 m. Note that in the model-based approach there is no need to select a probability sample (see Table \@ref(tab:approach)); there are no requirements on how the units are selected.
```{r, echo=FALSE}
field <- read_sf(system.file("extdata/leest.gpkg", package = "sswr")) %>%
st_set_crs(NA_crs_)
grdField <- st_make_grid(field, cellsize = 1, what = "centers")
grdField <- grdField[field] %>%
st_coordinates(grdField) %>%
as_tibble() %>%
mutate(
x1 = X / 1000 - 597,
x2 = Y / 1000 - 5654,
dummy = 1
)
coordinates(grdField) <- c("x1", "x2")
#simulate 5 fields
set.seed(314)
# unconditional Gaussian simulation by means of simple kriging
vgmodel <- vgm(model = "Sph", nugget = 0.1, psill = 0.6, range = 0.075)
simulation <- krige(
formula = dummy ~ 1,
locations = grdField,
newdata = grdField,
model = vgmodel,
nmax = 100,
nsim = 5,
beta = 0, #beta is model-mean
dummy = TRUE, #this is to enforce unconditional simulation
debug.level = 0
)
df_grdField <- as.data.frame(grdField)
cnst <- 5
df_grdField[["pop1"]] <- simulation[[1]] + cnst
df_grdField[["pop2"]] <- simulation[[2]] + cnst
df_grdField[["pop3"]] <- simulation[[3]] + cnst
df_grdField[["pop4"]] <- simulation[[4]] + cnst
df_grdField[["pop5"]] <- simulation[[5]] + cnst
#select 5 simple random samples of 10 points
set.seed(314)
n <- 10
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 5 * n)
samplesdf <- df_grdField[sampleId, c("x1","x2")]
samplesdf$sam <- rep(x = paste("sam", 1:5, sep = ""), each = n)
samplesdf$sam <- factor(x = samplesdf$sam, levels = paste("sam", 1:5, sep = ""), ordered = TRUE)
#select a spatial coverage sample of 10 points
myStrata <- field %>%
as_Spatial %>%
stratify(nStrata = 10, cellSize = 2, equalArea = FALSE, nTry = 10)
mySCsample <- spsample(myStrata) %>%
as("data.frame")
mySCsample$x1 <- mySCsample$x1 / 1000 - 597
mySCsample$x2 <- mySCsample$x2 / 1000 - 5654
```
```{r plotsimulationsDBMB, echo = FALSE, out.width = "100%", fig.asp = '0.4', fig.cap = "Random process considered in the design-based (top row) and the model-based approach (bottom row). The design-based approach considers only the sampled population, but all samples that can be generated by the sampling design. The model-based approach considers only the selected sample, but all populations that can be generated by the model."}
#Plot 5 times same simulated field 1 and 5 SI samples
plt1 <- ggplot(data = samplesdf) +
geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
geom_point(mapping = aes(x = x1, y =x2), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
facet_wrap(~ sam, ncol = 5, nrow = 1)
#Plot 5 simulated fields and spatial coverage sample
grdF <- df_grdField %>% pivot_longer(cols = c("pop1", "pop2", "pop3", "pop4", "pop5"))
plt2 <- ggplot(data = grdF) +
geom_tile(mapping = aes(x = x1, y = x2, fill = value)) +
geom_point(data = mySCsample, mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
facet_wrap(~name, nrow = 1, ncol = 5)
grid.arrange(plt1, plt2, nrow = 2)
```
As stressed by @dgr90 and @bru97, both approaches have their strengths and weaknesses. Broadly speaking, the design-based approach is the most appropriate if interest is in the population mean (total, proportion) or the population means (totals, proportions) of a restricted number of subpopulations (subareas). The model-based approach is the most appropriate if our aim is to map the study variable. Further, the strength of the design-based approach is the strict validity\index{Validity} of the estimates. Validity means that an objective assessment of the uncertainty of the estimator is warranted and that the coverage of confidence intervals is (almost) correct, provided that the sample is large enough to assume an approximately normal distribution of the estimator and design-unbiasedness of the variance estimator [@sar92]. The strength of the model-based approach is efficiency, i.e., more precise estimates of the (sub)population mean given the sample size, provided that a reasonably good model is used. So, if validity is more important than efficiency, the design-based approach is the best choice; in the reverse case, the model-based approach is preferable. For further reading, I recommend @Cassel1977 and @han83.
## Identically and independently distributed {#iid}
In a review paper on spatial sampling by @Wang2012, there is a section with the caption `Sampling of i.i.d. populations’. Here, i.i.d. stands for "identically and independently distributed\index{Identically and independently distributed}". In this section of @Wang2012 we can read: "In SRS (simple random sampling) it is assumed that the population is independent and identically distributed". This is one of the old misconceptions revitalised by this review paper. I will make clear that in statistics i.i.d. is not a characteristic of populations, so the concept of i.i.d. populations does not make sense. The same misconception can be found in @Plant2012: "There is considerable literature on sample size estimation, much of which is discussed by Cochran (1977, chapter 4). This literature, however, is valid for samples of independent data but may not retain its validity for spatial data". Also according to @Wang2010, the classical formula for the variance of the estimator of the mean with simple random sampling, $V=\sigma^2/n$, only holds when data are independent. They say: "However in the case of spatial data, although members of the sample are independent by construction, data values that are near to one another in space, are unlikely to be independent because of a fundamental property of attributes in space, which is that they show spatial structure or continuity (spatial autocorrelation)". According to @Wang2010, the variance should be approximated by
\begin{equation}
V(\hat{\bar{z}})=\frac{\sigma^2 - \overline{\mathrm{Cov}(z_i,z_j)}}{n} \;,
(\#eq:Wang2010)
\end{equation}
with $V(\hat{\bar{z}})$ the variance of the estimator of the regional mean (mean of spatial population), $\sigma^2$ the population variance, $n$ the sample size, and $\overline{\mathrm{Cov}(z_i,z_j)}$ the average autocovariance between all pairs of individuals $(i, j)$ in the population (sampled and unsampled). So, according to this formula, ignoring the mean covariance within the population leads to an overestimation of the variance of the estimator of the mean. In Section \@ref(effectivesamplesize) I will make clear that this formula is incorrect and that the classical formula is still valid, also for populations showing spatial structure or continuity.
Remarkably, in other publications we can read that the classical formula for the variance of the estimator of the population mean with simple random sampling *underestimates* the true variance for populations showing spatial structure, see for instance @Griffith2005 and @Plant2012. The reasoning is that due to the spatial structure, there is less information in the sample data about the population mean. In Section \@ref(effectivesamplesize) I explain that this is also a misconception. Do not get confused by these publications and stick to the classical formulas which you can find in standard textbooks on sampling theory, such as @coc77 and @loh99, as well as in Chapter \@ref(SI) of this book.
The concept of independence of random variables is illustrated with a simulation. The top row of Figure \@ref(fig:iid) shows five simple random samples of size two. The two points are repeatedly selected from the same population (showing clear spatial structure), so this top row represents the design-based approach. The bottom row shows two points, not selected randomly and independently, but at a fixed distance of 10 m. These two points are placed on different populations generated by the model described above, so the bottom row represents the model-based approach.
```{r iid, echo = FALSE, out.width = "100%", fig.asp = '0.4', fig.cap = "Illustration of independence in design-based and model-based approach. The top row shows five samples of two points selected randomly and independently from each other from one population (design-based approach). The bottom row shows two points not selected randomly, at a distance of 10 m from each other, from five model realisations (model-based approach)."}
set.seed(413)
n <- 2
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 5 * n)
samplesdf <- df_grdField[sampleId, ]
samplesdf$sam <- rep(x = paste("sam", 1:5, sep = ""), each = n)
samplesdf$sam <- factor(x = samplesdf$sam, levels = paste("sam", 1:5, sep = ""), ordered = TRUE)
#select two points at fixed distance
h <- 0.010
xy <- coordinates(grdField)
xy1st <- c(0.405, 0.738)
set.seed(314)
angle <- runif(n = 1, min = 0, max = 2 * pi)
dx <- numeric(length = 2)
dx[1] <- h * sin(angle)
dx[2] <- h * cos(angle)
xy2nd <- xy1st + dx
onepair <- data.frame(x = c(xy1st[1], xy2nd[1]), y = c(xy1st[2], xy2nd[2]))
plt1 <- ggplot(data = samplesdf) +
geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
facet_wrap(~ sam, ncol = 5, nrow = 1)
plt2 <- ggplot(data = grdF) +
geom_tile(mapping = aes(x = x1, y = x2, fill = value)) +
geom_point(data = onepair, mapping = aes(x = x, y = y), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
facet_wrap(~ name, nrow = 1, ncol = 5)
grid.arrange(plt1, plt2, nrow = 2)
```
(ref:ScatterplotsTwopointslabel) Scatter plots of the values of a study variable $z$ at two randomly and independently selected points, 1,000 times selected from one population (design-based approach), and at two fixed points with a separation distance of 10 m, selected non-randomly from 1,000 model realisations (model-based approach).
```{r ScatterplotsTwopoints, echo = FALSE, out.width = "100%", fig.cap = "(ref:ScatterplotsTwopointslabel)" }
#simulate 1000 values at two points seperated by distance h
c <- variogramLine(vgmodel, covariance = TRUE, dist_vector = c(0, h))
C <- matrix(nrow = 2, ncol = 2)
C[1, 1] <- C[2, 2] <- c$gamma[1]
C[1, 2] <- C[2, 1] <- c$gamma[2]
#cholesky decomposition
Upper <- chol(C)
set.seed(31415)
nsim <- 1000
Z <- matrix(nrow = nsim, ncol = 2)
for (i in 1:nsim) {
N <- rnorm(n = 2, 0, 1)
Z[i, ] <- crossprod(Upper, N)
}
Z <- Z + cnst
Zdf_MB <- as.data.frame(Z)
units <- sample(x = seq_len(nrow(df_grdField)), size = nsim)
Zdf_DB <- NULL
Zdf_DB$V1 <- df_grdField$pop1[units]
units <- sample(x = seq_len(nrow(df_grdField)), size = nsim)
Zdf_DB$V2 <- df_grdField$pop1[units]
Zdf <- rbind(Zdf_MB, Zdf_DB)
Zdf$approach <- rep(c("Model-based", "Design-based"), each = 1000)
Zdf$approach <- factor(Zdf$approach, levels = c("Design-based", "Model-based"), ordered = TRUE)
ggplot(data = Zdf) +
geom_point(mapping = aes(x = V1, y = V2), size = 1) +
scale_x_continuous(name = "z location 1", limits = c(0, 10)) +
scale_y_continuous(name = "z location 2\n", limits = c(0, 10)) +
coord_fixed() +
facet_wrap(~approach)
```
The values measured at the two points are plotted against each other in a scatter plot, not for just five simple random samples or five populations, but for 1,000 samples and 1,000 populations (Figure \@ref(fig:ScatterplotsTwopoints)). As we can see there is no correlation between the two variables generated by the repeated random selection of the two points (design-based), whereas the two variables generated by the repeated simulation of populations (model-based) are correlated.
Instead of two points, we may select two series of probability samples independently from each other, for instance two series of simple random samples (SI) of size 10, or two series of systematic random samples with random origin (SY) with an average size of 10, see Figure \@ref(fig:TwoseriesSISY).
```{r TwoseriesSISY, echo = FALSE, out.width = "100%", fig.cap = "Two series (a and b) of simple random samples of ten points (top) and two series (a and b) of systematic random samples of, on average, ten points (bottom). The samples of series a and b are selected independently from each other."}
#select 10 SI samples of 10 points
set.seed(314)
n <- 10
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 10 * n)
sampleSI <- df_grdField[sampleId, c("x1","x2")]
sampleSI$sam <- rep(c("a1", "a2", "a3", "a4", "a5", "b1", "b2", "b3", "b4", "b5"), each = n)
plt1 <- ggplot(data = sampleSI) +
geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
coord_fixed() +
facet_wrap(~ sam, ncol = 5, nrow = 2)
#select 10 systematic random samples of 10 points
sampleSize <- 10
gridded(grdField) <- TRUE
set.seed(143)
sampleSY <- NULL
for (i in 1:10) {
sampleSYi <- as(spsample(x = grdField, n = sampleSize, type = "regular", bb = bbox(grdField)), "data.frame")
sampleSYi$sam <- i
sampleSY <- rbind(sampleSY, sampleSYi)
}
sampleSY$sam[sampleSY$sam == 1] <- "a1"
sampleSY$sam[sampleSY$sam == 2] <- "a2"
sampleSY$sam[sampleSY$sam == 3] <- "a3"
sampleSY$sam[sampleSY$sam == 4] <- "a4"
sampleSY$sam[sampleSY$sam == 5] <- "a5"
sampleSY$sam[sampleSY$sam == 6] <- "b1"
sampleSY$sam[sampleSY$sam == 7] <- "b2"
sampleSY$sam[sampleSY$sam == 8] <- "b3"
sampleSY$sam[sampleSY$sam == 9] <- "b4"
sampleSY$sam[sampleSY$sam == 10] <- "b5"
#make plot of SY samples
plt2 <- ggplot(data = sampleSY) +
geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
scale_x_continuous(name = "", breaks = NULL) +
scale_y_continuous(name = "", breaks = NULL) +
scale_fill_viridis_c(name = "SOM") +
coord_fixed() +
theme(legend.position = "none") +
coord_fixed() +
facet_wrap(~ sam, ncol = 5, nrow = 2)
grid.arrange(plt1, plt2, nrow = 2)
```
Again, if we plot the sample means of pairs of simple random samples and pairs of systematic random samples against each other, we see that the two averages are not correlated (Figure \@ref(fig:ScatterplotsSISY)). Note that the variation of the averages of the systematic random samples is considerably smaller than that of the simple random samples. The sampled population shows spatial structure. By spreading the sampling units over the spatial population, the precision of the estimated population mean is increased, see Chapter \@ref(SY).
```{r ScatterplotsSISY, echo = FALSE, out.width = "100%", fig.cap = "Scatter plots of averages of 1,000 pairs of simple random samples of ten points (SI) and of averages of 1,000 pairs of systematic random samples of ten points on average (SY)."}
#make scatter plot of means estimated from repeated SI samples
mz <- matrix(nrow = nsim, ncol = 2)
for (i in 1:nsim) {
sampleId1 <- sample(x = seq_len(nrow(df_grdField)), size = 10)
mz[i, 1] <- mean(df_grdField$pop1[sampleId1])
sampleId2 <- sample(x = seq_len(nrow(df_grdField)), size = 10)
mz[i, 2] <- mean(df_grdField$pop1[sampleId2])
}
mz_SI <- as.data.frame(mz)
#make scatter plot of means estimated from repeated SY samples
mz <- matrix(nrow = nsim, ncol = 2)
sp_grdField <- df_grdField[, c("x1", "x2", "pop1")]
gridded(sp_grdField) <- ~ x1 + x2
for (i in 1:nsim) {
sampleSY1 <- spsample(x = sp_grdField, n = sampleSize, type = "regular")
sampleSY2 <- spsample(x = sp_grdField, n = sampleSize, type = "regular")
#subsetting rows with overlay method
z1 <- over(x = sampleSY1, y = sp_grdField)
z2 <- over(x = sampleSY2, y = sp_grdField)
mz[i, 1] <- mean(z1$pop1)
mz[i, 2] <- mean(z2$pop1)
}
mz_SY <- as.data.frame(mz)
mz_both <- rbind(mz_SI, mz_SY)
mz_both$design <- rep(c("SI", "SY"), each = 1000)
ggplot(data = mz_both) +
geom_point(mapping = aes(x = V1, y = V2), size = 1) +
scale_x_continuous(name = "Sample average series a", limits = c(5, 7)) +
scale_y_continuous(name = "Sample average series b\n", limits = c(5, 7)) +
coord_fixed() +
facet_wrap(~design)
```
This sampling experiment shows that independence is not a characteristic of a population, as stated by @Wang2012, but of random variables generated by a random process (in the experiment the values at points or the sample means). As the random process differs between the design-based and the model-based approach, independence has a different meaning in these two approaches. For this reason, it is imperative to be more specific when using the term independence, by saying that data are *design-independent*\index{Independence!design-independence} or that you *assume* that the data are *model-independent*\index{Independence!model-independence}.
## Bias and variance {#BiasandVariance}
Bias and variance are commonly used statistics to quantify the quality of an estimator. Bias quantifies the systematic error, variance the random error of the estimator. Both are defined as expectations. But are these expectations over realisations of a probability sampling design (samples) or realisations of a statistical model (populations)? Like independence, it is important to distinguish *design-bias*\index{Bias!design-bias} from *model-bias*\index{Bias!model-bias} and *design-variance*\index{Design-variance} (commonly referred to as sampling variance) from *model-variance*\index{Model-variance}.
The concept of model-unbiasedness deserves more attention. Figure \@ref(fig:preferentialsample) shows a preferential sample\index{Preferential sample} from a population simulated by sequential Gaussian simulation with a constant mean of 10 and an exponential semivariogram without nugget, a sill of 5, and a distance parameter of 20. The points are selected by sampling with draw-by-draw selection probabilities proportional to size (pps sampling, Chapter \@ref(pps)), using the square of the simulated values as a size variable. We may have a similar sample that is collected for delineating soil contamination or detecting hot spots of soil bacteria, etc. Many samples are selected at locations with a large value, few points at locations with a small value. The sample data are used in ordinary kriging (Figure \@ref(fig:preferentialsample)). The prediction errors are computed by subtracting the kriged map from the simulated population.
```{r, echo = FALSE}
s1 <- s2 <- 1:100 - 0.5
grd <- expand.grid(s1, s2)
N <- nrow(grd)
names(grd) <- c("s1", "s2")
coordinates(grd) <- ~s1 + s2
vgmodel <- vgm(model = "Exp", psill = 5, range = 20)
set.seed(314)
nsim <- 1
sim <- krige(
dummy ~ 1,
locations = grd,
newdata = grd,
model = vgmodel,
nmax = 100,
nsim = nsim,
beta = 10,
dummy = TRUE,
debug.level = 0
)
sim <- as.data.frame(sim)
names(sim)[3] <- "z"
sim$size <- sim$z^2
#select pps sample
set.seed(314)
units <- sample(N, 100, prob = sim$size)
sam <- sim[units, ]
maxsimz <- max(sim$z)
minsimz <- min(sim$z)
#now predict at nodes of grid using the simulated values at the pps sample
predgrd <- grd
coordinates(sam) <- ~s1 + s2
preds <- krige(
z ~ 1,
locations = sam,
newdata = predgrd,
model = vgmodel,
debug.level = 0
)
preds <- as.data.frame(preds)
names(preds)[3] <- "zpred"
#map predicted values
#replace one value by the maximum another by the minimum of the simulated values
preds$zpred[which.max(preds$zpred)] <- maxsimz
preds$zpred[1] <- minsimz
```
```{r preferentialsample, echo = FALSE, out.width = "100%", fig.cap = "Preferential sample (size of open dots is proportional to value of study variable) from a simulated field (z) and map of ordinary kriging predictions (zpred)."}
df <- data.frame(s1 = sim$s1, s2 = sim$s2, z = sim$z, zpred = preds$zpred)
d <- df %>% pivot_longer(cols = c("z", "zpred"))
sam <- as(sam, "data.frame")
ggplot(data = d) +
geom_tile(mapping = aes(x = s1, y = s2, fill = value)) +
geom_point(data = sam, mapping = aes(x = s1, y = s2, size = z), shape = 1) +
scale_fill_viridis_c(name = "z") +
coord_fixed() +
facet_grid(~ name)
```
```{r}
error <- preds$zpred - sim$z
```
Figure \@ref(fig:histogramerrorpreferentialsample) shows a histogram of the prediction errors. The population mean error equals `r formatC(mean(error), 3, format = "f")`, not 0. You may have expected a positive systematic error because of the overrepresentation of locations with large values, but on the other hand, kriging predictions are best linear unbiased predictions\index{Best linear unbiased predictor} (BLUP), so from that point of view, this systematic error might be unexpected. BLUP means that at individual locations the ordinary kriging predictions are unbiased. However, apparently this does not guarantee that the average of the prediction errors, averaged over all population units, equals 0. The reason is that unbiasedness is defined here over all realisations (populations) of the statistical model of spatial variation. So, the U in BLUP stands for model-unbiasedness. For other model realisations, sampled at the same points, we may have much smaller values, leading to a negative mean error of that population. On average, over all populations, the error at any point will be 0 and consequently also the average over all populations of the mean error.
```{r histogramerrorpreferentialsample, echo = FALSE, fig.width = 5, fig.cap = "Frequency distribution of the errors of ordinary kriging predictions from a preferential sample."}
ggplot() +
geom_histogram(aes(x = error), binwidth = 0.5, fill = "black", alpha = 0.5, colour = "black") +
scale_y_continuous(name = "Count") +
scale_x_continuous(name = "Prediction error")
```
This experiment shows that model-unbiasedness does not protect us against selection bias\index{Bias!selection bias}, i.e., bias due to preferential sampling.
## Effective sample size {#effectivesamplesize}
Another persistent misconception is that when estimating the variance of the estimator of the mean of a spatial population or the correlation of two variables of a population, we must account for autocorrelation\index{Autocorrelation} of the sample data. This misconception occurs, for instance, in @Griffith2005 and in various sections (for instance, sections 3.5, 10.1, and 11.2) of @Plant2012. The reasoning is that, due to the spatial autocorrelation in the sample data, there is less information in the data about the parameter of interest, and so the effective sample size\index{Effective sample size} is smaller than the actual sample size. An early example of this misconception is Barnes' publication on the required sample size for estimating nonparametric tolerance intervals [@Barnes1988]. @dgr92 showed that a basic probability sampling design like simple random sampling requires fewer sampling points than the model-based sampling design proposed by Barnes.
The misconception is caused by confusing population parameters with model parameters. Recall that the population mean and the model-mean are not the same; the model-mean $\mu$ of Equation \@ref(eq:OKmodel) is the expectation of the population means over all populations that can be simulated with the model. The same holds for the variance of a variable as well as for the covariance and the Pearson correlation coefficient of two variables. All these parameters can be defined as parameters of a finite or infinite population or of random variables generated by a superpopulation model. Using an effective sample size to quantify the variance of an estimator is perfectly correct for model parameters, but not so for population parameters. For instance, when the correlation coefficient is defined as a population parameter and sampling units are selected by simple random sampling, there is no need to apply the method proposed by @Clifford1989 to correct the *p*-value\index{\emph{p}-value of a test} in a significance test for the presence of spatial autocorrelation.
I elaborate on this for the mean as the parameter of interest. Suppose a sample is selected in some way (need not be random) and the sample mean is used as an estimator of the model-mean. Note that for a model with a constant mean as in Equation \@ref(eq:OKmodel), the sample mean is a model-unbiased\index{Unbiasedness!model-unbiasedness} estimator of the model-mean, but in general not the best linear unbiased estimator\index{Best linear unbiased estimator} (BLUE) of the model-mean. If the random variables are model-independent, the variance of the sample mean, used as an estimator of the model-mean, can be computed by
\begin{equation}
V(\hat{\mu}) = \frac{\sigma^2}{n} \;,
(\#eq:VindModelMean)
\end{equation}
with $\sigma^2$ the model-variance of the random variable (see Equation \@ref(eq:OKmodel)). The variance presented in Equation \@ref(eq:VindModelMean) necessarily is a model-variance as it quantifies our uncertainty about the model-mean, which only exists in the model-based approach. If the random variables are not model-independent, the model-variance of the sample mean can be computed by [@gru06]
\begin{equation}
V(\hat{\mu}) = \frac{\sigma^2}{n} \{1+(n-1)\bar{\rho}\} \;,
(\#eq:VdepModelMean)
\end{equation}
with $\bar{\rho}$ the mean correlation within the sample (the average of the correlation of all pairs of sampling points). The term inside the curly brackets is larger than 1, unless $\bar{\rho}$ equals 0. So, the variance of the estimator of the model-mean with dependent data is larger than when data are independent. The number of independent observations that is equivalent to a spatially autocorrelated data set's sample size $n$, referred to as the effective sample size, can be computed by [@gru06]
\begin{equation}
n_{\mathrm{eff}}= \frac{n}{\{1+(n-1)\bar{\rho}\}} \;.
(\#eq:effectivesamplesize)
\end{equation}
So, if we substitute $n_{\mathrm{eff}}$ for $n$ in Equation \@ref(eq:VindModelMean), we obtain the variance presented in Equation \@ref(eq:VdepModelMean). Equation \@ref(eq:effectivesamplesize) is equivalent to equation (2) in @Griffith2005. Figure \@ref(fig:effectivesamplesize) shows that the effective sample size decreases sharply with the mean correlation. With a mean correlation of 0, the effective sample size equals the actual sample size; with a mean correlation of 1, the effective sample size equals 1.
```{r effectivesamplesize, echo = FALSE, fig.asp = .7, fig.width = 5, fig.cap = "Effective sample sizes as a function of the mean correlation within the sample, for samples of size 25 and 100."}
meanrho <- seq(from = 0, to = 1, length.out = 201)
n <- c(25, 100)
n25 <- n[1] / (1 + (n[1] - 1) * meanrho)
n100 <- n[2] / (1 + (n[2] - 1) * meanrho)
df <- data.frame(meanrho, n25, n100)
df_lf <- df %>% pivot_longer(cols = c("n25", "n100"))
df_lf$name <- factor(df_lf$name, levels = c("n25", "n100"), ordered = TRUE)
ggplot(data = df_lf) +
geom_line(mapping = aes(x = meanrho, y = value, linetype = name), size = 1) +
scale_linetype_manual(values = c("twodash", "dotted"), name = "") +
scale_x_continuous(name = "Mean correlation") +
scale_y_continuous(name = "Effective sample size", limits = c(1, 100))
```
To illustrate the difference between the model-variance and the design-variance of a sample mean, I simulated a finite population of 100 units, located at the nodes of a square grid, with a model-mean of 10, an exponential semivariogram without nugget, an effective range of three times the distance between adjacent population units, and a sill of 1 (Figure \@ref(fig:finitepopulation)). The model-variance of the average of a simple random sample *without replacement* of size $n$ is computed using Equation \@ref(eq:VdepModelMean), and the design-variance of the sample mean, used as an estimate of the population mean, is computed by (see Equation \@ref(eq:EstVarMeanSI))
\begin{equation}
V(\hat{\bar{z}})=\left(1-\frac{n}{N}\right)\frac{S^2}{n} \;,
(\#eq:varmeanSRSwithout)
\end{equation}
with $N$ the total number of population units ($N=100$). This is done for a range of sample sizes: $n = 10, 11, \dots ,100$. Note that for $n < 100$ the model-variance of the sample mean for a given $n$, differs between samples. For samples showing strong spatial clustering, the mean correlation is relatively large, and consequently the model-variance is relatively large (see Equation \@ref(eq:VdepModelMean)). There is less information in these samples about the model-mean than in samples without spatial clustering of the points. Therefore, to estimate the expectation of the model-variance over repeated simple random sampling for a given $n$, I selected 200 simple random samples of that size $n$, and I averaged the 200 model-variances. Figure \@ref(fig:MBvarDBvar) shows the result. Both the model-variance and the design-variance of the sample mean decrease with the sample size. For all sample sizes, the model-variance is larger than the design-variance. The design-variance goes to 0 for $n = 100$ (see Equation \@ref(eq:varmeanSRSwithout)), whereas the model-variance for $n = 100$ equals 0.0509. This can be explained as follows. Although with $n = 100$ we know the population mean without error, this population mean is only an estimate of the model-mean. Recall that the model-mean is the expectation of the population mean over all realisations of the model.
```{r finitepopulation, echo = FALSE, fig.width = 5, fig.cap = "Simple random sample without replacement of ten points from a finite population simulated with a model with a model-mean of 10, a model-variance of 1, and an exponential semivariogram (without nugget) with a distance parameter equal to the distance between neighbours (effective range is three times this distance). The mean correlation within the sample equals 0.135, and the model-variance of the estimator of the model-mean equals 0.222."}
s1 <- s2 <- 1:10 - 0.5
grd <- expand.grid(s1, s2)
N <- nrow(grd)
names(grd) <- c("s1", "s2")
#simulate finite populations
sigmasq <- 1
vgmodel <- vgm(model = "Exp", psill = sigmasq, range = 1)
H <- as.matrix(dist(grd))
C <- variogramLine(vgmodel, dist_vector = H, covariance = TRUE)
Upper <- chol(C)
set.seed(314)
sim <- matrix(nrow = N, ncol = 10000)
for (i in 1 :10000) {
G <- rnorm(n = nrow(grd), 0, 1)
sim[, i] <- crossprod(Upper, G) + 10
}
simdf <- as.data.frame(cbind(grd, sim))
#select one population, and select SRS without replacement
names(simdf)[17] <- "z"
set.seed(314)
units <- sample(N, 10, replace = FALSE)
SRS <- simdf[units, c(1, 2)]
d <- spDists(as.matrix(SRS[, c(1, 2)]))
C <- variogramLine(vgmodel, dist_vector = d, covariance = TRUE)
meanrho <- mean(C / sigmasq)
modelvar <- round(sigmasq / 10 * (1 + (10 - 1) * meanrho), 3)
ggplot(data = simdf) +
geom_point(mapping = aes(x = s1, y = s2, colour = z), size = 3) +
geom_point(data = SRS, mapping = aes(x = s1, y = s2), size = 6, shape = 1, stroke = 1.5) +
scale_colour_viridis_c(name = "z") +
scale_x_continuous(name = "Easting", breaks = c(2, 4, 6, 8)) +
scale_y_continuous(name = "Northing", breaks = c(2, 4, 6, 8)) +
coord_fixed()
```
```{r, echo = FALSE, eval = FALSE}
n <- 10:100
DBvar <- (1 - n / N) * var(simdf$z) / n
MBvar <- MBvarGriffith <- matrix(nrow = length(n), ncol = 200)
set.seed(314)
for (i in seq_len(length(n))) {
for (j in 1:200) {
units <- sample(N, n[i], replace = FALSE)
SRS <- simdf[units, c(1, 2)]
d <- spDists(as.matrix(SRS[, c(1, 2)]))
C <- variogramLine(vgmodel, dist_vector = d, covariance = TRUE)
meanrho <- mean(C[lower.tri(C, diag = FALSE)] / sigmasq)
MBvar[i, j] <- (sigmasq / n[i]) * (1 + (n[i] - 1) * meanrho)
}
}
save(n, DBvar, MBvar, file = "results/MBvarDBvar.rda")
```
```{r MBvarDBvar, echo = FALSE, fig.asp = .7, fig.width = 5, fig.cap = "Model-variance (MB) and design-variance (DB) of the average of a simple random sample without replacement as a function of the sample size."}
load(file = "results/MBvarDBvar.rda")
meanMBvar <- apply(MBvar, MARGIN = 1, FUN = mean)
df <- data.frame(n, DBvar, meanMBvar)
d <- df %>% pivot_longer(cols = c("DBvar", "meanMBvar"))
ggplot(data = d) +
geom_point(mapping = aes(x = n, y = value, shape = name), size = 2) +
scale_shape_manual(values = c(3, 1), name = "Variance", labels = c("DB", "MB")) +
scale_x_continuous(name = "Sample size", limits = c(10, 100), breaks = c(10, 30, 50, 70, 100)) +
scale_y_continuous(name = "Variance")
```
In Figure \@ref(fig:HistogramsMeanVariance) we can see that the population mean shows considerable variation. The variance of 10,000 simulated population means equals 0.0513, which is nearly equal to the value of 0.0509 for the model-variance computed with Equation \@ref(eq:VdepModelMean).
In observational research, I cannot think of situations in which interest is in estimation of the mean of a superpopulation model. This in contrast to experimental research. In experimental research, we are interested in the effects of treatments; think for instance of the effects of different types of soil tillage on the soil carbon stock. These treatment effects are quantified by different model-means. Also, in time-series analysis of data collected in observational studies, we might be more interested in the model-mean than in the mean over a bounded period of time.
```{r HistogramsMeanVariance, echo = FALSE, out.width = "100%", fig.asp = 0.5, fig.cap = "Frequency distributions of means and variances of 10,000 simulated populations."}
#compute population means
popmeans <- apply(simdf[, -c(1, 2)], MARGIN = 2, FUN = mean)
plt1 <- ggplot() +
geom_histogram(aes(x = popmeans), fill = "black", alpha = 0.5, binwidth = 0.1, colour = "black") +
scale_y_continuous(name = "Count") +
scale_x_continuous(name = "Population mean")
#compute population variances
S2 <- apply(simdf[, -c(1, 2)], MARGIN = 2, FUN = var)
plt2 <- ggplot() +
geom_histogram(aes(x = S2), fill = "black", alpha = 0.5, binwidth = 0.1, colour = "black") +
scale_y_continuous(name = "Count") +
scale_x_continuous(name = "Population variance")
grid.arrange(plt1, plt2, nrow = 1)
```
Now let us return to Equation \@ref(eq:Wang2010). What is wrong with this variance estimator? Where @Griffith2005 confused the population mean and the model-mean, @Wang2010 confused the population variance with the sill\index{Sill} (a priori variance) of the random process that has generated the population [@webster2007]. The parameter $\sigma^2$ in their formula is defined as the population variance. In doing so, the variance estimator is clearly wrong. However, if we define $\sigma^2$ in this formula as the sill, the formula makes more sense, but even then, the equation is not fully correct. The variance computed with this equation is not the design-variance of the average of a simple random sample selected from the sampled population, but the *expectation* of this design-variance over all realisations of the model. So, it is a model-based prediction of the design-variance of the estimator of the population mean, estimated from a simple random sample, see Chapter \@ref(MBpredictionofDesignVariance). For the population actually sampled, the design-variance is either smaller or larger than this expectation. Figure \@ref(fig:HistogramsMeanVariance) shows that there is considerable variation in the population variance among the 10,000 populations simulated with the model. Consequently, for an individual population, the variance of the estimator of the population mean, estimated from a simple random sample, can largely differ from the model-expectation of this variance. Do not use Equation \@ref(eq:Wang2010) for estimating the design-variance of the estimator of the population mean, but simply use Equation \@ref(eq:varmeanSRSwithout) (for simple random sampling with replacement and simple random sampling of infinite populations the term $(1-n/N)$ can be dropped). Equation \@ref(eq:Wang2010) is only relevant for comparing simple random sampling with other sampling designs under a variety of models of spatial variation, see Section \@ref(AnalyticalApproach) (@Ripley1981, @dom94).
## Exploiting spatial structure in design-based approach {#ExploitSpatialStructure}
A further misconception is that, in the design-based approach, the possibilities of exploiting our knowledge about the spatial structure of the study variable are limited, because the sampling units are selected randomly. This would indeed be a very serious drawback, but happily enough, this is not true. There are various ways of utilising this knowledge. Our knowledge about the spatial structure can be used either at the stage of designing the sample and/or at the stage of the statistical inference once the data are collected (Table \@ref(tab:TableExploitingSpatialStructure)).
I distinguish the situation in which maps of covariates are available from the situation in which such maps are lacking. In the first situation, the covariate maps can be used, for instance, to stratify the population (Chapter \@ref(STSI)). With a quantitative covariate, optimal stratification methods are available. Other options are, for instance, pps sampling (Chapter \@ref(pps)), and balanced sampling and well-spread sampling in covariate space with the local pivotal method (Chapter \@ref(BalancedSpreaded)). At the inference stage, the covariate maps can be used in a model-assisted approach\index{Model-assisted approach}, using, for instance, a linear regression model to increase the precision of the design-based estimator (Chapter \@ref(Modelassisted), Section \@ref(ModelassistedvsModeldependent)).
If no covariate maps are available, we may anticipate the presence of spatial structure by spreading the sampling units throughout the study area. This spreading can be done in many ways, for instance by systematic random sampling (Chapter \@ref(SY)), compact geographical stratification (Section \@ref(geostrata)), well-spread sampling in geographical space with the local pivotal method (LPM) (Subsection \@ref(LPM)), and generalised random-tessellation stratified (GRTS) sampling (Subsection \@ref(GRTS)). At the inference stage, again a model-assisted approach can be advantageous, using the spatial coordinates in a regression model.
```{r TableExploitingSpatialStructure, echo = FALSE}
tabledf <- data.frame(stage = c("Sampling", "", "", "", "Inference"), with = c("Stratified random sampling", "pps sampling", "Balanced sampling", "Covariate space spreading with LPM", "Model-assisted: regression model"), without = c("Systematic random sampling", "Compact geographical stratification", "Geographical spreading with LPM", "GRTS sampling", "Model-assisted: spatial regression model"))
knitr::kable(
tabledf, caption = "Strategies in the design-based approach for exploiting knowledge about the spatial structure of the study variable.",
col.names = c("Stage", "Covariates available", "No covariates"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic(font_size = 8) %>%
add_footnote("LPM: local pivotal method.", notation = "none")
```
## Model-assisted vs. model-dependent {#ModelassistedvsModeldependent}
In this section the difference between the model-assisted approach and the model-based approach is explained. The model-assisted approach is a hybrid approach in between the design-based and the model-based approach. It tries to build the strength of the model-based approach, a potential increase of the accuracy of estimates, into the design-based approach. As in the design-based approach, sampling units are selected by probability sampling, and consequently bias and variance are defined as design-bias and design-variance (Table \@ref(tab:threeapproaches)). As in the model-based approach, a superpopulation model is used. However, the role of this model in the two approaches is fundamentally different. In both approaches we assume that the population of interest is a realisation of the superpopulation model. However, as explained above, in the model-based approach, the statistical properties of the estimators (predictors), such as bias and variance, are defined over all possible realisations of the model (Table \@ref(tab:threeapproaches)). So, unbiasedness and minimum variance of an estimator (predictor) means *model*-unbiasedness and minimum *model*-variance. On the contrary, in the model-assisted approach, the model is used to derive an efficient estimator (Chapter \@ref(Modelassisted)). To stress its different role in the model-assisted approach, the model is referred to as a working model\index{Working model}.
```{r threeapproaches, echo = FALSE}
approach <- data.frame(Approach = c("Design-based", "Model-assisted", "Model-based"), Sampling = c("Prob. sampling", "Prob. sampling", "No requirement"), Inference = c("Design-based", "Model-assisted", "Model-depend."), Coefficients = c("No model", "Population par.", "Superpop. par."), Criteria = c("Design-bias, Design-variance", "Design-bias, Design-variance", "Model-bias, Model-variance"))
knitr::kable(
approach, caption = "Three statistical approaches for sampling and inference.",
col.names = c("Approach", "Sampling", "Inference", "Regression coefficients", "Quality criteria"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic(font_size = 7)
```
An important property of model-assisted estimators is that, if a poor working model is used (our assumptions about how our population is generated are incorrect), then for moderate sample sizes the results are still valid, i.e., the empirical coverage rate\index{Empirical coverage rate} of a model-assisted estimate of the confidence interval of the population mean still is approximately equal to the nominal coverage rate. This is because the mismatch of the superpopulation model and the applied model-assisted estimator results in a large design-variance of the estimator of the population mean. This is illustrated with a simulation study, in which I compare the effect of using a correct vs. an incorrect model in estimation.
A population is simulated with a simple linear regression model with an intercept of 15 ($\beta_0 = 15$), a slope coefficient of 0.5 ($\beta_1 = 0.5$), and a constant residual standard deviation of 2 ($\sigma_{\epsilon}=2$). This is done by first simulating a population with covariate values with a model-mean of 20 ($\mu(x)=20$), using an exponential semivariogram without nugget, a sill variance of 25, and a distance parameter of 20 distance units. This field with covariate values is then linearly transformed using the above-mentioned regression coefficients. Finally, `white noise\index{White noise}' is added by drawing independently for each population unit a random number from a normal distribution with zero mean and a standard deviation of 2 (Figure \@ref(fig:plotsimulatedbivariatepopulation)).
```{r simulatebivariatepopulation, echo = FALSE}
s1 <- s2 <- 1:200 - 0.5
mypopulation <- expand.grid(s1, s2)
names(mypopulation) <- c("s1", "s2")
N <- nrow(mypopulation)
vgmodel <- vgm(model = "Exp", psill = 25, range = 20, nugget = 0)
coordinates(mypopulation) <- ~s1 + s2
set.seed(314)
sim <- krige(
dummy ~ 1,
locations = mypopulation,
newdata = mypopulation,
model = vgmodel,
nmax = 20,
nsim = 1,
beta = 20,
dummy = TRUE,
debug.level = 0
)
mypopulation$x <- sim$sim1
mypopulation <- as(mypopulation, "data.frame")
mypopulation$z <- 15 + 0.5 * mypopulation$x + rnorm(n = N, mean = 0, sd = 2)
```
The population mean of the study variable equals `r formatC(mean(mypopulation$z), 3, format = "f")`, which is pretty close to the known model-mean $\mu(z)$: $\beta_0 + \beta_1 \mu(x)= 15 + 0.5 \cdot 20$.
```{r plotsimulatedbivariatepopulation, echo = FALSE, out.width = "100%", fig.cap = "Realisation of simple linear regression model. x is the covariate, z is the study variable."}
plt1 <- ggplot(data = mypopulation) +
geom_raster(mapping = aes(x = s1, y = s2, fill = x)) +
# ggtitle("Covariate") +
theme(plot.title = element_text(size = 16, hjust = 0.5)) +
scale_fill_viridis_c(name = "x") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
coord_fixed()
plt2 <- ggplot(data = mypopulation) +
geom_raster(mapping = aes(x = s1, y = s2, fill = z)) +
# ggtitle("Study variable") +
theme(plot.title = element_text(size = 16, hjust = 0.5)) +
scale_fill_viridis_c(name = "z") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 1)
```
```{r scatterplotbivariatepopulation, echo = FALSE, fig.width = 5, out.width= "60%", fig.cap = "Exhaustive scatter plot of the simulated population, with population fit of a simple linear regression model (green line), and of a ratio model fitted with weights inversely proportional to the covariate (red line)."}
model_regr_pop <- lm(z ~ x, data = mypopulation)
model_ratio_pop <- lm(z ~ x -1, weights = 1 / x, data = mypopulation)
ggplot(mypopulation) +
geom_point(mapping = aes(x = x, y = z), alpha = 0.25) +
geom_abline(intercept = coef(model_regr_pop)[1], slope = coef(model_regr_pop)[2], colour = "green") +
geom_abline(intercept = 0, slope = coef(model_ratio_pop)[1], colour = "red") +
scale_x_continuous(limits = c(0, 40)) +
scale_y_continuous(limits = c(0, 40), name = "z")
sdres_regr <- summary(model_regr_pop)$sigma
fit_ratio <- fitted(model_ratio_pop)
e_ratio <- mypopulation$z - fit_ratio
sdres_ratio <- sqrt(sum(e_ratio^2) / nrow(mypopulation))
```
Figure \@ref(fig:scatterplotbivariatepopulation) shows a scatter plot for all population units. The Pearson correlation coefficient equals `r formatC(cor(mypopulation$x,mypopulation$z), 3, format = "f")`. Two models are fitted to the exhaustive scatter plot: a simple linear regression model and a ratio model. The ratio model assumes that the intercept $\beta_0$ equals 0 and that the residual variance is proportional to the covariate values: $\sigma^2_{\epsilon} \propto x$. The population fits of the coefficients of the simple linear regression model are `r formatC(coef(model_regr_pop)[1], 4, format = "f")` and `r formatC(coef(model_regr_pop)[2], 4, format = "f")`, which are very close to the model regression coefficients. The fitted ratio model is clearly very poor. The residual standard deviation of the population fit of the ratio model equals `r formatC(sdres_ratio, 3, format = "f")`, which is much larger than `r formatC(sdres_regr, 3, format = "f")` of the simple linear regression model.
The population mean of the study variable is estimated by selecting 5,000 times a simple random sample of 25 units. Each sample is used to estimate the population mean by two model-assisted estimators: the simple regression estimator and the ratio estimator (Chapter \@ref(Modelassisted)). The first estimator correctly assumes that the population is a realisation of a simple linear regression model, whereas the latter incorrectly assumes that it is a realisation of a ratio model. For each sample, the standard error of the two estimators are estimated as well, which is used to compute a 95\% confidence interval of the population mean. Then the empirical coverage rate is computed, i.e., the proportion of samples for which the population mean is inside the 95\% confidence interval. Ideally, this empirical coverage rate is equal to the nominal coverage rate of 0.95.
```{r coveragerate1, echo = FALSE}
mx_pop <- mean(mypopulation$x)
mz_pop <- mean(mypopulation$z)
n <- 25
S <- 5000
alpha <- 0.05
se_regr <- se_ratio <- ind_regr <- ind_ratio <- numeric(length = S)
set.seed(314)
for (i in 1:S) {
units <- sample(N, size = n, replace = TRUE)
mySI <- mypopulation[units, ]
mx_sam <- mean(mySI$x)
mz_sam <- mean(mySI$z)
model_regr <- lm(z ~ x, data = mySI)
b1 <- coef(model_regr)[2]
mz_regr <- mz_sam + b1 * (mx_pop - mx_sam)
e_regr <- residuals(model_regr)
S2e <- var(e_regr)
se_regr[i] <- sqrt(S2e / n)
margin <- qt(1 - alpha / 2, n - 2, lower.tail = TRUE) * se_regr[i]
lower <- mz_regr - margin
upper <- mz_regr + margin
ind_regr[i] <- (mz_pop > lower & mz_pop < upper)
model_ratio <- lm(z ~ x -1, weights = 1 / x, data = mySI)
b <- coef(model_ratio)
mz_ratio <- b * mx_pop
e_ratio <- residuals(model_ratio)
se_ratio[i] <- sqrt(var(e_ratio) / n)
margin <- qt(1 - alpha / 2, n - 1, lower.tail = TRUE) * se_ratio[i]
lower <- mz_ratio - margin
upper <- mz_ratio + margin
ind_ratio[i] <- (mz_pop > lower & mz_pop < upper)
}
coverage_regr <- mean(ind_regr)
coverage_ratio <- mean(ind_ratio)
```
The coverage rates of the simple regression estimator and the ratio estimator equal `r formatC(coverage_regr, 3, format = "f")` and `r formatC(coverage_ratio, 3, format = "f")`, respectively. Both coverage rates are very close to the nominal coverage rate of 0.95. So, despite the fact that the ratio estimator assumes an improper superpopulation model, the estimated confidence interval is still valid. The price we pay for the invalid model assumption is not an overestimated coverage rate of a confidence interval, but an increased standard error of the estimated population mean. The average over the 5,000 samples of the estimated standard error of the regression estimator equals `r formatC(mean(se_regr), 3, format = "f")`, whereas that of the ratio estimator equals `r formatC(mean(se_ratio), 3, format = "f")`. The larger standard error of the ratio estimator leads to wider confidence intervals, which explains that the coverage rate is still correct.
This sampling experiment is now repeated for samples sizes $n=10, 25, 50 , 100$ and for confidence levels $1-\alpha=0.01,0.02, \dots , 0.99$.
```{r, echo = FALSE, eval = FALSE}
n <- c(10, 25, 50, 100)
alphas <- (1:99) / 100
nsam <- 5000
mx_pop <- mean(mypopulation$x)
mz_pop <- mean(mypopulation$z)
mz_regr <- mz_ratio <- v_mz_regr <- v_mz_ratio <- av_mz_regr <- av_mz_ratio <- matrix(nrow = length(n), ncol = nsam)
ind_regr <- ind_ratio <- ind_regr_av <- ind_ratio_av <- array(dim = c(length(n), nsam, length(alphas)))
set.seed(314)
for (i in seq_len(length(n))) {
pi <- n[i] / N
for (j in 1:nsam) {
units <- sample(N, size = n[i], replace = TRUE)
mySI <- mypopulation[units, ]
mx_sam <- mean(mySI$x)
mz_sam <- mean(mySI$z)
model_regr <- lm(z ~ x, data = mySI)
model_ratio <- lm(z ~ x -1, weights = 1 / x, data = mySI)
mz_regr[i, j] <- mz_sam + coef(model_regr)[2] * (mx_pop - mx_sam)
mz_ratio[i, j] <- mz_sam / mx_sam * mx_pop
e_regr <- residuals(model_regr)
e_ratio <- residuals(model_ratio)
av_mz_regr[i, j] <- (var(e_regr) * (n[i] - 1) / (n[i] - 2)) / n[i]
av_mz_ratio[i, j] <- var(e_ratio) / n[i]
g_regr <- 1 + (mx_pop - mx_sam) * (mySI$x - mx_sam) / (mean(mySI$x^2) - mx_sam^2)
g_ratio <- mx_pop / mx_sam
v_mz_regr[i, j] <- sum(g_regr^2 * (e_regr / pi)^2) / N^2
v_mz_ratio[i, j] <- sum(g_ratio^2 * (e_ratio / pi)^2) / N^2
for (k in seq_len(length(alphas))) {
se <- sqrt(v_mz_regr[i, j])
margin <- qt(1 - alphas[k] / 2, n[i] - 2, lower.tail = TRUE) * se
lower <- mz_regr[i, j] - margin
upper <- mz_regr[i, j] + margin
ind_regr[i, j, k] <- (mz_pop > lower & mz_pop < upper)
se <- sqrt(v_mz_ratio[i, j])
margin <- qt(1 - alphas[k] / 2, n[i] - 1, lower.tail = TRUE) * se
lower <- mz_ratio[i, j] - margin
upper <- mz_ratio[i, j] + margin
ind_ratio[i, j, k] <- (mz_pop > lower & mz_pop < upper)
se <- sqrt(av_mz_regr[i, j])
margin <- qt(1 - alphas[k] / 2, n[i] - 2, lower.tail = TRUE) * se
lower <- mz_regr[i, j] - margin
upper <- mz_regr[i, j] + margin
ind_regr_av[i, j, k] <- (mz_pop > lower & mz_pop < upper)
se <- sqrt(av_mz_ratio[i, j])
margin <- qt(1 - alphas[k] / 2, n[i] - 1, lower.tail = TRUE) * se
lower <- mz_ratio[i, j] - margin
upper <- mz_ratio[i, j] + margin
ind_ratio_av[i, j, k] <- (mz_pop > lower & mz_pop < upper)
}
}
}
save(mz_regr, mz_ratio, v_mz_regr, v_mz_ratio, av_mz_regr, av_mz_ratio, ind_regr, ind_ratio, ind_regr_av, ind_ratio_av, file = "results/Coveragerates_modelassistedestimators.rda")
```
```{r coveragerates, echo = FALSE, out.width = "100%", fig.cap = "Empirical vs. nominal coverage rates of confidence intervals for the population mean, estimated by the simple regression estimator, for sample sizes 10, 25, 50, and 100."}
load(file = "results/Coveragerates_modelassistedestimators.rda")
n <- c(10, 25, 50, 100)
alphas <- (1:99) / 100
nsam <- 5000
Ep_mz_regr <- apply(mz_regr, MARGIN = 1, FUN = mean)
Ep_mz_ratio <- apply(mz_ratio, MARGIN = 1, FUN = mean)
Vp_mz_regr <- apply(mz_regr, MARGIN = 1, FUN = var)
Vp_mz_ratio <- apply(mz_ratio, MARGIN = 1, FUN = var)
Ep_av_mz_regr <- apply(av_mz_regr, MARGIN = 1, FUN = mean)
Ep_av_mz_ratio <- apply(av_mz_ratio, MARGIN = 1, FUN = mean)
Ep_v_mz_regr <- apply(v_mz_regr, MARGIN = 1, FUN = mean)
Ep_v_mz_ratio <- apply(v_mz_ratio, MARGIN = 1, FUN = mean)
coverage_regr <- coverage_ratio <- coverage_regr_av <- coverage_ratio_av <- matrix(nrow = length(n), ncol = length(alphas))
for (k in seq_len(length(alphas))) {
coverage_regr[, k] <- apply(ind_regr[, , k], MARGIN = 1, FUN = mean)
coverage_ratio[, k] <- apply(ind_ratio[, , k], MARGIN = 1, FUN = mean)
coverage_regr_av[, k] <- apply(ind_regr_av[, , k], MARGIN = 1, FUN = mean)
coverage_ratio_av[, k] <- apply(ind_ratio_av[, , k], MARGIN = 1, FUN = mean)
}
df <- data.frame(confidence = 1 - alphas,
n10 = coverage_regr_av[1, ],
n25 = coverage_regr_av[2, ],
n50 = coverage_regr_av[3, ],
n100 = coverage_regr_av[4, ])
d <- df %>% pivot_longer(cols = c("n10", "n25", "n50", "n100"))
d$name <- factor(d$name, levels = c("n10", "n25", "n50", "n100"), ordered = TRUE)
ggplot(d) +
geom_abline(intercept = 0, slope = 1, colour = "green") +
geom_point(mapping = aes(x = confidence, y = value), alpha = 0.5) +
scale_x_continuous(limits = c(0, 1), name = "Confidence level") +
scale_y_continuous(limits = c(0, 1), name = "Coverage rate") +
facet_wrap(~ name, ncol = 2, nrow = 2) +
coord_fixed()
```
```{r plotcoverageratesratioestimator, echo = FALSE, out.width = "100%", fig.cap = "Empirical versus nominal coverage rates of confidence intervals for the population mean, estimated by the ratio estimator, for sample sizes 10, 25, 50, and 100."}
df <- data.frame(confidence = 1 - alphas,
n10 = coverage_ratio_av[1, ],
n25 = coverage_ratio_av[2, ],
n50 = coverage_ratio_av[3, ],
n100 = coverage_ratio_av[4, ])
d <- df %>% pivot_longer(cols = c("n10", "n25", "n50", "n100"))
d$name <- factor(d$name, levels = c("n10", "n25", "n50", "n100"), ordered = TRUE)
ggplot(d) +
geom_abline(intercept = 0, slope = 1, colour = "green") +
geom_point(mapping = aes(x = confidence, y = value), alpha = 0.5) +
scale_x_continuous(limits = c(0, 1), name = "Confidence level") +
scale_y_continuous(limits = c(0, 1), name = "Coverage rate") +
facet_wrap(~ name, ncol = 2, nrow = 2) +
coord_fixed()
```
```{r TableRegressionRatioEstimator, echo = FALSE}
bias_regr <- ((mz_pop - Ep_mz_regr) / mz_pop) * 100
Sp_regr <- sqrt(Vp_mz_regr)
ase_regr <- sqrt(av_mz_regr)
Ep_ase_regr <- apply(ase_regr, MARGIN = 1, FUN = mean)
tabledf <- data.frame(size = n, bias = round(bias_regr, 4), sd = round(Sp_regr, 3), se = round(Ep_ase_regr, 3))
bias_ratio <- ((mz_pop - Ep_mz_ratio) / mz_pop) * 100
Sp_ratio <- sqrt(Vp_mz_ratio)
ase_ratio <- sqrt(av_mz_ratio)
Ep_ase_ratio <- apply(ase_ratio, MARGIN = 1, FUN = mean)
bias <- c(bias_regr, bias_ratio)
Sp <- c(Sp_regr, Sp_ratio)
Ep_ase <- c(Ep_ase_regr, Ep_ase_ratio)
tabledf <- data.frame(estimator = c(rep("Regression", 4), rep("Ratio", 4)), size = n, bias = round(bias, 4), sd = round(Sp, 3), se = round(Ep_ase, 3))
knitr::kable(
tabledf, caption = "Estimated relative bias of the regression estimator and ratio estimator, standard deviation of 5,000 regression/ratio estimates, and average of 5,000 estimated standard errors of the regression/ratio estimator.",
col.names = c("Estimator", "n", "Bias (%)", "Standard deviation", "Average standard error"),
booktabs = TRUE,
linesep = ""
) %>%
add_footnote("n: sample size.", notation = "none")
```
Figures \@ref(fig:coveragerates) and \@ref(fig:plotcoverageratesratioestimator) show that the empirical coverage rates are close to the nominal coverage rate, for all four sample sizes, both estimators, and all confidence levels. For the regression estimator and $n=10$, the empirical coverage rate is somewhat too small. This is because the standard error of the regression estimator is slightly underestimated at this sample size. The average of the estimated standard errors (square root of estimated variance of regression estimator) equals `r formatC(Ep_ase_regr[1], 3, format = "f")`, which is somewhat smaller than the standard deviation of the 5,000 regression estimates of `r formatC(Sp_regr[1], 3, format = "f")`. For all sample sizes, the standard deviation of the 5,000 ratio estimates is considerably larger than that of the 5,000 regression estimates (Table \@ref(tab:TableRegressionRatioEstimator)). For $n=10$ also the standard error of the ratio estimator is underestimated (the average of the 5,000 estimated standard errors is smaller than the standard deviation of the 5,000 ratio estimates), but as a percentage of the standard deviation of the 5,000 ratio estimates, this underestimation is smaller than for the regression estimator.
The relative bias, computed by
\begin{equation}
bias = \frac{\frac{1}{5000}\sum_{s=1}^{5000} (\hat{\bar{z}}_s-\bar{z})}{\bar{z}} \;,
\end{equation}
is about 0 for both estimators and all four sample sizes.
Contrarily, if in the model-based approach a poor superpopulation model is used, the predictions and the prediction error variances still are model-unbiased. However, for the sampled population, we may have serious systematic error in the estimated population mean and the variance of local predictions may be seriously over- or underestimated. For this reason, model-based inference is also referred to as *model-dependent* inference\index{Model-dependent predictor}, stressing that we fully rely on the model and that the validity\index{Validity} of the estimates and predictions depends on the quality of the model [@han83].
```{r, echo=FALSE}
rm(list = ls())
```