-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path14-SmallAreaEstimation.Rmd
854 lines (672 loc) · 58.6 KB
/
14-SmallAreaEstimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
# Sampling for estimating parameters of domains {#SmallAreaEstimation}
This chapter is about probability sampling and estimation of means or totals of subpopulations (subareas, subregions). In the sampling literature, these subpopulations are referred to as domains of interest\index{Domain of interest}, or shortly domains. Ideally, at the stage of designing a sample, these domains are known, and of every population unit we know to which domain it belongs. In that situation it is most convenient to use these domains as strata in random sampling, so that we can control the sample size in each domain (Chapter \@ref(STSI)).
If we have multiple maps with domains, think for instance of a soil class map, a map with land cover classes, and a map with countries, we can make an overlay of these maps to construct the cross-tabulation strata. However, this may result in numerous strata, in some cases even more than the sample size. In this situation, an attractive solution is multiway stratification (Subsection \@ref(Multiwaystratification)). With this design, the domains of interest are used as strata, not their cross-classification, and the sample sizes of these marginal strata are controlled.
Even with a multiway stratified sample, resulting in controlled sample sizes for each domain, the sample size of a domain can be too small for a reliable estimate of the mean or total when using just the data of this domain. In this case we may use model-assisted estimators (Chapter \@ref(Modelassisted)) to increase the precision. Not only the data collected from a given domain are used to estimate the mean or total, but also data outside the domain (Section \@ref(SmallDomainsModelAssisted)) (@cha94, @Rao2003, @Falorsi2008).
We may also wish to estimate the mean or total of domains that are not used as strata or marginal strata. The sample size in these domains is then not controlled and varies among samples selected with the sampling design. As before with multiway stratified sampling, the mean can either be estimated with the direct estimator, using the data from the domain only (Section \@ref(LargeDomainsDirectEstimator)), or a model-assisted estimator, also using data from outside the domain (Section \@ref(SmallDomainsModelAssisted)).
## Direct estimator for large domains {#LargeDomainsDirectEstimator}
If the sample size of a domain $d$ is considered large enough to obtain a reliable estimate of the mean and, besides, the size of the domain is known, the mean of that domain can be estimated by the direct estimator\index{Direct estimator}:
\begin{equation}
\hat{\bar{z}}_{d}=\frac{1}{N_{d}}\sum_{k \in \mathcal{S}_d}\frac{z_{dk}}{\pi _{dk}} \;,
(\#eq:piestimatormeandomain)
\end{equation}
where $N_{d}$ is the size of the domain, $z_{dk}$ is the value for unit $k$ of domain $d$, and $\pi _{dk}$ is the inclusion probability of this point.
When the domain is not used as a (marginal) stratum, so that the sample size of the domain is random, the mean of the domain can be estimated best by the ratio estimator:
\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}= \frac{\hat{t}_d(z)}{\widehat{N}_d}=
\frac{\sum_{k \in \mathcal{S}_d}\frac{z_{dk}}{\pi_{dk}}}{\sum_{k \in \mathcal{S}_d}\frac{1}{\pi_{dk}}} \;.
(\#eq:generalratiodomain)
\end{equation}
```{block2, type='rmdnote'}
The ratio estimator\index{Ratio estimator} can also be used when the size of the domain is unknown. An example of this is estimating the mean of soil classes *as observed in the field*, not *as depicted on a soil map*. A soil map is impure, i.e., the map units contain patches with soil classes that differ from the soil class as indicated on the map. The area of a given true soil class is not known.
```
For simple random sampling without replacement, $\pi_{dk} = n/N$. Inserting this in Equation \@ref(eq:generalratiodomain) gives
\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}=\frac{1}{n_{d}}\sum_{k \in \mathcal{S}_d}z_{dk} \;.
(\#eq:ratiodomainSI)
\end{equation}
The mean of the domain is simply estimated by the mean of the $z$-values observed in the domain, i.e., the sample mean in domain $d$. The variance of this estimator can be estimated by
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{ratio},d}\right) =
\frac{1}{\hat{a}_{d}^{2}}\cdot\frac{1}{n\,(n-1)}\sum_{k \in \mathcal{S}_d}(z_{dk}-\bar{z}_{\mathcal{S}_d})^{2} \;,
(\#eq:VarratiodomainSI)
\end{equation}
where $\bar{z}_{\mathcal{S}_d}$ is the sample mean in domain $d$, and $\hat{a}_{d}$ is the estimated relative size of domain $d$:
\begin{equation}
\hat{a}_{d}=\frac{n_{d}}{n} \;.
(\#eq:estimatedrelativesizedomain)
\end{equation}
Refer to section (8.2.2) in @gru06 for the ratio estimator and its standard error with stratified simple random sampling, in case the domains cut across the strata, and other sampling designs.
The ratio estimator and its standard error can be computed with function `svyby` of package **survey** [@Lumley2020]. This is illustrated with Eastern Amazonia. We wish to estimate the mean aboveground biomass (AGB) of the 16 ecoregions from a simple random sample of 200 units.
```{r, echo = FALSE}
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
```
```{r}
library(survey)
set.seed(314)
n <- 200
mysample <- grdAmazonia %>%
mutate(N = n()) %>%
slice_sample(n = n)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
```
The ratio estimates of the mean AGB are shown in Table \@ref(tab:TableRatioEstimatesEcoregions). Two ecoregions are missing in the table: no units are selected from these ecoregions so that a direct estimate is not available. There are three ecoregions with an estimated standard error of 0.0. These ecoregions have less than two sampling units only, so that the standard error cannot be estimated.
(ref:TableRatioEstimatesEcoregionslabel) Ratio estimates and estimated standard errors of the ratio estimator of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, with simple random sampling without replacement of size 200.
```{r TableRatioEstimatesEcoregions, echo = FALSE}
rownames(res) <- NULL
res[, c(2, 3)] <- round(res[, c(2, 3)], 1)
knitr::kable(
res, caption = "(ref:TableRatioEstimatesEcoregionslabel)",
booktabs = TRUE,
linesep = ""
) %>%
kable_classic() %>%
add_footnote("The estimated standard errors of 0.0 are non-availables.", notation = "none")
```
The simple random sampling is repeated 1,000 times, and every sample is used to estimate the mean AGB of the ecoregions both with the $\pi$ estimator and the ratio estimator. As can be seen in Table \@ref(tab:TableRatiovsHTEstimatesEcoregions) the standard deviation of the ratio estimates is much smaller than that of the $\pi$ estimates. The reason is that the number of sampling units in an ecoregion varies among samples, i.e., the sample size of an ecoregion is random. When many units are selected from an ecoregion, the estimated total of that ecoregion is large. The estimated mean as obtained with the $\pi$ estimator then is large too, because the estimated total is divided by the fixed size (total number of population units, $N_d$) of the ecoregion. However, in the ratio estimator the size of an ecoregion is estimated from the same sample, although we know its size, see Equation \@ref(eq:generalratiodomain). With many units selected from an ecoregion, the estimated size of that ecoregion, $\widehat{N}_d$, is also large. By dividing the large estimated total by the large estimated size, a more stable estimate of the mean of the domain is obtained. For quite a few ecoregions the standard deviations are very large, especially of the $\pi$ estimator. These are the ecoregions with very small average sample sizes. With simple random sampling, the expected sample size can simply be computed by $E[n] = n \; N_d/N$. In the following section, alternative estimators are described for these ecoregions with small expected sample sizes. To speed up the computations, I used a 5 km $\times$ 5 km subgrid of `grdAmazonia` in this sampling experiment.
```{r, echo = FALSE, eval = FALSE}
n <- 200
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
grdAmazonia$lnSWIR2 <- log(grdAmazonia$SWIR2)
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
ecoregions <- sort(unique(grdAmazonia$Ecoregion))
N <- nrow(grdAmazonia)
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
number_of_samples <- 1000
mx_eco_pop <- tapply(grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Ecoregion, FUN = mean)
mX_eco_pop <- data.frame(Intercept = rep(1, length(mx_eco_pop)), lnSWIR2 = mx_eco_pop)
set.seed(314)
mz_eco_HT <- mz_eco_ratio <- mz_eco_regr <- mz_eco_synt <- mz_eco_extsynt <- matrix(nrow = length(ecoregions), ncol = number_of_samples, dimnames = list(ecoregions, seq(1:number_of_samples)))
for (i in 1:number_of_samples) {
print(i)
mypop <- grdAmazonia
units <- sample(nrow(mypop), size = n, replace = FALSE)
mysample <- mypop[units, c("AGB", "Ecoregion")]
# HT estimator
tz <- (N / n) * tapply(mysample$AGB, INDEX = mysample$Ecoregion, FUN = sum)
mz_eco_HT[names(tz), i] <- tz / N_eco[names(tz)]
# ratio estimator
mysample$N <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
mz_eco_ratio[res$Ecoregion, i] <- res$AGB
# regression estimator
mypop$ind <- rep(0, nrow(mypop))
mypop$ind[units] <- 1
mypop$AGB[mypop$ind == 0] <- NA
res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = TRUE),
psmall = TRUE, exhaustive = mX_eco_pop)$estimation
mz_eco_regr[, i] <- res$estimate
# synthetic estimator
res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = FALSE),
psmall = FALSE, exhaustive = mX_eco_pop)$estimation
mz_eco_synt[, i] <- res$estimate
# extended synthetic estimator
res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = TRUE),
psmall = FALSE, exhaustive = mX_eco_pop)$estimation
mz_eco_extsynt[, i] <- res$estimate
}
save(mz_eco_HT, mz_eco_ratio, mz_eco_regr, mz_eco_synt, mz_eco_extsynt, file = "results/MeanAGB_ecoregions.rda")
```
(ref:TableRatiovsHTEstimatesEcoregions) Standard deviations of 1,000 $\pi$ estimates (HT) and 1,000 ratio estimates (Ratio) of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, with simple random sampling without replacement of size 200.
```{r TableRatiovsHTEstimatesEcoregions, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
sd_mz_HT <- apply(mz_eco_HT, MARGIN = 1, FUN = var, na.rm = TRUE) %>% sqrt(.) %>% round(1)
sd_mz_ratio <- apply(mz_eco_ratio, MARGIN = 1, FUN = var, na.rm = TRUE) %>% sqrt(.) %>% round(1)
notNA <- apply(mz_eco_HT, MARGIN = 1, FUN = function(x) {
sum(!is.na(x))})
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
N <- nrow(grdAmazonia)
En <- round(n * N_eco / N, 2)
nms <- names(sd_mz_HT)
names(sd_mz_HT) <- NULL
names(sd_mz_ratio) <- NULL
names(En) <- NULL
res <- data.frame(nms, sd_mz_HT, sd_mz_ratio, En)
rownames(res) <- NULL
knitr::kable(
res, caption = "(ref:TableRatiovsHTEstimatesEcoregions)",
col.names = c("Ecoregion", "HT", "Ratio", "n"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic() %>%
add_footnote("n: expected sample size.", notation = "none")
```
No covariates are used in the ratio estimator. If we wish to exploit covariates, the mean of a domain can be estimated best by the ratio of the regression estimate of the domain total and the estimated size of the domain:
\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}= \frac{\hat{t}_{\text{regr},d}(z)}{\widehat{N}_d}\;.
\end{equation}
For a large domain with a reasonable sample size, the regression estimate can be computed from the data of that domain (Chapter \@ref(Modelassisted)). For small domains, also the data from outside these domains can be used to estimate the population regression coefficients. This is explained in Subsection \@ref(RegressionestimatorSmallDomain).
## Model-assisted estimators for small domains {#SmallDomainsModelAssisted}
When the domains are not well represented in the sample, the direct estimators from the previous section lead to large standard errors. In this situation, we may try to increase the precision by also using observations from outside the domain. If we have covariates related to the study variable, we may exploit this ancillary information by fitting a regression model relating the study variable to the covariates and using the fitted model to predict the study variable for all population units (nodes of discretisation grid), see Chapter \@ref(Modelassisted)\index{Model-assisted approach}. However, for a small domain\index{Small domain}, we may have too few sampled units in that domain to fit a separate regression model. The alternative then is to use the entire sample to estimate the regression coefficients, and to use this global regression model to estimate the means of the domains. This introduces a systematic error, a design-bias, in the estimator. However, this extra error is potentially outweighed by the reduction of the random error due to the use of the globally estimated regression coefficients. If one or more units are selected from a domain, the observations of the study variable of these units can be used to correct for the bias. This leads to the regression estimator for small domains\index{Regression estimator!for small domains}. In the absence of such data, the mean of the domain can still be estimated by the so-called synthetic estimator\index{Synthetic estimator}.
There are quite a few packages for model-assisted estimation of means of small areas, the **maSAE** package [@maSAE], the **JoSAE** package [@JoSAE], the **rsae** package [@rsae], and the **forestinventory** package [@Hill2021]. I use package **forestinventory** for model-assisted estimation (Subsections \@ref(RegressionestimatorSmallDomain) and \@ref(SyntheticestimatorSmallDomain)) and package **JoSAE** for model-based prediction of the means of small areas (Section \@ref(SmallAreaModelBased)).
### Regression estimator {#RegressionestimatorSmallDomain}
In the regression estimator, the potential bias due to the globally estimated regression coefficients can be eliminated by adding the $\pi$ estimator of the mean of the regression residuals to the mean of the predictions in the domain (compare with Equation \@ref(eq:GREG)) (@Mandallaz2007, @Mandallaz2013):
\begin{equation}
\hat{\bar{z}}_{\text{regr},d} = \frac{1}{N_d} \sum_{k=1}^{N_d} \mathbf{x}^{\mathrm{T}}_{dk} \hat{\mathbf{b}} + \frac{1}{N_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \frac{1}{N_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} \;,
(\#eq:regressionestimatorsmalldomain)
\end{equation}
with $\mathbf{x}_{dk}$ the vector with covariate values for unit $k$ in domain $d$, $\hat{\mathbf{b}}$ the vector with globally estimated regression coefficients, $e_{dk}$ the residual for unit $k$ in domain $d$, $\pi_{dk}$ the inclusion probability of that unit, and $\bar{\mathbf{x}}_d$ the mean of the covariates in domain $d$. Alternatively, the mean of the residuals in a domain is estimated by the ratio estimator:
\begin{equation}
\hat{\bar{z}}_{\text{regr},d} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \frac{1}{\widehat{N}_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} \;,
(\#eq:regressionestimatorsmalldomainratio)
\end{equation}
with $\widehat{N}_d$ the estimated size of domain $d$, see Equation \@ref(eq:generalratiodomain). The regression coefficients can be estimated by Equation \@ref(eq:EstimatorMultipleRegressionCoefficients). With simple random sampling, the second term in Equation \@ref(eq:regressionestimatorsmalldomainratio) is equal to the sample mean of the residuals, so that the estimator reduces to
\begin{equation}
\hat{\bar{z}}_{\text{regr},d}=\bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \bar{e}_{\mathcal{S}_d}\;,
(\#eq:regressionestimatorsmalldomainSI)
\end{equation}
with $\bar{e}_{\mathcal{S}_d}$ the sample mean of the residuals in domain $d$.
A regression estimate can only be computed if we have at least one observation of the study variable in the domain $d$. The variance of the regression estimator of the mean for a small domain can be estimated by [@Hill2021]
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{regr},d}\right) = \bar{\mathbf{x}}_d^{\mathrm{T}} \widehat{\mathbf{C}}(\hat{\mathbf{b}}) \bar{\mathbf{x}}_d + \widehat{V}\!\left(\hat{\bar{e}}_d \right)\;,
(\#eq:Varregressionestimatorsmalldomain)
\end{equation}
with $\widehat{\mathbf{C}}(\hat{\mathbf{b}})$ the matrix with estimated sampling variances and covariances of the estimators of the regression coefficients. The first variance component is the contribution due to uncertainty about the regression coefficients, the second component accounts for the uncertainty about the mean of the residuals in the domain. For simple random sampling, the sampling variance of the $\pi$ estimator of the mean of the residuals in a domain can be estimated by the sample variance of the residuals in that domain divided by the sample size $n_d$. This variance estimator is presented in @Hill2021. If the domain is not used as a stratum and the domain mean of the residuals is estimated by the ratio estimator, the second variance component can be estimated by
\begin{equation}
\widehat{V}\!\left(\hat{\bar{e}}_{\text{ratio},d}\right) = \left(\frac{n}{n_{d}}\right)^{2}\cdot\frac{1}{n\,(n-1)}\sum_{k \in \mathcal{S}_d}(e_{dk}-\bar{e}_{\mathcal{S}_d})^{2} \;.
(\#eq:Varratioestimatorofmeanresidual)
\end{equation}
With simple random sampling, the sampling variances and covariances of the estimators of the regression coefficients can be estimated by (equation 2 in @Hill2021)
\begin{equation}
\widehat{\mathbf{C}}(\hat{\mathbf{b}}) = \frac{1}{n} \left( \sum_{k \in \mathcal{S}} \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right) ^{-1} \left( \frac{1}{n^2} \sum_{k \in \mathcal{S}} e_k^2 \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right) \frac{1}{n}\left(\sum_{k \in \mathcal{S}}^n \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right)^{-1} \;.
(\#eq:samplingVarregressioncoefficients)
\end{equation}
The sampling variances and covariances of the estimators of the population regression coefficients are not equal to the model-variances and covariances as obtained with multiple linear regression, using functions `lm` and `vcov`, see Section \@ref(GREG) and Chapter \@ref(Approaches).
Function `twophase` of package **forestinventory** [@Hill2021] can be used to compute the regression estimate for small domains and its standard error.
```{block2, type='rmdnote'}
The function name 'twophase' is somewhat confusing. It suggests that we have a large sample which is subsampled in a second phase, as described in Chapter \@ref(Twophase). This is not the case here. However, upon considering infinite populations, @Hill2021 treat the grid that discretises the infinite population as the first-phase sample. The sampling error introduced by this discretisation grid can then be accounted for. I ignore this sampling error, which will be very small anyway, because the number of grid cells is very large.
```
By assigning the domain means of the covariates to argument `exhaustive` of function `twophase` the sampling error of the first phase is ignored. Function `twophase` assumes simple random sampling (unless optional argument `cluster` is used). Note that for the unobserved population units (not selected units) the AGB values are changed into non-availables. In package **survey** also a function `twophase` is defined, for this reason the name of the package is made explicit by `forestinventory::twophase`. With argument `psmall = TRUE` and element `unbiased = TRUE` in the list `small_area` the regression estimate is computed. Log-transformed SWIR2 is used as a covariate in the regression estimator of the mean AGB of the ecoregions.
```{r, echo = FALSE}
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
```
```{r}
library(forestinventory)
n <- 200
set.seed(314)
units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
mysample <- grdAmazonia[units, c("AGB", "SWIR2", "Ecoregion")]
grdAmazonia <- grdAmazonia %>%
mutate(lnSWIR2 = log(SWIR2),
id = row_number(),
ind = as.integer(id %in% units))
grdAmazonia$AGB[grdAmazonia$ind == 0L] <- NA
mx_eco_pop <- tapply(
grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Ecoregion, FUN = mean)
mX_eco_pop <- data.frame(
Intercept = rep(1, length(mx_eco_pop)), lnSWIR2 = mx_eco_pop)
res <- forestinventory::twophase(AGB ~ lnSWIR2,
data = as.data.frame(grdAmazonia),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion",
areas = sort(unique(mysample$Ecoregion)),
unbiased = TRUE),
psmall = TRUE, exhaustive = mX_eco_pop)
regr <- res$estimation
```
(ref:TableRegressionEstimatesEcoregionslabel) Regression estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, using lnSWIR2 as a predictor.
```{r TableRegressionEstimatesEcoregions, echo = FALSE}
df <- regr[c("area", "estimate", "ext_variance", "g_variance", "n2G")]
df$estimate <- round(df$estimate, 1)
df$ext_variance <- formatC(df$ext_variance, 1, format = "f", big.mark = ",")
df$g_variance <- formatC(df$g_variance, 1, format = "f", big.mark = ",")
knitr::kable(
df, caption = "(ref:TableRegressionEstimatesEcoregionslabel)",
booktabs = TRUE,
col.names = c("Ecoregion", "AGB", "ext_var", "g_var", "n2G"),
linesep = "",
) %>%
kable_classic() %>%
add_footnote(" For explanation of variances of regression estimator, see text. n2G: sample size of ecoregion; NA: not available; NaN: not a number.", notation = "none")
```
The alternative is to save the selected units (sample) in a data frame, passed to function `twophase` with argument `data`. The results are identical because the true means of the covariate $x$ specified with argument `exhaustive` contains all required information at the population level.
For two ecoregions, no regression estimate of the mean AGB is obtained (Table \@ref(tab:TableRegressionEstimatesEcoregions)). No units are selected from these domains. The estimated variance of the estimated domain mean is in the column g_var. In the estimated variance ext_var the first variance component of Equation \@ref(eq:Varregressionestimatorsmalldomain) is ignored. Note that for the ecoregions with a sample size of one unit (the sample size per domain is in column `n2G`), no estimate of the variance is available, because the variance of the estimated mean of the residuals cannot be estimated from one unit.
Figure \@ref(fig:RatioversusRegrEcoregions) shows the regression estimates plotted against the ratio estimates. The intercept of the line, fitted with ordinary least squares (OLS), is larger than 0, and the slope is smaller than 1. Using the regression model predictions in the estimation of the means leads to some smoothing.
(ref:RatioversusRegrEcoregionslabel) Scatter plot of the ratio and the regression estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia for simple random sample without replacement of size 200. In the regression estimate, lnSWIR2 is used as a predictor. The line is fitted by ordinary least squares.
```{r RatioversusRegrEcoregions, echo = FALSE, fig.width = 5, fig.cap = "(ref:RatioversusRegrEcoregionslabel)"}
mysample$N <- nrow(grdAmazonia)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
names(regr)[1] <- "Ecoregion"
df <- merge(res, regr, by = "Ecoregion")
slr <- lm(estimate ~ AGB, data = df)
ab <- coef(slr)
ggplot(data = df) +
geom_point(mapping = aes(x = AGB, y = estimate), size = 2) +
geom_abline(intercept = ab[1], slope = ab[2]) +
scale_x_continuous(name = "Ratio estimate", limits = c(50, 300)) +
scale_y_continuous(name = "Regression estimate", limits = c(50, 300)) +
coord_fixed()
```
I quantified the gain in precision of the estimated mean AGB due to the use of the regression model by the variance of the ratio estimator divided by the variance of the regression estimator (Table \@ref(tab:TableGainRegressionestimator)). For ratios larger than 1, there is a gain in precision. Both variances are estimated from 1,000 repeated ratio and regression estimates obtained with simple random sampling without replacement of size 200. For all but two small ecoregions, there is a gain. For quite a few ecoregions, the gain is quite large. These are the ecoregions where the globally fitted regression model explains a large part of the spatial variation of AGB.
```{r TableGainRegressionestimator, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
ecoregions <- sort(unique(grdAmazonia$Ecoregion))
v_ratio <- apply(mz_eco_ratio, MARGIN = 1, FUN = var, na.rm = TRUE)
v_regr <- apply(mz_eco_regr, MARGIN = 1, FUN = var, na.rm = TRUE)
gain <- round(v_ratio / v_regr, 2)
df <- data.frame(ecoregions, gain = gain)
rownames(df) <- NULL
knitr::kable(
df, caption = "Estimated gain in precision of the estimated mean AGB of ecoregions in Eastern Amazonia, as quantified by the ratio of the estimated variance of the ratio estimator (no covariate used) to the estimated variance of the regression estimator (using lnSWIR2 as a predictor), for simple random sampling without replacement of size 200.",
booktabs = TRUE,
col.names = c("Ecoregion", "Gain"),
linesep = ""
) %>%
kable_classic()
```
### Synthetic estimator {#SyntheticestimatorSmallDomain}
For small domains from which no units are selected, the mean can still be estimated by the synthetic estimator\index{Synthetic estimator}, also referred to as the synthetic regression estimator, by dropping the second term in Equation \@ref(eq:regressionestimatorsmalldomain):
\begin{equation}
\hat{\bar{z}}_{\text{syn},d}=\bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}}\;.
(\#eq:syntheticestimatorsmalldomain)
\end{equation}
The variance can be estimated by
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{syn},d}\right) = \bar{\mathbf{x}}_d^{\mathrm{T}} \widehat{\mathbf{C}}(\hat{\mathbf{b}}) \bar{\mathbf{x}}_d \;.
(\#eq:Varsyntheticestimatorsmalldomain)
\end{equation}
This is equal to the first variance component of Equation \@ref(eq:Varregressionestimatorsmalldomain).
The synthetic estimate can be computed with function `twophase`, with argument `psmall = FALSE` and element `unbiased = FALSE` in the list `small_area`.
```{r}
res <- forestinventory::twophase(AGB ~ lnSWIR2,
data = as.data.frame(grdAmazonia),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = FALSE),
psmall = FALSE, exhaustive = mX_eco_pop)
synt <- res$estimation
```
(ref:TableSyntheticEstimateslabel) Synthetic estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, using lnSWIR2 as a predictor.
```{r TableSyntheticEstimates, echo = FALSE}
df <- synt[c("area", "estimate", "g_variance", "n2G")]
df[, c(2, 3)] <- round(df[, c(2, 3)], 1)
knitr::kable(
df, caption = "(ref:TableSyntheticEstimateslabel)",
booktabs = TRUE,
col.names = c("Ecoregion", "AGB", "g_var", "n2G"),
linesep = ""
) %>%
kable_classic() %>%
add_footnote(" n2G: sample size of ecoregion.", notation = "none")
```
```{r, echo = FALSE}
names(synt)[1] <- "Ecoregion"
df <- merge(synt, regr, by = "Ecoregion")
d_estimates <- df$estimate.x - df$estimate.y
md <- mean(d_estimates)
d_vars <- df$g_variance.x - df$g_variance.y
```
For all ecoregions, also the unsampled ones, a synthetic estimate of the mean AGB is obtained (Table \@ref(tab:TableSyntheticEstimates)). For the sampled ecoregions, the synthetic estimate differs from the regression estimate. This difference can be quite large for ecoregions with a small sample size. Averaged over all sampled ecoregions, the difference, computed as synthetic estimate minus regression estimate, equals `r formatC(md, 1, format = "f")` 10^9^ kg ha^-1^. The variance of the regression estimator is always much larger than the variance of the synthetic estimator. The difference is the variance of the estimator of the domain mean of the residuals. However, recall that the regression estimator is design-unbiased, whereas the synthetic estimator is not. A more fair comparison is on the basis of the root mean squared error (RMSE) (Table \@ref(tab:tableRMSEs)). For the regression estimator, the RMSE is equal to its standard error and therefore not shown in the table.
```{r, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
v_mz_eco_regr <- apply(mz_eco_regr, MARGIN = 1, FUN = var, na.rm = TRUE)
v_mz_eco_synt <- apply(mz_eco_synt, MARGIN = 1, FUN = var, na.rm = TRUE)
m_mz_eco_synt <- apply(mz_eco_synt, MARGIN = 1, FUN = mean, na.rm = TRUE)
bias_synt <- m_mz_eco_synt - mz_pop
RMSE_synt <- sqrt(v_mz_eco_synt + bias_synt^2)
RMSE_regr <- sqrt(v_mz_eco_regr)
dRMSE <- RMSE_regr - RMSE_synt
```
```{r tableRMSEs, echo = FALSE}
ecoregions_short <- ecoregions
ecoregions_short[c(1, 16)] <- c("Amazon-Orinoco Carib. mangroves", "Xingu-Toc.-Arag. moist forests")
df <- data.frame(Ecoregion = ecoregions_short, se.reg = round(sqrt(v_mz_eco_regr), 1), se.syn = round(sqrt(v_mz_eco_synt), 1), bias.syn = round(bias_synt, 1), RMSE.syn = round(RMSE_synt, 1))
rownames(df) <- NULL
knitr::kable(
df, caption = "Estimated standard error (se), bias, and root mean squared error (RMSE) of the regression estimator (reg) and the synthetic estimator (syn) of the mean AGB of ecoregions in Eastern Amazonia. The regression estimator is design-unbiased, so the RMSE of the regression estimator is equal to its standard error.",
booktabs = TRUE,
col.names = c("Ecoregion", "RMSE reg", "se syn", "bias syn", "RMSE syn"),
linesep = ""
) %>%
kable_classic()
```
In the synthetic estimator and the regression estimator, both quantitative covariates and categorical variables can be used. If one or more categorical variables are included in the estimator, the variable names in the data frame with the true means of the ancillary variables per domain, specified with argument `exhaustive`, must correspond to the column names of the design matrix that is generated with function `lm`, see Subsection \@ref(RegressionEstimatorSTSI).
## Model-based prediction {#SmallAreaModelBased}
The alternative for design-based and model-assisted estimation of the means or totals of small domains is model-based prediction. The fundamental difference between model-assisted estimation and model-based prediction is explained in Chapter \@ref(Approaches). The models used in this section are linear mixed models\index{Linear mixed model}. In a linear mixed model, the mean of the study variable is modelled as a linear combination of covariates, similar to a linear regression model. The difference with a linear regression model is that the residuals of the mean are not assumed independent. The dependency of the residuals is also modelled. Two types of linear mixed model are described: a random intercept model and a geostatistical model.
### Random intercept model
A basic linear mixed model that can be used for model-based prediction of means of small domains is the random intercept model\index{Random intercept model}:
\begin{equation}
\begin{split}
Z_{dk} &= \mathbf{x}_{dk}^{\text{T}} \pmb{\beta} + v_d + \epsilon_{dk} \\
v_d &\sim \mathcal{N}(0,\sigma^2_v) \\
\epsilon_{dk} &\sim \mathcal{N}(0,\sigma^2_{\epsilon}) \;.
\end{split}
(\#eq:RandomInterceptModel)
\end{equation}
Two random variables are now involved, both with a normal distribution with mean zero: $v_d$, a random intercept at the domain level with variance $\sigma^2_v$, and the residuals $\epsilon_{dk}$ at the unit level with variance $\sigma^2_{\epsilon}$. The variance $\sigma^2_v$ can be interpreted as a measure of the heterogeneity among the domains after accounting for the fixed effect [@Breidenbach2012]. With this model, the mean of a domain can be predicted by
\begin{equation}
\hat{\bar{z}}_{\text{mb},d} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\pmb{\beta}} + \hat{v}_d \;,
(\#eq:mbpredictordomainmeanrandomintercept)
\end{equation}
with $\hat{\pmb{\beta}}$ the best linear unbiased estimates (BLUE) of the regression coefficients and $\hat{v}_d$ the best linear unbiased prediction (BLUP) of the intercept for domain $d$, $v_d$. The model-based predictor can also be written as
\begin{equation}
\hat{\bar{z}}_{\text{mb},d} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\pmb{\beta}} + \lambda_d \left( \frac{1}{n_d }\sum_{k \in \mathcal{S}_d} \epsilon_{dk} \right) \;,
(\#eq:mbpredictordomainmeanrandomintercept2)
\end{equation}
with $\lambda_d$ a weight for the second term that corrects for the bias of the synthetic estimator. This weight is computed by
\begin{equation}
\lambda_d = \frac{\hat{\sigma}^2_v}{\hat{\sigma}^2_v + \hat{\sigma}^2_{\epsilon}/n_d}\;.
(\#eq:weightrandomintercept)
\end{equation}
This equation shows that the larger the estimated residual variance $\hat{\sigma}^2_{\epsilon}$, the smaller the weight for the bias correction factor, and the larger the sample size $n_d$, the larger the weight. Comparing Equations \@ref(eq:mbpredictordomainmeanrandomintercept) and \@ref(eq:mbpredictordomainmeanrandomintercept2) shows that the random intercept of a domain is predicted by the sample mean of the residuals of that domain, multiplied by a weight factor computed by Equation \@ref(eq:weightrandomintercept).
The means of the small domains can be computed with function `eblup.mse.f.wrap` of package **JoSAE** [@JoSAE]. It requires as input a linear mixed model generated with function `lme` of package **nlme** [@nlme]. The simple random sample of size 200 selected before is used to fit the linear mixed model, with lnSWIR2 as a fixed effect, i.e., the effect of lnSWIR2 on the mean AGB. The random effect is added by assigning another formula to argument `random`. The formula `~ 1 | Ecoregion` means that the intercept is treated as a random variable and that it varies among the ecoregions. This linear mixed model is referred to as a random intercept model: the intercepts are allowed to differ among the small domains, whereas the effects of the covariates, lnSWIR2 in our case, is equal for all domains.
Tibble `grdAmazonia` is converted to a `data.frame` to avoid problems with function `eblup.mse.f.wrap` hereafter. A simple random sample with replacement of size 200 is selected.
```{r}
grdAmazonia <- grdAmazonia %>%
as.data.frame() %>%
mutate(x1 = x1 /1000,
x2 = x2 / 1000)
set.seed(314)
mysample <- grdAmazonia %>%
dplyr::select(x1, x2, AGB, lnSWIR2, Ecoregion) %>%
slice_sample(n = n, replace = TRUE)
```
```{r}
library(nlme)
library(JoSAE)
lmm_AGB <- lme(fixed = AGB ~ lnSWIR2, data = mysample, random = ~ 1 | Ecoregion)
```
The fixed effects\index{Fixed effect} of the linear mixed model can be extracted with function `fixed.effects`.
```{r}
fixed_lmm <- fixed.effects(lmm_AGB)
```
The fixed effects of the linear mixed model differ somewhat from the fixed effects in the simple linear regression model (fixed_lm):
```{r, echo = FALSE}
slm <- lm(AGB ~ lnSWIR2, data = mysample)
ab.lm <- coef(slm)
(df <- data.frame(fixed_lm = ab.lm, fixed_lmm = fixed_lmm))
```
The random effect\index{Random effect} can be extracted with function `random.effects`.
```{r}
random.effects(lmm_AGB)
```
The random intercepts are added to the fixed intercept; the coefficient of lnSWIR2 is the same for all ecoregions:
```{r}
coef(lmm_AGB)
```
The fitted model can now be used to predict the means of the ecoregions as follows. As a first step, a data frame must be defined, with the size and the population mean of the covariate lnSWIR2 per domain. This data frame is passed to function `eblup.mse.f.wrap` with argument `domain.data`. This function computes the model-based prediction, as well as the regression estimator (Equation \@ref(eq:regressionestimatorsmalldomain)) and the synthetic estimator (Equation \@ref(eq:syntheticestimatorsmalldomain)) and their variances. The model-based predictor is the variable `EBLUP` in the output data frame. For the model-based predictor, two standard errors are computed, see @Breidenbach2012 for details.
```{r}
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
df_eco <- data.frame(Ecoregion = ecoregions, N = N_eco, lnSWIR2 = mx_eco_pop)
res <- eblup.mse.f.wrap(domain.data = df_eco, lme.obj = lmm_AGB)
df <- data.frame(Ecoregion = res$domain.ID, mb = res$EBLUP,
se.1 = res$EBLUP.se.1, se.2 = res$EBLUP.se.2)
```
Table \@ref(tab:TableRandomInterceptModelEstimates) shows the model-based predictions and the estimated standard errors of the mean AGB of the ecoregions, obtained with the random intercept model.
(ref:TableRandomInterceptModelEstimateslabel) Model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, obtained with the random intercept model and lnSWIR2 as a predictor.
```{r TableRandomInterceptModelEstimates, echo = FALSE}
df[, c(2, 3, 4)] <- round(df[, c(2, 3, 4)], 1)
knitr::kable(
df, caption = "(ref:TableRandomInterceptModelEstimateslabel)",
booktabs = TRUE,
col.names = c("Ecoregion", "AGB", "se.1", "se.2"),
linesep = ""
) %>%
kable_classic() %>%
add_footnote(" se.1 and se.2 are standard errors, for explanation see text.", notation = "none")
```
Note that with this model no predictions of the mean AGB are obtained for the unsampled ecoregions. This is because the random intercept $v_d$ cannot be predicted in the absence of data, see Equations \@ref(eq:mbpredictordomainmeanrandomintercept) and \@ref(eq:mbpredictordomainmeanrandomintercept2).
### Geostatistical model
In a geostatistical model, there is only one random variable, the residual of the model-mean, not two random variables as in the random intercept model. See Equation \@ref(eq:OKmodel) for a geostatistical model with a constant mean and Equation \@ref(eq:KEDmodel2) for a model with a mean that is a linear combination of covariates. In a geostatistical model, the covariance of the residuals of the mean at two locations is modelled as a function of the distance (and direction) of the points. Instead of the covariance, often the semivariance is modelled, i.e., half the variance of the difference of the residuals at two locations, see Chapter \@ref(Introkriging) for details.
The simple random sample of size 200 selected before is used to estimate the regression coefficients for the mean, an intercept, and a slope coefficient for lnSWIR2, and besides the parameters of a spherical semivariogram model for the residuals of the mean. The two regression coefficients and the three semivariogram parameters are estimated by restricted maximum likelihood (REML), see Subsection \@ref(REML). This estimation procedure is also used in function `lme` to fit the random intercept model. Here, function `likfit` of package **geoR** [@geoR] is used to estimate the model parameters. First, a geoR object must be generated with function `as.geodata`.
```{r}
library(geoR)
dGeoR <- as.geodata(mysample, header = TRUE,
coords.col = c("x1", "x2"), data.col = "AGB", covar.col = "lnSWIR2")
vgm_REML <- likfit(geodata = dGeoR, trend = ~ lnSWIR2,
cov.model = "spherical",
ini.cov.pars = c(600, 600), nugget = 1500,
lik.method = "REML", messages = FALSE)
```
The estimated intercept and slope are `r formatC(vgm_REML$beta[1], 0, format = "f", big.mark = ",")` and `r formatC(vgm_REML$beta[2], 1, format = "f")`, respectively. The estimated semivariogram parameters are `r formatC(vgm_REML$tausq, 0, format = "f", big.mark = ",")` (10^9^ kg ha^-1^)^2^, `r formatC(vgm_REML$sigmasq, 0, format = "f", big.mark = ",")` (10^9^ kg ha^-1^)^2^, and `r formatC(vgm_REML$phi, 0, format = "f")` km for the nugget, partial sill, and range, respectively. These model parameters are used to predict AGB for all units in the population, using function `krige` of package **gstat** [@peb04]. The REML estimates of the semivariogram parameters are passed to function `vgm` with arguments `nugget`, `psill`, and `range`. The coordinates of the sample are shifted to a random point within a 1 km $\times$ 1 km grid cell. This is done to avoid the coincidence of a sampling point and a prediction point, which leads to an error message when predicting AGB at the nodes of the grid.
```{r, eval = FALSE}
library(gstat)
mysample$x1 <- jitter(mysample$x1, amount = 0.5)
mysample$x2 <- jitter(mysample$x2, amount = 0.5)
coordinates(mysample) <- ~ x1 + x2
vgm_REML_gstat <- vgm(model = "Sph",
nugget = vgm_REML$nugget, psill = vgm_REML$sigmasq, range = vgm_REML$phi)
coordinates(grdAmazonia) <- ~ x1 + x2
predictions <- krige(
formula = AGB ~ lnSWIR2,
locations = mysample,
newdata = grdAmazonia,
model = vgm_REML_gstat,
debug.level = 0) %>% as("data.frame")
```
```{r, eval = FALSE, echo = FALSE}
save(predictions, file = "results/ModelBasedPredictionsAGB_Amazonia.rda")
```
The first six rows of `predictions` are shown below.
```{r, echo = FALSE}
load(file = "results/ModelBasedPredictionsAGB_Amazonia.rda")
head(predictions)
```
Besides a prediction (variable `var1.pred`), for every population unit the variance of the prediction error is computed (`var1.var`). The unitwise predictions can be averaged across all units of an ecoregion to obtain a model-based prediction of the mean of that ecoregion.
```{r}
AGBpred_unit <- predictions$var1.pred
mz_eco_mb <- tapply(AGBpred_unit, INDEX = grdAmazonia$Ecoregion,
FUN = mean) %>% round(1)
```
A difficulty is the computation of the standard error of these model-based predictions of the ecoregion mean. We cannot simply sum the unitwise variances and divide the sum by the squared number of units, because the prediction errors of units with a mutual distance smaller than the estimated range of the spherical semivariogram are correlated. A straightforward approach to obtain the standard error of the predicted mean is geostatistical simulation\index{Geostatistical simulation}. A large number of maps are simulated, conditional on the selected sample. For an infinite number of maps, the "average map", i.e., the map obtained by averaging for each unit all simulated values of that unit, is equal to the map with predicted AGB. For each simulated map, the average of the simulated values across all units of an ecoregion is computed. This results in as many averages as we have simulated maps. The variance of the averages of an ecoregion is an estimate of the variance of the predicted mean of that ecoregion. To reduce computing time, a 5 km $\times$ 5 km subgrid of `grdAmazonia` is used in the geostatistical simulation.
```{r, eval = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
grdAmazonia <- grdAmazonia %>%
mutate(x1 = x1 /1000, x2 = x2 / 1000, lnSWIR2 = log(SWIR2))
nsim <- 1000
coordinates(grdAmazonia) <- ~ x1 + x2
simulations <- krige(
formula = AGB ~ lnSWIR2,
locations = mysample,
newdata = grdAmazonia,
model = vgm_REML_gstat,
nmax = 100, nsim = nsim,
debug.level = 0) %>% as("data.frame")
grdAmazonia <- as_tibble(grdAmazonia)
AGBsim_eco <- matrix(nrow = length(ecoregions), ncol = nsim)
for (i in 1:nsim) {
AGBsim_eco[, i] <- tapply(simulations[, i + 2],
INDEX = grdAmazonia$Ecoregion, FUN = mean)
}
```
```{r, eval = FALSE, echo = FALSE}
save(AGBsim_eco, file = "results/AGBsimulated_ecoregions.rda")
```
```{r, echo = FALSE}
load(file = "results/AGBsimulated_ecoregions.rda")
se_mz_eco_mb <- round(sqrt(apply(AGBsim_eco, MARGIN = 1, FUN = var)), 1)
```
(ref:TableGeostatisticalModelEstimateslabel) Model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, using a simple random sample without replacement of size 200, obtained with the geostatistical model and lnSWIR2 as a predictor for the mean.
```{r TableGeostatisticalModelEstimates, echo = FALSE}
df <- data.frame(Ecoregion = ecoregions, mz_eco_mb, se_mz_eco_mb)
df[, c(2, 3)] <- round(df[, c(2, 3)], 1)
rownames(df) <- NULL
knitr::kable(
df, caption = "(ref:TableGeostatisticalModelEstimateslabel)",
booktabs = TRUE,
col.names = c("Ecoregion", "AGB", "se"),
linesep = ""
) %>%
kable_classic() %>%
add_footnote(" se: standard error of predicted mean.", notation = "none")
```
Similar to the synthetic estimator, for all ecoregions an estimate of the mean AGB is obtained, also for the unsampled ecoregions (Table \@ref(tab:TableGeostatisticalModelEstimates)). The model-based prediction is strongly correlated with the synthetic estimate (Figure \@ref(fig:MBvsSynt)).
(ref:MBvsSyntlabel) Scatter plot of the model-based prediction and the synthetic estimate of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia. The solid line is the 1:1 line.
```{r MBvsSynt, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsSyntlabel)"}
df <- data.frame(ecoregions, mb = mz_eco_mb, se.mb = se_mz_eco_mb, synt = synt$estimate, se.synt = sqrt(synt$g_variance))
ggplot(data = df) +
geom_point(mapping = aes(x = synt, y = mb), size = 2) +
geom_abline(intercept = 0, slope = 1) +
scale_x_continuous(name = "Synthetic estimate", limits = c(50, 300)) +
scale_y_continuous(name = "Model-based prediction", limits = c(50, 300)) +
coord_fixed()
```
The most striking difference is the standard error. The standard errors of the synthetic estimator range from 3.7 to 7.1 (Table \@ref(tab:tableRMSEs)), whereas the standard errors of the geostatistical predictions range from 5.9 to 15.9. However, these two standard errors are fundamentally different and should not be compared. The standard error of the synthetic estimator is a *sampling* standard error, i.e., it quantifies the variation of the estimated mean of an ecoregion over repeated random sampling with the sampling design, in this case simple random sampling of 200 units. The model-based standard error is not a sampling standard error but a model standard error, which expresses our uncertainty about the means of the domains due to our imperfect knowledge of the spatial variation of AGB. Given the observations of AGB at the selected sample, the map with the covariate lnSWIR2, and the estimated semivariogram model parameters, we are uncertain about the exact value of AGB at unsampled units. No samples are considered other than the one actually selected. For the fundamental difference between design-based, model-assisted, and model-based estimates of means, refer to Section \@ref(DBvsMB) and Chapter \@ref(Approaches).
It makes more sense to compare the two model-based predictions, the random intercept model predictions and the geostatistical predictions, and their standard errors. Figure \@ref(fig:MBvsMB) shows that the two model-based predictions are very similar.
(ref:MBvsMBlabel) Scatter plot of model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, obtained with the random intercept model and the geostatistical model. The solid line is the 1:1 line.
```{r MBvsMB, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsMBlabel)"}
ecos_in_sam <- sort(unique(mysample$Ecoregion))
res <- eblup.mse.f.wrap(domain.data = df_eco, lme.obj = lmm_AGB)
TF <- (ecoregions %in% ecos_in_sam)
n_eco <- tapply(mysample$AGB, INDEX = mysample$Ecoregion, FUN = length)
df <- data.frame(n_eco, mb.geo = mz_eco_mb[TF], semb.geo = se_mz_eco_mb[TF], mb.rint = res$EBLUP, semb.rint = res$EBLUP.se.1)
ggplot(data = df) +
geom_point(mapping = aes(x = mb.rint, y = mb.geo), size = 2) +
geom_abline(intercept = 0, slope = 1) +
scale_x_continuous(name = "Random intercept model prediction ", limits = c(50, 300)) +
scale_y_continuous(name = "Geostatistical prediction", limits = c(50, 300)) +
coord_fixed()
```
For six ecoregions, the standard errors of the geostatistical model predictions are much smaller than those of the random intercept model predictions (Figure \@ref(fig:MBvsMBse)). These are ecoregions with small sample sizes.
(ref:MBvsMBselabel) Scatter plot of the estimated standard error of model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions obtained with the random intercept model and the geostatistical model, using a simple random sample without replacement of size 200 from Eastern Amazonia. The numbers refer to the number of sampling units in an ecoregion. The solid line is the 1:1 line.
```{r MBvsMBse, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsMBselabel)"}
ggplot(data = df) +
geom_point(mapping = aes(x = semb.rint, y = semb.geo), size = 2) +
geom_text(mapping = aes(x = semb.rint, y = semb.geo, label = n_eco), nudge_x = 1) +
geom_abline(intercept = 0, slope = 1) +
scale_x_continuous(name = "Standard error random intercept model prediction ", limits = c(5, 40)) +
scale_y_continuous(name = "Standard error geostatistical prediction", limits = c(5, 40)) +
coord_fixed()
```
```{block2, type = 'rmdnote'}
If a different semivariogram model were used, both the predicted means per ecoregion and the standard errors would be different. Especially the variance is sensitive to the semivariogram. For this reason, the model-based predictions are also referred to as model-dependent predictions\index{Model-dependent predictor}, see Chapter \@ref(Approaches).
```
## Supplemental probability sampling of small domains
The sample size in small domains of interest can be so small that no reliable statistical estimate of the mean or total of these domains can be obtained. In this case, we may decide to collect a supplemental sample\index{Supplemental sample} from these domains. It is convenient to use these domains as strata in supplemental probability sampling, so that we can control the sample sizes in the strata. If we can safely assume that the study variable at the units of the first sample are not changed, there is no need to revisit these units; otherwise, we must revisit them to observe the current values.
There are two approaches for using the two probability samples to estimate the population mean or total of a small domain [@Grafstrom2019]. In the first approach, the two samples are combined, and then the merged sample is used to estimate the population mean or total. In the second approach, the samples are not combined, but the two estimates from the separate samples. In this section only the first approach is illustrated with a simple situation in which the two samples are easily combined. Refer to @Grafstrom2019 for a more general approach of how multiple probability samples can be combined.
Suppose that the original sample is a simple random sample from the entire study area. A supplemental sample is selected from small domains, i.e., domains that have few selected units only. For a given small domain, the first sample is supplemented by selecting a simple random sample from the units not yet selected in the first sample. The size of the supplemental sample of a domain depends on the number of units of that domain in the first sample. The first sample is supplemented so that the total sample size of that domain is fixed. In this case, the combined sample of a domain is a simple random sample from that domain, so that the usual estimators for simple random sampling can be used to estimate the domain mean or total and its standard error.
This sampling strategy is illustrated with Eastern Amazonia. A simple random sample without replacement of 400 units is selected.
```{r, echo = FALSE}
rm(grdAmazonia)
```
```{r}
grdAmazonia$Biome <- as.factor(grdAmazonia$Biome)
biomes <- c("Mangrove", "Forest_dry", "Grassland", "Forest_moist")
levels(grdAmazonia$Biome) <- biomes
n1 <- 400
set.seed(123)
units_1 <- sample(nrow(grdAmazonia), size = n1, replace = FALSE)
mysample_1 <- grdAmazonia[units_1, c("AGB", "Biome")]
print(n1_biome <- table(mysample_1$Biome))
```
The selected units are removed from the sampling frame. For each of the three small biomes, Mangrove, Forest_dry, and Grassland, the size of the supplemental sample is computed so that the total sample size becomes 40. The supplemental sample is selected by stratified simple random sampling without replacement, using the small biomes as strata (Chapter \@ref(STSI)).
```{r}
units_notselected <- grdAmazonia[-units_1, ]
Biomes_NFM <- units_notselected[units_notselected$Biome != "Forest_moist", ]
n_biome <- 40
n2_biome <- rep(n_biome, 3) - n1_biome[-4]
ord <- unique(Biomes_NFM$Biome)
units_2 <- sampling::strata(Biomes_NFM, stratanames = "Biome",
size = n2_biome[ord], method = "srswor")
mysample_2 <- getdata(Biomes_NFM, units_2)
mysample_2 <- mysample_2[c("AGB", "Biome")]
```
The two samples are merged, and the means of the domains are estimated by the sample means.
```{r}
mysample <- rbind(mysample_1, mysample_2)
print(mz_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean))
```
Finally, the standard error is estimated, accounting for sampling without replacement from a finite population (Equation \@ref(eq:EstVarMeanSI)).
```{r}
N_biome <- table(grdAmazonia$Biome)
fpc <- (1 - n_biome / N_biome)
S2z_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = var)
print(se_mz_biome <- sqrt(fpc * (S2z_biome / n_biome)))
```
```{r, echo = FALSE, eval = FALSE}
mz_biome <- v_mz_biome <- matrix(nrow = 10000, ncol = 4)
set.seed(314)
for (i in 1:10000) {
units_1 <- sample(nrow(grdAmazonia), size = n1, replace = FALSE)
mysample_1 <- grdAmazonia[units_1, c("AGB", "Biome")]
n1_biome <- table(mysample_1$Biome)
units_notselected <- grdAmazonia[-units_1, ]
Biomes.NFM <- units_notselected[units_notselected$Biome != "Forest_moist", ]
n_biome <- 40
n2_biome <- rep(n_biome, 3) - n1_biome[-4]
if (sum(n2_biome < 1) > 0) {
next
} else {
ord <- unique(Biomes.NFM$Biome)
units_2 <- sampling::strata(Biomes.NFM, stratanames = "Biome", size = n2_biome[ord], method = "srswor")
mysample_2 <- getdata(Biomes.NFM, units_2)
mysample_2 <- mysample_2[c("AGB", "Biome")]
mysample <- rbin_biome(mysample_1, mysample_2)
mz_biome[i, ] <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean)
N_biome <- table(grdAmazonia$Biome)
fpc <- (1 - n_biome / N_biome)
S2z_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = var)
v_mz_biome[i, ] <- fpc * (S2z_biome / n_biome)
}
}
save(mz_biome, v_mz_biome, file = "results/SupplementalSampleEstimates_Amazonia.rda")
```
```{r, echo = FALSE}
load(file = "results/SupplementalSampleEstimates_Amazonia.rda")
nas <- apply(mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)
Ep_mz_biome <- apply(mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)
Vp_mz_biome <- apply(mz_biome, MARGIN = 2, FUN = var, na.rm = TRUE)
se_mz_biome <- sqrt(v_mz_biome)
Ep_se_mz_biome <- apply(se_mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)
mz_biome_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Biome, FUN = mean)
alpha <- 0.05
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
Mz_biome_pop <- matrix(nrow = 10000, ncol = 3, data = mz_biome_pop[1:3], byrow = TRUE)
ind <- (Mz_biome_pop > lower & Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
sum(!is.na(x))
})
coveragerate_95 <- coverage / dum
alpha <- 0.10
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
ind <- (Mz_biome_pop > lower & Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
sum(!is.na(x))
})
coveragerate_90 <- coverage / dum
alpha <- 0.20
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
ind <- (Mz_biome_pop > lower & Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
sum(!is.na(x))})
coveragerate_80 <- coverage / dum
row_names <- c("Average of estimated means", "True means", "Standard deviation of estimated means", "Average of estimated standard errors", "Coverage rate 95%", "Coverage rate 90%", "Coverage rate 80%")
dom <- matrix(nrow = 7, ncol = 3, dimnames = list(row_names, NULL))
for (i in 1:3) {
dom[, i] <- c(round(Ep_mz_biome[i], 3), round(mz_biome_pop[i], 3), round(sqrt(Vp_mz_biome[i]), 3), round(Ep_se_mz_biome[i], 3), round(coveragerate_95[i], 3), round(coveragerate_90[i], 3), round(coveragerate_80[i], 3))
}
```
This sampling approach and estimation are repeated 10,000 times, i.e., a simple random sample without replacement of size 400 is selected 10,000 times from Eastern Amazonia, and the samples from the three small domains are supplemented so that the total sample sizes in these domains become 40. In two out of the 10,000 samples, the size of the first sample in one of the domains exceeded 40 units. These two samples are discarded. Ideally, these samples are not discarded, but their sizes in the small domains are reduced to 40 units, which are then used to estimate the means of the domains.
(ref:SmalldomainEstimateslabel) Summary statistics of 10,000 estimated means of AGB (10^9^ kg ha^-1^) of small domains (biomes) in Eastern Amazonia, estimated by combining a simple random sample without replacement of size 400 from all units and a supplemental stratified simple random sample without replacement from the units in the small domains not included in the simple random sample. The total sample size per small domain is 40.
```{r SmalldomainEstimates, echo = FALSE}
knitr::kable(
dom, caption = "(ref:SmalldomainEstimateslabel)",
col.names = c("Mangrove", "Dry forest", "Grassland"),
row.names = TRUE,
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
For all three small domains, the average of the 10,000 estimated means of AGB is about equal to the true mean (Table \@ref(tab:SmalldomainEstimates)). Also the mean of the 10,000 estimated standard errors is very close to the standard deviation of the 10,000 estimated means. The coverage rates of 95, 90, and 80% confidence intervals are about equal to the nominal coverage rates.
This simple approach is feasible because at the domain level the two merged samples are a simple random sample. This approach is also applicable when the first sample is a stratified simple random sample from the entire population, and the supplemental sample is a stratified simple random sample from a small domain using as strata the intersections of the strata used in the first phase and that domain.
```{r, echo = FALSE}
rm(list = ls())
```