-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path10-ModelAssisted.Rmd
1473 lines (1096 loc) · 91.6 KB
/
10-ModelAssisted.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Model-assisted estimation {#Modelassisted}
In many cases ancillary information is available that could be useful to increase the accuracy of the estimated mean or total of the study variable. The ancillary variable(s) can be qualitative (i.e., classifications) or quantitative. As we have seen before, both types of ancillary variable can be used at the design stage, i.e., in selecting the sampling units, to improve the performance of the sampling strategy, for instance by stratification (Chapter \@ref(STSI)), selecting sampling units with probabilities proportional to size (Chapter \@ref(pps)), or through balancing and/or spreading the sample on the covariates (Chapter \@ref(BalancedSpreaded)). This chapter explains how these covariates can be used at the stage of *estimation*, once the data are collected.
In the design-based approach for sampling, various estimators are developed that exploit one or more covariates. These estimators are derived from different superpopulation models\index{Superpopulation model} of the study variable. A superpopulation model is a statistical model that can be used to generate an infinite number of populations, a superpopulation, through simulation. An example is the simulation of spatial populations using a geostatistical model, see Chapter \@ref(MBpredictionofDesignVariance). A superpopulation is a construct, it does not exist in reality. We assume that the population of interest is one of the populations that can be generated with the chosen model. The combination of probability sampling and estimators that are built on a superpopulation model is referred to as the model-assisted approach\index{Model-assisted approach}. Also in the model-based approach a superpopulation model is used; however, its role is fundamentally different from that in the model-assisted approach, see Chapter \@ref(Approaches). To stress the different use of the superpopulation model in the model-assisted approach, this model is referred to as the "working model"\index{Working model}, i.e., the superpopulation model that is used to derive a model-assisted estimator.
@Breidt2017 present an overview of model-assisted estimators derived from a general working model:
\begin{equation}
Z_k = \mu(\mathbf{x}_k)+\epsilon_k\;,
(\#eq:workingmodel)
\end{equation}
with $\mu(\mathbf{x}_k)$ the model-mean\index{Model-mean} for population unit $k$ which is a function of the covariate values of that unit collected in vector $\mathbf{x}_k = (1, x_{1,k}, \dots , x_{J,k})^{\mathrm{T}}$ and $\epsilon_k$ a random variable with zero mean. Note that I use uppercase $Z$ to distinguish the random variable $Z_k$ of unit $k$ from one realisation of this random variable for unit $k$ in the population of interest, $z_k$. The model-mean $\mu(\mathbf{x}_k)$ can be a linear or a non-linear combination of the covariates. If the study variable and the covariate values were observed for all population units, all these data could be used to compute a so-called hypothetical population fit\index{Population fit of model parameters} of the model parameters. These model parameters can then be used to compute *estimates* of the model-means $\mu(\mathbf{x}_k)$, denoted by $m(\mathbf{x}_k)$, for all population units. For instance, with a (multiple) regression model $m(\mathbf{x}_k)=\mathbf{x}_k^{\text{T}} \mathbf{b}$, with $\mathbf{b}$ the vector with regression coefficients estimated from observations of the study variable $z$ and the covariates on *all* population units. In practice, we have a sample only, which is used to estimate $m(\mathbf{x}_k)$ by $\hat{m}(\mathbf{x}_k)$. For the multiple regression model $\hat{m}(\mathbf{x}_k)= \mathbf{x}_k^{\text{T}} \hat{\mathbf{b}}$, with $\hat{\mathbf{b}}$ the vector with regression coefficients estimated from the sample data. This leads to the generalised difference estimator\index{Generalised difference estimator} [@Wu2001]:
\begin{equation}
\hat{\bar{z}}_{\mathrm{dif}}=\frac{1}{N} \sum_{k=1}^N \hat{m}(\mathbf{x}_k) + \frac{1}{N} \sum_{k \in \mathcal{S}} \frac{z_k-\hat{m}(\mathbf{x}_k)}{\pi_k}\;,
(\#eq:GeneralizedDifferenceEstimator)
\end{equation}
with $\pi_k$ the inclusion probability of unit $k$. The first term is the population mean of model predictions of the study variable, and the second term is the $\pi$ estimator of the population mean of the residuals.
A wide variety of model-assisted estimators have been developed and tested over the past decades. They differ in the working model used to obtain the estimates $\hat{m}(\mathbf{x}_k)$ in Equation \@ref(eq:GeneralizedDifferenceEstimator). The best known class of model-assisted estimators is the generalised regression estimator\index{Regression estimator!generalised regression estimator} that uses a linear model in prediction [@sar92]. Alternative model-assisted estimators are the estimators using machine learning techniques for prediction. In the era of big data with a vastly increasing number of exhaustive data sets and a rapid development of machine learning techniques, these estimators have great potentials for spatial sample survey.
## Generalised regression estimator {#GREG}
The working model of the generalised regression estimator is the heteroscedastic multiple linear regression model\index{Linear regression model!heteroscedastic}:
\begin{equation}
Z_k = \mathbf{x}^{\mathrm{T}}_k \pmb{\beta}+\epsilon_k \;,
(\#eq:GREGmodel)
\end{equation}
with $\epsilon_k$ uncorrelated residuals, with zero mean and variance $\sigma^2(\epsilon_k)$. Note that the variance of the residuals $\sigma^2(\epsilon_k)$ need not be constant but may differ among the population units. If $\{z_k,x_{1,k}, \dots , x_{J,k}\}$ were observed for all units $k= 1, \dots , N$ in the population, the regression coefficients $\pmb{\beta}$ would be estimated by
\begin{equation}
\mathbf{b} = \left(\sum_{k=1}^N \frac{\mathbf{x}_k\mathbf{x}_k^{\mathrm{T}}}{\sigma^2(\epsilon_k)} \right)^{-1} \sum_{k=1}^N \frac{\mathbf{x}_k z_k}{\sigma^2(\epsilon_k)}\;,
(\#eq:populationfitGREG)
\end{equation}
with $\mathbf{x}_k$ the vector $(1, x_{1,k}, \dots , x_{J,k})^{\mathrm{T}}$ and $\sigma^2(\epsilon_k)$ the variance of the residual of unit $k$. Similar to the distinction between model-mean and population mean (see Chapter \@ref(Approaches)), here the model regression coefficients\index{Model regression coefficient} $\pmb{\beta}$ are distinguished from the population regression coefficients\index{Population regression coefficient} $\mathbf{b}$. The means $m(\mathbf{x}_k)$ would then be computed by
\begin{equation}
m(\mathbf{x}_k) = \mathbf{x}_k^{\mathrm{T}} \mathbf{b}\;.
(\#eq:mxi)
\end{equation}
If we have a probability sample from the population of interest, $\mathbf{b}$ is estimated by replacing the population totals in Equation \@ref(eq:populationfitGREG) by their $\pi$ estimators:
\begin{equation}
\hat{\mathbf{b}} = \left(\sum_{k \in \mathcal{S}} \frac{\mathbf{x}_k\mathbf{x}_k^{\mathrm{T}}}{\sigma^2(\epsilon_k) \pi_k} \right)^{-1} \sum_{k \in \mathcal{S}} \frac{\mathbf{x}_k z_k}{\sigma^2(\epsilon_k) \pi_k} \;.
(\#eq:EstimatorGREGCoefficients)
\end{equation}
```{block2, type='rmdcaution'}
With unequal inclusion probabilities, the design-based estimators of the population regression coefficients differ from the usual ordinary least squares\index{Ordinary least squares} (OLS) estimators of the regression coefficients defined as model parameters. The values $\hat{b}_j$ are estimates of the *population parameters* $b_j$.
```
The mean values $m(\mathbf{x}_k)$ are now estimated by
\begin{equation}
\hat{m}(\mathbf{x}_k) = \mathbf{x}_k^{\mathrm{T}} \hat{\mathbf{b}}\;.
(\#eq:estmxi)
\end{equation}
Plugging Equation \@ref(eq:estmxi) into the generalised difference estimator, Equation \@ref(eq:GeneralizedDifferenceEstimator), leads to the generalised regression estimator\index{Regression estimator!generalised regression estimator} for the population mean:
\begin{equation}
\hat{\bar{z}}_{\mathrm{regr}} = \frac{1}{N} \sum_{k=1}^N \mathbf{x}_k^{\mathrm{T}} \hat{\mathbf{b}} + \frac{1}{N} \sum_{k \in \mathcal{S}} \frac{z_k-\mathbf{x}^{\mathrm{T}}_k\hat{\mathbf{b}}} {\pi_k}
\;.
(\#eq:GREG)
\end{equation}
This estimator can also be written as
\begin{equation}
\hat{\bar{z}}_{\text{regr}}= \hat{\bar{z}}_{\pi}+\sum_{j=1}^J \hat{b}_j(\bar{x}_j-\hat{\bar{x}}_{j,\pi}) \;,
(\#eq:GREG2)
\end{equation}
with $\hat{\bar{z}}_{\pi}$ and $\hat{\bar{x}}_{j,\pi}$ the $\pi$ estimator of the study variable and the $j$th covariate, respectively, $\bar{x}_j$ the population mean of the $j$th covariate, and $\hat{b}_j$ the estimated slope coefficient associated with the $j$th covariate. So, the generalised regression estimate is equal to the $\pi$ estimate when the estimated means of the covariates are equal to the population means. This is the rationale of balanced sampling (Chapter \@ref(BalancedSpreaded)).
The alternative formulation of the regression estimator (Equation \@ref(eq:GREG2)) shows that we do not need to know the covariate values for all population units. Knowledge of the population means of the covariates is sufficient. This is because a linear relation is assumed between the study variable and the covariates. On the contrary, for non-linear working models such as a random forest model, exhaustive knowledge of the covariates is needed so that the estimated mean $\hat{m}(\mathbf{x}_k)$ in Equation \@ref(eq:GeneralizedDifferenceEstimator) can be computed for every unit in the population.
@sar92 worked out the generalised regression estimator for various superpopulation models, such as the simple and multiple linear regression model, the ratio model, and the analysis of variance (ANOVA) model.
### Simple and multiple regression estimators {#RegressionEstimator}
The working model of the simple and the multiple regression estimator is the homoscedastic linear regression model\index{Linear regression model!homoscedastic}. The only difference with the heteroscedastic model is that the variance of the residuals is assumed constant: $\sigma^2(\epsilon_k) = \sigma^2(\epsilon), k = 1 , \dots , N$.
In the simple linear regression model, the mean is a linear function of a single covariate, $\mu(x_k)= \alpha + \beta\;x_k$. The simple linear regression model\index{Simple linear regression model} leads to the simple regression estimator\index{Regression estimator!simple regression estimator}. With simple random sampling, this estimator for the population mean is
\begin{equation}
\hat{\bar{z}}_{\text{regr}}= \bar{z}_{\mathcal{S}}+\hat{b}\left( \bar{x}-\bar{x}_{\mathcal{S}}\right) \;,
(\#eq:SimpleRegressionEstimatorSI)
\end{equation}
where $\bar{z}_{\mathcal{S}}$ and $\bar{x}_{\mathcal{S}}$ are the sample means of the study variable and the covariate, respectively, $\bar{x}$ is the population mean of the covariate, and $\hat{b}$ is the estimated slope coefficient:
\begin{equation}
\hat{b}=\frac{\sum_{k \in \mathcal{S}} (x_k-\bar{x}_{\mathcal{S}})(z_k-\bar{z}_{\mathcal{S}})}{\sum_{k \in \mathcal{S}}(x_k-\bar{x}_{\mathcal{S}})^2} \;.
(\#eq:OLSSlope)
\end{equation}
The rationale of the regression estimator is that when the estimated mean of the covariate is, for instance, smaller than the population mean of the covariate, then with a positive correlation between study variable and covariate, also the estimated mean of the study variable is expected to be smaller than the population mean of the study variable. The difference between the population mean and the estimated mean of the covariate can be used to improve the $\pi$ estimate of the mean of $z$ (which is for simple random sampling equal to the sample mean $\bar{z}_{\mathcal{S}}$), by adding a term proportional to the difference between the estimated mean and the population mean of the covariate. As a scaling factor, the estimated slope of the fitted regression line is used.
The sampling variance of this regression estimator can be estimated by computing first the regression residuals $e_k= z_k - \hat{z}_k,\;k = 1, \dots , n$ at the sampling units, with $\hat{z}_k = \hat{a} + \hat{b} x_k$ the predicted value for unit $k$. Note that I use symbol $\epsilon$ (Equation \@ref(eq:GREGmodel)) for the residuals from the model with the model regression coefficients $\pmb{\beta}$, whereas for the residuals from the model with the estimated population regression coefficients $\hat{\mathbf{b}}$ I use symbol $e$. To compute the residuals $e$ also an estimate of the intercept $a$ is needed. With simple random sampling, this intercept can be estimated by
\begin{equation}
\hat{a} = \bar{z}_{\mathcal{S}} - \hat{b}\; \bar{x}_{\mathcal{S}}\;.
(\#eq:OLSIntercept)
\end{equation}
The sampling variance of the regression estimator of the population mean is *approximately* equal to the sampling variance of the $\pi$ estimator of the mean of the model residuals:
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\mathrm{regr}}\right)=\left(1-\frac{n}{N}\right)\frac{\widehat{S^{2}}(e)}{n} \;,
(\#eq:VarianceRegressionEstimatorSI)
\end{equation}
with $\widehat{S^{2}}(e)$ the estimated population variance of the regression residuals:
\begin{equation}
\widehat{S^{2}}(e)=\frac{1}{n-1}\sum_{k \in \mathcal{S}} e_k^2 \;.
(\#eq:VarianceResiduals)
\end{equation}
```{block2, type='rmdnote'}
The variance estimator is an approximation because the regression coefficients are also estimated from the sample, which makes the regression estimator non-linear. The approximation of the variance is based on a Taylor linearisation\index{Taylor linearisation of regression estimator} of the regression estimator (@sar92, p. 235).
```
For simple random sampling with replacement from finite populations and simple random sampling of infinite populations, the finite population correction factor $1-n/N$ must be dropped, see Chapter \@ref(SI).
In the multiple linear regression model\index{Multiple linear regression model}, the mean is a linear function of multiple covariates. This model leads to the multiple regression estimator\index{Regression estimator!multiple regression estimator}. With simple random sampling, the population regression coefficients of this estimator can be estimated by
\begin{equation}
\hat{\mathbf{b}} = \left(\sum_{k \in \mathcal{S}} \mathbf{x}_k\mathbf{x}_k^{\mathrm{T}} \right)^{-1} \sum_{k \in \mathcal{S}} \mathbf{x}_k z_k \;.
(\#eq:EstimatorMultipleRegressionCoefficients)
\end{equation}
Comparison with the general estimator of the population regression coefficients (Equation \@ref(eq:EstimatorGREGCoefficients)) shows that the variance of the residuals, $\sigma^2(\epsilon_k)$, is missing as they are assumed constant. Besides, the inclusion probabilities $\pi_k$ are missing because they are equal for all population units with simple random sampling.
The simple regression estimator is illustrated with Eastern Amazonia. The population mean of the aboveground biomass (AGB) is estimated by the simple regression estimator, using natural logs of MODIS short-wave infrared radiation (SWIR2) as a covariate.
A simple random sample without replacement of 100 units is selected using `slice_sample` of package **dplyr**, and the two population regression coefficients are estimated with Equation \@ref(eq:EstimatorMultipleRegressionCoefficients).
```{r}
grdAmazonia <- grdAmazonia %>%
mutate(lnSWIR2 = log(SWIR2))
set.seed(321)
n <- 100
mysample <- grdAmazonia %>%
dplyr::select(AGB,lnSWIR2) %>%
slice_sample(n = n)
X <- cbind(rep(1, n), mysample$lnSWIR2)
XXinv <- solve(t(X) %*% X)
Xz <- t(X) %*% mysample$AGB
print(ab <- t(XXinv %*% Xz))
```
The same estimates are obtained by ordinary least squares (OLS) fitting of the model with function `lm`.
```{r}
lm_sample <- lm(AGB ~ lnSWIR2, data = mysample)
print(ab_mb <- coef(lm_sample))
```
As already stressed above, the design-based estimates of the population regression coefficients are only equal to the model-based OLS estimates of the regression coefficients for equal probability sampling designs. Also be aware that the variance of the design-based estimates of the population regression coefficients is not equal to the model-based variance of the model regression coefficients. See section (11.2.2.1) in @loh99 for how to estimate the variance of the design-based estimates of the population regression coefficients.
Figure \@ref(fig:ScatterAGBvsSWIR2) shows the scatter plot for the simple random sample and the fitted simple linear regression model.
(ref:ScatterAGBvsSWIR2label) Scatter plot of AGB (10^9^ kg ha^-1^) against lnSWIR2 of a simple random sample of size 100 from Eastern Amazonia and the fitted simple linear regression model for AGB.
```{r ScatterAGBvsSWIR2, echo = FALSE, fig.width = 5, fig.asp = .7, fig.cap = "(ref:ScatterAGBvsSWIR2label)"}
ggplot(data = mysample) +
geom_point(mapping = aes(x = lnSWIR2, y = AGB)) +
geom_abline(intercept = ab[1], slope = ab[2]) +
scale_x_continuous(name = "lnSWIR2") +
scale_y_continuous(name = "AGB")
```
The simple random sample is used to estimate the population mean of the study variable AGB by the simple regression estimator and to approximate the sampling variance of the regression estimator. The residuals of the fitted model can be extracted with function `residuals` because in this case the OLS estimates of the regression coefficients are equal to the design-based estimates. With unequal inclusion probabilities, the residuals must be computed by predicting the study variable for the selected units, using the design-based estimates of the regression coefficients, and subtracting the predictions from the observations of the study variable.
```{r}
mx_pop <- mean(grdAmazonia$lnSWIR2)
mx_sam <- mean(mysample$lnSWIR2)
mz_sam <- mean(mysample$AGB)
mz_regr <- mz_sam + ab[2] * (mx_pop - mx_sam)
e <- residuals(lm_sample)
S2e <- var(e)
N <- nrow(grdAmazonia)
se_mz_regr <- sqrt((1 - n / N) * S2e / n)
```
The difference $\delta(x)$ between the population mean of the covariate lnSWIR2 (`r formatC(mx_pop, 3, format = "f")`) and its estimated mean (`r formatC(mx_sam, 3, format = "f")`) equals `r formatC(mx_pop - mx_sam, 3, format = "f")`. We may expect the difference between the unknown population mean of the study variable AGB and its sample mean (`r formatC(mz_sam, 3, format = "f")`) to be equal to $\delta(x)$, multiplied by the estimated slope of the line, which equals `r formatC(ab[2], 1, format = "f")`. The result, `r formatC(ab[2] * (mx_pop - mx_sam), 4, format = "f")`, is added to the simple random sample estimate, so that the ultimate regression estimate is adjusted downward to `r formatC(mz_regr, 1, format = "f")` 10^9^ kg ha^-1^.
The estimated approximate standard error of the regression estimator equals `r formatC(se_mz_regr, 3, format = "f")` 10^9^ kg ha^-1^. The approximated variance is a simplification of a more complicated approximation derived from writing the regression estimator of the population total as a weighted sum of the $\pi$-expanded observations (@sar92, equation (6.5.9)):
\begin{equation}
\hat{\bar{z}}_{\mathrm{regr}}=\frac{1}{N}\sum_{k \in \mathcal{S}} g_k \frac{z_k}{\pi_k}\;,
(\#eq:AlternativeRegressionEstimator)
\end{equation}
with $g_k$ the weight for unit $k$. For simple random sampling, the weights are equal to (@sar92, equation (6.5.12))
\begin{equation}
g_k = 1+\frac{(\bar{x}-\bar{x}_{\mathcal{S}})(x_k-\bar{x}_{\mathcal{S}})}{\widehat{S^2}(x)}\;.
(\#eq:RegressionWeights)
\end{equation}
These weights are now computed.
```{r}
S2x <- sum((mysample$lnSWIR2 - mean(mysample$lnSWIR2))^2) / n
g <- 1 + ((mx_pop - mx_sam) * (mysample$lnSWIR2 - mx_sam)) / S2x
```
The sample mean of the weights equals 1,
```{r}
mean(g)
```
and the sample mean of the product of the weights and the covariate $x$ equals the population mean of the covariate.
```{r}
all.equal(mean(g * mysample$lnSWIR2), mean(grdAmazonia$lnSWIR2))
```
In other words, the sample is calibrated on the known population means. The variance of the regression estimator of the population mean can be approximated by (@sar92, section 6.6)
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\mathrm{regr}}\right)=\left(1-\frac{n}{N}\right)\frac{\sum_{k \in \mathcal{S}} g_k^2e_k^2}{n(n-1)} \;.
(\#eq:AlternativeVarianceRegressionEstimator)
\end{equation}
Comparing this with Equation \@ref(eq:VarianceRegressionEstimatorSI) shows that in the first approximation we assumed that all weights are equal to 1.
The alternative approximate standard error is computed in the next code chunk.
```{r}
S2ge <- sum(g^2 * e^2) / (n - 1)
print(se_mz_regr <- sqrt((1 - n / N) * S2ge / n))
```
The regression estimator and its standard error can be computed with package **survey** [@Lumley2020]. After specifying the sampling design with function `svydesign`, function `calibrate` is used to calibrate the sample on the known population totals $N$ and $t(x) = \sum_{k=1}^N x_k$, with $x_k$ the value of covariate lnSWIR2 for unit $k$.
```{r}
library(survey)
mysample$fpc <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ fpc)
populationtotals <- c(N, sum(grdAmazonia$lnSWIR2))
mysample_cal <- calibrate(design_si, formula = ~ lnSWIR2,
population = populationtotals, calfun = "linear")
```
The calibrated weights\index{Calibrated weights} can be extracted with function `weights`.
```{block2, type = 'rmdcaution'}
The calibrated weights are divided by the inclusion probabilities $\pi=n/N$, so that the sample sum of the weights equals $N$ and not the sample size $n$ (as in the code chunk above).
```
```{r}
g <- weights(mysample_cal)
all.equal(sum(g), N)
```
The sample sum of the product of the weights and the covariate equals the population total of the covariate.
```{r}
all.equal(sum(g * mysample$lnSWIR2), sum(grdAmazonia$lnSWIR2))
```
Finally, the population mean can be estimated with function `svymean`. This is simply the sample sum of the product of the weights and the study variable AGB, divided by $N$.
```{r}
svymean(~ AGB, mysample_cal)
```
The standard error is computed with Equation \@ref(eq:AlternativeVarianceRegressionEstimator). Figure \@ref(fig:SamplingDistributionRegression) shows the sampling distribution of the simple regression estimator along with the distribution of the $\pi$ estimator, obtained by repeating simple random sampling of 100 units and estimation 10,000 times.
```{r, eval = FALSE, echo = FALSE}
number_of_samples <- 10000
mz_HT <- mz_regr <- v_mz_regr <- numeric(length = number_of_samples)
set.seed(314)
for (i in 1:number_of_samples) {
units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
mysample <- grdAmazonia[units, c("AGB", "lnSWIR2")]
mz_HT[i] <- mean(mysample$AGB)
mysample$fpc <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ fpc)
mysample_cal <- calibrate(design_si, formula = ~ lnSWIR2, population = populationtotals, calfun = "linear")
res <- svymean(~ AGB, mysample_cal) %>% as.data.frame(.)
mz_regr[i] <- res$mean
v_mz_regr[i] <- res$AGB^2
}
save(mz_HT, mz_regr, v_mz_regr, file = "results/SimpleRegressionEstimatesAGB_Amazonia.rda")
```
(ref:SamplingDistributionRegressionlabel) Approximated sampling distribution of the simple regression estimator (Regression) and the $\pi$ estimator (HT) of the mean AGB (10^9^ kg ha^-1^) in Eastern Amazonia, for simple random sampling without replacement of size 100.
```{r SamplingDistributionRegression, echo = FALSE, fig.width = 5, fig.asp = .8, fig.cap = "(ref:SamplingDistributionRegressionlabel)"}
load(file = "results/SimpleRegressionEstimatesAGB_Amazonia.rda")
estimates <- data.frame(mz_regr, mz_HT)
names(estimates) <- c("Regression", "HT")
df <- estimates %>% pivot_longer(cols = c("Regression", "HT"))
df$name <- factor(df$name, levels = c("Regression", "HT"), ordered = TRUE)
ggplot(data = df) +
geom_boxplot(aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdAmazonia$AGB), colour = "red") +
scale_x_discrete(name = "Estimator") +
scale_y_continuous(name = "Estimated mean AGB")
bias <- mean(mz_regr) - mean(grdAmazonia$AGB)
v_mz_regr_sim <- var(mz_regr)
m_v_mz_regr_sim <- mean(v_mz_regr)
v_mz_SI <- (1 - n / N) * var(grdAmazonia$AGB) / n
gain <- v_mz_SI / var(mz_regr)
```
The average of the 10,000 regression estimates equals `r formatC(mean(mz_regr), 1, format = "f")` 10^9^ kg ha^-1^. The population mean of the study variable AGB equals `r formatC(mean(grdAmazonia$AGB), 1, format = "f")` 10^9^ kg ha^-1^, so the estimated bias of the regression estimator equals `r formatC(bias, 1, format = "f")` 10^9^ kg ha^-1^, which is negligibly small related to the estimated population mean. The variance of the 10,000 regression estimates equals `r formatC(v_mz_regr_sim, 2, format = "f")` (10^9^ kg ha^-1^)^2^, and the average of the 10,000 estimated approximate variances using Equation \@ref(eq:AlternativeVarianceRegressionEstimator) equals `r formatC(m_v_mz_regr_sim, 2, format = "f")` (10^9^ kg ha^-1^)^2^. The gain in precision due to the regression estimator, quantified by the ratio of the variance of the $\pi$ estimator to the variance of the regression estimator equals `r formatC(gain, 3, format = "f")`.
For simple random sampling, the ratio of the variances of the simple regression estimator and the $\pi$ estimator
is independent of the sample size and equals $1-r^2$, with $r$ the correlation coefficient of the study variable and the covariate (@sar92, p. 274).
Using multiple covariates in the regression estimator is straightforward with function `calibrate` of package **survey**. As a first step, the best model is selected with function `regsubsets` of package **leaps** [@leaps].
```{r}
library(leaps)
n <- 100
set.seed(321)
mysample <- grdAmazonia %>%
dplyr::select(AGB, lnSWIR2, Terra_PP, Prec_dm, Elevation, Clay) %>%
slice_sample(n = n)
models <- regsubsets(AGB ~ ., data = mysample, nvmax = 4)
res_sum <- summary(models)
res_sum$outmat
```
The best model with one predictor is the model with lnSWIR2, the best model with two predictors is the one with lnSWIR2 and Terra_PP, etc. Of these models, the third model, i.e., the model with lnSWIR2, Terra_PP, and Elevation, is the best when using adjusted $R^2$ as a selection criterion.
```{r}
which.max(res_sum$adjr2)
```
The standard error of the estimated mean AGB is somewhat reduced by adding the covariates Terra_PP and Elevation to the regression estimator.
```{r}
mysample$fpc <- nrow(grdAmazonia)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ fpc)
totals <- c(nrow(grdAmazonia), sum(grdAmazonia$lnSWIR2),
sum(grdAmazonia$Terra_PP), sum(grdAmazonia$Elevation))
mysample_cal <- calibrate(design_si, formula = ~ lnSWIR2 + Terra_PP + Elevation,
population = totals, calfun = "linear")
svymean(~ AGB, mysample_cal)
```
Another interesting package for model-assisted estimation is package **mase** [@mase2018]. The regression estimate can be computed with function `greg`.
```{r}
library(mase)
covars <- c("lnSWIR2", "Terra_PP", "Elevation")
res <- greg(y = mysample$AGB, xsample = mysample[covars],
xpop = grdAmazonia[covars], pi = rep(n / N, n),
var_est = TRUE, var_method = "LinHTSRS", model = "linear")
res$pop_mean
```
The multiple regression estimate is equal to the estimate obtained with function `calibrate` of package **survey**. The estimated standard error equals
```{r}
sqrt(res$pop_mean_var)
```
which is slightly smaller than the standard error computed with package **survey**. The standard error obtained with function `greg` is computed by ignoring the g-weights [@McConville2020]. In an exercise, the two approximate standard errors are compared in a sampling experiment.
### Penalised least squares estimation
In the previous subsection, I first selected a best subset of covariates before using these covariates in estimating the population regression coefficients. The alternative is to skip the selection of the best model and to estimate the population regression coefficients of *all* covariates by penalised least squares (PLS) estimation\index{Penalised least squares}. In PLS estimation a penalty equal to the sum of the absolute or squared values of the population regression coefficients is added to the minimisation criterion, see @McConville2020 for details. PLS estimation is implemented in function `gregElasticNet` of package **mase**.
```{r}
covars <- c("lnSWIR2", "Terra_PP", "Prec_dm", "Elevation", "Clay")
res <- gregElasticNet(
y = mysample$AGB, xsample = mysample[covars],
xpop = grdAmazonia[covars], pi = rep(n / N, n),
var_est = TRUE, var_method = "LinHTSRS", model = "linear",
lambda = "lambda.min", cvfolds = 100)
signif(res$coefficients, 4)
```
All five covariates are used in prediction, but the coefficients associated with these predictors are small except for lnSWIR2.
As shown below, the estimated standard error is considerably larger than the standard error obtained with lnSWIR2, Terra_PP, and Elevation as predictors. In this case, the elastic net regression estimator\index{Regression estimator!elastic net regression estimator} does not work as well as the multiple regression estimator using the best subset of the covariates.
```{r}
sqrt(res$pop_mean_var)
```
#### Exercises {-}
1. Write an **R** script to select a simple random sample without replacement of size 50 from Eastern Amazonia. Compute the regression estimator of the population mean of AGB and its standard error "by hand", i.e., without using package **survey**, using lnSWIR2 as a covariate. Use Equation \@ref(eq:VarianceRegressionEstimatorSI) to estimate the standard error.
+ Use the same sample to compute the regression estimator with functions `calibrate` and `svymean` of package **survey**. The regression estimate and its standard error can be extracted from the output object of `svymean` with methods `coef` and `SE`, respectively.
+ Repeat both estimation procedures 1,000 times (for-loop). Check that the 2 $\times$ 1,000 estimated population means obtained with both estimators are all equal (use function `all.equal`), and compute summary statistics of the two approximated standard errors. Which approximate standard error estimator has the largest mean value?
2. Write an **R** script to compute the sampling variance of the simple regression estimator of the mean AGB, using lnSWIR2 as a covariate, for simple random sampling and sample sizes 10, 25, and 100, assuming that the population regression coefficients are perfectly known. Hint: fit a simple linear regression model on all data, and compute the population variance of the residuals.
+ Next, select 10,000 times a simple random sample with replacement of size 10 (for-loop). Use each sample to estimate the population mean of AGB with the simple regression estimator (using sample estimates of the population regression coefficients), and estimate the approximate variance of the estimator of the mean. Compute the variance of the 10,000 regression estimates and the average of the 10,000 approximate variance estimates. Repeat this for sample sizes 25 and 100.
+ Compute for each sample size the difference between the experimental variance (variance of the 10,000 regression estimates) and the variance obtained with the population fit of the regression model as a proportion of the experimental variance. Explain what you see.
+ Compute for each sample size the difference between the average of the 10,000 approximated variances and the experimental variance as a proportion of the experimental variance. Explain what you see.
### Regression estimator with stratified simple random sampling {#RegressionEstimatorSTSI}
With stratified simple random sampling there are two regression estimators: *separate* and *combined*. In the first estimator, the regression estimator for simple random sampling is applied at the level of the strata\index{Regression estimator!separate regression estimator}. This implies that for each stratum separately a vector with population regression coefficients $\mathbf{b}_h$ is estimated. The regression estimates of the stratum means are then combined by computing the weighted average, using the relative sizes of the strata as weights:
\begin{equation}
\hat{\bar{z}}_{\mathrm{sregr}}=\sum_{h=1}^H w_h \hat{\bar{z}}_{\text{regr,}h} \;,
(\#eq:SeparateRegressionEstimator)
\end{equation}
with, for the simple regression estimator and simple random sampling within the strata
\begin{equation}
\hat{\bar{z}}_{\text{regr,}h} = \bar{z}_{\mathcal{S}h}+\hat{b}_h\left( \bar{x}_h-\bar{x}_{\mathcal{S}h}\right) \;,
(\#eq:RegressionEstimatorStratumMean)
\end{equation}
with $\bar{z}_{\mathcal{S}h}$ and $\bar{x}_{\mathcal{S}h}$ the stratum sample means of the study variable and the covariate, respectively, $\bar{x}_h$ the mean of the covariate in stratum $h$, and $\hat{b}_h$ the estimated slope coefficient for stratum $h$.
The variance of this separate regression estimator of the population mean can be estimated by first estimating the variances of the regression estimators of the stratum means using Equation \@ref(eq:VarianceRegressionEstimatorSI), and then combining these variances using Equation \@ref(eq:EstVarMeanSTSI).
The separate regression estimator is illustrated with Eastern Amazonia. Biomes are used as strata. There are four biomes, the levels of which are given short names using function `levels`.
```{r}
grdAmazonia$Biome <- as.factor(grdAmazonia$Biome)
levels(grdAmazonia$Biome)
```
```{r}
biomes <- c("Mangrove", "Forest_dry", "Grassland", "Forest_moist")
levels(grdAmazonia$Biome) <- biomes
```
Moist forest is by far the largest stratum, it covers 92\% of the area. Mangrove, Forest_dry, and Grassland cover 0.4, 2.3, and 5.5\% of the area, respectively. A stratified simple random sample of size 100 is selected using function `strata` of package **sampling**, see Chapter \@ref(STSI). I chose five units as a minimum sample size. Note that the stratum sample sizes are not proportional to their size.
```{r}
library(sampling)
N_h <- table(grdAmazonia$Biome)
n_h <- c(5, 5, 5, 85)
set.seed(314)
units <- sampling::strata(grdAmazonia, stratanames = "Biome",
size = n_h[unique(grdAmazonia$Biome)], method = "srswor")
mysample <- getdata(grdAmazonia, units)
```
As a first step in estimation, for each stratum the mean of the covariate over all units in a stratum (population mean per stratum) and the sample means of the study variable and the covariate are computed.
```{r}
mx_h_pop <- tapply(grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Biome, FUN = mean)
mz_h_sam <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean)
mx_h_sam <- tapply(mysample$lnSWIR2, INDEX = mysample$Biome, FUN = mean)
```
The next step is to estimate the regression coefficients (intercept and slope) per stratum. This is done in a for-loop. The estimated slope coefficient is used to compute the regression estimator per stratum. The residuals are extracted to approximate the variance of the regression estimator per stratum.
```{r}
b_h <- mz_h_regr <- v_mz_h_regr <- numeric(length = 4)
for (i in 1:4) {
subsam <- subset(mysample, mysample$Biome == levels(grdAmazonia$Biome)[i])
lm_sample <- lm(AGB ~ lnSWIR2, data = subsam)
b_h[i] <- coef(lm_sample)[2]
mz_h_regr[i] <- mz_h_sam[i] + b_h[i] * (mx_h_pop[i] - mx_h_sam[i])
e <- residuals(lm_sample)
S2e_h <- var(e)
v_mz_h_regr[i] <- (1 - n_h[i] / N_h[i]) * S2e_h / n_h[i]
}
```
Finally, the separate regression estimate is computed as a weighted average of the regression estimates per stratum.
```{r}
w_h <- N_h / sum(N_h)
print(mz_sepreg <- sum(w_h * mz_h_regr))
```
The standard error of the separate regression estimator is computed by the square root of the pooled variances of the regression estimator per stratum, using the squared relative sizes of the strata as weights.
```{r}
sum(w_h^2 * v_mz_h_regr) %>% sqrt(.)
```
The separate regression estimator can be computed with package **survey**. The computation of the population totals merits special attention. For the simple regression estimator using simple random sampling, these totals are the total number of population units and the population total of the covariate lnSWIR2. These are the population totals associated with the columns of the design matrix that is constructed with function `lm` to estimate the regression coefficients. The column with ones results in an estimated intercept, the column with lnSWIR2 values in an estimated slope.
The model that is fitted now is an analysis of covariance (ANCOVA) model\index{ANCOVA model} with factor Biome and covariate lnSWIR2.
```{r}
ancova <- lm(AGB ~ Biome * lnSWIR2, data = mysample)
```
**R** uses the so-called cornerstone representation of the ANCOVA model. The reference level is stratum Mangrove. The question is what population totals must be passed to function `calibrate` with this ANCOVA model. This can be determined by printing the design matrix that is used to fit the ANCOVA model. Only the first two rows are printed.
```{r}
designmat <- model.matrix(ancova, mysample)
```
```{r, echo = FALSE}
rownames(designmat) <- NULL
designmat[c(1, 2), ]
```
With this model formulation, the first population total is the total number of population units. The second, third, and fourth population totals are the number of population units in stratum levels 2, 3, and 4. The fifth population total is the population total of covariate lnSWIR2 and the sixth, seventh, and eighth population totals are the totals of covariate lnSWIR2 in stratum levels 2, 3, and 4.
```{r}
N_h <- as.numeric(N_h)
lut <- data.frame(Biome = biomes, N_h)
mysample <- merge(x = mysample, y = lut)
design_stsi <- svydesign(
id = ~ 1, strata = ~ factor(Biome), data = mysample, fpc = ~ N_h)
tx_pop <- sum(grdAmazonia$lnSWIR2)
tx_h_pop <- N_h * mx_h_pop
totals <- c(sum(N_h), N_h[c(2, 3, 4)], tx_pop, tx_h_pop[c(2, 3, 4)])
names(totals) <- names(coef(ancova))
mysample_cal <- calibrate(
design_stsi, formula = ~ Biome * lnSWIR2,
population = totals, calfun = "linear")
svymean(~ AGB, mysample_cal)
```
```{block2, type = 'rmdnote'}
The line `names(totals) <- names(coef(ancova))` is not strictly needed. This is just to suppress a warning that the names of the numeric with the population totals does not match the names of the columns of the design matrix. As a consequence, we do not need to fit the ANCOVA model either.
```
Alternatively, we may use the following formula in function `lm`.
```{r}
ancova2 <- lm(AGB ~ 0 + Biome / lnSWIR2, data = mysample)
designmat <- model.matrix(ancova2, mysample)
```
```{r, echo = FALSE}
rownames(designmat) <- NULL
designmat[c(1, 2), ]
```
With this formula, the population totals are the number of population units in stratum levels 1, 2, 3, and 4, as well as the population totals of covariate lnSWIR2 of the strata.
```{r}
totals <- c(N_h, tx_h_pop)
names(totals) <- names(coef(ancova2))
mysample_cal <- calibrate(
design_stsi, formula = ~ 0 + Biome / lnSWIR2, population = totals,
calfun = "linear")
svymean(~ AGB, mysample_cal)
```
<!--Recall the alternative formulation of the regression estimator, Equation \@ref(eq:AlternativeRegressionEstimator), and that the sample sum of the $\pi$-expanded $x$-values multiplied by the calibrated weights $g_k$ are equal to the population totals. This explains why the population totals should match the columns of the design matrix.-->
#### Combined regression estimator
The alternative to the separate simple regression estimator is the combined simple regression estimator\index{Regression estimator!combined regression estimator}:
\begin{equation}
\hat{\bar{z}}_{\mathrm{cregr}} = \hat{\bar{z}}_{\pi}+\hat{b}\left( \bar{x}-\hat{\bar{x}}_{\pi}\right) \;,
(\#eq:CombinedRegressionEstimator)
\end{equation}
with $\hat{b}$ the estimated slope coefficient, estimated by Equation \@ref(eq:EstimatorGREGCoefficients), discarding the variance of the residuals $\sigma^2(\epsilon_k)$ as they are assumed constant, and using the appropriate inclusion probabilities which differ among the strata, and $\hat{\bar{z}}_{\pi}$ and $\hat{\bar{x}}_{\pi}$ the $\pi$ estimators of the population mean of the study variable and the covariate for stratified simple random sampling, respectively. Working Equation \@ref(eq:EstimatorGREGCoefficients) out for stratified simple random sampling yields
\begin{equation}
\hat{b}=\frac{w^2_h \widehat{S^2}_h(z,x)}{w^2_h \widehat{S^2}_h(x)}\;,
(\#eq:bCombinedRegressionEstimator)
\end{equation}
with $\widehat{S^2}_h(z,x)$ the estimated covariance of the study variable and the covariate in stratum $h$ and $\widehat{S^2}_h(x)$ the estimated variance of the covariate.
In the combined simple regression estimator only one regression coefficient $b$ is estimated, the slope coefficient for the entire population. This combined regression estimator is recommended when the stratum sample sizes are so small, as in our case, that the estimated regression coefficients per stratum, $\hat{b}_h$, become unreliable.
```{block2, type='rmdcaution'}
Estimator \@ref(eq:bCombinedRegressionEstimator) is for infinite populations and for stratified simple random sampling with replacement of finite populations. For sampling without replacement from finite populations, finite population corrections $1-n_h/N_h$ must be added to the numerator and denominator of $\hat{b}$ (@coc77, p. 202).
```
The approximate variance of the combined regression estimator can be estimated as follows:
1. Compute residuals: $e_k = z_k - (\hat{a} + \hat{b} x_k)$, with $\hat{a}$ and $\hat{b}$ the estimated regression coefficients for the whole population.
2. Estimate for each stratum the variance of the estimator of the mean of the residuals: $\widehat{V}\!\left(\hat{\bar{e}}_h\right)=\widehat{S^{2}}_h(e)/n_h$, with $\widehat{S^{2}}_h(e)$ the estimated variance of the residuals in stratum $h$.
3. Combine the estimated variances per stratum: $\widehat{V}\!\left(\hat{\bar{z}}_{\text{cregr}}\right)=\sum_{h=1}^Hw^2_h\widehat{V}\!\left(\hat{\bar{e}}_h\right)$.
The next code chunks show the estimation procedure. First, the population means of the study variable AGB and of the covariate lnSWIR2 are estimated by the $\pi$ estimator, see Chapter \@ref(STSI).
```{r}
mz_h_HT <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean)
mx_h_HT <- tapply(mysample$lnSWIR2, INDEX = mysample$Biome, FUN = mean)
mz_HT <- sum(w_h * mz_h_HT)
mx_HT <- sum(w_h * mx_h_HT)
```
The next step is to estimate the population regression coefficients, using Equation \@ref(eq:EstimatorGREGCoefficients) in which the variances $\sigma^2(\epsilon_k)$ can be dropped, as these are assumed constant. The inclusion probabilities are in column `Prob` of `mysample`.
```{r}
W <- diag(x = 1 / mysample$Prob, nrow = n, ncol = n)
X <- cbind(rep(1, n), mysample$lnSWIR2)
XWX <- t(X) %*% W %*% X
XWz <- t(X) %*% W %*% mysample$AGB
print(ab <- t(solve(XWX, XWz)))
```
Note that the same estimates are obtained by model-based estimation, using weighted least squares\index{Weighted least squares}, based on the assumption that the variances $\sigma^2(\epsilon_k)$ are proportional to the inclusion probabilities (which is a weird assumption).
```{r}
lm_wls <- lm(AGB ~ lnSWIR2, weights = 1 / Prob, data = mysample)
coef(lm_wls)
```
```{block2, type = 'rmdnote'}
In model-based estimation, the weights differ among the units because of assumed differences in the variance of the residuals, whereas in design-based estimation we assign different weights to the observations because the units have different inclusion probabilities [@loh99].
```
Finally, the combined regression estimate is computed.
```{r}
print(mz_combreg <- mz_HT + ab[2] * (mx_pop - mx_HT))
```
To approximate the variance of the combined regression estimator, first the residuals are computed. Then these residuals are used to estimate the spatial variance of the residuals within the strata, $\widehat{S^{2}}_h(e)$, and the variance of the estimator of the mean of the residuals, $\widehat{V}\!\left(\hat{\bar{e}}_h\right)$. Finally, by taking the square root, the estimated standard error is obtained.
```{r}
mysample$e <- mysample$AGB - (ab[1] + ab[2] * mysample$lnSWIR2)
v_me_h <- numeric(length = 4)
for (i in 1:4) {
subsam <- subset(mysample, mysample$Biome == levels(grdAmazonia$Biome)[i])
S2e_h <- var(subsam$e)
v_me_h[i] <- (1 - n_h[i] / N_h[i]) * S2e_h / n_h[i]
}
print(se_mz_combreg <- sqrt(sum(w_h^2 * v_me_h)))
```
Computing the combined regression estimator with package **survey** proceeds as follows.
```{r}
design_stsi <- svydesign(
id = ~ 1, strata = ~ factor(Biome), data = mysample, fpc = ~ N_h)
totals <- c(nrow(grdAmazonia), sum(grdAmazonia$lnSWIR2))
mysample_cal <- calibrate(
design_stsi, formula = ~ lnSWIR2, population = totals,
calfun = "linear")
svymean(~ AGB, mysample_cal)
```
Function `calibrate` computes the regression estimate and its standard error with the calibrated weights\index{Calibrated weights} $g_k$ (@sar92, equation (6.5.12)). This explains the difference between the two standard errors.
```{r, echo = FALSE}
#check the standard error of the regression estimator obtained with package survey, using calibrated weights $g_k$.
lut <- data.frame(Biome = biomes, pi = n_h / N_h)
mysample <- merge(x = mysample, y = lut)
S2x <- sum(((mysample$lnSWIR2 - mx_HT)^2) / mysample$pi) / sum(mysample$pi^-1)
g <- 1 + (mx_pop - mx_HT) * (mysample$lnSWIR2 - mx_HT) / S2x
lm_sample <- lm(AGB ~ lnSWIR2, weights = 1 / Prob, data = mysample)
e <- residuals(lm_sample)
ge <- g * e
S2geh <- tapply(ge, INDEX = mysample$Biome, FUN = var)
se_mz_combreg <- sqrt(sum(w_h^2 * (1 - n_h / N_h) * S2geh / n_h))
```
## Ratio estimator {#RatioEstimator}
In some cases it is reasonable to assume that the fitted line passes through the origin. An example is the case study on poppy area in Kandahar (Chapter \@ref(pps)). The covariate is the agricultural area within the 5 km squares that serve as sampling units. It is reasonable to assume that when the covariate equals zero, also the poppy area is zero. So, if we have an estimate of the ratio of the total poppy area in the population to the total agricultural area in the population and besides know the total agricultural area in the population, the total poppy area in the population can be estimated by multiplying the estimated ratio with the known population total agricultural area:
\begin{equation}
\hat{t}_{\mathrm{ratio}}(z)=\frac{\hat{t}_{\pi}(z)}{\hat{t}_{\pi}(x)} \;t(x) = \hat{b} \;t(x)\;,
(\#eq:RatioEstimator)
\end{equation}
with $\hat{t}_{\pi}(z)$ and $\hat{t}_{\pi}(x)$ the $\pi$ estimators of the total of the study variable (poppy area) and the ancillary variable (agricultural area), respectively, and $t(x)$ the total of the ancillary variable, which must be known.
The working model\index{Working model} of the ratio estimator\index{Ratio estimator} is a heteroscedastic model, i.e., a model with non-constant variance, without intercept (see Exercise 3 hereafter):
\begin{equation}
\begin{split}
Z(x_k) &= \beta \;x_k + \epsilon_k \\
\sigma^2(\epsilon_k) &= \sigma^2 x_k \;,
\end{split}
(\#eq:ratiomodel)
\end{equation}
with $\beta$ the slope of the line and $\sigma^2$ a constant (variance of residual for $x_k = 1$). The residual variance is assumed proportional to the covariate $x$.
```{block2, type='rmdnote'}
The ratio estimator was applied before to estimate the population mean or population total from a systematic random sample (Chapter \@ref(SY)), a one-stage and two-stage cluster random sample (Sections \@ref(SIC) and \@ref(TwostageSISI)), and a ppswor sample (Section \@ref(ppswor)). By taking $x_k = 1,\; k=1, \dots , N$, $\hat{t}_{\pi}(x)$ in Equation \@ref(eq:RatioEstimator) is equal to $\hat{N}$, and $t(x)$ is equal to $N$. For (two-stage) cluster random sampling $M$ is used for the total number of population units ($N$ is the total number of clusters or primary sampling units in the population) and therefore $\hat{t}_{\pi}(x)=\hat{M}$ and $t(x)=M$. This yields the ratio estimators of the population total appropriate for these sampling designs.
```
Equation \@ref(eq:RatioEstimator) is a general estimator that can be used for any probability sampling design, not only for simple random sampling. For simple random sampling, the coefficient $b$ is estimated by the ratio of the sample means of $z$ and $x$.
For simple random sampling, the sampling variance of the ratio estimator of the population total can be approximated by
\begin{equation}
\widehat{V}\!\left(\hat{t}_{\mathrm{ratio}}(z)\right)=N^2\frac{\widehat{S^{2}}(e)}{n} \;,
(\#eq:VarianceRatioEstimatorSI)
\end{equation}
with $\widehat{S^{2}}(e)$ the estimated variance of the residuals $e_k=z_k-\hat{b}x_k$:
\begin{equation}
\widehat{S^{2}}(e)=\frac{1}{n-1}\sum_{k \in \mathcal{S}}e_k^2 \;.
(\#eq:VarianceResidualsRatioEstimator)
\end{equation}
For simple random sampling without replacement from finite populations, Equation \@ref(eq:VarianceRatioEstimatorSI) must be multiplied by $\left(1-\frac{n}{N}\right)$.
In **R** the ratio estimator for the total poppy area and the estimator of its variance for a simple random sample without replacement can be computed as follows.
```{r}
n <- 50
N <- nrow(grdKandahar)
units <- sample(N, size = n, replace = FALSE)
mysample <- grdKandahar[units, ]
b <- mean(mysample$poppy) / mean(mysample$agri)
tx_pop <- sum(grdKandahar$agri)
print(tz_ratio <- b * tx_pop)
e <- mysample$poppy - b * mysample$agri
print(se_tz_ratio <- sqrt(N^2 * (1 - (n / N)) * var(e) / n))
```
An improved variance approximation is obtained with Equation \@ref(eq:AlternativeVarianceRegressionEstimator). For the ratio model and simple random sampling, the calibrated weights are equal to (@sar92, p. 248)
\begin{equation}
g = \frac{t(x)}{\hat{t}_{\pi}(x)} \;.
(\#eq:weightsratiomodel)
\end{equation}
```{r}
pi <- n / N
tx_HT <- sum(mysample$agri / pi)
g <- tx_pop / tx_HT
S2ge <- sum(g^2 * e^2) / (n - 1)
print(se_tz_ratio <- sqrt(N^2 * (1 - n / N) * S2ge / n))
```
The ratio estimate and the estimated standard error of the ratio estimator can be computed with package **survey** as follows.
```{r}
mysample$N <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
b <- svyratio(~ poppy, ~ agri, design = design_si)
predict(b, total = tx_pop)
```
(ref:SamplingDistributionRatiolabel) Approximated sampling distribution of the ratio estimator (Ratio) and the $\pi$ estimator (HT) of the total poppy area (ha) in Kandahar with simple random sampling without replacement of size 50.
```{r SamplingDistributionRatio, echo = FALSE, fig.width = 5, fig.asp = .8, fig.cap = "(ref:SamplingDistributionRatiolabel)"}
number_of_samples <- 10000
tz_ratio <- v_tz_ratio <- tz_HT <- numeric(length = number_of_samples)
set.seed(314)
for (i in 1:number_of_samples) {
mysample <- sample(N, size = n, replace = FALSE)
b <- mean(grdKandahar$poppy[mysample]) / mean(grdKandahar$agri[mysample])
tz_ratio[i] <- b * tx_pop
e <- grdKandahar$poppy[mysample] - b * grdKandahar$agri[mysample]
v_tz_ratio[i] <- N^2 * (1 - (n / N)) * var(e) / n
tz_HT[i] <- mean(grdKandahar$poppy[mysample]) * N
}
estimates <- data.frame(tz_ratio, tz_HT)
names(estimates) <- c("Ratio", "HT")
df <- estimates %>% pivot_longer(cols = c("Ratio", "HT"))
df$name <- factor(df$name, levels = c("Ratio", "HT"), ordered = TRUE)
ggplot(data = df) +
geom_boxplot(aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdKandahar$poppy) * N, colour = "red") +
scale_x_discrete(name = "Estimator") +
scale_y_continuous(name = "Estimated total poppy area")
m_tz_ratio_sim <- mean(tz_ratio)
bias <- mean(tz_ratio) - sum(grdKandahar$poppy)
v_tz_ratio_sim <- var(tz_ratio)
m_v_tz_ratio_sim <- mean(v_tz_ratio)
gain <- var(tz_HT) / var(tz_ratio)
```
Figure \@ref(fig:SamplingDistributionRatio) shows the sampling distribution of the ratio estimator and the $\pi$ estimator, obtained by repeating simple random sampling of size 50 and estimation 10,000 times. The average of the 10,000 ratio estimates of the total poppy area equals `r formatC(m_tz_ratio_sim, 0, format = "f", big.mark = ",")` ha. The population total of poppy equals `r formatC(sum(grdKandahar$poppy), 0, format = "f", big.mark = ",")` ha, so the estimated bias of the ratio estimator equals `r formatC(bias, 0, format = "f")` ha. The boxplots in Figure \@ref(fig:SamplingDistributionRatio) show that the ratio estimator has less extreme outliers. The standard deviation of the 10,000 ratio estimates equals `r formatC(sqrt(v_tz_ratio_sim), 0, format = "f", big.mark = ",")` ha. The gain in precision due to the ratio estimator, quantified by the ratio of the variance of the $\pi$ estimator to the variance of the ratio estimator, equals `r formatC(gain, 3, format = "f")`.
#### Exercises {-}
3. Write an **R** script to compute the ratio of the population total poppy area to the population total agricultural area ($t(z)/t(x)$). Then use all data to fit a linear model without intercept for the poppy area, using the agricultural area as a covariate, assuming that the variance of the residuals is proportional to the agricultural area (heteroscedastic model). Hint: use function `lm` with argument `formula = poppy ~ agri - 1` and argument `weights = 1 / agri`. Also fit a model without intercept, assuming a constant variance of the residuals (homoscedastic model). Compare the estimated slopes of the two models with the ratio of the total poppy area to the total agricultural area.
### Ratio estimators with stratified simple random sampling {#RatioEstimatorSTSI}
With stratified simple random sampling, there are, similar to the regression estimator, two options for estimating a population parameter: either estimate the ratios separately for the strata or estimate a combined ratio. The separate ratio estimator\index{Ratio estimator!separate ratio estimator} of the population total is
\begin{equation}
\hat{t}_{\mathrm{sratio}}(z)=\sum_{h=1}^H \hat{t}_{\mathrm{ratio},h}(z) \;,
(\#eq:SeparateRatioEstimatorSTSI)
\end{equation}
with
\begin{equation}
\hat{t}_{\mathrm{ratio},h}(z)=\frac{\hat{t}_{\pi,h}(z)}{\hat{t}_{\pi,h}(x)} t_h(x) \;,
(\#eq:RatioEstimatorStratumTotal)
\end{equation}
in which $\hat{t}_{\pi,h}(z)$ and $\hat{t}_{\pi,h}(x)$ are the $\pi$ estimators of the population total of the study variable and the covariate for stratum $h$, respectively.
The combined ratio estimator\index{Ratio estimator!combined ratio estimator} is
\begin{equation}
\hat{t}_{\mathrm{cratio}}(z)=\frac{\sum_{h=1}^H\hat{t}_{\pi,h}(z)}{\sum_{h=1}^H\hat{t}_{\pi,h}(x)} t(x) \;.
(\#eq:CombinedRatioEstimatorSTSI)
\end{equation}
The code chunks below show how the combined and separate regression estimate can be computed with package **survey**. First, two equal-sized strata are computed using the median of the covariate `agri` as a stratum bound. Stratum sample sizes are computed, and a stratified simple random sample without replacement is selected.
```{r}
median_agri <- quantile(grdKandahar$agri, probs = 0.5)
grdKandahar$stratum <- findInterval(grdKandahar$agri, median_agri) + 1
N_h <- table(grdKandahar$stratum)
n_h <- round(n * N_h / sum(N_h))
set.seed(314)
units <- sampling::strata(grdKandahar, stratanames = "stratum",
size = n_h, method = "srswor")
mysample <- getdata(grdKandahar, units)
```
The stratum sizes `N_h` are added to `mysample`, function `svydesign` specifies the sampling design, function `svyratio` estimates the population ratio and its variance, and finally function `predict` estimates the population total.
```{r}
lut <- data.frame(stratum = c(1, 2), N_h)
mysample <- merge(x = mysample, y = lut)
design_stsi <- svydesign(
id = ~ 1, strata = ~ stratum, data = mysample, fpc = ~ Freq)
common <- svyratio(~ poppy, ~ agri, design_stsi, separate = FALSE)
predict(common, total = sum(grdKandahar$agri))
```
The same estimate is obtained with function `calibrate`.
```{r}
mysample_cal <- calibrate(
design_stsi, ~ agri - 1, population = tx_pop, variance = 1)
svytotal(~ poppy, mysample_cal)
```
Computing the separate ratio estimator goes along the same lines. Function `svyratio` with argument `separate = TRUE` estimates the ratio and its variance for each stratum separately. To predict the population total, the stratum totals of the covariate must be passed to function `predict` using argument `total`.
```{r}
separate <- svyratio(~ poppy, ~ agri, design_stsi, separate = TRUE)
tx_h_pop <- tapply(grdKandahar$agri, INDEX = grdKandahar$stratum, FUN = sum)
predict(separate, total = tx_h_pop)
```
## Poststratified estimator {#PoststratifiedEstimator}
In stratified random sampling (Chapter \@ref(STSI)), the population is divided into several disjoint subpopulations and from each subpopulation a probability sample is selected. The subpopulations then serve as strata. The larger the difference in the stratum means and the smaller the variance within the strata, the larger the gain in precision compared to simple random sampling, see Subsection \@ref(WhyStratify).
The alternative to using the subpopulations as strata at the stage of sampling is to use them as poststrata in estimating the population mean. For instance, if we have selected a simple random sample from a spatial population and we have a map of subpopulations possibly related to the study variable, then these subpopulations still can be used in the poststratified estimator\index{Poststratified estimator}. What needs to be done only is to classify the selected units. Hereafter, the subpopulations that serve as poststrata are referred to as groups.
For any probability sampling design, the population mean can be estimated by
\begin{equation}
\hat{\bar{z}}_{\text{pos}}=
\sum_{g=1}^{G} w_{g}\frac{\hat{t}_g(z)}{\widehat{N}_{g}} =
\sum_{g=1}^{G} w_{g}\frac{\sum_{k \in \mathcal{S}_g}\frac{z_{k}}{\pi_k}}
{\sum_{k \in \mathcal{S}_g}\frac{1}{\pi_k}} \;,
(\#eq:PostStratifiedEstimator)
\end{equation}
where $\mathcal{S}_g$ is the sample from group $g$, $w_g=N_g/N$ is the relative size of group $g$, $\hat{t}_g(z)$ is the estimated total of the study variable for group $g$, $\widehat{N}_g$ is the estimator of the size of group $g$, and $\pi_k$ is the inclusion probability of unit $k$. The estimated group means are weighted by their relative sizes $w_g$, which are assumed to be known. In spite of this, the group means are estimated by dividing the estimated group totals by their \emph{estimated} size, $\widehat{N}_g$, because this ratio estimator is more precise than the $\pi$ estimator of the group mean.
The poststratified estimator is the natural estimator for the one-way ANOVA model\index{ANOVA model!one-way ANOVA model},
\begin{equation}
\begin{split}
Z_k &= \mu_g + \epsilon_k \\
\sigma^2_k &= \sigma^2_g \;,
\end{split}
(\#eq:ANOVAmodel)
\end{equation}
with $\mu_g$ the mean for group (subpopulation) $g=1, \dots , G$ and $\sigma^2_g$ the variance of the study variable of group $g$.
For simple random sampling, the poststratified estimator reduces to
\begin{equation}
\hat{\bar{z}}_{\text{pos}}=\sum_{g=1}^{G}w_{g}\,\bar{z}_{\mathcal{S}_g} \;,
(\#eq:PostStratifiedEstimatorSI)
\end{equation}
where $\bar{z}_{\mathcal{S}_g}$ is the sample mean of group $g$. If for all groups we have at least two sampling units, $n_g \geq 2$, the variance of this poststratified estimator of the mean can be estimated by
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{pos}}|\mathbf{n}_g\right)=
\sum_{g=1}^{G}w_{g}^{2}\frac{\widehat{S_{g}^{2}}}{n_{g}} \;,
(\#eq:CondVarPostStratifiedEstimator)
\end{equation}
where $n_{g}$ is the number of sampling units in group $g$, and $\widehat{S_{g}^{2}}$ is the estimated spatial variance of $z$ in
group $g$, which for simple random sampling can be estimated by
\begin{equation}
\widehat{S_{g}^{2}}=\frac{1}{n_{g}-1}\sum_{k \in \mathcal{S}_g}(z_{k}-\bar{z}_{\mathcal{S}_g})^{2} \;.
(\#eq:S2g)
\end{equation}
This is an estimator of the *conditional* sampling variance, i.e., the variance of the poststratified estimator over all simple random samples with group sample sizes, collected in the vector $\mathbf{n}_g$, equal to the group sample sizes in the sample actually selected. The poststratified estimator requires that the sizes (areas) of the strata are known. See Section \@ref(TwophaseStratification) for a sampling strategy that does not require known stratum sizes.
The poststratified estimator is illustrated with study area Voorst. We consider the situation that we do not have the map with the five combinations of soil type and land use that served as strata in Chapter \@ref(STSI). The soil-land use classes (groups) used in the poststratified estimator are only observed at the selected sampling units. Only three poststrata are distinguished: the original strata BA, EA, and PA are merged into one stratum SA with function `fct_collapse` of package **forcats** [@forcats]. The sizes of these poststrata must be known.
```{r poststratification}
library(forcats)
grdVoorst$poststratum <- fct_collapse(
grdVoorst$stratum, SA = c("BA", "EA", "PA"))
print(N_g <- tapply(grdVoorst$z, INDEX = grdVoorst$poststratum, FUN = length))
```
One hundred points are selected by simple random sampling with replacement. The expected sample sizes per group are proportional to the size of the groups, $E(n_g/n) = N_g/N$, but for a single sample the sample proportions may deviate considerably from the population proportions.
```{r}
n <- 100
N <- nrow(grdVoorst)
set.seed(314)
units <- sample(N, size = n, replace = TRUE)
mysample <- grdVoorst[units, ]
n_g <- tapply(mysample$z, INDEX = mysample$poststratum, FUN = length)
print(n_g)
```
The population mean is estimated by first computing the sample means per group, followed by computing the weighted average of the sample means, using the relative sizes of the groups as weights.
```{r}
mz_g <- tapply(mysample$z, INDEX = mysample$poststratum, FUN = mean)
w_g <- N_g / N
print(mz_pst <- sum(w_g * mz_g))
```
```{r, echo = FALSE}
n_g <- as.numeric(n_g)
```
The variance of the estimator of the mean is estimated by computing the sample variances per group, dividing these by the sample sizes per group, and computing the weighted average, using as weights the squared relative group sizes. This estimated sampling variance is the variance of the estimator of the mean over all simple random samples with `r n_g[1]` units of group SA, `r n_g[2]` units of group RA, and `r n_g[3]` units of XF.
```{r}
S2z_g <- tapply(mysample$z, INDEX = mysample$poststratum, FUN = var)
v_mz_g <- S2z_g / as.numeric(n_g)
print(condse_mz_pst <- sqrt(sum(w_g^2 * v_mz_g)))
```
Note that this variance estimator can only be computed with at least two units per group. For this reason, I recommend using a limited number of groups, especially for small sample sizes.
Function `postStratify` of package **survey** can be used to compute the poststratified estimator and its standard error.
```{r}
mysample$weights <- N / n
design_si <- svydesign(id = ~ 1, weights = ~ weights, data = mysample)
pop <- data.frame(poststratum = c("SA", "RA", "XF"), Freq = N_g)
mysample_pst <- postStratify(
design_si, strata = ~ poststratum, population = pop)
svymean(~ z, mysample_pst)
```
```{block2, type = 'rmdwarning'}
@loh99 warns about data snooping\index{Data snooping}. By defining groups after analysing the data, arbitrarily small sampling variances of the estimated mean can be obtained.
```
## Model-assisted estimation using machine learning techniques {#RandomForest}
@Breidt2017 review model-assisted estimators based on machine learning techniques\index{Machine learning technique}. Of special interest is the general approach proposed by @Wu2001 for incorporating non-linear predictions in the model-assisted estimator. They show how non-linear predictions of the study variable, for instance obtained by a regression tree or random forest, can be used in the model-calibration estimator:
\begin{equation}
\hat{\bar{z}}_{\mathrm{MC}}= \hat{\bar{z}}_{\pi} + \hat{a}\left(1-\frac{1}{N}\sum_{k \in \mathcal{S}}\frac{1}{\pi_k}\right)+ \hat{b}\left(\frac{1}{N}\sum_{k=1}^N \hat{m}(\mathbf{x}_k) - \frac{1}{N}\sum_{k \in \mathcal{S}} \frac{\hat{m}(\mathbf{x}_k)}{\pi_k} \right)
\;,
(\#eq:ModelCalibrationEstimator)
\end{equation}
with $\hat{b}$ a slope coefficient estimated by
\begin{equation}
\hat{b} = \frac{\sum_{k \in \mathcal{S}} 1/\pi_k \{\hat{m}(\mathbf{x}_k)-\hat{\bar{m}}_{\pi}\} \{z_k-\hat{\bar{z}}_{\pi}\}}{\sum_{k \in \mathcal{S}} 1/\pi_k \{\hat{m}(\mathbf{x}_k)-\hat{\bar{m}}_{\pi}\}^2}\;,
(\#eq:SlopeCalibrationEstimator)
\end{equation}
with $\hat{\bar{z}}_{\pi}$ the $\pi$ estimator of the population mean of the study variable, $\hat{\bar{m}}_{\pi}$ the $\pi$ estimator of the population mean of the predicted values, and $\hat{a}$ an intercept estimated by
\begin{equation}
\hat{a} = (1-\hat{b})\left(\frac{1}{N}\sum_{k \in \mathcal{S}}\frac{z_k}{\pi_k}\right)\;.
(\#eq:InterceptCalibrationEstimator)
\end{equation}
The second term in Equation \@ref(eq:ModelCalibrationEstimator) cancels for all sampling designs for which the sum of the design weights, i.e., the sum of the reciprocal of the inclusion probabilities, equals the population size: $\sum_{k \in \mathcal{S}} 1/\pi_k=N$. Only for some unequal probability sampling designs this may not be the case.
The alternative is to plug the fitted values $\hat{m}(\mathbf{x}_k)$ into the generalised difference estimator, Equation \@ref(eq:GeneralizedDifferenceEstimator). If we drop the second term, the model-calibration estimator\index{Model-calibration estimator} can be rewritten as
\begin{equation}
\hat{\bar{z}}_{\mathrm{MC}}=\frac{1}{N}\sum_{k=1}^N\hat{b}\;\hat{m}(\mathbf{x}_k)+\frac{1}{N}\sum_{k \in \mathcal{S}}\frac{z_k-\hat{b}\;\hat{m}(\mathbf{x}_k)}{\pi_k}
\;.
(\#eq:ModelCalibrationEstimator2)
\end{equation}
Comparison with the generalised difference estimator, Equation \@ref(eq:GeneralizedDifferenceEstimator), shows that these two estimators are equivalent when $\hat{b}=1$. For non-linear working models, generally $\hat{b} \neq 1$, so that these two estimators are not the same. @Wu2003 shows that the calibration estimator has a general optimality property.
In case you are confused by all these model-assisted estimators, let me clarify. The most general estimator is the model-calibration estimator. If we take for $\hat{b}$ the value 1, this estimator is equivalent to the generalised difference estimator (Equation \@ref(eq:GeneralizedDifferenceEstimator)). The predictions $\hat{m}(\mathbf{x}_k)$ in these estimators can be computed either by a linear model or a non-linear model. If a linear model is used in the generalised difference estimator, this estimator is equal to the generalised regression estimator (Equation \@ref(eq:GREG)). With linear models, $\hat{b}$ in Equation \@ref(eq:ModelCalibrationEstimator2) equals 1, so that all three estimators are equal.
For simple random sampling, the inclusion probabilities of the units are the same for all units: $\pi_k = n/N$, reducing Equations \@ref(eq:ModelCalibrationEstimator) and \@ref(eq:SlopeCalibrationEstimator) to