14-SmallAreaEstimation.Rmd

# Sampling for estimating parameters of domains {#SmallAreaEstimation}

This chapter is about probability sampling and estimation of means or totals of subpopulations (subareas, subregions). In the sampling literature, these subpopulations are referred to as domains of interest\index{Domain of interest}, or shortly domains. Ideally, at the stage of designing a sample, these domains are known, and of every population unit we know to which domain it belongs. In that situation it is most convenient to use these domains as strata in random sampling, so that we can control the sample size in each domain (Chapter \@ref(STSI)). 

If we have multiple maps with domains, think for instance of a soil class map, a map with land cover classes, and a map with countries, we can make an overlay of these maps to construct the cross-tabulation strata. However, this may result in numerous strata, in some cases even more than the sample size. In this situation, an attractive solution is multiway stratification (Subsection \@ref(Multiwaystratification)). With this design, the domains of interest are used as strata, not their cross-classification, and the sample sizes of these marginal strata are controlled.

Even with a multiway stratified sample, resulting in controlled sample sizes for each domain, the sample size of a domain can be too small for a reliable estimate of the mean or total when using just the data of this domain. In this case we may use model-assisted estimators (Chapter \@ref(Modelassisted)) to increase the precision. Not only the data collected from a given domain are used to estimate the mean or total, but also data outside the domain (Section \@ref(SmallDomainsModelAssisted)) (@cha94, @Rao2003, @Falorsi2008).

We may also wish to estimate the mean or total of domains that are not used as strata or marginal strata. The sample size in these domains is then not controlled and varies among samples selected with the sampling design. As before with multiway stratified sampling, the mean can either be estimated with the direct estimator, using the data from the domain only (Section \@ref(LargeDomainsDirectEstimator)), or a model-assisted estimator, also using data from outside the domain (Section \@ref(SmallDomainsModelAssisted)).

## Direct estimator for large domains {#LargeDomainsDirectEstimator}

If the sample size of a domain $d$ is considered large enough to obtain a reliable estimate of the mean and, besides, the size of the domain is known, the mean of that domain can be estimated by the direct estimator\index{Direct estimator}:

\begin{equation}
\hat{\bar{z}}_{d}=\frac{1}{N_{d}}\sum_{k \in \mathcal{S}_d}\frac{z_{dk}}{\pi _{dk}} \;,
(\#eq:piestimatormeandomain)
\end{equation}

where $N_{d}$ is the size of the domain, $z_{dk}$ is the value for unit $k$ of domain $d$, and $\pi _{dk}$ is the inclusion probability of this point.

When the domain is not used as a (marginal) stratum, so that the sample size of the domain is random, the mean of the domain can be estimated best by the ratio estimator:

\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}= \frac{\hat{t}_d(z)}{\widehat{N}_d}=
\frac{\sum_{k \in \mathcal{S}_d}\frac{z_{dk}}{\pi_{dk}}}{\sum_{k \in \mathcal{S}_d}\frac{1}{\pi_{dk}}} \;.
(\#eq:generalratiodomain)
\end{equation}

```{block2, type='rmdnote'}
The ratio estimator\index{Ratio estimator} can also be used when the size of the domain is unknown. An example of this is estimating the mean of soil classes *as observed in the field*, not *as depicted on a soil map*. A soil map is impure, i.e., the map units contain patches with soil classes that differ from the soil class as indicated on the map. The area of a given true soil class is not known. 
```

For simple random sampling without replacement, $\pi_{dk} = n/N$. Inserting this in Equation \@ref(eq:generalratiodomain) gives

\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}=\frac{1}{n_{d}}\sum_{k \in \mathcal{S}_d}z_{dk} \;.
(\#eq:ratiodomainSI)
\end{equation}

The mean of the domain is simply estimated by the mean of the $z$-values observed in the domain, i.e., the sample mean in domain $d$. The variance of this estimator can be estimated by

\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{ratio},d}\right) =
\frac{1}{\hat{a}_{d}^{2}}\cdot\frac{1}{n\,(n-1)}\sum_{k \in \mathcal{S}_d}(z_{dk}-\bar{z}_{\mathcal{S}_d})^{2} \;,
(\#eq:VarratiodomainSI)
\end{equation}

where $\bar{z}_{\mathcal{S}_d}$ is the sample mean in domain $d$, and $\hat{a}_{d}$ is the estimated relative size of domain $d$:

\begin{equation}
\hat{a}_{d}=\frac{n_{d}}{n} \;.
(\#eq:estimatedrelativesizedomain)
\end{equation}

Refer to section (8.2.2) in @gru06 for the ratio estimator and its standard error with stratified simple random sampling, in case the domains cut across the strata, and other sampling designs.

The ratio estimator and its standard error can be computed with function `svyby` of package **survey** [@Lumley2020]. This is illustrated with Eastern Amazonia. We wish to estimate the mean aboveground biomass (AGB) of the 16 ecoregions from a simple random sample of 200 units.

```{r, echo = FALSE}
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
```

```{r}
library(survey)
set.seed(314)
n <- 200
mysample <- grdAmazonia %>%
  mutate(N = n()) %>%
  slice_sample(n = n)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
```

The ratio estimates of the mean AGB are shown in Table \@ref(tab:TableRatioEstimatesEcoregions). Two ecoregions are missing in the table: no units are selected from these ecoregions so that a direct estimate is not available. There are three ecoregions with an estimated standard error of 0.0. These ecoregions have less than two sampling units only, so that the standard error cannot be estimated.

(ref:TableRatioEstimatesEcoregionslabel) Ratio estimates and estimated standard errors of the ratio estimator of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, with simple random sampling without replacement of size 200.

```{r TableRatioEstimatesEcoregions, echo = FALSE}
rownames(res) <- NULL
res[, c(2, 3)] <- round(res[, c(2, 3)], 1)
knitr::kable(
  res, caption = "(ref:TableRatioEstimatesEcoregionslabel)",
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic() %>%
   add_footnote("The estimated standard errors of 0.0 are non-availables.", notation = "none")
```

The simple random sampling is repeated 1,000 times, and every sample is used to estimate the mean AGB of the ecoregions both with the $\pi$ estimator and the ratio estimator. As can be seen in Table \@ref(tab:TableRatiovsHTEstimatesEcoregions) the standard deviation of the ratio estimates is much smaller than that of the $\pi$ estimates. The reason is that the number of sampling units in an ecoregion varies among samples, i.e., the sample size of an ecoregion is random. When many units are selected from an ecoregion, the estimated total of that ecoregion is large. The estimated mean as obtained with the $\pi$ estimator then is large too, because the estimated total is divided by the fixed size (total number of population units, $N_d$) of the ecoregion. However, in the ratio estimator the size of an ecoregion is estimated from the same sample, although we know its size, see Equation \@ref(eq:generalratiodomain). With many units selected from an ecoregion, the estimated size of that ecoregion, $\widehat{N}_d$, is also large. By dividing the large estimated total by the large estimated size, a more stable estimate of the mean of the domain is obtained. For quite a few ecoregions the standard deviations are very large, especially of the $\pi$ estimator. These are the ecoregions with very small average sample sizes. With simple random sampling, the expected sample size can simply be computed by $E[n] = n \; N_d/N$. In the following section, alternative estimators are described for these ecoregions with small expected sample sizes. To speed up the computations, I used a 5 km $\times$ 5 km subgrid of `grdAmazonia` in this sampling experiment.

```{r, echo = FALSE, eval = FALSE}
n <- 200
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
grdAmazonia$lnSWIR2 <- log(grdAmazonia$SWIR2)
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
ecoregions <- sort(unique(grdAmazonia$Ecoregion))
N <- nrow(grdAmazonia)
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
number_of_samples <- 1000
mx_eco_pop <- tapply(grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Ecoregion, FUN = mean)
mX_eco_pop <- data.frame(Intercept = rep(1, length(mx_eco_pop)), lnSWIR2 = mx_eco_pop)

set.seed(314)
mz_eco_HT <- mz_eco_ratio <- mz_eco_regr <- mz_eco_synt <- mz_eco_extsynt <- matrix(nrow = length(ecoregions), ncol = number_of_samples, dimnames = list(ecoregions, seq(1:number_of_samples)))
for (i in 1:number_of_samples) {
  print(i)
  mypop <- grdAmazonia
  units <- sample(nrow(mypop), size = n, replace = FALSE)
  mysample <- mypop[units, c("AGB", "Ecoregion")]

# HT estimator
  tz <- (N / n) * tapply(mysample$AGB, INDEX = mysample$Ecoregion, FUN = sum)
  mz_eco_HT[names(tz), i] <- tz / N_eco[names(tz)]

# ratio estimator
  mysample$N <- N
  design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
  res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
  mz_eco_ratio[res$Ecoregion, i] <- res$AGB

# regression estimator
  mypop$ind <- rep(0, nrow(mypop))
  mypop$ind[units] <- 1
  mypop$AGB[mypop$ind == 0] <- NA
  res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
                                   phase_id = list(phase.col = "ind", terrgrid.id = 1),
                                   small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = TRUE),
                                   psmall = TRUE, exhaustive = mX_eco_pop)$estimation
  mz_eco_regr[, i] <- res$estimate

# synthetic estimator
  res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
                                   phase_id = list(phase.col = "ind", terrgrid.id = 1),
                                   small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = FALSE),
                                   psmall = FALSE, exhaustive = mX_eco_pop)$estimation
  mz_eco_synt[, i] <- res$estimate

# extended synthetic estimator
  res <- forestinventory::twophase(AGB ~ lnSWIR2, data = as.data.frame(mypop),
                                   phase_id = list(phase.col = "ind", terrgrid.id = 1),
                                   small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = TRUE),
                                   psmall = FALSE, exhaustive = mX_eco_pop)$estimation
  mz_eco_extsynt[, i] <- res$estimate

}
save(mz_eco_HT, mz_eco_ratio, mz_eco_regr, mz_eco_synt, mz_eco_extsynt, file = "results/MeanAGB_ecoregions.rda")
```

(ref:TableRatiovsHTEstimatesEcoregions) Standard deviations of 1,000 $\pi$ estimates (HT) and 1,000 ratio estimates (Ratio) of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, with simple random sampling without replacement of size 200.

```{r TableRatiovsHTEstimatesEcoregions, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
sd_mz_HT <- apply(mz_eco_HT, MARGIN = 1, FUN = var, na.rm = TRUE) %>% sqrt(.) %>% round(1)
sd_mz_ratio <- apply(mz_eco_ratio, MARGIN = 1, FUN = var, na.rm = TRUE) %>% sqrt(.) %>% round(1)
notNA <- apply(mz_eco_HT, MARGIN = 1, FUN = function(x) {
  sum(!is.na(x))})
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
N <- nrow(grdAmazonia)
En <- round(n * N_eco / N, 2)
nms <- names(sd_mz_HT)
names(sd_mz_HT) <- NULL
names(sd_mz_ratio) <- NULL
names(En) <- NULL
res <- data.frame(nms, sd_mz_HT, sd_mz_ratio, En)
rownames(res) <- NULL
knitr::kable(
  res, caption = "(ref:TableRatiovsHTEstimatesEcoregions)",
  col.names = c("Ecoregion", "HT", "Ratio", "n"),
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic() %>%
   add_footnote("n: expected sample size.", notation = "none")
```

No covariates are used in the ratio estimator. If we wish to exploit covariates, the mean of a domain can be estimated best by the ratio of the regression estimate of the domain total and the estimated size of the domain:

\begin{equation}
\hat{\bar{z}}_{\text{ratio},d}= \frac{\hat{t}_{\text{regr},d}(z)}{\widehat{N}_d}\;.
\end{equation}

For a large domain with a reasonable sample size, the regression estimate can be computed from the data of that domain (Chapter \@ref(Modelassisted)). For small domains, also the data from outside these domains can be used to estimate the population regression coefficients. This is explained in Subsection \@ref(RegressionestimatorSmallDomain). 

## Model-assisted estimators for small domains {#SmallDomainsModelAssisted}

When the domains are not well represented in the sample, the direct estimators from the previous section lead to large standard errors. In this situation, we may try to increase the precision by also using observations from outside the domain. If we have covariates related to the study variable, we may exploit this ancillary information by fitting a regression model relating the study variable to the covariates and using the fitted model to predict the study variable for all population units (nodes of discretisation grid), see Chapter \@ref(Modelassisted)\index{Model-assisted approach}. However, for a small domain\index{Small domain}, we may have too few sampled units in that domain to fit a separate regression model. The alternative then is to use the entire sample to estimate the regression coefficients, and to use this global regression model to estimate the means of the domains. This introduces a systematic error, a design-bias, in the estimator. However, this extra error is potentially outweighed by the reduction of the random error due to the use of the globally estimated regression coefficients. If one or more units are selected from a domain, the observations of the study variable of these units can be used to correct for the bias. This leads to the regression estimator for small domains\index{Regression estimator!for small domains}. In the absence of such data, the mean of the domain can still be estimated by the so-called synthetic estimator\index{Synthetic estimator}.

There are quite a few packages for model-assisted estimation of means of small areas, the **maSAE** package [@maSAE], the **JoSAE** package [@JoSAE], the **rsae** package [@rsae], and the **forestinventory** package [@Hill2021]. I use package **forestinventory** for model-assisted estimation (Subsections \@ref(RegressionestimatorSmallDomain) and \@ref(SyntheticestimatorSmallDomain)) and package **JoSAE** for model-based prediction of the means of small areas (Section \@ref(SmallAreaModelBased)).

### Regression estimator {#RegressionestimatorSmallDomain}

In the regression estimator, the potential bias due to the globally estimated regression coefficients can be eliminated by adding the $\pi$ estimator of the mean of the regression residuals to the mean of the predictions in the domain (compare with Equation \@ref(eq:GREG)) (@Mandallaz2007, @Mandallaz2013):

\begin{equation}
\hat{\bar{z}}_{\text{regr},d} = \frac{1}{N_d} \sum_{k=1}^{N_d} \mathbf{x}^{\mathrm{T}}_{dk} \hat{\mathbf{b}} + \frac{1}{N_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \frac{1}{N_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} \;,
(\#eq:regressionestimatorsmalldomain)
\end{equation}

with $\mathbf{x}_{dk}$ the vector with covariate values for unit $k$ in domain $d$, $\hat{\mathbf{b}}$ the vector with globally estimated regression coefficients, $e_{dk}$ the residual for unit $k$ in domain $d$, $\pi_{dk}$ the inclusion probability of that unit, and $\bar{\mathbf{x}}_d$ the mean of the covariates in domain $d$. Alternatively, the mean of the residuals in a domain is estimated by the ratio estimator:

\begin{equation}
\hat{\bar{z}}_{\text{regr},d} =  \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \frac{1}{\widehat{N}_d} \sum_{k \in \mathcal{S}_d} \frac{e_{dk}}{\pi_{dk}} \;,
(\#eq:regressionestimatorsmalldomainratio)
\end{equation}

with $\widehat{N}_d$ the estimated size of domain $d$, see Equation \@ref(eq:generalratiodomain). The regression coefficients can be estimated by Equation \@ref(eq:EstimatorMultipleRegressionCoefficients). With simple random sampling, the second term in Equation \@ref(eq:regressionestimatorsmalldomainratio) is equal to the sample mean of the residuals, so that the estimator reduces to 

\begin{equation}
\hat{\bar{z}}_{\text{regr},d}=\bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}} + \bar{e}_{\mathcal{S}_d}\;,
(\#eq:regressionestimatorsmalldomainSI)
\end{equation}

with $\bar{e}_{\mathcal{S}_d}$ the sample mean of the residuals in domain $d$.

A regression estimate can only be computed if we have at least one observation of the study variable in the domain $d$. The variance of the regression estimator of the mean for a small domain can be estimated by [@Hill2021]

\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{regr},d}\right) = \bar{\mathbf{x}}_d^{\mathrm{T}} \widehat{\mathbf{C}}(\hat{\mathbf{b}}) \bar{\mathbf{x}}_d + \widehat{V}\!\left(\hat{\bar{e}}_d \right)\;,
(\#eq:Varregressionestimatorsmalldomain)
\end{equation}

with $\widehat{\mathbf{C}}(\hat{\mathbf{b}})$ the matrix with estimated sampling variances and covariances of the estimators of the regression coefficients. The first variance component is the contribution due to uncertainty about the regression coefficients, the second component accounts for the uncertainty about the mean of the residuals in the domain. For simple random sampling, the sampling variance of the $\pi$ estimator of the mean of the residuals in a domain can be estimated by the sample variance of the residuals in that domain divided by the sample size $n_d$. This variance estimator is presented in @Hill2021. If the domain is not used as a stratum and the domain mean of the residuals is estimated by the ratio estimator, the second variance component can be estimated by

\begin{equation}
\widehat{V}\!\left(\hat{\bar{e}}_{\text{ratio},d}\right) = \left(\frac{n}{n_{d}}\right)^{2}\cdot\frac{1}{n\,(n-1)}\sum_{k \in \mathcal{S}_d}(e_{dk}-\bar{e}_{\mathcal{S}_d})^{2} \;.
(\#eq:Varratioestimatorofmeanresidual)
\end{equation}

With simple random sampling, the sampling variances and covariances of the estimators of the regression coefficients can be estimated by (equation 2 in @Hill2021)

\begin{equation}
\widehat{\mathbf{C}}(\hat{\mathbf{b}}) = \frac{1}{n} \left( \sum_{k \in \mathcal{S}} \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right) ^{-1} \left( \frac{1}{n^2} \sum_{k \in \mathcal{S}} e_k^2  \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right) \frac{1}{n}\left(\sum_{k \in \mathcal{S}}^n \mathbf{x}_k \mathbf{x}^{\mathrm{T}}_k \right)^{-1} \;.
(\#eq:samplingVarregressioncoefficients)
\end{equation}

The sampling variances and covariances of the estimators of the population regression coefficients are not equal to the model-variances and covariances as obtained with multiple linear regression, using functions `lm` and `vcov`, see Section \@ref(GREG) and Chapter \@ref(Approaches).

Function `twophase` of package **forestinventory** [@Hill2021] can be used to compute the regression estimate for small domains and its standard error. 

```{block2, type='rmdnote'}
The function name 'twophase' is somewhat confusing. It suggests that we have a large sample which is subsampled in a second phase, as described in Chapter \@ref(Twophase). This is not the case here. However, upon considering infinite populations, @Hill2021 treat the grid that discretises the infinite population as the first-phase sample. The sampling error introduced by this discretisation grid can then be accounted for. I ignore this sampling error, which will be very small anyway, because the number of grid cells is very large.
```

By assigning the domain means of the covariates to argument `exhaustive` of function `twophase` the sampling error of the first phase is ignored.  Function `twophase` assumes simple random sampling (unless optional argument `cluster` is used). Note that for the unobserved population units (not selected units) the AGB values are changed into non-availables. In package **survey** also a function `twophase` is defined, for this reason the  name of the package is made explicit by `forestinventory::twophase`. With argument `psmall = TRUE` and element `unbiased = TRUE` in the list `small_area` the regression estimate is computed. Log-transformed SWIR2 is used as a covariate in the regression estimator of the mean AGB of the ecoregions.

```{r, echo = FALSE}
mz_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = mean)
```

```{r}
library(forestinventory)
n <- 200
set.seed(314)
units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
mysample <- grdAmazonia[units, c("AGB", "SWIR2", "Ecoregion")]
grdAmazonia <- grdAmazonia %>%
  mutate(lnSWIR2 = log(SWIR2),
         id = row_number(),
         ind = as.integer(id %in% units))
grdAmazonia$AGB[grdAmazonia$ind == 0L] <- NA
mx_eco_pop <- tapply(
  grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Ecoregion, FUN = mean)
mX_eco_pop <- data.frame(
  Intercept = rep(1, length(mx_eco_pop)), lnSWIR2 = mx_eco_pop)
res <- forestinventory::twophase(AGB ~ lnSWIR2,
  data = as.data.frame(grdAmazonia),
  phase_id = list(phase.col = "ind", terrgrid.id = 1),
  small_area = list(sa.col = "Ecoregion",
                    areas = sort(unique(mysample$Ecoregion)),
                    unbiased = TRUE),
  psmall = TRUE, exhaustive = mX_eco_pop)
regr <- res$estimation
```


(ref:TableRegressionEstimatesEcoregionslabel) Regression estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, using lnSWIR2 as a predictor.

```{r TableRegressionEstimatesEcoregions, echo = FALSE}
df <- regr[c("area", "estimate", "ext_variance", "g_variance", "n2G")]
df$estimate <- round(df$estimate, 1)
df$ext_variance <- formatC(df$ext_variance, 1, format = "f", big.mark = ",")
df$g_variance <- formatC(df$g_variance, 1, format = "f", big.mark = ",")

knitr::kable(
  df, caption = "(ref:TableRegressionEstimatesEcoregionslabel)",
  booktabs = TRUE,
  col.names = c("Ecoregion", "AGB", "ext_var", "g_var", "n2G"),
  linesep = "",
) %>%
  kable_classic() %>%
   add_footnote(" For explanation of variances of regression estimator, see text. n2G: sample size of ecoregion; NA: not available; NaN: not a number.", notation = "none")
```

The alternative is to save the selected units (sample) in a data frame, passed to function `twophase` with argument `data`. The results are identical because the true means of the covariate $x$ specified with argument `exhaustive` contains all required information at the population level.

For two ecoregions, no regression estimate of the mean AGB is obtained (Table \@ref(tab:TableRegressionEstimatesEcoregions)). No units are selected from these domains. The estimated variance of the estimated domain mean is in the column g_var. In the estimated variance ext_var the first variance component of Equation \@ref(eq:Varregressionestimatorsmalldomain) is ignored. Note that for the ecoregions with a sample size of one unit (the sample size per domain is in column `n2G`), no estimate of the variance is available, because the variance of the estimated mean of the residuals cannot be estimated from one unit.

Figure \@ref(fig:RatioversusRegrEcoregions) shows the regression estimates plotted against the ratio estimates. The intercept of the line, fitted with ordinary least squares (OLS), is larger than 0, and the slope is smaller than 1. Using the regression model predictions in the estimation of the means leads to some smoothing.

(ref:RatioversusRegrEcoregionslabel) Scatter plot of the ratio and the regression estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia for simple random sample without replacement of size 200. In the regression estimate, lnSWIR2 is used as a predictor. The line is fitted by ordinary least squares.

```{r RatioversusRegrEcoregions, echo = FALSE, fig.width = 5, fig.cap = "(ref:RatioversusRegrEcoregionslabel)"}
mysample$N <- nrow(grdAmazonia)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)

names(regr)[1] <- "Ecoregion"

df <- merge(res, regr, by = "Ecoregion")
slr <- lm(estimate ~ AGB, data = df)
ab <- coef(slr)
ggplot(data = df) +
        geom_point(mapping = aes(x = AGB, y = estimate), size = 2) +
        geom_abline(intercept = ab[1], slope = ab[2]) +
        scale_x_continuous(name = "Ratio estimate", limits = c(50, 300)) +
        scale_y_continuous(name = "Regression estimate", limits = c(50, 300)) +
        coord_fixed()
```

I quantified the gain in precision of the estimated mean AGB due to the use of the regression model by the variance of the ratio estimator divided by the variance of the regression estimator (Table \@ref(tab:TableGainRegressionestimator)). For ratios larger than 1, there is a gain in precision. Both variances are estimated from 1,000 repeated ratio and regression estimates obtained with simple random sampling without replacement of size 200. For all but two small ecoregions, there is a gain. For quite a few ecoregions, the gain is quite large. These are the ecoregions where the globally fitted regression model explains a large part of the spatial variation of AGB.

```{r TableGainRegressionestimator, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
ecoregions <- sort(unique(grdAmazonia$Ecoregion))
v_ratio <- apply(mz_eco_ratio, MARGIN = 1, FUN = var, na.rm = TRUE)
v_regr <- apply(mz_eco_regr, MARGIN = 1, FUN = var, na.rm = TRUE)
gain <- round(v_ratio / v_regr, 2)
df <- data.frame(ecoregions, gain = gain)
rownames(df) <- NULL
knitr::kable(
  df, caption = "Estimated gain in precision of the estimated mean AGB of ecoregions in Eastern Amazonia, as quantified by the ratio of the estimated variance of the ratio estimator (no covariate used) to the estimated variance of the regression estimator (using lnSWIR2 as a predictor), for simple random sampling without replacement of size 200.",
  booktabs = TRUE,
  col.names = c("Ecoregion", "Gain"),
  linesep = ""
) %>%
  kable_classic()
```


### Synthetic estimator {#SyntheticestimatorSmallDomain}

For small domains from which no units are selected, the mean can still be estimated by the synthetic estimator\index{Synthetic estimator}, also referred to as the synthetic regression estimator, by dropping the second term in Equation \@ref(eq:regressionestimatorsmalldomain): 

\begin{equation}
\hat{\bar{z}}_{\text{syn},d}=\bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\mathbf{b}}\;.
(\#eq:syntheticestimatorsmalldomain)
\end{equation}

The variance can be estimated by

\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{\text{syn},d}\right) = \bar{\mathbf{x}}_d^{\mathrm{T}} \widehat{\mathbf{C}}(\hat{\mathbf{b}}) \bar{\mathbf{x}}_d \;.
(\#eq:Varsyntheticestimatorsmalldomain)
\end{equation}

This is equal to the first variance component of Equation \@ref(eq:Varregressionestimatorsmalldomain). 
The synthetic estimate can be computed with function `twophase`, with argument `psmall = FALSE` and element `unbiased = FALSE` in the list `small_area`.

```{r}
res <- forestinventory::twophase(AGB ~ lnSWIR2,
  data = as.data.frame(grdAmazonia),
  phase_id = list(phase.col = "ind", terrgrid.id = 1),
  small_area = list(sa.col = "Ecoregion", areas = ecoregions, unbiased = FALSE),
  psmall = FALSE, exhaustive = mX_eco_pop)
synt <- res$estimation
```
(ref:TableSyntheticEstimateslabel) Synthetic estimates of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, using lnSWIR2 as a predictor.

```{r TableSyntheticEstimates, echo = FALSE}
df <- synt[c("area", "estimate", "g_variance", "n2G")]
df[, c(2, 3)] <- round(df[, c(2, 3)], 1)

knitr::kable(
  df, caption = "(ref:TableSyntheticEstimateslabel)",
  booktabs = TRUE,
  col.names = c("Ecoregion", "AGB", "g_var", "n2G"),
  linesep = ""
) %>%
  kable_classic()  %>%
   add_footnote(" n2G: sample size of ecoregion.", notation = "none")
```


```{r, echo = FALSE}
names(synt)[1] <- "Ecoregion"
df <- merge(synt, regr, by = "Ecoregion")
d_estimates <- df$estimate.x - df$estimate.y
md <- mean(d_estimates)
d_vars <- df$g_variance.x - df$g_variance.y
```

For all ecoregions, also the unsampled ones, a synthetic estimate of the mean AGB is obtained (Table \@ref(tab:TableSyntheticEstimates)). For the sampled ecoregions, the synthetic estimate differs from the regression estimate. This difference can be quite large for ecoregions with a small sample size. Averaged over all sampled ecoregions, the difference, computed as synthetic estimate minus regression estimate,  equals `r formatC(md, 1, format = "f")` 10^9^ kg ha^-1^. The variance of the regression estimator is always much larger than the variance of the synthetic estimator. The difference is the variance of the estimator of the domain mean of the residuals. However, recall that the regression estimator is design-unbiased, whereas the synthetic estimator is not. A more fair comparison is on the basis of the root mean squared error (RMSE) (Table \@ref(tab:tableRMSEs)). For the regression estimator, the RMSE is equal to its standard error and therefore not shown in the table.

```{r, echo = FALSE}
load(file = "results/MeanAGB_ecoregions.rda")
v_mz_eco_regr <- apply(mz_eco_regr, MARGIN = 1, FUN = var, na.rm = TRUE)
v_mz_eco_synt <- apply(mz_eco_synt, MARGIN = 1, FUN = var, na.rm = TRUE)
m_mz_eco_synt <- apply(mz_eco_synt, MARGIN = 1, FUN = mean, na.rm = TRUE)
bias_synt <- m_mz_eco_synt - mz_pop
RMSE_synt <- sqrt(v_mz_eco_synt + bias_synt^2)
RMSE_regr <- sqrt(v_mz_eco_regr)
dRMSE <- RMSE_regr - RMSE_synt
```

```{r tableRMSEs, echo = FALSE}
ecoregions_short <- ecoregions
ecoregions_short[c(1, 16)] <- c("Amazon-Orinoco Carib. mangroves", "Xingu-Toc.-Arag. moist forests")
df <- data.frame(Ecoregion = ecoregions_short, se.reg = round(sqrt(v_mz_eco_regr), 1), se.syn = round(sqrt(v_mz_eco_synt), 1), bias.syn = round(bias_synt, 1), RMSE.syn = round(RMSE_synt, 1))
rownames(df) <- NULL
knitr::kable(
  df, caption = "Estimated standard error (se), bias, and root mean squared error (RMSE) of the regression estimator (reg) and the synthetic estimator (syn) of the mean AGB of ecoregions in Eastern Amazonia. The regression estimator is design-unbiased, so the RMSE of the regression estimator is equal to its standard error.",
  booktabs = TRUE,
  col.names = c("Ecoregion", "RMSE reg", "se syn", "bias syn", "RMSE syn"),
  linesep = ""
) %>%
  kable_classic()
```


In the synthetic estimator and the regression estimator, both quantitative covariates and categorical variables can be used. If one or more categorical variables are included in the estimator, the variable names in the data frame with the true means of the ancillary variables per domain, specified with argument `exhaustive`,  must correspond to the column names of the design matrix that is generated with function `lm`, see Subsection \@ref(RegressionEstimatorSTSI).

## Model-based prediction {#SmallAreaModelBased}

The alternative for design-based and model-assisted estimation of the means or totals of small domains is model-based prediction. The fundamental difference between model-assisted estimation and model-based prediction is explained in Chapter \@ref(Approaches). The models used in this section are linear mixed models\index{Linear mixed model}. In a linear mixed model, the mean of the study variable is modelled as a linear combination of covariates, similar to a linear regression model. The difference with a linear regression model is that the residuals of the mean are not assumed independent. The dependency of the residuals is also modelled. Two types of linear mixed model are described: a random intercept model and a geostatistical model. 

### Random intercept model

A basic linear mixed model that can be used for model-based prediction of means of small domains is the random intercept model\index{Random intercept model}:

\begin{equation}
\begin{split}
Z_{dk} &= \mathbf{x}_{dk}^{\text{T}} \pmb{\beta} + v_d + \epsilon_{dk} \\
v_d &\sim \mathcal{N}(0,\sigma^2_v) \\
\epsilon_{dk} &\sim \mathcal{N}(0,\sigma^2_{\epsilon}) \;.
\end{split}
(\#eq:RandomInterceptModel)
\end{equation}

Two random variables are now involved, both with a normal distribution with mean zero: $v_d$, a random intercept at the domain level with variance $\sigma^2_v$, and the residuals $\epsilon_{dk}$ at the unit level with variance $\sigma^2_{\epsilon}$. The variance $\sigma^2_v$ can be interpreted as a measure of the heterogeneity among the domains after accounting for the fixed effect [@Breidenbach2012]. With this model, the mean of a domain can be predicted by

\begin{equation}
\hat{\bar{z}}_{\text{mb},d} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\pmb{\beta}} + \hat{v}_d \;,
(\#eq:mbpredictordomainmeanrandomintercept)
\end{equation}

with $\hat{\pmb{\beta}}$ the best linear unbiased estimates (BLUE) of the regression coefficients and $\hat{v}_d$ the best linear unbiased prediction (BLUP) of the intercept for domain $d$, $v_d$. The model-based predictor can also be written as

\begin{equation}
\hat{\bar{z}}_{\text{mb},d} = \bar{\mathbf{x}}_d^{\mathrm{T}} \hat{\pmb{\beta}} + \lambda_d \left( \frac{1}{n_d }\sum_{k \in \mathcal{S}_d} \epsilon_{dk} \right) \;,
(\#eq:mbpredictordomainmeanrandomintercept2)
\end{equation}

with $\lambda_d$ a weight for the second term that corrects for the bias of the synthetic estimator. This weight is computed by 

\begin{equation}
\lambda_d = \frac{\hat{\sigma}^2_v}{\hat{\sigma}^2_v + \hat{\sigma}^2_{\epsilon}/n_d}\;.
(\#eq:weightrandomintercept)
\end{equation}

This equation shows that the larger the estimated residual variance $\hat{\sigma}^2_{\epsilon}$, the smaller the weight for the bias correction factor, and the larger the sample size $n_d$, the larger the weight. Comparing Equations \@ref(eq:mbpredictordomainmeanrandomintercept) and \@ref(eq:mbpredictordomainmeanrandomintercept2) shows that the random intercept of a domain is predicted by the sample mean of the residuals of that domain, multiplied by a weight factor computed by Equation \@ref(eq:weightrandomintercept).  

The means of the small domains can be computed with function `eblup.mse.f.wrap` of package **JoSAE** [@JoSAE]. It requires as input a linear mixed model generated with function `lme` of package **nlme** [@nlme]. The simple random sample of size 200 selected before is used to fit the linear mixed model, with lnSWIR2 as a fixed effect, i.e., the effect of lnSWIR2 on the mean AGB. The random effect is added by assigning another formula to argument `random`. The formula `~ 1 | Ecoregion` means that the intercept is treated as a random variable and that it varies among the ecoregions. This linear mixed model is referred to as a random intercept model: the intercepts are allowed to differ among the small domains, whereas the effects of the covariates, lnSWIR2 in our case, is equal for all domains.

Tibble `grdAmazonia` is converted to a `data.frame` to avoid problems with function `eblup.mse.f.wrap` hereafter. A simple random sample with replacement of size 200 is selected.

```{r}
grdAmazonia <- grdAmazonia %>%
  as.data.frame() %>%
  mutate(x1 = x1 /1000,
         x2 = x2 / 1000)
set.seed(314)
mysample <- grdAmazonia %>%
  dplyr::select(x1, x2, AGB, lnSWIR2, Ecoregion) %>%
  slice_sample(n = n, replace = TRUE)
```

```{r}
library(nlme)
library(JoSAE)
lmm_AGB <- lme(fixed = AGB ~ lnSWIR2, data = mysample, random = ~ 1 | Ecoregion)
```

The fixed effects\index{Fixed effect} of the linear mixed model can be extracted with function `fixed.effects`.

```{r}
fixed_lmm <- fixed.effects(lmm_AGB)
```

The fixed effects of the linear mixed model differ somewhat from the fixed effects in the simple linear regression model (fixed_lm):
 
```{r, echo = FALSE}
slm <- lm(AGB ~ lnSWIR2, data = mysample)
ab.lm <- coef(slm)
(df <- data.frame(fixed_lm = ab.lm, fixed_lmm = fixed_lmm))
```

The random effect\index{Random effect} can be extracted with function `random.effects`. 

```{r}
random.effects(lmm_AGB)
```

The random intercepts are added to the fixed intercept; the coefficient of lnSWIR2 is the same for all ecoregions: 

```{r}
coef(lmm_AGB)
```

The fitted model can now be used to predict the means of the ecoregions as follows. As a first step, a data frame must be defined, with the size and the population mean of the covariate lnSWIR2 per domain. This data frame is passed to function `eblup.mse.f.wrap` with argument `domain.data`. This function computes the model-based prediction, as well as the regression estimator (Equation \@ref(eq:regressionestimatorsmalldomain)) and the synthetic estimator (Equation \@ref(eq:syntheticestimatorsmalldomain)) and their variances. The model-based predictor is the variable `EBLUP` in the output data frame. For the model-based predictor, two standard errors are computed, see @Breidenbach2012 for details.

```{r}
N_eco <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Ecoregion, FUN = length)
df_eco <- data.frame(Ecoregion = ecoregions, N = N_eco, lnSWIR2 = mx_eco_pop)
res <- eblup.mse.f.wrap(domain.data = df_eco, lme.obj = lmm_AGB)
df <- data.frame(Ecoregion = res$domain.ID, mb = res$EBLUP,
  se.1 = res$EBLUP.se.1, se.2 = res$EBLUP.se.2)
```

Table \@ref(tab:TableRandomInterceptModelEstimates) shows the model-based predictions and the estimated standard errors of the mean AGB of the ecoregions, obtained with the random intercept model.

(ref:TableRandomInterceptModelEstimateslabel) Model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, for simple random sample without replacement of size 200, obtained with the random intercept model and lnSWIR2 as a predictor.

```{r TableRandomInterceptModelEstimates, echo = FALSE}
df[, c(2, 3, 4)] <- round(df[, c(2, 3, 4)], 1)

knitr::kable(
  df, caption = "(ref:TableRandomInterceptModelEstimateslabel)",
  booktabs = TRUE,
  col.names = c("Ecoregion", "AGB", "se.1", "se.2"),
  linesep = ""
) %>%
  kable_classic() %>%
   add_footnote(" se.1 and se.2 are standard errors, for explanation see text.", notation = "none")
```

Note that with this model no predictions of the mean AGB are obtained for the unsampled ecoregions. This is because the random intercept $v_d$ cannot be predicted in the absence of data, see Equations \@ref(eq:mbpredictordomainmeanrandomintercept) and \@ref(eq:mbpredictordomainmeanrandomintercept2).

### Geostatistical model 

In a geostatistical model, there is only one random variable, the residual of the model-mean, not two random variables as in the random intercept model. See Equation \@ref(eq:OKmodel) for a geostatistical model with a constant mean and Equation \@ref(eq:KEDmodel2) for a model with a mean that is a linear combination of covariates. In a geostatistical model, the covariance of the residuals of the mean at two locations is modelled as a function of the distance (and direction) of the points. Instead of the covariance, often the semivariance is modelled, i.e., half the variance of the difference of the residuals at two locations, see Chapter \@ref(Introkriging) for details.

The simple random sample of size 200 selected before is used to estimate the regression coefficients for the mean, an intercept, and a slope coefficient for lnSWIR2, and besides the parameters of a spherical semivariogram model for the residuals of the mean. The two regression coefficients and the three semivariogram parameters are estimated by restricted maximum likelihood (REML), see Subsection \@ref(REML). This estimation procedure is also used in function `lme` to fit the random intercept model. Here, function `likfit` of package **geoR** [@geoR] is used to estimate the model parameters. First, a geoR object must be generated with function `as.geodata`.

```{r}
library(geoR)
dGeoR <- as.geodata(mysample, header = TRUE,
  coords.col = c("x1", "x2"), data.col = "AGB", covar.col = "lnSWIR2")
vgm_REML <- likfit(geodata = dGeoR, trend = ~ lnSWIR2,
  cov.model = "spherical",
  ini.cov.pars = c(600, 600), nugget = 1500,
  lik.method = "REML", messages = FALSE)
```

The estimated intercept and slope are `r formatC(vgm_REML$beta[1], 0, format = "f", big.mark = ",")` and `r formatC(vgm_REML$beta[2], 1, format = "f")`, respectively. The estimated semivariogram parameters are `r formatC(vgm_REML$tausq, 0, format = "f", big.mark = ",")`  (10^9^ kg ha^-1^)^2^, `r formatC(vgm_REML$sigmasq, 0, format = "f", big.mark = ",")` (10^9^ kg ha^-1^)^2^, and `r formatC(vgm_REML$phi, 0, format = "f")` km for the nugget, partial sill, and range, respectively. These model parameters are used to predict AGB for all units in the population, using function `krige` of package **gstat** [@peb04]. The REML estimates of the semivariogram parameters are passed to function `vgm` with arguments `nugget`, `psill`, and `range`. The coordinates of the sample are shifted to a random point within a 1 km $\times$ 1 km grid cell. This is done to avoid the coincidence of a sampling point and a prediction point, which leads to an error message when predicting AGB at the nodes of the grid.

```{r, eval = FALSE}
library(gstat)
mysample$x1 <- jitter(mysample$x1, amount = 0.5)
mysample$x2 <- jitter(mysample$x2, amount = 0.5)
coordinates(mysample) <- ~ x1 + x2
vgm_REML_gstat <- vgm(model = "Sph",
  nugget = vgm_REML$nugget, psill = vgm_REML$sigmasq, range = vgm_REML$phi)
coordinates(grdAmazonia) <- ~ x1 + x2
predictions  <- krige(
  formula = AGB ~ lnSWIR2,
  locations = mysample,
  newdata = grdAmazonia,
  model = vgm_REML_gstat,
  debug.level = 0) %>% as("data.frame")
```

```{r, eval = FALSE, echo = FALSE}
save(predictions, file = "results/ModelBasedPredictionsAGB_Amazonia.rda")
```

The first six rows of `predictions` are shown below.

```{r, echo = FALSE}
load(file = "results/ModelBasedPredictionsAGB_Amazonia.rda")
head(predictions)
```

Besides a prediction (variable `var1.pred`), for every population unit the variance of the prediction error is computed (`var1.var`). The unitwise predictions can be averaged across all units of an ecoregion to obtain a model-based prediction of the mean of that ecoregion.

```{r}
AGBpred_unit <- predictions$var1.pred
mz_eco_mb <- tapply(AGBpred_unit, INDEX = grdAmazonia$Ecoregion,
                    FUN = mean) %>% round(1)
```

A difficulty is the computation of the standard error of these model-based predictions of the ecoregion mean. We cannot simply sum the unitwise variances and divide the sum by the squared number of units, because the prediction errors of units with a mutual distance smaller than the estimated range of the spherical semivariogram are correlated. A straightforward approach to obtain the standard error of the predicted mean is geostatistical simulation\index{Geostatistical simulation}. A large number of maps are simulated, conditional on the selected sample. For an infinite number of maps, the "average map", i.e., the map obtained by averaging for each unit all simulated values of that unit, is equal to the map with predicted AGB. For each simulated map, the average of the simulated values across all units of an ecoregion is computed. This results in as many averages as we have simulated maps. The variance of the averages of an ecoregion is an estimate of the variance of the predicted mean of that ecoregion. To reduce computing time, a 5 km $\times$ 5 km subgrid of `grdAmazonia` is used in the geostatistical simulation.  

```{r, eval = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
grdAmazonia <- grdAmazonia %>%
  mutate(x1 = x1 /1000, x2 = x2 / 1000, lnSWIR2 = log(SWIR2))
nsim <- 1000
coordinates(grdAmazonia) <- ~ x1 + x2
simulations  <- krige(
  formula = AGB ~ lnSWIR2,
  locations = mysample,
  newdata = grdAmazonia,
  model = vgm_REML_gstat,
  nmax = 100, nsim = nsim,
  debug.level = 0) %>% as("data.frame")
grdAmazonia <- as_tibble(grdAmazonia)
AGBsim_eco <- matrix(nrow = length(ecoregions), ncol = nsim)
for (i in 1:nsim) {
  AGBsim_eco[, i] <- tapply(simulations[, i + 2],
  INDEX = grdAmazonia$Ecoregion, FUN = mean)
}
```

```{r, eval = FALSE, echo = FALSE}
save(AGBsim_eco, file = "results/AGBsimulated_ecoregions.rda")
```

```{r, echo = FALSE}
load(file = "results/AGBsimulated_ecoregions.rda")
se_mz_eco_mb <- round(sqrt(apply(AGBsim_eco, MARGIN = 1, FUN = var)), 1)
```

(ref:TableGeostatisticalModelEstimateslabel) Model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, using a simple random sample without replacement of size 200, obtained with the geostatistical model and lnSWIR2 as a predictor for the mean.

```{r TableGeostatisticalModelEstimates, echo = FALSE}
df <- data.frame(Ecoregion = ecoregions, mz_eco_mb, se_mz_eco_mb)
df[, c(2, 3)] <- round(df[, c(2, 3)], 1)
rownames(df) <- NULL
knitr::kable(
  df, caption = "(ref:TableGeostatisticalModelEstimateslabel)",
  booktabs = TRUE,
  col.names = c("Ecoregion", "AGB", "se"),
  linesep = ""
) %>%
  kable_classic() %>%
   add_footnote(" se: standard error of predicted mean.", notation = "none")
```

Similar to the synthetic estimator, for all ecoregions an estimate of the mean AGB is obtained, also for the unsampled ecoregions (Table \@ref(tab:TableGeostatisticalModelEstimates)). The model-based prediction is strongly correlated with the synthetic estimate (Figure \@ref(fig:MBvsSynt)).

(ref:MBvsSyntlabel) Scatter plot of the model-based prediction and the synthetic estimate of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia. The solid line is the 1:1 line.

```{r MBvsSynt, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsSyntlabel)"}
df <- data.frame(ecoregions, mb = mz_eco_mb, se.mb = se_mz_eco_mb, synt = synt$estimate, se.synt = sqrt(synt$g_variance))
ggplot(data = df) +
        geom_point(mapping = aes(x = synt, y = mb), size = 2) +
        geom_abline(intercept = 0, slope = 1) +
        scale_x_continuous(name = "Synthetic estimate", limits = c(50, 300)) +
        scale_y_continuous(name = "Model-based prediction", limits = c(50, 300)) +
        coord_fixed()
```

The most striking difference is the standard error. The standard errors of the synthetic estimator range from 3.7 to 7.1 (Table \@ref(tab:tableRMSEs)), whereas the standard errors of the geostatistical predictions range from 5.9 to 15.9. However, these two standard errors are fundamentally different and should not be compared. The standard error of the synthetic estimator is a *sampling* standard error, i.e., it quantifies the variation of the estimated mean of an ecoregion over repeated random sampling with the sampling design, in this case simple random sampling of 200 units. The model-based standard error is not a sampling standard error but a model standard error, which expresses our uncertainty about the means of the domains due to our imperfect knowledge of the spatial variation of AGB. Given the observations of AGB at the selected sample, the map with the covariate lnSWIR2, and the estimated semivariogram model parameters, we are uncertain about the exact value of AGB at unsampled units. No samples are considered other than the one actually selected. For the fundamental difference between design-based, model-assisted, and model-based estimates of means, refer to Section \@ref(DBvsMB) and Chapter \@ref(Approaches).

It makes more sense to compare the two model-based predictions, the random intercept model predictions and the geostatistical predictions, and their standard errors. Figure \@ref(fig:MBvsMB) shows that the two model-based predictions are very similar. 

(ref:MBvsMBlabel) Scatter plot of model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions in Eastern Amazonia, obtained with the random intercept model and the geostatistical model. The solid line is the 1:1 line.

```{r MBvsMB, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsMBlabel)"}
ecos_in_sam <- sort(unique(mysample$Ecoregion))
res <- eblup.mse.f.wrap(domain.data = df_eco, lme.obj = lmm_AGB)
TF <- (ecoregions %in% ecos_in_sam)
n_eco <- tapply(mysample$AGB, INDEX = mysample$Ecoregion, FUN = length)
df <- data.frame(n_eco, mb.geo = mz_eco_mb[TF], semb.geo = se_mz_eco_mb[TF], mb.rint = res$EBLUP, semb.rint = res$EBLUP.se.1)
ggplot(data = df) +
        geom_point(mapping = aes(x = mb.rint, y = mb.geo), size = 2) +
        geom_abline(intercept = 0, slope = 1) +
        scale_x_continuous(name = "Random intercept model prediction ", limits = c(50, 300)) +
        scale_y_continuous(name = "Geostatistical prediction", limits = c(50, 300)) +
        coord_fixed()
```

For six ecoregions, the standard errors of the geostatistical model predictions are much smaller than those of the random intercept model predictions (Figure \@ref(fig:MBvsMBse)). These are ecoregions with small sample sizes.

(ref:MBvsMBselabel) Scatter plot of the estimated standard error of model-based predictions of the mean AGB (10^9^ kg ha^-1^) of ecoregions obtained with the random intercept model and the geostatistical model, using a simple random sample without replacement of size 200 from Eastern Amazonia. The numbers refer to the number of sampling units in an ecoregion. The solid line is the 1:1 line.

```{r MBvsMBse, echo = FALSE, fig.width = 5, fig.cap = "(ref:MBvsMBselabel)"}
ggplot(data = df) +
        geom_point(mapping = aes(x = semb.rint, y = semb.geo), size = 2) +
        geom_text(mapping = aes(x = semb.rint, y = semb.geo, label = n_eco), nudge_x = 1) +
        geom_abline(intercept = 0, slope = 1) +
        scale_x_continuous(name = "Standard error random intercept model prediction ", limits = c(5, 40)) +
        scale_y_continuous(name = "Standard error geostatistical prediction", limits = c(5, 40)) +
        coord_fixed()

```

```{block2, type = 'rmdnote'}
If a different semivariogram model were used, both the predicted means per ecoregion and the standard errors would be different. Especially the variance is sensitive to the semivariogram. For this reason, the model-based predictions are also referred to as model-dependent predictions\index{Model-dependent predictor}, see Chapter \@ref(Approaches).
```

## Supplemental probability sampling of small domains

The sample size in small domains of interest can be so small that no reliable statistical estimate of the mean or total of these domains can be obtained. In this case, we may decide to collect a supplemental sample\index{Supplemental sample} from these domains. It is convenient to use these domains as strata in supplemental probability sampling, so that we can control the sample sizes in the strata. If we can safely assume that the study variable at the units of the first sample are not changed, there is no need to revisit these units; otherwise, we must revisit them to observe the current values.

There are two approaches for using the two probability samples to estimate the population mean or total of a small domain [@Grafstrom2019]. In the first approach, the two samples are combined, and then the merged sample is used to estimate the population mean or total. In the second approach, the samples are not combined, but the two estimates from the separate samples. In this section only the first approach is illustrated with a simple situation in which the two samples are easily combined. Refer to @Grafstrom2019 for a more general approach of how multiple probability samples can be combined.

Suppose that the original sample is a simple random sample from the entire study area. A supplemental sample is selected from small domains, i.e., domains that have few selected units only. For a given small domain, the first sample is supplemented by selecting a simple random sample from the units not yet selected in the first sample. The size of the supplemental sample of a domain depends on the number of units of that domain in the first sample. The first sample is supplemented so that the total sample size of that domain is fixed. In this case, the combined sample of a domain is a simple random sample from that domain, so that the usual estimators for simple random sampling can be used to estimate the domain mean or total and its standard error.

This sampling strategy is illustrated with Eastern Amazonia. A simple random sample without replacement of 400 units is selected.

```{r, echo = FALSE}
rm(grdAmazonia)
```


```{r}
grdAmazonia$Biome <- as.factor(grdAmazonia$Biome)
biomes <- c("Mangrove", "Forest_dry", "Grassland", "Forest_moist")
levels(grdAmazonia$Biome) <- biomes
n1 <- 400
set.seed(123)
units_1 <- sample(nrow(grdAmazonia), size = n1, replace = FALSE)
mysample_1 <- grdAmazonia[units_1, c("AGB", "Biome")]
print(n1_biome <- table(mysample_1$Biome))
```

The selected units are removed from the sampling frame. For each of the three small biomes, Mangrove, Forest_dry, and Grassland, the size of the supplemental sample is computed so that the total sample size becomes 40. The supplemental sample is selected by stratified simple random sampling without replacement, using the small biomes as strata (Chapter \@ref(STSI)).

```{r}
units_notselected <- grdAmazonia[-units_1, ]
Biomes_NFM <- units_notselected[units_notselected$Biome != "Forest_moist", ]
n_biome <- 40
n2_biome <- rep(n_biome, 3) - n1_biome[-4]
ord <- unique(Biomes_NFM$Biome)
units_2 <- sampling::strata(Biomes_NFM, stratanames = "Biome",
  size = n2_biome[ord], method = "srswor")
mysample_2 <- getdata(Biomes_NFM, units_2)
mysample_2 <- mysample_2[c("AGB", "Biome")]
```

The two samples are merged, and the means of the domains are estimated by the sample means.

```{r}
mysample <- rbind(mysample_1, mysample_2)
print(mz_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean))
```

Finally, the standard error is estimated, accounting for sampling without replacement from a finite population (Equation \@ref(eq:EstVarMeanSI)).

```{r}
N_biome <- table(grdAmazonia$Biome)
fpc <- (1 - n_biome / N_biome)
S2z_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = var)
print(se_mz_biome <- sqrt(fpc * (S2z_biome / n_biome)))
```


```{r, echo = FALSE, eval = FALSE}
mz_biome <- v_mz_biome <- matrix(nrow = 10000, ncol = 4)
set.seed(314)
for (i in 1:10000) {
  units_1 <- sample(nrow(grdAmazonia), size = n1, replace = FALSE)
  mysample_1 <- grdAmazonia[units_1, c("AGB", "Biome")]
  n1_biome <- table(mysample_1$Biome)

  units_notselected <- grdAmazonia[-units_1, ]
  Biomes.NFM <- units_notselected[units_notselected$Biome != "Forest_moist", ]
  n_biome <- 40
  n2_biome <- rep(n_biome, 3) - n1_biome[-4]
  if (sum(n2_biome < 1) > 0) {
    next
    } else {
  ord <- unique(Biomes.NFM$Biome)
  units_2 <- sampling::strata(Biomes.NFM, stratanames = "Biome", size = n2_biome[ord], method = "srswor")
  mysample_2 <- getdata(Biomes.NFM, units_2)
  mysample_2 <- mysample_2[c("AGB", "Biome")]

  mysample <- rbin_biome(mysample_1, mysample_2)
  mz_biome[i, ] <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = mean)

  N_biome <- table(grdAmazonia$Biome)
  fpc <- (1 - n_biome / N_biome)
  S2z_biome <- tapply(mysample$AGB, INDEX = mysample$Biome, FUN = var)
  v_mz_biome[i, ] <- fpc * (S2z_biome / n_biome)
  }
}
save(mz_biome, v_mz_biome, file = "results/SupplementalSampleEstimates_Amazonia.rda")
```

```{r, echo = FALSE}
load(file = "results/SupplementalSampleEstimates_Amazonia.rda")
nas <- apply(mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)

Ep_mz_biome <- apply(mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)
Vp_mz_biome <- apply(mz_biome, MARGIN = 2, FUN = var, na.rm = TRUE)
se_mz_biome <- sqrt(v_mz_biome)
Ep_se_mz_biome <- apply(se_mz_biome, MARGIN = 2, FUN = mean, na.rm = TRUE)
mz_biome_pop <- tapply(grdAmazonia$AGB, INDEX = grdAmazonia$Biome, FUN = mean)

alpha <- 0.05
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
Mz_biome_pop <- matrix(nrow = 10000, ncol = 3, data = mz_biome_pop[1:3], byrow = TRUE)
ind <- (Mz_biome_pop > lower &  Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
  sum(!is.na(x))
  })
coveragerate_95 <- coverage / dum

alpha <- 0.10
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
ind <- (Mz_biome_pop > lower &  Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
  sum(!is.na(x))
  })
coveragerate_90 <- coverage / dum

alpha <- 0.20
margin <- qt(1 - alpha / 2, 39, lower.tail = TRUE) * sqrt(v_mz_biome[, 1:3])
lower <- mz_biome[, 1:3] - margin
upper <- mz_biome[, 1:3] + margin
ind <- (Mz_biome_pop > lower &  Mz_biome_pop < upper)
coverage <- apply(ind, MARGIN = 2, FUN = sum, na.rm = TRUE)
dum <- apply(ind, MARGIN = 2, FUN = function(x) {
  sum(!is.na(x))})
coveragerate_80 <- coverage / dum

row_names <- c("Average of estimated means", "True means", "Standard deviation of estimated means", "Average of estimated standard errors", "Coverage rate 95%", "Coverage rate 90%", "Coverage rate 80%")
dom <- matrix(nrow = 7, ncol = 3, dimnames = list(row_names, NULL))
for (i in 1:3) {
  dom[, i] <- c(round(Ep_mz_biome[i], 3), round(mz_biome_pop[i], 3), round(sqrt(Vp_mz_biome[i]), 3), round(Ep_se_mz_biome[i], 3), round(coveragerate_95[i], 3), round(coveragerate_90[i], 3), round(coveragerate_80[i], 3))
}
```

This sampling approach and estimation are repeated 10,000 times, i.e., a simple random sample without replacement of size 400 is selected 10,000 times from Eastern Amazonia, and the samples from the three small domains are supplemented so that the total sample sizes in these domains become 40. In two out of the 10,000 samples, the size of the first sample in one of the domains exceeded 40 units. These two samples are discarded. Ideally, these samples are not discarded, but their sizes in the small domains are reduced to 40 units, which are then used to estimate the means of the domains.

(ref:SmalldomainEstimateslabel) Summary statistics of 10,000 estimated means of AGB (10^9^ kg ha^-1^) of small domains (biomes) in Eastern Amazonia, estimated by combining a simple random sample without replacement of size 400 from all units and a supplemental stratified simple random sample without replacement from the units in the small domains not included in the simple random sample. The total sample size per small domain is 40.

```{r SmalldomainEstimates, echo = FALSE}
knitr::kable(
  dom, caption = "(ref:SmalldomainEstimateslabel)",
  col.names = c("Mangrove", "Dry forest", "Grassland"),
  row.names = TRUE,
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic()
```

For all three small domains, the average of the 10,000 estimated means of AGB is about equal to the true mean (Table \@ref(tab:SmalldomainEstimates)). Also the mean of the 10,000 estimated standard errors is very close to the standard deviation of the 10,000 estimated means. The coverage rates of 95, 90, and 80% confidence intervals are about equal to the nominal coverage rates.

This simple approach is feasible because at the domain level the two merged samples are a simple random sample. This  approach is also applicable when the first sample is a stratified simple random sample from the entire population, and the supplemental sample is a stratified simple random sample from a small domain using as strata the intersections of the strata used in the first phase and that domain.

```{r, echo = FALSE}
rm(list = ls())
```