32-StatisticalApproaches.Rmd

# Design-based, model-based, and model-assisted approach for sampling and inference {#Approaches}

Section \@ref(DBvsMB) already mentioned the design-based and the model-based approach for sampling and statistical inference. In this chapter, the fundamental differences between these two approaches are explained in more detail. Several misconceptions about the design-based approach\index{Design-based approach} for sampling and statistical inference, based on classical sampling theory\index{Classical sampling theory}, seem to be quite persistent.  These misconceptions are the result of confusion about basic statistical concepts such as independence, expectation, and bias and variance of estimators or predictors. These concepts have a different meaning in the design-based and the model-based approach\index{Model-based approach}. Besides, a population mean is still often confused with a model-mean, and a population variance with a model-variance, leading to invalid formulas for the sampling variance of an estimator of the population mean. The fundamental differences between these two approaches are illustrated with simulations, so that hopefully a better understanding of this subject is obtained. Besides, the difference between model-dependent inference (as used in  the model-based approach) and model-assisted inference is explained. This chapter has been published as part of a journal paper, see @Brus2021.

## Two sources of randomness

In my classes about spatial sampling, I ask the participants the following question. Suppose we have measurements of a soil property, for instance soil organic carbon content, at two locations separated by 20 cm. Do you think these two measurements are correlated? I ask them to vote for one of three answers:  

1. yes, they are (>80\% confident);
2. no, they are not (>80\% confident); or
3. I do not know.

Most students vote for answer 1, the other students vote for answer 3, nearly no one votes for answer 2. Then I explain that you cannot say which answer is correct, simply because for correlation we need two series of data, not just two numbers. The question then is how to generate two series of data. We need some random process\index{Random process} for this. This random process differs between the design-based and the model-based approach.

In the design-based approach, the random process is the random selection of sampling units, whereas in the model-based approach randomness is introduced via the statistical model of the spatial variation (Table \@ref(tab:approach)). So, the design-based approach requires probability sampling, i.e., random sampling, using a random number generator\index{Random number generator}, in such way that all population units have a positive probability of being included in the sample and that these inclusion probabilities are known for at least the selected population units [@sar92]. A probability sampling\index{Probability sampling} design can be used to generate an infinite number of samples in theory, although in practical applications only one is selected.

The spatial variation model\index{Spatial variation model} used in the model-based approach contains two terms, one for the mean (deterministic part) and one for the error with a specified probability distribution. For instance, Equation \@ref(eq:OKmodel) in Chapter \@ref(Introkriging) describes the model used in ordinary kriging. This model can be used to simulate an infinite number of spatial populations. All these populations together are referred to as a superpopulation (@sar92, @loh99). Depending on the model of spatial variation, the simulated populations may show spatial structure\index{Spatial structure} because the mean is a function of covariates, as in kriging with an external drift, and/or because the errors are spatially autocorrelated. A superpopulation\index{Superpopulation} is a construct, the populations do not exist in the real world. The populations are similar, but not identical. For instance, the mean differs among the populations. The expectation of the population mean, i.e., the average over all possible simulated populations, equals the superpopulation mean\index{Superpopulation mean}, commonly referred to as the model-mean\index{Model-mean}, parameter $\mu$ in Equation \@ref(eq:OKmodel). The variance also differs among the populations. Contrary to the mean, the average of the population variance over all  populations generally is not equal to the model-variance, parameter $\sigma^2$ in Equation \@ref(eq:OKmodel), but smaller. I will come back to this later. The differences between the simulated spatial populations illustrate our uncertainty about the spatial variation of the study variable in the population that is sampled or will be sampled.

In the design-based approach, only one population is considered, the one sampled, but the statistical inference is based on all samples that can be generated by a probability sampling. The top row of Figure \@ref(fig:plotsimulationsDBMB) shows five simple random samples of size ten. The population is the same in all plots. Proponents of the design-based approach  do not like to consider other populations than the one sampled. Their challenge is to characterise this one population from a probability sample.

On the contrary, in the model-based approach only one sample is considered, but the statistical inference is based on all populations that can be generated with the spatial variation model. Proponents of the model-based approach do not like to consider other samples than the one selected. Their challenge is to get the most out of the sample that is selected. The bottom row of Figure \@ref(fig:plotsimulationsDBMB) shows a spatial coverage sample, superimposed on five different populations simulated with an ordinary kriging model, using a spherical semivariogram with a nugget of 0.1, partial sill of 0.6, and a range of 75 m. Note that in the model-based approach there is no need to select a probability sample (see Table \@ref(tab:approach)); there are no requirements on how the units are selected.  


```{r, echo=FALSE}
field <- read_sf(system.file("extdata/leest.gpkg", package = "sswr")) %>%
  st_set_crs(NA_crs_)
grdField <- st_make_grid(field, cellsize = 1, what = "centers")
grdField <- grdField[field] %>%
  st_coordinates(grdField) %>%
  as_tibble() %>%
  mutate(
    x1 = X / 1000 - 597,
    x2 = Y / 1000 - 5654,
    dummy = 1
  )
coordinates(grdField) <- c("x1", "x2")

#simulate 5 fields
set.seed(314)
# unconditional Gaussian simulation by means of simple kriging
vgmodel <- vgm(model = "Sph", nugget = 0.1, psill = 0.6, range = 0.075)
simulation  <- krige(
    formula = dummy ~ 1,
    locations = grdField,
    newdata = grdField,
    model = vgmodel,
    nmax = 100,
    nsim = 5,
    beta = 0, #beta is model-mean
    dummy = TRUE, #this is to enforce unconditional simulation
    debug.level = 0
    )
df_grdField <- as.data.frame(grdField)
cnst <- 5
df_grdField[["pop1"]] <- simulation[[1]] + cnst
df_grdField[["pop2"]] <- simulation[[2]] + cnst
df_grdField[["pop3"]] <- simulation[[3]] + cnst
df_grdField[["pop4"]] <- simulation[[4]] + cnst
df_grdField[["pop5"]] <- simulation[[5]] + cnst

#select 5 simple random samples of 10 points
set.seed(314)
n <- 10
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 5 * n)
samplesdf <- df_grdField[sampleId, c("x1","x2")]
samplesdf$sam <- rep(x = paste("sam", 1:5, sep = ""), each = n)
samplesdf$sam <- factor(x = samplesdf$sam, levels = paste("sam", 1:5, sep = ""), ordered = TRUE)

#select a spatial coverage sample of 10 points
myStrata <- field %>%
  as_Spatial %>%
  stratify(nStrata = 10, cellSize = 2, equalArea = FALSE, nTry = 10)
mySCsample <- spsample(myStrata) %>%
  as("data.frame")
mySCsample$x1 <- mySCsample$x1 / 1000 - 597
mySCsample$x2 <- mySCsample$x2 / 1000 - 5654
```


```{r plotsimulationsDBMB, echo = FALSE, out.width = "100%", fig.asp = '0.4', fig.cap = "Random process considered in the design-based (top row) and the model-based approach (bottom row). The design-based approach considers only the sampled population, but all samples that can be generated by the sampling design. The model-based approach considers only the selected sample, but all populations that can be generated by the model."}
#Plot 5 times same simulated field 1 and 5 SI samples
plt1 <- ggplot(data = samplesdf) +
  geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
  geom_point(mapping = aes(x = x1, y =x2), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  facet_wrap(~ sam, ncol = 5, nrow = 1)

#Plot 5 simulated fields and spatial coverage sample
grdF <- df_grdField %>% pivot_longer(cols = c("pop1", "pop2", "pop3", "pop4", "pop5"))
plt2 <- ggplot(data = grdF) +
  geom_tile(mapping = aes(x = x1, y = x2, fill = value)) +
  geom_point(data = mySCsample, mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  facet_wrap(~name, nrow = 1, ncol = 5)

grid.arrange(plt1, plt2, nrow = 2)
```

As stressed by @dgr90 and @bru97, both approaches have their strengths and weaknesses. Broadly speaking, the design-based approach is the most appropriate if interest is in the population mean (total, proportion) or the population means (totals, proportions) of a restricted number of subpopulations (subareas). The model-based approach is the most appropriate if our aim is to map the study variable. Further, the strength of the design-based approach is the strict validity\index{Validity} of the estimates. Validity means that an objective assessment of the uncertainty of the estimator is warranted and that the coverage of confidence intervals is (almost) correct, provided that the sample is large enough to assume an approximately normal distribution of the estimator and design-unbiasedness of the variance estimator [@sar92]. The strength of the model-based approach is efficiency, i.e., more precise estimates of the (sub)population mean given the sample size, provided that a reasonably good model is used. So, if validity is more important than efficiency, the design-based approach is the best choice; in the reverse case, the model-based approach is preferable. For further reading, I recommend @Cassel1977 and @han83.

## Identically and independently distributed {#iid}

In a review paper on spatial sampling by @Wang2012, there is a section with the caption `Sampling of i.i.d. populations’. Here, i.i.d. stands for "identically and independently distributed\index{Identically and independently distributed}". In this section of @Wang2012 we can read: "In SRS (simple random sampling) it is assumed that the population is independent and identically distributed". This is one of the old misconceptions revitalised by this review paper. I will make clear that in statistics i.i.d. is not a characteristic of populations, so the concept of i.i.d. populations does not make sense.  The same misconception can be found in @Plant2012: "There is considerable literature on sample size estimation, much of which is discussed by Cochran (1977, chapter 4). This literature, however, is valid for samples of independent data but may not retain its validity for spatial data". Also according to @Wang2010, the classical formula for the variance of the estimator of the mean with simple random sampling, $V=\sigma^2/n$, only holds when data are independent. They say: "However in the case of spatial data, although members of the sample are independent by construction, data values that are near to one another in space, are unlikely to be independent because of a fundamental property of attributes in space, which is that they show spatial structure or continuity (spatial autocorrelation)". According to @Wang2010, the variance should be approximated by

\begin{equation}
V(\hat{\bar{z}})=\frac{\sigma^2 - \overline{\mathrm{Cov}(z_i,z_j)}}{n} \;,
(\#eq:Wang2010)
\end{equation}

with $V(\hat{\bar{z}})$ the variance of the estimator of the regional mean (mean of spatial population), $\sigma^2$ the population variance, $n$ the sample size, and $\overline{\mathrm{Cov}(z_i,z_j)}$ the average autocovariance between all pairs of individuals $(i, j)$ in the population (sampled and unsampled). So, according to this formula, ignoring the mean covariance within the population leads to an overestimation of the variance of the estimator of the mean. In Section \@ref(effectivesamplesize) I will make clear that this formula is incorrect and that the classical formula is still valid, also for populations showing spatial structure or continuity.

Remarkably, in other publications we can read that the classical formula for the variance of the estimator of the population mean with simple random sampling *underestimates* the true variance for populations showing spatial structure, see for instance @Griffith2005 and @Plant2012. The reasoning is that due to the spatial structure, there is less information in the sample data about the population mean. In  Section \@ref(effectivesamplesize) I explain that this is also a misconception. Do not get confused by these publications and stick to the classical formulas which you can find in standard textbooks on sampling theory, such as @coc77 and @loh99, as well as in Chapter \@ref(SI) of this book.

The concept of independence of random variables is illustrated with a simulation. The top row of Figure \@ref(fig:iid) shows five simple random samples of size two. The two points are repeatedly selected from the same population (showing clear spatial structure), so this top row represents the design-based approach. The bottom row shows two points, not selected randomly and independently, but at a fixed distance of 10 m. These two points are placed on different populations generated by the model described above, so the bottom row represents the model-based approach. 

```{r iid, echo = FALSE, out.width = "100%", fig.asp = '0.4', fig.cap = "Illustration of independence in design-based and model-based approach. The top row shows five samples of two points selected randomly and independently from each other from one population (design-based approach). The bottom row shows two points not selected randomly, at a distance of 10 m from each other, from five model realisations (model-based approach)."}
set.seed(413)
n <- 2
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 5 * n)
samplesdf <- df_grdField[sampleId, ]
samplesdf$sam <- rep(x = paste("sam", 1:5, sep = ""), each = n)
samplesdf$sam <- factor(x = samplesdf$sam, levels = paste("sam", 1:5, sep = ""), ordered = TRUE)

#select two points at fixed distance
h <- 0.010
xy <- coordinates(grdField)
xy1st <- c(0.405, 0.738)
set.seed(314)
angle <- runif(n = 1, min = 0, max = 2 * pi)
dx <- numeric(length = 2)
dx[1] <- h * sin(angle)
dx[2] <- h * cos(angle)
xy2nd <- xy1st + dx
onepair <- data.frame(x = c(xy1st[1], xy2nd[1]), y = c(xy1st[2], xy2nd[2]))

plt1 <- ggplot(data = samplesdf) +
  geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
  geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  facet_wrap(~ sam, ncol = 5, nrow = 1)

plt2 <- ggplot(data = grdF) +
  geom_tile(mapping = aes(x = x1, y = x2, fill = value)) +
  geom_point(data = onepair, mapping = aes(x = x, y = y), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  facet_wrap(~ name, nrow = 1, ncol = 5)

grid.arrange(plt1, plt2, nrow = 2)
```
(ref:ScatterplotsTwopointslabel) Scatter plots of the values of a study variable $z$ at two randomly and independently selected points, 1,000 times selected from one population (design-based approach), and at two fixed points with a separation distance of 10 m, selected non-randomly from 1,000 model realisations (model-based approach).

```{r ScatterplotsTwopoints, echo = FALSE, out.width = "100%", fig.cap = "(ref:ScatterplotsTwopointslabel)" }
#simulate 1000 values at two points seperated by distance h
c <- variogramLine(vgmodel, covariance = TRUE, dist_vector = c(0, h))
C <- matrix(nrow = 2, ncol = 2)
C[1, 1] <- C[2, 2] <- c$gamma[1]
C[1, 2] <- C[2, 1] <- c$gamma[2]

#cholesky decomposition
Upper <- chol(C)

set.seed(31415)
nsim <- 1000
Z <- matrix(nrow = nsim, ncol = 2)
for (i in 1:nsim) {
  N <- rnorm(n = 2, 0, 1)
  Z[i, ] <- crossprod(Upper, N)
}
Z <- Z + cnst
Zdf_MB <- as.data.frame(Z)

units <- sample(x = seq_len(nrow(df_grdField)), size = nsim)
Zdf_DB <- NULL
Zdf_DB$V1 <- df_grdField$pop1[units]
units <- sample(x = seq_len(nrow(df_grdField)), size = nsim)
Zdf_DB$V2 <- df_grdField$pop1[units]

Zdf <- rbind(Zdf_MB, Zdf_DB)
Zdf$approach <- rep(c("Model-based", "Design-based"), each = 1000)
Zdf$approach <- factor(Zdf$approach, levels = c("Design-based", "Model-based"), ordered = TRUE)
ggplot(data = Zdf) +
  geom_point(mapping = aes(x = V1, y = V2), size = 1) +
  scale_x_continuous(name = "z location 1", limits = c(0, 10)) +
  scale_y_continuous(name = "z location 2\n", limits = c(0, 10)) +
  coord_fixed() +
  facet_wrap(~approach)
```
The values measured at the two points are plotted against each other in a scatter plot, not for just five simple random samples or five populations, but for 1,000 samples and 1,000 populations (Figure \@ref(fig:ScatterplotsTwopoints)). As we can see there is no correlation between the two variables generated by the repeated random selection of the two points (design-based), whereas the two variables generated by the repeated simulation of populations (model-based) are correlated.

Instead of two points, we may select two series of probability samples independently from each other, for instance two series of simple random samples (SI) of size 10, or two series of systematic random samples with random origin (SY) with an average size of 10, see Figure \@ref(fig:TwoseriesSISY). 

```{r TwoseriesSISY, echo = FALSE, out.width = "100%", fig.cap = "Two series (a and b) of simple random samples of ten points (top) and two series (a and b) of systematic random samples of, on average, ten points (bottom). The samples of series a and b are selected independently from each other."}
#select 10 SI samples of 10 points
set.seed(314)
n <- 10
sampleId <- sample(x = seq_len(nrow(df_grdField)), size = 10 * n)
sampleSI <- df_grdField[sampleId, c("x1","x2")]
sampleSI$sam <- rep(c("a1", "a2", "a3", "a4", "a5", "b1", "b2", "b3", "b4", "b5"), each = n)
plt1 <- ggplot(data = sampleSI) +
  geom_tile(data = df_grdField,  mapping = aes(x = x1, y = x2, fill = pop1)) +
  geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  coord_fixed() +
  facet_wrap(~ sam, ncol = 5, nrow = 2)

#select 10 systematic random samples of 10 points
sampleSize <- 10
gridded(grdField) <- TRUE
set.seed(143)
sampleSY <- NULL
for (i in 1:10) {
  sampleSYi <- as(spsample(x = grdField, n = sampleSize, type = "regular", bb = bbox(grdField)), "data.frame")
  sampleSYi$sam <- i
  sampleSY <- rbind(sampleSY, sampleSYi)
}
sampleSY$sam[sampleSY$sam == 1] <- "a1"
sampleSY$sam[sampleSY$sam == 2] <- "a2"
sampleSY$sam[sampleSY$sam == 3] <- "a3"
sampleSY$sam[sampleSY$sam == 4] <- "a4"
sampleSY$sam[sampleSY$sam == 5] <- "a5"
sampleSY$sam[sampleSY$sam == 6] <- "b1"
sampleSY$sam[sampleSY$sam == 7] <- "b2"
sampleSY$sam[sampleSY$sam == 8] <- "b3"
sampleSY$sam[sampleSY$sam == 9] <- "b4"
sampleSY$sam[sampleSY$sam == 10] <- "b5"

#make plot of SY samples
plt2 <- ggplot(data = sampleSY) +
  geom_tile(data = df_grdField, mapping = aes(x = x1, y = x2, fill = pop1)) +
  geom_point(mapping = aes(x = x1, y = x2), size = 1.5, colour = "red") +
  scale_x_continuous(name = "", breaks = NULL) +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_fill_viridis_c(name = "SOM") +
  coord_fixed() +
  theme(legend.position = "none") +
  coord_fixed() +
  facet_wrap(~ sam, ncol = 5, nrow = 2)

grid.arrange(plt1, plt2, nrow = 2)
```

Again, if we plot the sample means of pairs of simple random samples and pairs of systematic random samples against each other, we see that the two averages are not correlated (Figure \@ref(fig:ScatterplotsSISY)). Note that the variation of the averages of the systematic random samples is considerably smaller than that of the simple random samples. The sampled population shows spatial structure. By spreading the sampling units over the spatial population, the precision of the estimated population mean is increased, see Chapter \@ref(SY).

```{r ScatterplotsSISY, echo = FALSE, out.width = "100%", fig.cap = "Scatter plots of averages of 1,000 pairs of simple random samples of ten points (SI) and of averages of 1,000 pairs of systematic random samples of ten points on average (SY)."}
#make scatter plot of means estimated from repeated SI samples
mz <- matrix(nrow = nsim, ncol = 2)

for (i in 1:nsim) {
  sampleId1 <- sample(x = seq_len(nrow(df_grdField)), size = 10)
  mz[i, 1] <- mean(df_grdField$pop1[sampleId1])
  sampleId2 <- sample(x = seq_len(nrow(df_grdField)), size = 10)
  mz[i, 2] <- mean(df_grdField$pop1[sampleId2])
}
mz_SI <- as.data.frame(mz)

#make scatter plot of means estimated from repeated SY samples
mz <- matrix(nrow = nsim, ncol = 2)
sp_grdField <- df_grdField[, c("x1", "x2", "pop1")]
gridded(sp_grdField) <- ~ x1 + x2

for (i in 1:nsim) {
  sampleSY1 <- spsample(x = sp_grdField, n = sampleSize, type = "regular")
  sampleSY2 <- spsample(x = sp_grdField, n = sampleSize, type = "regular")

#subsetting rows with overlay method
  z1 <- over(x = sampleSY1, y = sp_grdField)
  z2 <- over(x = sampleSY2, y = sp_grdField)

  mz[i, 1] <- mean(z1$pop1)
  mz[i, 2] <- mean(z2$pop1)
}
mz_SY <- as.data.frame(mz)

mz_both <- rbind(mz_SI, mz_SY)
mz_both$design <- rep(c("SI", "SY"), each = 1000)
ggplot(data = mz_both) +
  geom_point(mapping = aes(x = V1, y = V2), size = 1) +
  scale_x_continuous(name = "Sample average series a", limits = c(5, 7)) +
  scale_y_continuous(name = "Sample average series b\n", limits = c(5, 7)) +
  coord_fixed() +
  facet_wrap(~design)
```

This sampling experiment shows that independence is not a characteristic of a population, as stated by @Wang2012, but of random variables generated by a random process (in the experiment the values at points or the sample means). As the random process differs between the design-based and the model-based approach, independence has a different meaning in these two approaches. For this reason, it is imperative to be more specific when using the term independence, by saying that data are *design-independent*\index{Independence!design-independence} or that you *assume* that the data are *model-independent*\index{Independence!model-independence}.

## Bias and variance {#BiasandVariance}

Bias and variance are commonly used statistics to quantify the quality of an estimator. Bias quantifies the systematic error, variance the random error of the estimator. Both are defined as expectations. But are these expectations over realisations of a probability sampling design (samples) or realisations of a statistical model (populations)? Like independence, it is important to distinguish *design-bias*\index{Bias!design-bias} from *model-bias*\index{Bias!model-bias} and *design-variance*\index{Design-variance} (commonly referred to as sampling variance) from *model-variance*\index{Model-variance}.

The concept of model-unbiasedness deserves more attention. Figure \@ref(fig:preferentialsample) shows a preferential sample\index{Preferential sample} from a population simulated by sequential Gaussian simulation with a constant mean of 10 and an exponential semivariogram without nugget, a sill of 5, and a distance parameter of 20. The points are selected by sampling with draw-by-draw selection probabilities proportional to size (pps sampling, Chapter \@ref(pps)), using the square of the simulated values as a size variable. We may have a similar sample that is collected for delineating soil contamination or detecting hot spots of soil bacteria, etc. Many samples are selected at locations with a large value, few points at locations with a small value. The sample data are used in ordinary kriging (Figure \@ref(fig:preferentialsample)). The prediction errors are computed by subtracting the kriged map from the simulated population. 

```{r, echo = FALSE}
s1 <- s2 <- 1:100 - 0.5
grd <- expand.grid(s1, s2)
N <- nrow(grd)
names(grd) <- c("s1", "s2")
coordinates(grd) <- ~s1 + s2
vgmodel <- vgm(model = "Exp", psill = 5, range = 20)
set.seed(314)
nsim <- 1
sim <- krige(
  dummy ~ 1,
  locations = grd,
  newdata = grd,
  model = vgmodel,
  nmax = 100,
  nsim = nsim,
  beta = 10,
  dummy = TRUE,
  debug.level = 0
)
sim <- as.data.frame(sim)
names(sim)[3] <- "z"
sim$size <- sim$z^2

#select pps sample
set.seed(314)
units <- sample(N, 100, prob = sim$size)
sam <- sim[units, ]

maxsimz <- max(sim$z)
minsimz <- min(sim$z)

#now predict at nodes of grid using the simulated values at the pps sample
predgrd <- grd
coordinates(sam) <- ~s1 + s2
preds <- krige(
  z ~ 1,
  locations = sam,
  newdata = predgrd,
  model = vgmodel,
  debug.level = 0
)
preds <- as.data.frame(preds)
names(preds)[3] <- "zpred"

#map predicted values
#replace one value by the maximum another by the minimum of the simulated values
preds$zpred[which.max(preds$zpred)] <- maxsimz
preds$zpred[1] <- minsimz
```

```{r preferentialsample, echo = FALSE, out.width = "100%", fig.cap = "Preferential sample (size of open dots is proportional to value of study variable) from a simulated field (z) and map of ordinary kriging predictions (zpred)."}

df <- data.frame(s1 = sim$s1, s2 = sim$s2, z = sim$z, zpred = preds$zpred)
d <- df %>% pivot_longer(cols = c("z", "zpred"))

sam <- as(sam, "data.frame")

ggplot(data = d) +
  geom_tile(mapping = aes(x = s1, y = s2, fill = value)) +
  geom_point(data = sam, mapping = aes(x = s1, y = s2, size = z), shape = 1) +
  scale_fill_viridis_c(name = "z") +
  coord_fixed() +
  facet_grid(~ name)
```
```{r}
error <- preds$zpred - sim$z
```

Figure \@ref(fig:histogramerrorpreferentialsample) shows a histogram of the prediction errors. The population mean error equals `r formatC(mean(error), 3, format = "f")`, not 0. You may have expected a positive systematic error because of the overrepresentation of locations with large values, but on the other hand, kriging predictions are best linear unbiased predictions\index{Best linear unbiased predictor} (BLUP), so from that point of view, this systematic error might be unexpected. BLUP means that at individual locations the ordinary kriging predictions are unbiased. However, apparently this does not guarantee that the average of the prediction errors, averaged over all population units, equals 0. The reason is that unbiasedness is defined here over all realisations (populations) of the statistical model of spatial variation. So, the U in BLUP stands for model-unbiasedness. For other model realisations, sampled at the same points, we may have much smaller values, leading to a negative mean error of that population. On average, over all populations, the error at any point will be 0 and consequently also the average over all populations of the mean error.

```{r histogramerrorpreferentialsample, echo = FALSE, fig.width = 5, fig.cap = "Frequency distribution of the errors of ordinary kriging predictions from a preferential sample."}
ggplot() +
  geom_histogram(aes(x = error), binwidth = 0.5, fill = "black", alpha = 0.5, colour = "black") +
  scale_y_continuous(name = "Count") +
  scale_x_continuous(name = "Prediction error")
```

This experiment shows that model-unbiasedness does not protect us against selection bias\index{Bias!selection bias}, i.e., bias due to preferential sampling.

## Effective sample size {#effectivesamplesize}

Another persistent misconception is that when estimating the variance of the estimator of the mean of a spatial population or the correlation of two variables of a population, we must account for autocorrelation\index{Autocorrelation} of the sample data. This misconception occurs, for instance, in @Griffith2005 and in various sections (for instance, sections 3.5, 10.1, and 11.2) of @Plant2012. The reasoning is that, due to the spatial autocorrelation in the sample data, there is less information in the data about the parameter of interest, and so the effective sample size\index{Effective sample size} is smaller than the actual sample size. An early example of this misconception is Barnes' publication on the required sample size for estimating nonparametric tolerance intervals  [@Barnes1988]. @dgr92 showed that a basic probability sampling design like simple random sampling requires fewer sampling points than the model-based sampling design proposed by Barnes.

The misconception is caused by confusing population parameters with model parameters. Recall that the population mean and the model-mean are not the same; the model-mean $\mu$ of Equation \@ref(eq:OKmodel) is the expectation of the population means over all populations that can be simulated with the model. The same holds for the variance of a variable as well as for the covariance and the Pearson correlation coefficient of two variables. All these parameters can be defined as  parameters of a finite or infinite population or of random variables generated by a superpopulation model. Using an effective sample size to quantify the variance of an estimator is perfectly correct for model parameters, but not so for population parameters. For instance, when the correlation coefficient is defined as a population parameter and sampling units are selected by simple random sampling, there is no need to apply the method proposed by @Clifford1989 to correct the *p*-value\index{\emph{p}-value of a test} in a significance test for the presence of spatial autocorrelation.

I elaborate on this for the mean as the parameter of interest. Suppose a sample is selected in some way (need not be random) and the sample mean is used as an estimator of the model-mean. Note that for a model with a constant mean as in Equation \@ref(eq:OKmodel), the sample mean is a model-unbiased\index{Unbiasedness!model-unbiasedness} estimator of the model-mean, but in general not the best linear unbiased estimator\index{Best linear unbiased estimator} (BLUE) of the model-mean. If the random variables are model-independent, the variance of the sample mean, used as an estimator of the model-mean, can be computed by

\begin{equation}
V(\hat{\mu}) = \frac{\sigma^2}{n} \;,
(\#eq:VindModelMean)
\end{equation}

with $\sigma^2$ the model-variance of the random variable (see Equation \@ref(eq:OKmodel)). The variance presented in Equation \@ref(eq:VindModelMean) necessarily is a model-variance as it quantifies our uncertainty about the model-mean, which only exists in the model-based approach. If the random variables are not model-independent, the model-variance of the sample mean can be computed by [@gru06]

\begin{equation}
    V(\hat{\mu}) = \frac{\sigma^2}{n} \{1+(n-1)\bar{\rho}\} \;,
(\#eq:VdepModelMean)
\end{equation}

with $\bar{\rho}$ the mean correlation within the sample (the average of the correlation of all pairs of sampling points). The term inside the curly brackets is larger than 1, unless $\bar{\rho}$ equals 0. So, the variance of the estimator of the model-mean with dependent data is larger than when data are independent. The number of independent observations that is equivalent to a spatially autocorrelated data set's sample size $n$, referred to as the effective sample size, can be computed by [@gru06]

\begin{equation}
    n_{\mathrm{eff}}= \frac{n}{\{1+(n-1)\bar{\rho}\}} \;.
(\#eq:effectivesamplesize)
\end{equation}

So, if we substitute $n_{\mathrm{eff}}$ for $n$ in Equation \@ref(eq:VindModelMean), we obtain the variance presented in Equation \@ref(eq:VdepModelMean). Equation \@ref(eq:effectivesamplesize) is equivalent to equation (2) in @Griffith2005. Figure \@ref(fig:effectivesamplesize) shows that the effective sample size decreases sharply with the mean correlation. With a mean correlation of 0, the effective sample size equals the actual sample size; with a mean correlation of 1, the effective sample size equals 1.

```{r effectivesamplesize, echo = FALSE, fig.asp = .7, fig.width = 5, fig.cap = "Effective sample sizes as a function of the mean correlation within the sample, for samples of size 25 and 100."}
meanrho <- seq(from = 0, to = 1, length.out = 201)
n <- c(25, 100)
n25 <- n[1] / (1 + (n[1] - 1) * meanrho)
n100 <- n[2] / (1 + (n[2] - 1) * meanrho)
df <- data.frame(meanrho, n25, n100)
df_lf <- df %>% pivot_longer(cols = c("n25", "n100"))
df_lf$name <- factor(df_lf$name, levels = c("n25", "n100"), ordered = TRUE)
ggplot(data = df_lf) +
  geom_line(mapping = aes(x = meanrho, y = value, linetype = name), size = 1) +
  scale_linetype_manual(values = c("twodash", "dotted"), name = "") +
  scale_x_continuous(name = "Mean correlation") +
  scale_y_continuous(name = "Effective sample size", limits = c(1, 100))
```

To illustrate the difference between the model-variance and the design-variance of a sample mean, I simulated a finite population of 100 units, located at the nodes of a square grid, with a model-mean of 10, an exponential semivariogram without nugget, an effective range of three times the distance between adjacent population units, and a sill of 1 (Figure \@ref(fig:finitepopulation)). The model-variance of the average of a simple random sample *without replacement* of size $n$ is computed using Equation \@ref(eq:VdepModelMean), and the design-variance of the sample mean, used as an estimate of the population mean, is computed by (see Equation \@ref(eq:EstVarMeanSI))

\begin{equation}
V(\hat{\bar{z}})=\left(1-\frac{n}{N}\right)\frac{S^2}{n} \;,
(\#eq:varmeanSRSwithout)
\end{equation}

with $N$ the total number of population units ($N=100$). This is done for a range of sample sizes: $n = 10, 11, \dots ,100$. Note that for $n < 100$ the model-variance of the sample mean for a given $n$, differs between samples. For samples showing strong spatial clustering, the mean correlation is relatively large, and consequently the model-variance is relatively large (see Equation \@ref(eq:VdepModelMean)). There is less information in these samples about the model-mean than in samples without spatial clustering of the points. Therefore, to estimate the expectation of the model-variance over repeated simple random sampling for a given $n$, I selected 200 simple random samples of that size $n$, and I averaged the 200 model-variances. Figure \@ref(fig:MBvarDBvar) shows the result. Both the model-variance and the design-variance of the sample mean decrease with the sample size. For all sample sizes, the model-variance is larger than the design-variance. The design-variance goes to 0 for $n = 100$ (see Equation \@ref(eq:varmeanSRSwithout)), whereas the model-variance for $n = 100$ equals 0.0509. This can be explained as follows. Although with $n = 100$ we know the population mean without error, this population mean is only an estimate of the model-mean. Recall that the model-mean is the expectation of the population mean over all realisations of the model. 

```{r finitepopulation, echo = FALSE, fig.width = 5, fig.cap = "Simple random sample without replacement of ten points from a  finite population simulated with a model with a model-mean of 10, a model-variance of 1, and an exponential semivariogram (without nugget) with a distance parameter equal to the distance between neighbours (effective range is three times this distance). The mean correlation within the sample equals 0.135, and the model-variance of the estimator of the model-mean equals 0.222."}
s1 <- s2 <- 1:10 - 0.5
grd <- expand.grid(s1, s2)
N <- nrow(grd)
names(grd) <- c("s1", "s2")

#simulate finite populations
sigmasq <- 1
vgmodel <- vgm(model = "Exp", psill = sigmasq, range = 1)
H <- as.matrix(dist(grd))
C <- variogramLine(vgmodel, dist_vector = H, covariance = TRUE)
Upper <- chol(C)

set.seed(314)
sim <- matrix(nrow = N, ncol = 10000)
for (i in 1 :10000) {
  G <- rnorm(n = nrow(grd), 0, 1)
  sim[, i] <- crossprod(Upper, G) + 10
}
simdf <- as.data.frame(cbind(grd, sim))

#select one population, and select SRS without replacement
names(simdf)[17] <- "z"

set.seed(314)
units <- sample(N, 10, replace = FALSE)
SRS <- simdf[units, c(1, 2)]
d <- spDists(as.matrix(SRS[, c(1, 2)]))
C <- variogramLine(vgmodel, dist_vector = d, covariance = TRUE)
meanrho <- mean(C / sigmasq)
modelvar <- round(sigmasq / 10 * (1 + (10 - 1) * meanrho), 3)

ggplot(data = simdf) +
    geom_point(mapping = aes(x = s1, y = s2, colour = z), size = 3) +
    geom_point(data = SRS, mapping = aes(x = s1, y = s2), size = 6, shape = 1, stroke = 1.5) +
    scale_colour_viridis_c(name = "z") +
    scale_x_continuous(name = "Easting", breaks = c(2, 4, 6, 8)) +
    scale_y_continuous(name = "Northing", breaks = c(2, 4, 6, 8)) +
    coord_fixed()
```


```{r, echo = FALSE, eval = FALSE}
n <- 10:100
DBvar <- (1 - n / N) * var(simdf$z) / n
MBvar <- MBvarGriffith <- matrix(nrow = length(n), ncol = 200)

set.seed(314)
for (i in seq_len(length(n))) {
  for (j in 1:200) {
    units <- sample(N, n[i], replace = FALSE)
    SRS <- simdf[units, c(1, 2)]
    d <- spDists(as.matrix(SRS[, c(1, 2)]))
    C <- variogramLine(vgmodel, dist_vector = d, covariance = TRUE)
    meanrho <- mean(C[lower.tri(C, diag = FALSE)] / sigmasq)
    MBvar[i, j] <- (sigmasq / n[i]) * (1 + (n[i] - 1) * meanrho)
  }
}
save(n, DBvar, MBvar, file = "results/MBvarDBvar.rda")
```

```{r MBvarDBvar, echo = FALSE, fig.asp = .7, fig.width = 5, fig.cap = "Model-variance (MB) and design-variance (DB) of the average of a simple random sample without replacement as a function of the sample size."}
load(file = "results/MBvarDBvar.rda")
meanMBvar <- apply(MBvar, MARGIN = 1, FUN = mean)
df <- data.frame(n, DBvar, meanMBvar)
d <- df %>% pivot_longer(cols = c("DBvar", "meanMBvar"))
ggplot(data = d) +
  geom_point(mapping = aes(x = n, y = value, shape = name), size = 2) +
  scale_shape_manual(values = c(3, 1), name = "Variance", labels = c("DB", "MB")) +
  scale_x_continuous(name = "Sample size", limits = c(10, 100), breaks = c(10, 30, 50, 70, 100)) +
  scale_y_continuous(name = "Variance")
```

In Figure \@ref(fig:HistogramsMeanVariance) we can see that the population mean shows considerable variation. The variance of 10,000 simulated population means equals 0.0513, which is nearly equal to the value of 0.0509 for the model-variance computed with Equation \@ref(eq:VdepModelMean).

In observational research, I cannot think of situations in which interest is in estimation of the mean of a superpopulation model. This in contrast to experimental research. In experimental research, we are interested in the effects of treatments; think for instance of the effects of different types of soil tillage on the soil carbon stock. These treatment effects are quantified by different model-means. Also, in time-series analysis of data collected in observational studies, we might be more interested in the model-mean than in the mean over a bounded period of time.

```{r HistogramsMeanVariance, echo = FALSE, out.width = "100%", fig.asp = 0.5, fig.cap = "Frequency distributions of means and variances of 10,000 simulated populations."}
#compute population means
popmeans <- apply(simdf[, -c(1, 2)], MARGIN = 2, FUN = mean)

plt1 <- ggplot() +
  geom_histogram(aes(x = popmeans), fill = "black", alpha = 0.5, binwidth = 0.1, colour = "black") +
  scale_y_continuous(name = "Count") +
  scale_x_continuous(name = "Population mean")

#compute population variances
S2 <- apply(simdf[, -c(1, 2)], MARGIN = 2, FUN = var)
plt2 <- ggplot() +
  geom_histogram(aes(x = S2), fill = "black", alpha = 0.5, binwidth = 0.1, colour = "black") +
  scale_y_continuous(name = "Count") +
  scale_x_continuous(name = "Population variance")

grid.arrange(plt1, plt2, nrow = 1)
```

Now let us return to Equation \@ref(eq:Wang2010). What is wrong with this variance estimator? Where @Griffith2005 confused the population mean and the model-mean, @Wang2010 confused the population variance with the sill\index{Sill} (a priori variance) of the random process that has generated the population [@webster2007]. The parameter $\sigma^2$ in their formula is defined as the population variance. In doing so, the variance estimator is clearly wrong. However, if we define $\sigma^2$ in this formula as the sill, the formula makes more sense, but even then, the equation is not fully correct. The variance computed with this equation is not the design-variance of the average of a simple random sample selected from the sampled population, but the *expectation* of this design-variance over all realisations of the model. So, it is a model-based prediction of the design-variance of the estimator of the population mean, estimated from a simple random sample, see Chapter \@ref(MBpredictionofDesignVariance). For the population actually sampled, the design-variance is either smaller or larger than this expectation. Figure \@ref(fig:HistogramsMeanVariance) shows that there is considerable variation in the population variance among the 10,000 populations simulated with the model. Consequently, for an individual population, the variance of the estimator of the population mean, estimated from a simple random sample, can largely differ from the model-expectation of this variance. Do not use Equation \@ref(eq:Wang2010) for estimating the design-variance of the estimator of the population mean, but simply use Equation \@ref(eq:varmeanSRSwithout) (for simple random sampling with replacement and simple random sampling of infinite populations the term $(1-n/N)$ can be dropped). Equation  \@ref(eq:Wang2010) is only relevant for comparing simple random sampling with other sampling designs under a variety of models of spatial variation, see Section \@ref(AnalyticalApproach) (@Ripley1981, @dom94).

## Exploiting spatial structure in design-based approach {#ExploitSpatialStructure}

A further misconception is that, in the design-based approach, the possibilities of exploiting our knowledge about the spatial structure of the study variable are limited, because the sampling units are selected randomly. This would indeed be a very serious drawback, but happily enough, this is not true. There are various ways of utilising this knowledge. Our knowledge about the spatial structure can be used either at the stage of designing the sample and/or at the stage of the statistical inference once the data are collected (Table \@ref(tab:TableExploitingSpatialStructure)). 

I distinguish the situation in which maps of covariates are available from the situation in which such maps are lacking. In the first situation, the covariate maps can be used, for instance, to stratify the population (Chapter \@ref(STSI)). With a quantitative covariate, optimal stratification methods are available. Other options are, for instance, pps sampling (Chapter \@ref(pps)), and balanced sampling and well-spread sampling in covariate space with the local pivotal method (Chapter \@ref(BalancedSpreaded)). At the inference stage, the covariate maps can be used in a model-assisted approach\index{Model-assisted approach}, using, for instance, a linear regression model to increase the precision of the design-based estimator (Chapter \@ref(Modelassisted), Section \@ref(ModelassistedvsModeldependent)). 

If no covariate maps are available, we may anticipate the presence of spatial structure by spreading the sampling units throughout the study area. This spreading can be done in many ways, for instance by systematic random sampling (Chapter \@ref(SY)), compact geographical stratification (Section \@ref(geostrata)), well-spread sampling in geographical space with the local pivotal method (LPM) (Subsection \@ref(LPM)), and generalised random-tessellation stratified (GRTS) sampling (Subsection \@ref(GRTS)). At the inference stage, again a model-assisted approach can be advantageous, using the spatial coordinates in a regression model.

```{r TableExploitingSpatialStructure, echo = FALSE}
tabledf <- data.frame(stage = c("Sampling", "", "", "", "Inference"), with = c("Stratified random sampling", "pps sampling", "Balanced sampling", "Covariate space spreading with LPM", "Model-assisted: regression model"), without = c("Systematic random sampling", "Compact geographical stratification", "Geographical spreading with LPM", "GRTS sampling", "Model-assisted: spatial regression model"))
knitr::kable(
  tabledf, caption = "Strategies in the design-based approach for exploiting knowledge about the spatial structure of the study variable.",
  col.names = c("Stage", "Covariates available", "No covariates"),
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic(font_size = 8)  %>%
   add_footnote("LPM: local pivotal method.", notation = "none")
```

## Model-assisted vs. model-dependent {#ModelassistedvsModeldependent}

In this section the difference between the model-assisted approach and the model-based approach is explained. The model-assisted approach is a hybrid approach in between the design-based and the model-based approach. It tries to build the strength of the model-based approach, a potential increase of the accuracy of estimates, into the design-based approach. As in the design-based approach, sampling units are selected by probability sampling, and consequently bias and variance are defined as design-bias and design-variance (Table \@ref(tab:threeapproaches)). As in the model-based approach, a superpopulation model is used. However, the role of this model in the two approaches is fundamentally different. In both approaches we assume that the population of interest is a realisation of the superpopulation model. However, as explained above, in the model-based approach, the statistical properties of the estimators (predictors), such as bias and variance, are defined over all possible realisations of the model (Table \@ref(tab:threeapproaches)). So, unbiasedness and minimum variance of an estimator (predictor) means *model*-unbiasedness and minimum *model*-variance. On the contrary, in the model-assisted approach, the model is used to derive an efficient estimator (Chapter \@ref(Modelassisted)). To stress its different role in the model-assisted approach, the model is referred to as a working model\index{Working model}.

```{r threeapproaches, echo = FALSE}
approach <- data.frame(Approach = c("Design-based", "Model-assisted", "Model-based"), Sampling = c("Prob. sampling", "Prob. sampling", "No requirement"), Inference = c("Design-based", "Model-assisted", "Model-depend."), Coefficients = c("No model", "Population par.", "Superpop. par."), Criteria = c("Design-bias,  Design-variance", "Design-bias,  Design-variance", "Model-bias, Model-variance"))

knitr::kable(
  approach, caption = "Three statistical approaches for sampling and inference.",
  col.names = c("Approach", "Sampling", "Inference", "Regression coefficients", "Quality criteria"),
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic(font_size = 7)
```

An important property of model-assisted estimators is that, if a poor working model is used (our assumptions about how our population is generated are incorrect), then for moderate sample sizes the results are still valid, i.e., the empirical coverage rate\index{Empirical coverage rate} of a model-assisted estimate of the confidence interval of the population mean still is approximately equal to the nominal coverage rate. This is because the mismatch of the superpopulation model and the applied model-assisted estimator results in a large design-variance of the estimator of the population mean. This is illustrated with a simulation study, in which I compare the effect of using a correct vs. an incorrect model in estimation.

A population is simulated with a simple linear regression model with an intercept of 15 ($\beta_0 = 15$), a slope coefficient of 0.5 ($\beta_1 = 0.5$), and a constant residual standard deviation of 2 ($\sigma_{\epsilon}=2$). This is done by first simulating a population with covariate values with a model-mean of 20 ($\mu(x)=20$), using an exponential semivariogram without nugget, a sill variance of 25, and a distance parameter of 20 distance units. This field with covariate values is then linearly transformed using the above-mentioned regression coefficients. Finally, `white noise\index{White noise}' is added by drawing independently for each population unit a random number from a normal distribution with zero mean and a standard deviation of 2 (Figure \@ref(fig:plotsimulatedbivariatepopulation)).

```{r simulatebivariatepopulation, echo = FALSE}
s1 <- s2 <- 1:200 - 0.5
mypopulation <- expand.grid(s1, s2)
names(mypopulation) <- c("s1", "s2")
N <- nrow(mypopulation)
vgmodel <- vgm(model = "Exp", psill = 25, range = 20, nugget = 0)
coordinates(mypopulation) <- ~s1 + s2
set.seed(314)
sim <- krige(
  dummy ~ 1,
  locations = mypopulation,
  newdata = mypopulation,
  model = vgmodel,
  nmax = 20,
  nsim = 1,
  beta = 20,
  dummy = TRUE,
  debug.level = 0
)
mypopulation$x <- sim$sim1
mypopulation <- as(mypopulation, "data.frame")
mypopulation$z <- 15 + 0.5 * mypopulation$x + rnorm(n = N, mean = 0, sd = 2)
```

The population mean of the study variable equals `r formatC(mean(mypopulation$z), 3, format = "f")`, which is pretty close to the known model-mean $\mu(z)$: $\beta_0 + \beta_1 \mu(x)= 15 + 0.5 \cdot 20$.


```{r plotsimulatedbivariatepopulation, echo = FALSE, out.width = "100%", fig.cap = "Realisation of simple linear regression model. x is the covariate, z is the study variable."}
plt1 <- ggplot(data = mypopulation) +
  geom_raster(mapping = aes(x = s1, y = s2, fill = x)) +
#  ggtitle("Covariate") +
  theme(plot.title = element_text(size = 16, hjust = 0.5)) +
  scale_fill_viridis_c(name = "x") +
  scale_y_continuous(name = "Northing") +
  scale_x_continuous(name = "Easting") +
  coord_fixed()

plt2 <- ggplot(data = mypopulation) +
  geom_raster(mapping = aes(x = s1, y = s2, fill = z)) +
#  ggtitle("Study variable") +
  theme(plot.title = element_text(size = 16, hjust = 0.5)) +
  scale_fill_viridis_c(name = "z") +
  scale_y_continuous(name = "Northing") +
  scale_x_continuous(name = "Easting") +
  coord_fixed()

grid.arrange(plt1, plt2, nrow = 1)
```


```{r scatterplotbivariatepopulation, echo = FALSE, fig.width = 5, out.width= "60%", fig.cap = "Exhaustive scatter plot of the simulated population, with population fit of a simple linear regression model (green line), and of a ratio model fitted with weights inversely proportional to the covariate (red line)."}
model_regr_pop <- lm(z ~ x, data = mypopulation)
model_ratio_pop <- lm(z ~ x -1, weights = 1 / x, data = mypopulation)
ggplot(mypopulation) +
  geom_point(mapping = aes(x = x, y = z), alpha = 0.25) +
  geom_abline(intercept = coef(model_regr_pop)[1], slope = coef(model_regr_pop)[2], colour = "green") +
  geom_abline(intercept = 0, slope = coef(model_ratio_pop)[1], colour = "red") +
  scale_x_continuous(limits = c(0, 40)) +
  scale_y_continuous(limits = c(0, 40), name = "z")

sdres_regr <- summary(model_regr_pop)$sigma
fit_ratio <- fitted(model_ratio_pop)
e_ratio <- mypopulation$z - fit_ratio
sdres_ratio <- sqrt(sum(e_ratio^2) / nrow(mypopulation))
```

Figure \@ref(fig:scatterplotbivariatepopulation) shows a scatter plot for all population units. The Pearson correlation coefficient equals `r formatC(cor(mypopulation$x,mypopulation$z), 3, format = "f")`. Two models are fitted to the exhaustive scatter plot: a simple linear regression model and a ratio model. The ratio model assumes that the intercept $\beta_0$ equals 0 and that the residual variance is proportional to the covariate values: $\sigma^2_{\epsilon} \propto x$. The population fits of the coefficients of the simple linear regression model are `r formatC(coef(model_regr_pop)[1], 4, format = "f")` and `r formatC(coef(model_regr_pop)[2], 4, format = "f")`, which are very close to the model regression coefficients. The fitted ratio model is clearly very poor. The residual standard deviation of the population fit of the ratio model equals `r formatC(sdres_ratio, 3, format = "f")`, which is much larger than  `r formatC(sdres_regr, 3, format = "f")` of the simple linear regression model.

The population mean of the study variable is estimated by selecting 5,000 times a simple random sample of 25 units. Each sample is used to estimate the population mean by two model-assisted estimators: the simple regression estimator and the ratio estimator (Chapter \@ref(Modelassisted)). The first estimator correctly assumes that the population is a realisation of a simple linear regression model, whereas the latter incorrectly assumes that it is a realisation of a ratio model. For each sample, the standard error of the two estimators are estimated as well, which is used to compute a 95\% confidence interval of the population mean. Then the empirical coverage rate is computed, i.e., the proportion of samples for which the population mean is inside the 95\% confidence interval. Ideally, this empirical coverage rate is equal to the nominal coverage rate of 0.95.

```{r coveragerate1, echo = FALSE}
mx_pop <- mean(mypopulation$x)
mz_pop <- mean(mypopulation$z)
n <- 25
S <- 5000
alpha <- 0.05
se_regr <- se_ratio <- ind_regr <- ind_ratio <- numeric(length = S)
set.seed(314)
for (i in 1:S) {
  units <- sample(N, size = n, replace = TRUE)
  mySI <- mypopulation[units, ]
  mx_sam <- mean(mySI$x)
  mz_sam <- mean(mySI$z)
  model_regr <- lm(z ~ x, data = mySI)
  b1 <- coef(model_regr)[2]
  mz_regr <- mz_sam + b1 * (mx_pop - mx_sam)
  e_regr <- residuals(model_regr)
  S2e <- var(e_regr)
  se_regr[i] <- sqrt(S2e / n)
  margin <- qt(1 - alpha / 2, n - 2, lower.tail = TRUE) * se_regr[i]
  lower <- mz_regr - margin
  upper <- mz_regr + margin
  ind_regr[i] <- (mz_pop > lower &  mz_pop < upper)

  model_ratio <- lm(z ~ x -1, weights = 1 / x, data = mySI)
  b <- coef(model_ratio)
  mz_ratio <- b * mx_pop
  e_ratio <- residuals(model_ratio)
  se_ratio[i] <- sqrt(var(e_ratio) / n)
  margin <- qt(1 - alpha / 2, n - 1, lower.tail = TRUE) * se_ratio[i]
  lower <- mz_ratio - margin
  upper <- mz_ratio + margin
  ind_ratio[i] <- (mz_pop > lower &  mz_pop < upper)
}
coverage_regr <- mean(ind_regr)
coverage_ratio <- mean(ind_ratio)
```

The coverage rates of the simple regression estimator and the ratio estimator equal `r formatC(coverage_regr, 3, format = "f")` and `r formatC(coverage_ratio, 3, format = "f")`, respectively. Both coverage rates are very close to the nominal coverage rate of 0.95.  So, despite the fact that the ratio estimator assumes an improper superpopulation model, the estimated confidence interval is still valid. The price we pay for the invalid model assumption is not an overestimated coverage rate of a confidence interval, but an increased standard error of the estimated population mean. The average over the 5,000 samples of the estimated standard error of the regression estimator equals `r formatC(mean(se_regr), 3, format = "f")`, whereas that of the ratio estimator equals `r formatC(mean(se_ratio), 3, format = "f")`. The larger standard error of the ratio estimator leads to wider confidence intervals, which explains that the coverage rate is still correct.

This sampling experiment is now repeated for samples sizes $n=10, 25, 50 , 100$ and for confidence levels $1-\alpha=0.01,0.02, \dots , 0.99$.

```{r, echo = FALSE, eval = FALSE}
n <- c(10, 25, 50, 100)
alphas <- (1:99) / 100
nsam <- 5000

mx_pop <- mean(mypopulation$x)
mz_pop <- mean(mypopulation$z)

mz_regr <- mz_ratio <- v_mz_regr <- v_mz_ratio <- av_mz_regr <- av_mz_ratio <- matrix(nrow = length(n), ncol = nsam)
ind_regr <- ind_ratio <- ind_regr_av <- ind_ratio_av <- array(dim = c(length(n), nsam, length(alphas)))

set.seed(314)
for (i in seq_len(length(n))) {
  pi <- n[i] / N
  for (j in 1:nsam) {
    units <- sample(N, size = n[i], replace = TRUE)
    mySI <- mypopulation[units, ]
    mx_sam <- mean(mySI$x)
    mz_sam <- mean(mySI$z)
    model_regr <- lm(z ~ x, data = mySI)
    model_ratio <- lm(z ~ x -1, weights = 1 / x, data = mySI)
    mz_regr[i, j] <- mz_sam + coef(model_regr)[2] * (mx_pop - mx_sam)
    mz_ratio[i, j] <- mz_sam / mx_sam * mx_pop
    e_regr <- residuals(model_regr)
    e_ratio <- residuals(model_ratio)
    av_mz_regr[i, j] <- (var(e_regr) * (n[i] - 1) / (n[i] - 2)) / n[i]
    av_mz_ratio[i, j] <- var(e_ratio) / n[i]
    g_regr <- 1 + (mx_pop - mx_sam) * (mySI$x - mx_sam) / (mean(mySI$x^2) - mx_sam^2)
    g_ratio <- mx_pop / mx_sam
    v_mz_regr[i, j] <- sum(g_regr^2 * (e_regr / pi)^2) / N^2
    v_mz_ratio[i, j] <- sum(g_ratio^2 * (e_ratio / pi)^2) / N^2
    for (k in seq_len(length(alphas))) {
      se <- sqrt(v_mz_regr[i, j])
      margin <- qt(1 - alphas[k] / 2, n[i] - 2, lower.tail = TRUE) * se
      lower <- mz_regr[i, j] - margin
      upper <- mz_regr[i, j] + margin
      ind_regr[i, j, k] <- (mz_pop > lower &  mz_pop < upper)
      se <- sqrt(v_mz_ratio[i, j])
      margin <- qt(1 - alphas[k] / 2, n[i] - 1, lower.tail = TRUE) * se
      lower <- mz_ratio[i, j] - margin
      upper <- mz_ratio[i, j] + margin
      ind_ratio[i, j, k] <- (mz_pop > lower &  mz_pop < upper)
      se <- sqrt(av_mz_regr[i, j])
      margin <- qt(1 - alphas[k] / 2, n[i] - 2, lower.tail = TRUE) * se
      lower <- mz_regr[i, j] - margin
      upper <- mz_regr[i, j] + margin
      ind_regr_av[i, j, k] <- (mz_pop > lower &  mz_pop < upper)
      se <- sqrt(av_mz_ratio[i, j])
      margin <- qt(1 - alphas[k] / 2, n[i] - 1, lower.tail = TRUE) * se
      lower <- mz_ratio[i, j] - margin
      upper <- mz_ratio[i, j] + margin
      ind_ratio_av[i, j, k] <- (mz_pop > lower &  mz_pop < upper)
    }
  }
}
save(mz_regr, mz_ratio, v_mz_regr, v_mz_ratio, av_mz_regr, av_mz_ratio, ind_regr, ind_ratio, ind_regr_av, ind_ratio_av, file = "results/Coveragerates_modelassistedestimators.rda")
```

```{r coveragerates, echo = FALSE, out.width = "100%", fig.cap = "Empirical vs. nominal coverage rates of confidence intervals for the population mean, estimated by the simple regression estimator, for sample sizes 10, 25, 50, and 100."}
load(file = "results/Coveragerates_modelassistedestimators.rda")
n <- c(10, 25, 50, 100)
alphas <- (1:99) / 100
nsam <- 5000

Ep_mz_regr <- apply(mz_regr, MARGIN = 1, FUN = mean)
Ep_mz_ratio <- apply(mz_ratio, MARGIN = 1, FUN = mean)

Vp_mz_regr <- apply(mz_regr, MARGIN = 1, FUN = var)
Vp_mz_ratio <- apply(mz_ratio, MARGIN = 1, FUN = var)

Ep_av_mz_regr <- apply(av_mz_regr, MARGIN = 1, FUN = mean)
Ep_av_mz_ratio <- apply(av_mz_ratio, MARGIN = 1, FUN = mean)
Ep_v_mz_regr <- apply(v_mz_regr, MARGIN = 1, FUN = mean)
Ep_v_mz_ratio <- apply(v_mz_ratio, MARGIN = 1, FUN = mean)

coverage_regr <- coverage_ratio <- coverage_regr_av <- coverage_ratio_av <- matrix(nrow = length(n), ncol = length(alphas))
for (k in seq_len(length(alphas))) {
  coverage_regr[, k] <- apply(ind_regr[, , k], MARGIN = 1, FUN = mean)
  coverage_ratio[, k] <- apply(ind_ratio[, , k], MARGIN = 1, FUN = mean)
  coverage_regr_av[, k] <- apply(ind_regr_av[, , k], MARGIN = 1, FUN = mean)
  coverage_ratio_av[, k] <- apply(ind_ratio_av[, , k], MARGIN = 1, FUN = mean)
}

df <- data.frame(confidence = 1 - alphas,
                 n10 = coverage_regr_av[1, ],
                 n25 = coverage_regr_av[2, ],
                 n50 = coverage_regr_av[3, ],
                 n100 = coverage_regr_av[4, ])

d <- df %>% pivot_longer(cols = c("n10", "n25", "n50", "n100"))
d$name <- factor(d$name, levels = c("n10", "n25", "n50", "n100"), ordered = TRUE)

ggplot(d) +
  geom_abline(intercept = 0, slope = 1, colour = "green") +
  geom_point(mapping = aes(x = confidence, y = value), alpha = 0.5) +
  scale_x_continuous(limits = c(0, 1), name = "Confidence level") +
  scale_y_continuous(limits = c(0, 1), name = "Coverage rate") +
  facet_wrap(~ name, ncol = 2, nrow = 2) +
  coord_fixed()
```

```{r plotcoverageratesratioestimator, echo = FALSE, out.width = "100%", fig.cap = "Empirical versus nominal coverage rates of confidence intervals for the population mean, estimated by the ratio estimator, for sample sizes 10, 25, 50, and 100."}
df <- data.frame(confidence = 1 - alphas,
                 n10 = coverage_ratio_av[1, ],
                 n25 = coverage_ratio_av[2, ],
                 n50 = coverage_ratio_av[3, ],
                 n100 = coverage_ratio_av[4, ])

d <- df %>% pivot_longer(cols = c("n10", "n25", "n50", "n100"))
d$name <- factor(d$name, levels = c("n10", "n25", "n50", "n100"), ordered = TRUE)

ggplot(d) +
  geom_abline(intercept = 0, slope = 1, colour = "green") + 
  geom_point(mapping = aes(x = confidence, y = value), alpha = 0.5) +
  scale_x_continuous(limits = c(0, 1), name = "Confidence level") +
  scale_y_continuous(limits = c(0, 1), name = "Coverage rate") +
  facet_wrap(~ name, ncol = 2, nrow = 2) +
  coord_fixed()
```

```{r TableRegressionRatioEstimator, echo = FALSE}
bias_regr <- ((mz_pop - Ep_mz_regr) / mz_pop) * 100
Sp_regr <- sqrt(Vp_mz_regr)
ase_regr <- sqrt(av_mz_regr)
Ep_ase_regr <- apply(ase_regr, MARGIN = 1, FUN = mean)
tabledf <- data.frame(size = n, bias = round(bias_regr, 4), sd = round(Sp_regr, 3), se = round(Ep_ase_regr, 3))

bias_ratio <- ((mz_pop - Ep_mz_ratio) / mz_pop) * 100
Sp_ratio <- sqrt(Vp_mz_ratio)
ase_ratio <- sqrt(av_mz_ratio)
Ep_ase_ratio <- apply(ase_ratio, MARGIN = 1, FUN = mean)

bias <- c(bias_regr, bias_ratio)
Sp <- c(Sp_regr, Sp_ratio)
Ep_ase <- c(Ep_ase_regr, Ep_ase_ratio)

tabledf <- data.frame(estimator = c(rep("Regression", 4), rep("Ratio", 4)), size = n, bias = round(bias, 4), sd = round(Sp, 3), se = round(Ep_ase, 3))

knitr::kable(
  tabledf, caption = "Estimated relative bias of the regression estimator and ratio estimator, standard deviation of 5,000 regression/ratio estimates, and average of 5,000 estimated standard errors of the regression/ratio estimator.",
  col.names = c("Estimator", "n", "Bias (%)", "Standard deviation", "Average standard error"),
  booktabs = TRUE,
  linesep = ""
)   %>%
   add_footnote("n: sample size.", notation = "none")
```


Figures \@ref(fig:coveragerates) and \@ref(fig:plotcoverageratesratioestimator) show that the empirical coverage rates are close to the nominal coverage rate, for all four sample sizes, both estimators, and all confidence levels. For the regression estimator and $n=10$, the empirical coverage rate is somewhat too small. This is because the standard error of the regression estimator is slightly underestimated at this sample size. The average of the estimated standard errors (square root of estimated variance of regression estimator) equals `r formatC(Ep_ase_regr[1], 3, format = "f")`, which is somewhat smaller than the standard deviation of the 5,000 regression estimates of `r formatC(Sp_regr[1], 3, format = "f")`. For all sample sizes, the standard deviation of the 5,000 ratio estimates is considerably larger than that of the 5,000 regression estimates (Table \@ref(tab:TableRegressionRatioEstimator)). For $n=10$ also the standard error of the ratio estimator is underestimated (the average of the 5,000 estimated standard errors is smaller than the standard deviation of the 5,000 ratio estimates), but as a percentage of the standard deviation of the 5,000 ratio estimates, this underestimation is smaller than for the regression estimator.

The relative bias, computed by 

\begin{equation}
bias = \frac{\frac{1}{5000}\sum_{s=1}^{5000} (\hat{\bar{z}}_s-\bar{z})}{\bar{z}} \;,
\end{equation}

is about 0 for both estimators and all four sample sizes.

Contrarily, if in the model-based approach a poor superpopulation model is used, the predictions and the prediction error variances still are model-unbiased. However, for the sampled population, we may have serious systematic error in the estimated population mean and the variance of local predictions may be seriously over- or underestimated. For this reason, model-based inference is also referred to as *model-dependent* inference\index{Model-dependent predictor}, stressing that we fully rely on the model and that the validity\index{Validity} of the estimates and predictions depends on the quality of the model [@han83].


```{r, echo=FALSE}
rm(list = ls())
```