-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path06-Cluster.Rmd
554 lines (426 loc) · 32.9 KB
/
06-Cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
# Cluster random sampling {#Cl}
With stratified random sampling using geographical strata and systematic random sampling, the sampling units are well spread throughout the study area. In general, this leads to an increase of the precision of the estimated mean (total). This is because many spatial populations show spatial structure, so that the values of the study variable at two close units are more similar than those at two distant units. With large study areas the price to be paid for this is long travel times, so that fewer sampling units can be observed in a given survey time. In this situation, it can be more efficient to select *spatial clusters*\index{Spatial cluster} of population units. In cluster random sampling\index{Cluster random sampling}, once a cluster is selected, *all* units in this cluster are observed. Therefore, this design is also referred to as *one-stage* cluster random sampling. The clusters are not subsampled as in two-stage cluster random sampling (see Chapter \@ref(Twostage)).
In spatial sampling, a popular cluster shape is a transect\index{Transect}. This is because the individual sampling units of a transect can easily be located in the field, which was in particular an advantage in the pre-GPS era.
The implementation of cluster random sampling is not straightforward. Frequently this sampling design is improperly implemented. A proper selection technique is as follows [@gru06]. In the first step, a starting unit is selected, for instance by simple random sampling. Then the remaining units of the cluster to which the starting unit belongs are identified by making use of the definition of the cluster. For instance, with clusters defined as E-W oriented transects with a spacing of 100 m between the units of a cluster, all units E and W of the starting unit at a distance of 100 m, 200 m, etc. that fall inside the study area are selected. These two steps are repeated until the required number of *clusters*, not the number of units, is selected.
A requirement of a valid selection method is that the same cluster is selected, regardless of which of its units is used as a starting unit. In the example above, this is the case: regardless of which of the units of the transect is selected first, the final set of units selected is the same because, as stated above, all units E and W of the starting unit are selected.
```{block2, type = 'rmdnote'}
An example of an improper implementation of cluster random sampling is the following selection procedure. A cluster is defined as an E-W oriented transect of four units with a mutual spacing of 100 m. A cluster is selected by randomly selecting a starting unit. The remaining three units of the cluster are selected E of this starting unit. Units outside the study area are ignored. With this selection method, the set of selected units is *not* independent of the starting unit, and therefore this selection method is invalid.
```
Note that the size, i.e., the number of units, of a cluster need not be constant. With the proper selection method described above, the selection probability of a cluster is proportional to its size. With irregularly shaped study areas, the size of the clusters can vary strongly. The size of the clusters can be controlled by subdividing the study area into blocks, for instance, stripes perpendicular to the direction of the transects, or square blocks in case the clusters are grids. In this case, the remaining units are identified by extending the transect or grid to the boundary of the block. With irregularly shaped areas, blocking will not entirely eliminate the variation in cluster sizes\index{Cluster size}.
Cluster random sampling is illustrated with the selection of E-W oriented transects in Voorst. In order to delimit the length of the transects, the study area is split into six 1 km $\times$ 1 km zones. In this case, the zones have an equal size, but this is not needed. Note that these zones do not serve as strata. When used as strata, from each zone, one or more clusters would be selected, see Section \@ref(StratifiedCl).
In the code chunk below, function `findInterval` of the **base** package is used to determine for all discretisation points in which zone they fall.
```{r}
cell_size <- 25
w <- 1000 #width of zones
grdVoorst <- grdVoorst %>%
mutate(zone = s1 %>% findInterval(min(s1) + 1:5 * w + 0.5 * cell_size))
```
As a first step in the **R** code below, variable `cluster` is added to `grdVoorst` indicating to which cluster a unit belongs. Note that each unit belongs exactly to one cluster. The operator `%%` computes the modulus of the s1-coordinate and the spacing of units within a transect (cluster). Function `str_c` of package **stringr** [@stringr] joins the resulting vector, the vector with the s2-coordinate, and the vector with the zone into a single character vector. The sizes of the clusters are computed with function `tapply`.
```{r}
spacing <- 100
grdVoorst <- grdVoorst %>%
mutate(
cluster = str_c(
(s1 - min(s1)) %% spacing,
s2 - min(s2),
zone, sep = "_"),
unit = row_number())
M_cl <- tapply(grdVoorst$z, INDEX = grdVoorst$cluster, FUN = length)
```
```{r clustersize, echo = FALSE, fig.width=5, fig.cap = "Frequency distribution of the size of clusters in Voorst. Clusters are E-W oriented transects within zones, with a spacing of 100 m between units."}
cnts <- table(M_cl) %>% as.numeric()
df <- data.frame(Size = min(M_cl):max(M_cl), Count = cnts)
ggplot(data = df) +
geom_bar(aes(x = Size, y = Count), stat = "identity")
```
In total, there are `r length(unique(grdVoorst$cluster))` clusters in the population. Figure \@ref(fig:clustersize) shows the frequency distribution of the size of the clusters.
Clusters are selected with probabilities proportional to their size and with replacement (ppswr). This can be done by simple random sampling with replacement of elementary units (centres of grid cells) and identifying the clusters to which these units belong. Finally, all units of the selected clusters are included in the sample. In the code chunk below, a function is defined for selecting clusters by ppswr. Note variable `cldraw`, that has value 1 for all units selected in the first draw, value 2 for all units selected in the second draw, etc. This variable is needed in estimating the population mean, as explained in Section \@ref(clustersamplingestimators).
```{r}
cl_ppswr <- function(sframe, n) {
units <- sample(nrow(sframe), size = n, replace = TRUE)
units_cl <- sframe$cluster[units]
mysamples <- NULL
for (i in seq_len(length(units_cl))) {
mysample <- sframe[sframe$cluster %in% units_cl[i], ]
mysample$start <- 0
mysample$start[mysample$unit %in% units[i]] <- 1
mysample$cldraw <- rep(i, nrow(mysample))
mysamples <- rbind(mysamples, mysample)
}
mysamples
}
```
Function `cl_ppswr` is now used to select six times a cluster by ppswr.
```{r}
n <- 6
set.seed(314)
mysample <- cl_ppswr(sframe = grdVoorst, n = n)
```
As our population actually is infinite, the centres of the selected grid cells are jittered to a random point within the selected grid cells. Note that the same noise is added to all units of a given cluster.
```{r, echo = FALSE, eval = FALSE}
for (i in 1:n) {
units <- which(mysample$cldraw == i)
mysample$s1[units] <- mysample$s1[units] + runif(1, min = -12.5, max = 12.5)
mysample$s2[units] <- mysample$s2[units] + runif(1, min = -12.5, max = 12.5)
}
```
```{r}
mysample <- mysample %>%
group_by(cldraw) %>%
mutate(s1 = s1 + runif(1, min = -12.5, max = 12.5),
s2 = s2 + runif(1, min = -12.5, max = 12.5))
```
Figure \@ref(fig:ClVoorst) shows the selected sample. Note that in this case the second west-most zone has two transects (clusters), whereas one zone has none, showing that the zones are not used as strata. The total number of selected points equals `r nrow(mysample)`. Similar to systematic random sampling, with cluster random sampling the total sample size is random, so that we do not have perfect control of the total sample size. This is because in this case the size, i.e., the number of points, of the clusters is not constant but varies.
```{r ClVoorst, echo = FALSE, out.width = '100%', fig.cap = "Cluster random sample from Voorst selected by ppswr."}
ggplot(grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = as.factor(zone))) +
geom_raster() +
scale_fill_viridis_d(alpha = 0.5) +
geom_point(data = mysample, size = 1.5) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(legend.position = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
The output data frame of function `cl` has a variable named `start`. This is an indicator with value 1 if this point of the cluster is selected first, and 0 otherwise. When in the field, it appears that the first selected point of a cluster does not belong to the target population, all other points of that cluster are also discarded. This is to keep the selection probabilities of the clusters exactly proportional to their size. Column `cldraw` is needed in estimation because clusters are selected with replacement. In case a cluster is selected more than once, multiple means of that cluster are used in estimation, see next section.
## Estimation of population parameters {#clustersamplingestimators}
With ppswr sampling\index{pps sampling!with replacement (ppswr)} of clusters, the population total can be estimated by the pwr estimator:
\begin{equation}
\hat{t}(z) = \frac{1}{n}\sum_{j \in \mathcal{S}} \frac{t_{j}(z)}{p_{j}} \;,
(\#eq:EstTotalCl1)
\end{equation}
with $n$ the number of cluster draws, $p_j$ the draw-by-draw selection probability of cluster $j$, and $t_j(z)$ the total of cluster $j$:
\begin{equation}
t_j(z) = \sum_{k=1}^{M_j} z_{kj} \;,
(\#eq:clustertotal)
\end{equation}
with $M_j$ the size (number of units) of cluster $j$ and $z_{kj}$ the study variable value of unit $k$ in cluster $j$.
The draw-by-draw selection probability of a cluster\index{Draw-by-draw selection probability!of a cluster} equals
\begin{equation}
p_{j} = \frac{M_j}{M} \;,
(\#eq:drawbydraw)
\end{equation}
with $M$ the total number of population units (for Voorst $M$ equals 7,528). Inserting this in Equation \@ref(eq:EstTotalCl1) yields
\begin{equation}
\hat{t}(z) = \frac{M}{n} \sum_{j \in \mathcal{S}} \frac{t_{j}(z)}{M_{j}} = \frac{M}{n} \sum_{j \in \mathcal{S}} \bar{z}_{j} \;,
(\#eq:EstTotalCl)
\end{equation}
with $\bar{z}_{j}$ the mean of cluster $j$. Note that if a cluster is selected more than once, multiple means of that cluster are used in the estimator.
Dividing this estimator by the total number of population units, $M$, yields the estimator of the population mean:
\begin{equation}
\hat{\bar{\bar{z}}}=\frac{1}{n}\sum\limits_{j \in \mathcal{S}} \bar{z}_{j} \;.
(\#eq:EstMeanCl)
\end{equation}
Note the two bars in $\hat{\bar{\bar{z}}}$, indicating that the observations are averaged twice.
For an infinite population of points discretised by the centres of a finite number of grid cells, $z_{kj}$ in Equation \@ref(eq:clustertotal) is the study variable value at a randomly selected point within the grid cell multiplied by the area of the grid cell. The estimated population total thus obtained is equal to the estimated population mean (Equation \@ref(eq:EstMeanCl)) multiplied by the area of the study area.
The sampling variance of the estimator of the mean with ppswr sampling of clusters is equal to (@coc77, equation (9A.6))
\begin{equation}
V(\hat{\bar{\bar{z}}})= \frac{1}{n}\sum_{j=1}^N \frac{M_j}{M} (\bar{z}_j-\bar{z})^2 \;,
(\#eq:TrueVarEstMeanCl)
\end{equation}
with $N$ the total number of clusters (for Voorst, $N=960$), $\bar{z}_j$ the mean of cluster $j$, and $\bar{z}$ the population mean. Note that $M_j/M$ is the selection probability of cluster $j$.
This sampling variance can be estimated by (@coc77, equation (9A.22))
\begin{equation}
\widehat{V}\!\left(\hat{\bar{\bar{z}}}\right)=\frac{\widehat{S^2}(\bar{z})}{n} \;,
(\#eq:VarEstMeanCl)
\end{equation}
where $\widehat{S^2}(\bar{z})$ is the estimated variance of cluster means (the between-cluster variance):
\begin{equation}
\widehat{S^2}(\bar{z}) = \frac{1}{n-1}\sum_{j \in \mathcal{S}}(\bar{z}_{j}-\hat{\bar{\bar{z}}})^2 \;.
(\#eq:S2EstMeanCl)
\end{equation}
In **R** the population mean and the sampling variance of the estimator of the population means can be estimated as follows.
```{r, echo = FALSE, eval = FALSE}
mz_cl <- tapply(
mysample$z, INDEX = mysample$cldraw, FUN = mean)
mz <- mean(mz_cl)
se_mz <- sqrt(var(mz_cl) / n)
```
```{r}
est <- mysample %>%
group_by(cldraw) %>%
summarise(mz_cl = mean(z)) %>%
summarise(mz = mean(mz_cl),
se_mz = sqrt(var(mz_cl) / n()))
```
The estimated mean equals `r formatC(est$mz, 1, format = "f")` g kg^-1^, and the estimated standard error equals `r formatC(est$se_mz, 1, format = "f")` g kg^-1^. Note that the size of the clusters does not appear in these formulas. This simplicity is due to the fact that the clusters are selected with probabilities proportional to size. The effect of the cluster size on the variance is implicitly accounted for. To understand this, consider that larger clusters result in smaller variance among their means.
The same estimates are obtained with functions `svydesign` and `svymean` of package **survey** [@Lumley2020]. Argument `weights` specifies the weights of the sampled clusters equal to $M/(M_j\; n)$ (Equation \@ref(eq:EstTotalCl)).
```{r}
library(survey)
M <- nrow(grdVoorst)
mysample$weights <- M / (M_cl[mysample$cluster] * n)
design_cluster <- svydesign(id = ~ cldraw, weights = ~ weights, data = mysample)
svymean(~ z, design_cluster, deff = "replace")
```
The design effect\index{Design effect} `DEff` as estimated from the selected cluster sample is considerably larger than 1. About 4 times more sampling points are needed with cluster random sampling compared to simple random sampling to estimate the population mean with the same precision.
A confidence interval estimate of the population mean can be computed with method `confint`. The number of degrees of freedom equals the number of cluster draws minus 1.
```{r}
confint(svymean(~ z, design_cluster, df = degf(design_cluster), level = 0.95))
```
Figure \@ref(fig:SamplingDistributionCl) shows the approximated sampling distribution of the pwr estimator of the mean soil organic matter (SOM) concentration with cluster random sampling and of the $\pi$ estimator with simple random sampling, obtained by repeating the random sampling with each design and estimation 10,000 times. The size of the simple random samples is equal to the expected sample size of the cluster random sampling design (rounded to nearest integer).
```{r, echo = FALSE, eval = FALSE}
number_of_samples <- 10000
mz <- v_mz <- mz_SI <- sampleSizes <- numeric(length = number_of_samples)
#compute size of clusters
M_cl <- tapply(grdVoorst$z, INDEX = grdVoorst$cluster, FUN = length)
#compute expected sample size
p <- M_cl / sum(M_cl)
m_n <- round(n * sum(p * M_cl), 0)
SI <- function(sframe, n) {
units <- sample(nrow(sframe), size = n, replace = TRUE)
mysample <- sframe[units, ]
mysample
}
set.seed(314)
for (i in 1:number_of_samples) {
mysample <- cl_ppswr(sframe = grdVoorst, n = n)
clusterMeans <- tapply(mysample$z, INDEX = mysample$cldraw, FUN = mean)
mz[i] <- mean(clusterMeans)
v_mz[i] <- var(clusterMeans) / n
sampleSizes[i] <- nrow(mysample)
mySIsample <- SI(grdVoorst, n = m_n)
mz_SI[i] <- mean(mySIsample$z)
}
save(mz, mz_SI, v_mz, sampleSizes, m_n, file = "results/Cl_Voorst.rda")
```
(ref:SamplingDistributionCllabel) Approximated sampling distribution of the pwr estimator of the mean SOM concentration (g kg^-1^) in Voorst with cluster random sampling (Cl) and of the $\pi$ estimator with simple random sampling (SI). In Cl six clusters are selected by ppswr. In SI 49 units are selected, which is equal to the expected sample size of Cl.
```{r SamplingDistributionCl, echo = FALSE, fig.asp = .8, fig.width=5, fig.cap = "(ref:SamplingDistributionCllabel)"}
load(file = "results/Cl_Voorst.rda")
estimates <- data.frame(mz, mz_SI)
names(estimates) <- c("Cl", "SI")
df <- estimates %>% pivot_longer(cols = c("Cl", "SI"))
ggplot(data = df) +
geom_boxplot(aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdVoorst$z), colour = "red") +
scale_x_discrete(name = "Sampling design") +
scale_y_continuous(name = "Estimated mean SOM")
```
The variance of the 10,000 estimated population means with cluster random sampling equals `r formatC(var(estimates$Cl), 1, format = "f")` (g kg^-1^)^2^. This is considerably larger than with simple random sampling: `r formatC(var(estimates$SI), 1, format = "f")` (g kg^-1^)^2^. The large variance is caused by the strong spatial clustering of points. This may save travel time in large study areas, but in Voorst the saved travel time will be very limited, and therefore cluster random sampling in Voorst is not a good idea. The average of the estimated variances with cluster random sampling equals `r formatC(mean(v_mz), 1, format = "f")` (g kg^-1^)^2^. The difference with the variance of the 10,000 estimated means is small because the estimator of the variance, Equation \@ref(eq:VarEstMeanCl), is unbiased. Figure \@ref(fig:histsamplesizeCl) shows the approximated sampling distribution of the sample size. The expected sample size can be computed as follows:
```{r}
p <- M_cl / sum(M_cl)
print(m_n <- n * sum(p * M_cl))
```
So, the unequal draw-by-draw selection probabilities of the clusters are accounted for in computing the expected sample size\index{Expected sample size}.
```{r histsamplesizeCl, echo = FALSE, fig.width=5, fig.cap = "Approximated sampling distribution of the sample size with cluster random sampling from Voorst, for six clusters selected by ppswr."}
n_df <- data.frame(n = sampleSizes)
ggplot(n_df) +
geom_histogram(aes(x = n, y = ..density..), fill = "black", alpha = 0.5, breaks = 33.5:60.5, colour = "black") +
geom_density(aes(x = n, y = ..density..), lwd = 1, adjust = 2) +
scale_x_continuous(name = "Sample size") +
scale_y_continuous(name = "Density")
```
#### Exercises {-}
1. Write an **R** script to compute the true sampling variance of the estimator of the mean SOM concentration in Voorst for cluster random sampling and clusters selected with ppswr, $n = 6$, see Equation \@ref(eq:TrueVarEstMeanCl). Compare the sampling variance for cluster random sampling with the sampling variance for simple random sampling with a sample size equal to the expected sample size of cluster random sampling.
2. As an alternative we may select three times a transect, using three 2 km $\times$ 1 km zones obtained by joining two neighbouring 1 km $\times$ 1 km zones of Figure \@ref(fig:ClVoorst). Do you expect that the sampling variance of the estimator of the population mean is equal to, larger or smaller than that of the sampling design with six transects of half the length?
## Clusters selected with probabilities proportional to size, without replacement
In the previous section the clusters were selected with replacement (ppswr). The advantage of sampling with replacement is that this keeps the statistical inference simple, more specifically the estimation of the standard error of the estimator of the population mean. However, in sampling from finite populations, cluster sampling with replacement is less efficient than cluster sampling without replacement, especially with large sampling fractions of clusters, i.e., if $1-n/N$ is small, with $N$ being the total number of clusters and $n$ the sample size, i.e., the number of cluster draws. If a cluster is selected more than once, there is less information about the population mean in this sample than in a sample with all clusters different. Selection of clusters with probabilities proportional to size without replacement (ppswor) is not straightforward\index{pps sampling!without replacement (ppswor)}.
```{block2, type='rmdnote'}
The problem is the computation of the inclusion probabilities of the clusters. After we have selected a first cluster, we must adapt the sum of the sizes of the $N-1$ remaining clusters and recompute the selection probabilities of the remaining clusters in the second draw, etc. Section 6.4 of @loh99 nicely describes how the inclusion probabilities of the $N$ clusters in a cluster random sample of size two, selected by ppswor, can be computed.
```
Many algorithms have been developed for ppswor sampling, see @Tille2006 for an overview, and quite a few of them are implemented in package **sampling** [@Tille2016]. In the next code chunk, function `UPpivotal` is used to select a cluster random sample with ppswor. For an explanation of this algorithm, see Subsection \@ref(pivotalmethod).
```{r}
library(sampling)
n <- 6
pi <- n * M_cl / M
set.seed(314)
eps <- 1e-6
sampleind <- UPpivotal(pik = pi, eps = eps)
clusters <- sort(unique(grdVoorst$cluster))
clusters_sampled <- clusters[sampleind == 1]
mysample <- grdVoorst[grdVoorst$cluster %in% clusters_sampled, ]
```
All inclusion probabilities `pi` are smaller than 1. However, this is not guaranteed when computed as in the code chunk above. In case there are units whose size is an extreme positive outlier, the inclusion probabilities of these units can become larger than 1. These units then are always selected. The inclusion probabilities of the remaining units must be recomputed, using a reduced sample size and the sizes of the remaining units. This is what is done with function `inclusionprobabilities` of package **sampling**.
The population mean can be estimated by the $\pi$ estimator by writing a few lines of **R** code yourself or by using function `svymean` of package **survey** as shown hereafter. Estimation of the sampling variance in pps sampling of clusters without replacement is difficult^[The problem is the computation of the joint inclusion probabilities of pairs of points.]. A simple solution is to treat the cluster sample as a ppswr sample and to estimate the variance with Equation \@ref(eq:VarEstMeanCl). With small sampling fractions, this variance approximation is fine: the overestimation of the variance is negligible. For larger sampling fractions, various alternative variance approximations are developed, see @Berger2004 for details. One of the methods is Brewer's method\index{Brewer's variance estimator}, which is implemented in function `svydesign`.
```{r}
names(pi) <- clusters
mysample$pi <- pi[mysample$cluster]
design_clppswor <- svydesign(
id = ~ cluster, data = mysample, pps = "brewer", probs = ~ pi, fpc = ~ pi)
svymean(~ z, design_clppswor)
```
Another variance estimator implemented in function `svydesign` is the Hartley-Rao estimator\index{Hartley-Rao's variance estimator}. The two estimated standard errors are nearly equal.
```{r}
p2sum <- sum((n * M_cl[mysample$cluster] / M)^2) / n
design_hr <- svydesign(
id = ~ cluster, data = mysample, pps = HR(p2sum), probs = ~ pi, fpc = ~ pi)
svymean(~ z, design_hr)
```
## Simple random sampling of clusters {#SIC}
Suppose the clusters have unequal size, but we do not know the size of the clusters, so that we cannot select the clusters with probabilities proportional to their size. In this case, we may select the clusters by simple random sampling without replacement. The inclusion probability of a cluster equals $n/N$ with $n$ the number of selected clusters and $N$ the total number of clusters in the population. This yields the following $\pi$ estimator of the population total:
\begin{equation}
\hat{t}(z) = \frac{N}{n} \sum_{j \in \mathcal{S}} t_{j}(z)\;.
(\#eq:EstTotalClEqual)
\end{equation}
The population mean can be estimated by dividing this estimator of the population total by the total number of units in the population $M$:
\begin{equation}
\hat{\bar{\bar{z}}}_{\pi}(z) = \frac{\hat{t}(z)}{M}\;.
(\#eq:EstMeanHTClEqual)
\end{equation}
Alternatively, we may estimate the population mean by dividing the estimate of the population total by the *estimated* population size:
\begin{equation}
\widehat{M} = \sum_{j \in \mathcal{S}} \frac{M_{j}}{\pi_{j}} = \frac{N}{n} \sum_{j \in \mathcal{S}} M_{j} \;.
(\#eq:EstPopulatonSizeClEqual)
\end{equation}
This leads to the ratio estimator\index{Ratio estimator} of the population mean:
\begin{equation}
\hat{\bar{\bar{z}}}_{\text{ratio}}(z) = \frac{\hat{t}(z)}{\widehat{M}} \;.
(\#eq:EstMeanRatioClEqual)
\end{equation}
The $\pi$ estimator and the ratio estimator are equal when the clusters are selected with probabilities proportional to size. This is because the estimated population size is equal to the true population size.
```{r}
print(M_HT <- sum(1 / mysample$pi))
```
However, when clusters of different size are selected with equal probabilities, the two estimators are different. This is shown below. Six clusters are selected by simple random sampling without replacement.
```{r}
set.seed(314)
clusters <- sort(unique(grdVoorst$cluster))
units_cl <- sample(length(clusters), size = n, replace = FALSE)
clusters_sampled <- clusters[units_cl]
mysample <- grdVoorst[grdVoorst$cluster %in% clusters_sampled, ]
```
The $\pi$ estimate and the ratio estimate of the population mean are computed for the selected sample.
```{r}
N <- length(clusters)
mysample$pi <- n / N
tz_HT <- sum(mysample$z / mysample$pi)
mz_HT <- tz_HT / M
M_HT <- sum(1 / mysample$pi)
mz_ratio <- tz_HT / M_HT
```
The $\pi$ estimate equals `r formatC(mz_HT, 3, format = "f")` g kg^-1^, and the ratio estimate equals `r formatC(mz_ratio, 3, format = "f")` g kg^-1^. The $\pi$ estimate of the population mean can also be computed by first computing totals of clusters, see Equations \@ref(eq:EstTotalClEqual) and \@ref(eq:EstMeanHTClEqual).
```{r}
tz_cluster <- tapply(mysample$z, INDEX = mysample$cluster, FUN = sum)
pi_cluster <- n / N
tz_HT <- sum(tz_cluster / pi_cluster)
print(mz_HT <- tz_HT / M)
```
The variance of the $\pi$ estimator of the population mean can be estimated by first estimating the variance of the estimator of the total:
\begin{equation}
\widehat{V}(\hat{t}(z)) = N^2\left(1-\frac{n}{N}\right)\frac{\widehat{S^2}(t(z))}{n}
(\#eq:EstVarTotalHTClequal)
\;,
\end{equation}
and dividing this variance by the squared number of population units:
\begin{equation}
\widehat{V}(\hat{\bar{\bar{z}}}) = \frac{1}{M^2} \widehat{V}(\hat{t}(z)) \;.
(\#eq:EstVarMeanHTClequal)
\end{equation}
```{r}
fpc <- 1 - n / N
v_tz <- N^2 * fpc * var(tz_cluster) / n
se_mz_HT <- sqrt(v_tz / M^2)
```
The estimated standard error equals `r formatC(se_mz_HT, 1, format = "f")` g kg^-1^.
To compute the variance of the ratio estimator of the population mean, we first compute residuals of cluster totals:
\begin{equation}
e_j = t_j(z)-\hat{b}M_j \;,
(\#eq:residualsclustertotals)
\end{equation}
with $\hat{b}$ the ratio of the estimated population mean of the cluster totals to the estimated population mean of the cluster sizes:
\begin{equation}
\hat{b}=\frac{\frac{1}{n}\sum_{j \in \mathcal{S}} t_{j}}{\frac{1}{n}\sum_{j \in \mathcal{S}} M_{j}} \;.
(\#eq:ratioclustertotalclustersize)
\end{equation}
The variance of the ratio estimator of the population mean can be estimated by
\begin{equation}
\hat{V}(\hat{\bar{\bar{z}}}_{\text{ratio}})=\left(1-\frac{n}{N}\right)\frac{1}{(\frac{1}{n}\sum_{j \in \mathcal{S}} M_{j})^2}\frac{\widehat{S^2}_e}{n} \;,
(\#eq:varratioestimatormeanCl)
\end{equation}
with $\widehat{S^2}_e$ the estimated variance of the residuals.
```{r}
m_M_cl <- mean(M_cl[unique(mysample$cluster)])
b <- mean(tz_cluster) / m_M_cl
e_cl <- tz_cluster - b * M_cl[sort(unique(mysample$cluster))]
S2e <- var(e_cl)
print(se_mz_ratio <- sqrt(fpc * 1 / m_M_cl^2 * S2e / n))
```
The ratio estimate can also be computed with function `svymean` of package **survey**, which also provides an estimate of the standard error of the estimated mean.
```{r}
design_SIC <- svydesign(
id = ~ cluster, probs = ~ pi, fpc = ~ pi, data = mysample)
svymean(~ z, design_SIC)
```
## Stratified cluster random sampling {#StratifiedCl}
The basic sampling designs stratified random sampling (Chapter \@ref(STSI)) and cluster random sampling can be combined into stratified cluster random sampling\index{Stratified random sampling!stratified cluster random sampling}. So, instead of selecting simple random samples from the strata, within each stratum clusters are randomly selected. Figure \@ref(fig:STCl) shows a stratified cluster random sample from Voorst. The strata consist of three 2 km $\times$ 1 km zones, obtained by joining two neighbouring 1 km $\times$ 1 km zones (Figure \@ref(fig:ClVoorst)). The clusters are the same as before, i.e., E-W oriented transects within 1 km $\times$ 1 km zones, with an inter-unit spacing of 100 m. Within each stratum, two times a cluster is selected by ppswr. The stratification avoids the clustering of the selected transects in one part of the study area. Compared to (unstratified) cluster random sampling, the geographical spreading of the clusters is improved, which may lead to an increase of the precision of the estimated population mean. In Figure \@ref(fig:STCl) and in the most western stratum, the two selected transects are in the same 1 km $\times$ 1 km zone. The alternative would be to use the six zones as strata, leading to an improved spreading of the clusters, but there is also a downside with this design, see Exercise 3. Note that the selection probabilities are now equal to
\begin{equation}
p_{jh}= M_j/M_h \;,
(\#eq:drawbySTSICl)
\end{equation}
with $M_h$ the total number of population units of stratum $h$.
```{r}
grdVoorst$zone_stratum <- as.factor(grdVoorst$zone)
levels(grdVoorst$zone_stratum) <- rep(c("a", "b", "c"), each = 2)
n_h <- c(2, 2, 2)
set.seed(324)
stratumlabels <- unique(grdVoorst$zone_stratum)
mysample <- NULL
for (i in 1:3) {
grd_h <- grdVoorst[grdVoorst$zone_stratum == stratumlabels[i], ]
mysample_h <- cl_ppswr(sframe = grd_h, n = n_h[i])
mysample <- rbind(mysample, mysample_h)
}
```
```{r STCl, echo = FALSE, out.width = '100%', fig.cap = "Stratified cluster random sample from Voorst, with three strata. From each stratum two times a cluster is selected by ppswr."}
s1bnd <- seq(from = min(grdVoorst$s1) + w, to = min(grdVoorst$s1) + 5 * w,
by = w) + cell_size / 2
ggplot(grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = zone_stratum)) +
geom_raster() +
scale_fill_viridis_d(alpha = 0.5) +
geom_point(data = mysample, size = 1.5) +
geom_vline(xintercept = s1bnd / 1000) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(legend.position = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
The population mean is estimated by first estimating the stratum means using Equation \@ref(eq:EstMeanCl) at the level of the strata, followed by computing the weighted average of the estimated stratum means using Equation \@ref(eq:HTMeanSTSI2). The variance of the estimator of the population mean is estimated in the same way, by first estimating the variance of the estimator of the stratum means using Equations \@ref(eq:VarEstMeanCl) and \@ref(eq:S2EstMeanCl) at the level of the strata, followed by computing the weighted average of the estimated variances of the estimated stratum means (Equation \@ref(eq:EstVarMeanSTSI)).
```{r, echo = FALSE, eval = FALSE}
mz_h <- v_mz_h <- numeric(length = 3)
for (i in 1:3) {
units <- which(mysample$zone_stratum == letters[i])
mysample_h <- mysample[units, ]
mz_cl <- tapply(mysample_h$z, INDEX = mysample_h$cldraw, FUN = mean)
mz_h[i] <- mean(mz_cl)
v_mz_h[i] <- var(mz_cl) / n_h[i]
}
M_h <- tapply(grdVoorst$z, INDEX = grdVoorst$zone_stratum, FUN = length)
w_h <- M_h / M
mz <- sum(w_h * mz_h)
se_mz <- sqrt(sum(w_h^2 * v_mz_h))
```
```{r}
strata_size <- grdVoorst %>%
group_by(zone_stratum) %>%
summarise(M_h = n()) %>%
mutate(w_h = M_h / sum(M_h))
est <- mysample %>%
group_by(zone_stratum, cldraw) %>%
summarise(mz_cl = mean(z), .groups = "drop_last") %>%
summarise(mz_h = mean(mz_cl),
v_mz_h = var(mz_cl) / n()) %>%
left_join(strata_size, by = "zone_stratum") %>%
summarise(mz = sum(w_h * mz_h),
se_mz = sqrt(sum(w_h^2 * v_mz_h)))
```
The estimated mean equals `r formatC(est$mz, 1, format = "f")` g kg^-1^, and the estimated standard error equals `r formatC(est$se_mz, 1, format = "f")` g kg^-1^. The same estimates are obtained with function `svymean`. Weights for the clusters are computed as before, but now at the level of the strata. Note argument `nest = TRUE`, which means that the clusters are nested within the strata.
```{r}
mysample$weights <- strata_size$M_h[mysample$zone_stratum] /
(M_cl[mysample$cluster] * n_h[mysample$zone_stratum])
design_strcluster <- svydesign(id = ~ cldraw, strata = ~ zone_stratum,
weights = ~ weights, data = mysample, nest = TRUE)
svymean(~ z, design_strcluster)
```
#### Exercises {-}
3. Why is it attractive in stratified random cluster sampling to select at least two clusters per stratum?
```{r, echo = FALSE}
rm(list = ls())
```