-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path30-Sampling4VariogramEstimation.Rmd
990 lines (806 loc) · 66 KB
/
30-Sampling4VariogramEstimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
# Sampling for estimating the semivariogram {#SamplingVariogram}
For model-based sampling as described in Chapters \@ref(MBgridspacing) and \@ref(MBSamplePattern), we must specify the (residual) semivariogram. In case we do not have a reasonable prior estimate of the semivariogram, we may decide to first collect data with the specific aim of estimating the semivariogram. This semivariogram is subsequently used to design a model-based sample for mapping. This chapter is about how to design a reconnaissance sample survey\index{Reconnaissance sample survey} for estimating the semivariogram.
The first question is how many observations we need for this. @web92 gave as a rule of thumb that 150 to 225 points are needed to obtain a reliable semivariogram when estimated by the method-of-moments (MoM). @lar00b showed that with maximum likelihood (ML) estimation two-thirds to only half of the observations are needed to achieve equal precision of the estimated semivariogram parameters.
Once we have decided on the sample size, we must select the locations of the sampling units. Two random sampling designs for semivariogram estimation are described, nested sampling (Section \@ref(NestedSampling)) and independent sampling of pairs of points (Section \@ref(IndependentSamplingPairs)). Section \@ref(samplingforvariogram) is devoted to model-based optimisation of the sampling pattern for semivariogram estimation. Section \@ref(SamplingEstimationandPrediction) is about how to design a single sample that can be used both for estimation of the semivariogram and prediction (mapping). In a final section, a practical solution is described for the problem of how to design a sample for semivariogram estimation and prediction.
## Nested sampling {#NestedSampling}
Nested sampling can be used to estimate the semivariance at a limited number of separation distances, see @Oliver1986 and @Webster2006. First, we must decide on these separation distances. We need point-pairs at various separation distances, especially for small separation distances, so that we get reliable estimates of this part of the semivariogram which has a strong effect on the kriging weights. Usually, separation distances are chosen in a geometric progression, for instance 2, 8, 32, 128, and 512 m. The multiplier, which is four in this example, should not be too small; as a rule of thumb, use three or larger.
There are two versions of nested sampling\index{Nested sampling}. In the first stage of the first version, several main stations are selected in a way that they cover the study area well, for instance by spatial coverage sampling. In the second stage, each of the main stations is used as a starting point to select one point at a distance equal to the largest chosen separation distance (512 m in the example) in a random direction from the main station. This doubles the sample size. In the third stage, all points selected in the previous stages (main stations of stage 1 plus the points of stage 2) are used as starting points to select one point at a distance equal to the second largest separation distance (128 m), and so on. All points selected in the various stages are included in the nested sample. The code chunk below shows the function for random selection of one point at distance $h$ from a starting point. Note the `while` loop which continues until a point is found that is inside the area. This is checked with function `over` of package **sp**.
```{r}
SelectPoint <- function(start, h, area) {
dxy <- numeric(length = 2)
inArea <- NA
while (is.na(inArea)) {
angle <- runif(n = 1, min = 0, max = 2 * pi)
dxy[1] <- h * sin(angle); dxy[2] <- h * cos(angle)
xypnt <- start + dxy
coordinates(xypnt) <- ~ s1 + s2
inArea <- as.numeric(over(x = xypnt, y = area))[1]
}
xypoint <- as.data.frame(xypnt)
xypoint
}
```
The first stage of the second version is equal to that of the first version. However, in the second stage each of the main stations serves as a starting point for randomly selecting a *pair of points* with a separation distance equal to the largest chosen separation distance. The main station is halfway the selected pair of points. In the third stage, each of the substations is used to select in the same way a pair of points separated by the second largest chosen distance, and so on. Only the points selected in the final stage are used as sampling points. The **R** code below shows the function for random selection of two points separated by $h$ distance units with a starting point halfway the pair of points. The `while` loop continues until both points of a pair are inside the area.
```{r}
SelectPair <- function(start, h, area) {
dxy <- numeric(length = 2)
xypoints <- NULL
inArea1 <- inArea2 <- NA
while (is.na(inArea1) | is.na(inArea2)) {
angle <- runif(n = 1, min = 0, max = 2 * pi)
dxy[1] <- h * sin(angle) / 2; dxy[2] <- h * cos(angle) / 2
xypnt1 <- start + dxy
coordinates(xypnt1) <- ~ s1 + s2
inArea1 <- as.numeric(over(x = xypnt1, y = area))[1]
dxy[1] <- -dxy[1]; dxy[2] <- -dxy[2]
xypnt2 <- start + dxy
coordinates(xypnt2) <- ~ s1 + s2
inArea2 <- as.numeric(over(x = xypnt2, y = area))[1]
}
xypoints <- rbind(as.data.frame(xypnt1), as.data.frame(xypnt2))
xypoints
}
```
The **R** code below shows the selection of a nested sample from Hunter Valley using both versions. Only one main station is selected. In total 16 points are selected in four stages. The separation distances are 2,000, 1,000, 500, and 250 m. Sixteen points is not enough for estimating the semivariogram and a multiplier of two is rather small, but this example is for illustrative purposes only.
Note that the separation distances are in descending order. The largest separation distance should not be chosen too large, because then, when the main station is somewhere in the middle of the study area, it may happen that using the first version, no pair can be found with that separation distance. A similar problem may occur with the second version when in subsequent stages a station is selected near the border of the study area. A copy of `grdHunterValley` is made because both the original data frame is needed, as well as a gridded version of this data frame.
```{r}
library(sp)
grid <- grdHunterValley
gridded(grdHunterValley) <- ~ s1 + s2
lags <- c(2000, 1000, 500, 250)
```
The next code chunk is an implementation of the first version of nested sampling.
```{r NestedSampling1}
set.seed(614)
unit <- sample(nrow(grid), 1)
mainstation <- grid[unit, c("s1", "s2")]
newpnt <- SelectPoint(start = mainstation, h = lags[1], area = grdHunterValley)
mysample_nested <- rbind(mainstation, newpnt)
for (j in 2:length(lags)) {
newpnts <- NULL
for (i in seq_len(nrow(mysample_nested))) {
pnts <- SelectPoint(
start = mysample_nested[i, ], h = lags[j], area = grdHunterValley)
newpnts <- rbind(newpnts, pnts)
}
mysample_nested <- rbind(mysample_nested, newpnts)
}
```
The **R** code for the second version is presented in the next code chunk.
```{r NestedSampling2}
unit <- sample(nrow(grid), 1)
mainstation <- grid[unit, c("s1", "s2")]
pnt <- SelectPoint(start = mainstation, h = lags[1], area = grdHunterValley)
stations <- rbind(mainstation, pnt)
allstations <- rbind(mainstation, pnt)
for (j in 2:length(lags)) {
newstations <- NULL
for (i in seq_len(nrow(stations))) {
pnts <- SelectPair(
start = stations[i, ], h = lags[j], area = grdHunterValley)
newstations <- rbind(newstations, pnts)
allstations <- rbind(allstations, pnts)
}
stations <- newstations
}
mysample_nested_2 <- as_tibble(stations)
```
Figure \@ref(fig:BalancedNestedSample) shows the two selected nested samples. For the sample selected with the second version, also the stations that served as starting points for the selection of the point-pairs are plotted.
```{r BalancedNestedSample, echo = FALSE, out.width = "100%", fig.cap = "Balanced nested samples from Hunter Valley, selected with the two versions of nested sampling. In the subfigure of the second version the selected sampling points (symbol x) are plotted together with the selected stations (halfway the two points of a pair)."}
mysample_nested$stage <- as.factor(c(1, 1, 2, 2, rep(3, 4), rep(4, 8)))
plt1 <- ggplot(mysample_nested) +
geom_tile(data = grid, mapping = aes(x = s1 / 1000, y = s2 / 1000), fill = "grey") +
geom_point(aes(x = s1 / 1000, y = s2 / 1000, shape = stage), size = 1) +
scale_shape(name="Stage") +
ggtitle("First version") +
theme(plot.title = element_text(size = 10, hjust = 0.5)) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
nrstations <- nrow(allstations) - nrow(mysample_nested_2)
allstations <- allstations[1:nrstations, ]
allstations$station <- as.factor(c(rep(1, 2), rep(2, 4), rep(3, 8)))
plt2 <- ggplot(mysample_nested_2) +
geom_tile(data = grid, mapping = aes(x = s1 / 1000, y = s2 / 1000), fill = "grey") +
geom_point(data = allstations, aes(x = s1 / 1000, y = s2 / 1000, shape = station), size = 1) +
geom_point(aes(x = s1 / 1000, y = s2 / 1000), shape = 4, size = 1) +
scale_shape(name="Station") +
ggtitle("Second version") +
theme(plot.title = element_text(size = 10, hjust = 0.5)) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 2)
#remove column stage
mysample_nested <- mysample_nested [, -3]
```
The samples of Figure \@ref(fig:BalancedNestedSample) are examples of *balanced* nested samples\index{Nested sampling!balanced nested sampling}. The number of point-pairs separated by a given distance doubles with every stage. As a consequence, the estimated semivariances for the smallest separation distance are much more precise than for the largest distance. We are most uncertain about the estimated semivariances for the largest separation distances. If in the first stage only one point-pair separated by the largest distance is selected, then we have only one degree of freedom for estimating the variance component associated with this stage. It is more efficient to select more than one main station, say about ten, and to select fewer points in the final stages. For instance, with the second version we may decide to select a point-pair at only half the number of stations selected in the one-but-last stage. The nested sample then becomes unbalanced.
The model for nested sampling with four stages is a hierarchical analysis of variance (ANOVA) model\index{ANOVA model!hierarchical ANOVA model} with random effects:
\begin{equation}
Z_{ijkl}=\mu+A_i+B_{ij}+C_{ijk}+\epsilon_{ijkl} \;,
(\#eq:ANOVAmodelnested)
\end{equation}
with $\mu$ the mean, $A_i$ the effect of the $i$th first stage station, $B_{ij}$ the effect of the $j$th second stage station within the $i$th first stage station, and so on. $A_i$, $B_{ij}$, $C_{ijk}$, and $\epsilon_{ijkl}$ are random quantities (random effects) all with zero mean and variances $\sigma^2_1$, $\sigma^2_2$, $\sigma^2_3$, and $\sigma^2_4$, respectively.
For balanced designs, the variance components\index{Variance component} can be estimated by the MoM from a hierarchical ANOVA. The first step is to assign factors to the sampling points that indicate the grouping of the sampling points in the various stages. The number of factors needed is the number of stages minus 1. All factors have two levels. Figures \@ref(fig:FactorLevelsBalancedNestedSample) and \@ref(fig:FactorLevelsBalancedNestedSample2) show the levels of the three factors. The levels of the first factor show the strongest spatial clustering, those of the second factor the one-but-strongest, and so on.
```{r}
mysample_nested$factor1 <- rep(rep(1:2), times = 8)
mysample_nested$factor2 <- rep(rep(1:2, each = 2), times = 4)
mysample_nested$factor3 <- rep(rep(1:2, each = 4), times = 2)
```
```{r FactorLevelsBalancedNestedSample, echo = FALSE, out.width = "100%", fig.asp = 0.30, fig.cap = "The levels of the three factors assigned to the sampling points of the balanced nested sample selected with the first version."}
df <- mysample_nested %>% pivot_longer(cols = c("factor1", "factor2", "factor3"))
ggplot(df) +
geom_tile(data = grid, mapping = aes(x = s1 / 1000, y = s2 / 1000), fill = "grey") +
geom_point(aes(x = s1 / 1000, y = s2 / 1000), shape = 4, size = 0.8) +
geom_text(aes(x = (s1 - 150) / 1000, y = s2 / 1000, label = value), size = 2) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(axis.text=element_text(size=8),
axis.title=element_text(size=8)) +
facet_wrap(~name)
```
The **R** code below shows the construction of the three factors for the second version of nested sampling.
```{r}
mysample_nested_2$factor1 <- rep(1:2, each = 8)
mysample_nested_2$factor2 <- rep(rep(1:2, each = 4), times = 2)
mysample_nested_2$factor3 <- rep(rep(1:2, each = 2), times = 4)
```
```{r FactorLevelsBalancedNestedSample2, echo = FALSE, out.width = "100%", fig.asp = 0.30, fig.cap = "The levels of the three factors assigned to the sampling points of the balanced nested sample selected with the second version."}
df <- mysample_nested_2 %>% pivot_longer(cols = c("factor1", "factor2", "factor3"))
ggplot(df) +
geom_tile(data = grid, mapping = aes(x = s1 / 1000, y = s2 / 1000), fill = "grey") +
geom_point(aes(x = s1 / 1000, y = s2 / 1000), shape = 4, size = 0.8) +
geom_text(aes(x = (s1 - 150) / 1000, y = s2 / 1000, label = value), size = 2) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(axis.text=element_text(size=8),
axis.title=element_text(size=8)) +
facet_wrap(~name)
```
For unbalanced nested designs\index{Nested sampling!unbalanced nested sampling}, the variance components can be estimated by restricted maximum likelihood\index{Restricted maximum likelihood estimation} (REML) [@Webster2006]. REML estimation is also recommended if in Equation \@ref(eq:ANOVAmodelnested) instead of a constant mean $\mu$ the mean is a linear combination of one or more covariates (fixed effects). The semivariances at the chosen separation distances are obtained by cumulating the estimated variance components.
The **R** code below shows how the variance components and the semivariances can be estimated with function `lme` of the package **nlme** [@nlme], once the data are collected and added to the data frame. This function fits linear mixed-effects models and allows for nested random effects. It can be used both for balanced and unbalanced nested samples, and for a constant mean or a mean that is a linear combination of covariates. Argument `fixed` is a formula describing the fixed effects with the response variable on the left-hand side of the ~ operator and the covariates for the mean on the right-hand side. If a constant mean is assumed, as in our example, this is indicated by the 1 on the right-hand side of the ~ operator. Argument `random` is a one-sided formula (no response variable is on the left-hand side of the ~ operator). On the right-hand side of the $|$ separator, the nested structure of the data is specified using the factors of Figure \@ref(fig:FactorLevelsBalancedNestedSample). The 1 on the left-hand side of the $|$ separator means that we assume that all regression coefficients associated with the covariates are fixed (non-random) quantities.
```{r EstimateVarianceComponents, eval = FALSE}
library(nlme)
lmodel <- lme(
fixed = z ~ 1, data = mysample_nested,
random = ~ 1 | factor1 / factor2 / factor3)
res <- as.matrix(VarCorr(lmodel))
sigmas <- as.numeric(res[c(2, 4, 6, 7), 1])
sigma <- rev(sigmas)
semivar <- cumsum(sigmas)
```
Random sampling of the points is not strictly needed, because a model-based approach is followed here. The model of Equation \@ref(eq:ANOVAmodelnested) is a superpopulation model, i.e., we assume that the population is generated by this model (see Chapter \@ref(Approaches)). @Papritz2011, for instance, selected the points (using the second version) non-randomly to improve the control of the nested subareas and the average separation distances.
@Lark2011CAGEO describes a method for optimisation of a nested design, given the total number of points and the chosen separation distances.
#### Exercises {-}
1. Write an **R** script to select with the first version a balanced nested sample from Hunter Valley. Use as separation distances 1,000, 500, 200, 100, and 50 m.
+ Add the factors that are needed for estimating the variance components to the data frame with the selected sampling points.
+ Overlay the sampling points with the `SpatialPixelsDataFrame`, and estimate the semivariances for the attribute compound topographic index (cti).
## Independent sampling of pairs of points {#IndependentSamplingPairs}
With the nested design, the estimated semivariances for the different separation distances are not independent. Independent estimated semivariances can be obtained by independent random selection of pairs of points (IPP sampling). Independence here means design-independence, see Section \@ref(iid). Similar to a regression model, a semivariogram can be defined as a superpopulation model or as a population model. Only in the current section a semivariogram is defined at the population level. Such a semivariogram is referred to as a non-ergodic semivariogram\index{Non-ergodic semivariogram} or local semivariogram\index{Local semivariogram} [@bru94b].
IPP sampling is straightforward for simple random sampling\index{Simple random sampling of point-pairs|(}. For each separation distance a point-pair is selected by first selecting fully randomly one point from the study area. Then the second point is randomly selected from the circle with the first point at its centre and a radius equal to the chosen separation distance. If this second point is outside the study area, both points are discarded. This is repeated until we have the required point-pairs for this separation distance. The next code chunk is an implementation of this selection procedure.
```{r SIPairs}
SIpairs <- function(h, n, area) {
topo <- as(getGridTopology(area), "data.frame")
cell_size <- topo$cellsize[1]
xy <- coordinates(area)
dxy <- numeric(length = 2)
xypnts1 <- xypnts2 <- NULL
i <- 1
while (i <= n) {
unit1 <- sample(length(area), size = 1)
xypnt1 <- xy[unit1, ]
xypnt1[1] <- jitter(xypnt1[1], amount = cell_size / 2)
xypnt1[2] <- jitter(xypnt1[2], amount = cell_size / 2)
angle <- runif(n = 1, min = 0, max = 2 * pi)
dxy[1] <- h * sin(angle); dxy[2] <- h * cos(angle)
xypnt2 <- as.data.frame(t(xypnt1 + dxy))
coordinates(xypnt2) <- ~ s1 + s2
inArea <- as.numeric(over(x = xypnt2, y = area))[1]
if (!is.na(inArea)) {
xypnts1 <- rbind(xypnts1, xypnt1)
xypnts2 <- rbind(xypnts2, as.data.frame(xypnt2))
i <- i + 1
}
rm(xypnt1, xypnt2)
}
cbind(xypnts1, xypnts2)
}
```
IPP sampling is illustrated with the compound topographic index (cti, which is the same as topographic wetness index) data of Hunter Valley. Five separation distances are chosen, collected in numeric `h`, and for each distance $n=100$ point-pairs are selected by simple random sampling.
```{r IPPSampling}
library(sp)
h <- c(50, 100, 200, 500, 1000)
n <- 100
set.seed(123)
allpairs <- NULL
for (i in seq_len(length(h))) {
pairs <- SIpairs(h = h[i], n = n, area = grdHunterValley)
allpairs <- rbind(allpairs, pairs, make.row.names = FALSE)
}
```
The `data.frame` `allpairs` has four variables: the spatial coordinates of the first and of the second point of a pair. An overlay is made of the selected points with the `SpatialPixelsDataFrame`, and the cti values are extracted.
```{r Allpairs}
pnt1 <- allpairs[, c(1, 2)]
coordinates(pnt1) <- ~ s1 + s2
z1 <- over(x = pnt1, y = grdHunterValley)["cti"]
pnt2 <- allpairs[, c(3, 4)]
coordinates(pnt2) <- ~ s1 + s2
z2 <- over(x = pnt2, y = grdHunterValley)["cti"]
mysample <- data.frame(h = rep(h, each = n), z1, z2)
names(mysample)[c(2, 3)] <- c("z1", "z2")
```
The semivariances for the chosen separation distances are estimated as well as the variance of these estimated semivariances.
```{r}
gammah <- vgammah <- numeric(length = length(h))
for (i in seq_len(length(h))) {
units <- which(mysample$h == h[i])
pairsh <- mysample[units, ]
gammah[i] <- mean((pairsh$z1 - pairsh$z2)^2, na.rm = TRUE) / 2
vgammah[i] <- var((pairsh$z1 - pairsh$z2)^2, na.rm = TRUE) / (n * 4)
}
```
A spherical model with nugget is fitted to the sample semivariogram, using function `nls`, with weights equal to the reciprocal of the estimated variances of the estimated semivariances.
```{r}
sample_vg <- data.frame(h, gammah, vgammah)
SphNug <- function(h, range, psill, nugget) {
h <- h / range
nugget + psill * ifelse(h < 1, (1.5 * h - 0.5 * h^3), 1)
}
fit.var <- nls(gammah ~ SphNug(h, range, psill, nugget),
data = sample_vg, start = list(psill = 4, range = 200, nugget = 1),
weights = 1 / vgammah, algorithm = "port", lower = c(0, 0, 0))
print(pars <- signif(coef(fit.var), 3))
```
Figure \@ref(fig:variogramCTI) shows the sample semivariogram and the fitted model.
```{r variogramCTI, echo = FALSE, fig.asp = 0.7, fig.width = 5, fig.cap = "Sample semivariogram, obtained by independent sampling of pairs of points, and fitted spherical model of compount topographic index in Hunter Valley."}
#Compute fitted values
dist <- seq(from = 0.1, to = 1000, by = 0.1)
fitted <- SphNug(h = dist, range = pars["range"], psill = pars["psill"], nugget = pars["nugget"])
#Plot sample semivariogram and fitted model
df <- data.frame(h = dist, fitted = fitted)
ggplot(data = sample_vg) +
geom_point(mapping = aes(x = h, y = gammah), size = 2) +
geom_line(data = df, mapping = aes(x = dist, y = fitted), colour = "red") +
scale_x_continuous(name = "Separation distance (m)") +
scale_y_continuous(name = "Semivariance", limits = c(0, NA))
```
The covariances of the estimated semivariances at different separation distances are zero, as the point-pairs are selected independently. This keeps estimation of the variances and covariances of the estimated semivariogram parameters simple. In the next code chunk, this is done by bootstrapping\index{Bootstrap sample}.
In bootstrapping, for each separation distance a simple random sample *with replacement* of point-pairs is selected from the original sample of point-pairs. A point-pair can be selected more than once. The sample size (number of draws) is equal to the total number of point-pairs per separation distance in the original sample.
Every run of the bootstrap results in as many bootstrap samples as there are separation distances. The bootstrap samples are used to fit a semivariogram model. The whole procedure is repeated 500 times, resulting in 500 vectors with model parameters. These vectors can be used to estimate the variances and covariances of the estimators of the three semivariogram parameters.
```{r bootstrap}
allpars <- NULL
R <- 500
for (j in 1:R) {
gammah <- vgammah <- numeric(length = length(h))
for (i in seq_len(length(h))) {
units <- which(mysample$h == h[i])
mysam_btsp <- mysample[units, ] %>%
slice_sample(n = n, replace = TRUE)
gammah[i] <- mean((mysam_btsp$z1 - mysam_btsp$z2)^2, na.rm = TRUE) / 2
vgammah[i] <- var((mysam_btsp$z1 - mysam_btsp$z2)^2, na.rm = TRUE) / (n * 4)
}
sample_vg <- data.frame(h, gammah, vgammah)
tryCatch({
fittedvariogram <- nls(gammah ~ SphNug(h, range, psill, nugget),
data = sample_vg, start = list(psill = 4, range = 200, nugget = 1),
weights = 1 / vgammah, algorithm = "port", lower = c(0, 0, 0))
pars <- coef(fittedvariogram)
allpars <- rbind(allpars, pars)}, error = function(e) {})
}
#compute variance-covariance matrix
signif(var(allpars), 3)
```
Note the large variance for the range parameter (the standard deviation is 258 m) as well as the negative covariance of the nugget and the partial sill parameter (the Pearson correlation coefficient is -0.72). Histograms of the three estimated semivariogram parameters are shown in Figure \@ref(fig:histogramsvariogramparameters)\index{Simple random sampling of point-pairs|}.
```{r histogramsvariogramparameters, echo = FALSE, out.width = "100%", fig.asp = 0.6, fig.cap = "Frequency distributions of estimated parameters of a spherical semivariogram of compound topographic index in Hunter Valley."}
df <- as.data.frame(allpars)
names(df) <- c("PartialSill", "Range", "Nugget")
plt1 <- ggplot(data = df) +
geom_histogram(aes(x = Nugget), fill = "black", alpha = 0.5, colour = "black", binwidth = 0.25) +
scale_y_continuous(name="Count")
plt2 <- ggplot(data = df) +
geom_histogram(aes(x = PartialSill), fill = "black", alpha = 0.5, colour = "black", binwidth = 0.5) +
scale_x_continuous(name="Partial sill ") +
scale_y_continuous(name="Count")
plt3 <- ggplot(data = df) +
geom_histogram(aes(x = Range), fill = "black", alpha = 0.5, colour = "black", binwidth = 150) +
scale_y_continuous(name="Count")
grid.arrange(plt1, plt2, plt3, nrow = 2)
```
```{block2, type = 'rmdnote'}
@Marcelli2019 show how a probability sample of *points* (instead of pairs of points) can be used in design-based estimation of the semivariogram. From the $n$ randomly selected points all $n(n-1)/2$ point-pairs are constructed. The *second-order inclusion probabilities* of these point-pairs are used to estimate the mean semivariance for separation distance classes. This sampling strategy makes better use of the data and is therefore potentially more efficient than IPP sampling.
```
#### Exercises {-}
2. Write an **R** script to select simple random samples of pairs of points for estimating the semivariogram of cti in Hunter Valley. Use as separation distances 25, 50, 100, 200, and 400 m. Note that these separation distances are smaller than those used above. Select 100 pairs per separation distance.
+ Compute the sample semivariogram, and estimate a spherical model with nugget using function `nls`.
+ Compare the estimated semivariogram parameters with the estimates obtained with the larger separation distances.
+ Estimate the variance-covariance matrix of the estimated semivariogram parameters by bootstrapping.
+ Compare the variances of the estimated semivariogram parameters with the variances obtained with the larger separation distances. Which variance is changed most?
## Optimisation of sampling pattern for semivariogram estimation {#samplingforvariogram}
There is rich literature on model-based optimisation of the sampling locations for semivariogram estimation. Several design criteria (minimisation criteria) have been proposed for optimising the sampling pattern. Various authors have proposed a measure of the uncertainty about the semivariogram parameters as a minimisation criterion. These criteria are described and illustrated in the next subsection. The kriging variance is sensitive to errors in the estimated semivariogram. Therefore, @lar02 proposed to use a measure of the uncertainty about the kriging variance as a minimisation criterion (Subsection \@ref(UncertainKrigingVariance)).
### Uncertainty about semivariogram parameters {#UncertainSemivariogramParameters}
@mul99 as well as @bog99 proposed the determinant of the variance-covariance matrix\index{Determinant of variance-covariance matrix} of semivariogram parameters, estimated by generalised least squares to fit the MoM sample semivariogram. For instance, if we have two semivariogram parameters, $\theta_1$ and $\theta_2$, the determinant of the 2 $\times$ 2 variance-covariance matrix equals the sum of the variances of the two estimated parameters minus two times the covariance of the two estimated parameters. If the two estimated parameters are positively correlated, the determinant of the matrix is smaller than if they are uncorrelated, and the covariance term is zero. The determinant is a measure of our *joint* uncertainty about the semivariogram parameters.
@zhu05 proposed as a minimisation criterion the log of the determinant of the inverse Fisher information matrix\index{Fisher information matrix} in ML estimation\index{Maximum likelihood estimation} of the semivariogram, hereafter shortly denoted by logdet. The Fisher information about a semivariogram parameter is a function of the likelihood of the semivariogram parameter; the likelihood of a semivariogram parameter is the probability of the data as a function of the semivariogram parameter. The log of this likelihood can be plotted against values of the parameter. The flatter the log-likelihood surface\index{Log-likelihood surface}, the less information is in the data about the parameter. The flatness of the surface can be measured by the first derivative of the log-likelihood to the semivariogram parameter. Strong negative or positive derivative values indicate a steep surface. The Fisher information for a model parameter is defined as the expectation of the *square* of the first derivative of the log-likelihood to that semivariogram parameter, see @Ly2017 for a nice tutorial on this subject. The more information we have about a semivariogram, the less uncertain we are about that parameter. This explains why the inverse of the Fisher information can be used as a measure of uncertainty. The inverse Fisher information matrix contains the variances and covariances of the estimated semivariogram parameters.
The code chunks hereafter show how logdet can be computed. It makes use of the result of @Kitanidis87 who showed that each element of the Fisher information matrix $\mathbf{I}(\theta)$ can be obtained with (see also @lar02)
\begin{equation}
[\mathbf{I}(\theta)]_{ij}=\frac{1}{2}\mathrm{Tr}\left[\mathbf{A}^{-1}\frac{\partial\mathbf{A}}{\partial\theta_i}\mathbf{A}^{-1}\frac{\partial\mathbf{A}}{\partial\theta_j}\right]\;,
(\#eq:FisherInformation)
\end{equation}
with $\mathbf{A}$ the correlation matrix of the sampling points, $\frac{\partial\mathbf{A}}{\partial\theta_i}$ the partial derivative of the correlation matrix to the $i$th semivariogram parameter, and Tr[$\cdot$] the trace of a matrix.
As an illustration, I selected a simple random sample of 50 points from Hunter Valley. A matrix with distances between the points of a sample is computed. Preliminary values for the semivariogram parameters $\xi$ (ratio of spatial dependence\index{Ratio of spatial dependence}) and $\phi$ (distance parameter\index{Distance parameter}) are obtained by visual inspection of the sample semivariogram, and the **gstat** [@peb04] function `variogramLine` is used to compute the correlation matrix.
```{r, echo = FALSE}
rm(grdHunterValley)
```
```{r}
library(sp)
library(gstat)
set.seed(314)
mysample0 <- grdHunterValley %>%
slice_sample(n = 50)
coordinates(mysample0) <- ~ s1 + s2
D <- spDists(mysample0)
xi <- 0.8; phi <- 200
thetas <- c(xi, phi)
vgmodel <- vgm(model = "Exp", psill = thetas[1],
range = thetas[2], nugget = 1 - thetas[1])
A <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
```
In the next step, the semivariogram parameters are slightly changed one-by-one. The changes, referred to as perturbations\index{Perturbation}, are a small fraction of the preliminary semivariogram parameter values. The perturbed semivariogram parameters are used to compute the perturbed correlation matrices (`pA`) and the partial derivatives of the correlation matrix (`dA`) for each perturbation.
```{r}
perturbation <- 0.01
pA <- dA <- list()
for (i in seq_len(length(thetas))) {
thetas_pert <- thetas
thetas_pert[i] <- (1 + perturbation) * thetas[i]
vgmodel_pert <- vgm(model = "Exp", psill = thetas_pert[1],
range = thetas_pert[2], nugget = 1 - thetas_pert[1])
pA[[i]] <- variogramLine(vgmodel_pert, dist_vector = D, covariance = TRUE)
dA[[i]] <- (pA[[i]] - A) / (thetas[i] * perturbation)
}
```
Finally, the Fisher information matrix is computed using Equation \@ref(eq:FisherInformation). We do not need to compute the inverse of the Fisher information matrix, because the determinant of the inverse of a matrix is equal to the inverse of the determinant of the (not inverted) matrix. The determinant is computed with function `determinant`.
```{r}
I <- matrix(0, length(thetas), length(thetas))
for (i in seq_len(length(thetas))) {
m_i <- solve(A, dA[[i]])
for (j in i:length(thetas)) {
m_j <- solve(A, dA[[j]])
I[i, j] <- I[j, i] <- 0.5 * sum(diag(m_i %*% m_j))
}
}
logdet0 <- -determinant(I, logarithm = TRUE)$modulus
```
The joint uncertainty about the semivariogram parameters, as quantified by the log of the determinant of the inverse of the information matrix, equals `r formatC(logdet0, 3, format = "f")`. Hereafter, we will see how much this joint uncertainty can be reduced by optimising the spatial pattern of the sample used for semivariogram estimation, compared to the simple random sample used in the above calculation. Note that a preliminary semivariogram is needed to compute an optimised sampling pattern for semivariogram estimation.
Function `optimUSER` of package **spsann** can be used to search for the sampling locations with the minimum value of logdet. This function has been used before in Section \@ref(SamplePatternOK). Package **spsann** cannot deal with the `r formatC(nrow(grdHunterValley), 0, format = "f", big.mark = ",")` candidate grid nodes of Hunter Valley; these are too many. I therefore selected a subgrid of 50 m $\times$ 50 m.
The size of the sample for estimation of the semivariogram is passed to function `optimUSER` with argument `points`. The objective function to be minimised is passed to function `optimUSER` with argument `fun`. The objective function `logdet` is defined in package **sswr**. Argument `model` specifies the model *type*, using the characters for model types of package **gstat**. Argument `thetas` specifies the preliminary semivariogram parameter values. Argument `perturbation` specifies how much the semivariogram parameters are changed to compute the perturbed correlation matrices (`pA`) and the partial derivatives of the correlation matrix (`dA`).
```{r, eval = FALSE}
gridded(grdHunterValley) <- ~ s1 + s2
candi <- spsample(grdHunterValley, type = "regular",
cellsize = c(50, 50), offset = c(0.5, 0.5))
candi <- as.data.frame(candi)
names(candi) <- c("x", "y")
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8, 0.95),
initial.temperature = 0.15, temperature.decrease = 0.9,
chains = 300, chain.length = 10, stopping = 10,
x.min = 0, y.min = 0, cellsize = 50)
set.seed(314)
res <- optimUSER(
points = 50, candi = candi,
fun = logdet,
model = "Exp", thetas = thetas, perturbation = 0.01,
schedule = schedule, track = TRUE)
```
```{r, eval = FALSE, echo = FALSE}
write_rds(res, file = "results/MBSample_logdet_phi200nug02_HunterValley_50pnts.rds")
```
```{r, echo = FALSE}
res <- read_rds(file = "results/MBSample_logdet_phi200nug02_HunterValley_50pnts.rds")
logdetopt <- tail(res$objective$energy$obj, 1)
```
Figure \@ref(fig:MBVariogram) shows the optimised sampling pattern of 50 points. The logdet of the optimised sample equals `r formatC(logdetopt, 3, format = "f")`, which is `r formatC(100*logdetopt/logdet0, 0, format = "f")`\% of the value of the simple random sample used above to illustrate the computations. The optimised sample consists of two clusters. There are quite a few point-pairs with nearly coinciding points.
### Uncertainty about the kriging variance {#UncertainKrigingVariance}
@lar02 proposed as a minimisation criterion the estimation variance of the kriging variance (VKV) due to uncertainty in the ML estimates of the semivariogram parameters. This variance is approximated by a first order Taylor series\index{First order Taylor series}, requiring the partial derivatives of the kriging variance with respect to the semivariogram parameters:
\begin{equation}
VKV(\mathbf{s}_0) = \sum_{i=1}^p \sum_{j=1}^p \mathrm{Cov}(\theta_i,\theta_j) \frac{\partial V_{\mathrm{OK}}(\mathbf{s}_0)} {\partial \theta_i} \frac{\partial V_{\mathrm{OK}}(\mathbf{s}_0)} {\partial \theta_j}\;,
(\#eq:varkrigingvar)
\end{equation}
with $p$ the number of semivariogram parameters, $\mathrm{Cov}(\theta_i,\theta_j)$ the covariances of the semivariogram parameters $\theta_i$ and $\theta_j$ (elements of the inverse of the Fisher information matrix $\mathbf{I}^{-1}(\pmb{\theta})$, Equation \@ref(eq:FisherInformation)), and $\frac{\partial V_{\mathrm{OK}}(\mathbf{s}_0)} {\partial \theta_i}$ the partial derivative of the kriging variance to the $i$th semivariogram parameter at prediction location $\mathbf{s}_0$.
The first step in designing a sample for semivariogram estimation using a population parameter of VKV as a minimisation criterion is to select a sample for the second sampling round. In the code chunk below, a spatial coverage sample of 100 points is selected, using function `stratify` of package **spcosa**, see Section \@ref(SpatialCoverage). Once the observations of this sample are collected in the second sampling round, these data are used for mapping by ordinary kriging.
To optimise the sampling pattern of the first sampling round for variogram estimation, the population mean of VKV (MVKV) is used as a minimisation criterion. This population mean is estimated from a centred square grid of 200 points, the evaluation sample.
```{r}
library(spcosa)
gridded(grdHunterValley) <- ~ s1 + s2
set.seed(314)
mystrata <- stratify(grdHunterValley, nStrata = 100, equalArea = FALSE, nTry = 10)
mysample_SC <- as(spsample(mystrata), "SpatialPoints")
mysample_eval <- spsample(
x = grdHunterValley, n = 200, type = "regular", offset = c(0.5, 0.5))
```
The following code chunks show how VKV at the evaluation point is computed. First, the correlation matrix of the spatial coverage sample (`A`) is computed as well as the correlation matrix of the spatial coverage sample and the evaluation points (`A0`). Correlation matrix `A` is extended with a column and a row with ones, see Equation \@ref(eq:krigingeqsmatrix).
```{r}
D <- spDists(mysample_SC)
vgmodel <- vgm(model = "Exp", psill = thetas[1], range = thetas[2],
nugget = 1 - thetas[1])
A <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
nobs <- length(mysample_SC)
B <- matrix(data = 1, nrow = nobs + 1, ncol = nobs + 1)
B[1:nobs, 1:nobs] <- A
B[nobs + 1, nobs + 1] <- 0
D0 <- spDists(x = mysample_eval, y = mysample_SC)
A0 <- variogramLine(vgmodel, dist_vector = D0, covariance = TRUE)
b <- cbind(A0, 1)
```
Next, the semivariogram parameters are perturbed one-by-one, and the perturbed correlation matrices `pA` and `pA0` are computed.
```{r}
pA <- pA0 <- list()
for (i in seq_len(length(thetas))) {
thetas_pert <- thetas
thetas_pert[i] <- (1 + perturbation) * thetas[i]
vgmodel_pert <- vgm(model = "Exp", psill = thetas_pert[1],
range = thetas_pert[2], nugget = 1 - thetas_pert[1])
pA[[i]] <- variogramLine(vgmodel_pert, dist_vector = D, covariance = TRUE)
pA0[[i]] <- variogramLine(vgmodel_pert, dist_vector = D0, covariance = TRUE)
}
pB <- pb <- list()
for (i in seq_len(length(thetas))) {
pB[[i]] <- B
pB[[i]][1:nobs, 1:nobs] <- pA[[i]]
pb[[i]] <- cbind(pA0[[i]], 1)
}
```
Next, the kriging variance and the perturbed kriging variances are computed, and the partial derivatives of the kriging variance with respect to the semivariogram parameters are approximated. See Equations \@ref(eq:krigingweights) and \@ref(eq:OKvariance) for how the kriging weights `l` and the kriging variance `var` are computed.
```{r}
var <- numeric(length = length(mysample_eval))
pvar <- matrix(nrow = length(mysample_eval), ncol = length(thetas))
for (i in seq_len(length(mysample_eval))) {
#compute kriging weights and Lagrange multiplier
l <- solve(B, b[i,])
var[i] <- 1 - l[1:nobs] %*% A0[i, ] - l[nobs + 1]
for (j in seq_len(length(thetas))) {
pl <- solve(pB[[j]], pb[[j]][i, ])
pvar[i, j] <- 1 - pl[1:nobs] %*% pA0[[j]][i, ] - pl[nobs + 1]
}
}
dvar <- list()
for (i in seq_len(length(thetas))) {
dvar[[i]] <- (pvar[, i] - var) / (thetas[i] * perturbation)
}
```
Finally, the partial derivatives of the kriging variance are used to approximate VKV at the 200 evaluation points (Equation \@ref(eq:varkrigingvar)). For this, the variances and covariances of the estimated semivariogram parameters are needed, estimated by the inverse of the Fisher information matrix. The Fisher information matrix computed in Subsection \@ref(UncertainSemivariogramParameters) for the simple random sample of 50 points is also used here.
Note that the variance-covariance matrix of estimated semivariogram parameters is computed from the sample for semivariogram estimation only. The spatial coverage sample of the second sampling round is not used for estimating the semivariogram, but for prediction only. Section \@ref(SamplingEstimationandPrediction) is about designing one sample, instead of two samples, that is used both for estimation of the model parameters and for prediction.
```{r}
invI <- solve(I)
VKV <- numeric(length = length(var))
for (i in seq_len(length(thetas))) {
for (j in seq_len(length(thetas))) {
VKVij <- invI[i, j] * dvar[[i]] * dvar[[j]]
VKV <- VKV + VKVij
}
}
MVKV0 <- mean(VKV)
```
For the simple random sample, the square root of MVKV equals `r formatC(sqrt(MVKV0), 3, format = "f")`. The mean kriging variance (MKV) at these points equals `r formatC(mean(var), 3, format = "f")`, so the uncertainty about the kriging variance is substantial. Hereafter, we will see how much MVKV can be reduced by optimising the sampling pattern with spatial simulated annealing.
As for logdet, the sample with minimum value for MVKV can be searched for using **spsann** function `optimUSER`. The objective function `MVKV` is defined in package **sswr**. Argument `points` specifies the size of the sample for semivariogram estimation. Argument `psample` is to specify the sample used for prediction at the evaluation points (after the second round of sampling). Argument `esample` is to specify the sample with evaluation points for estimating MVKV. The optimisation requires substantial computing time. With 200 evaluation points and the annealing schedule specified below the computing time was 46.25 minutes (processor AMD Ryzen 5, 16 GB RAM).
```{r, echo = FALSE}
candi <- spsample(grdHunterValley, type = "regular",
cellsize = c(50, 50), offset = c(0.5, 0.5))
candi <- as.data.frame(candi)
names(candi) <- c("x", "y")
```
```{r, eval = FALSE}
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8, 0.95),
initial.temperature = 0.002, temperature.decrease = 0.8,
chains = 300, chain.length = 2, stopping = 10,
x.min = 0, y.min = 0, cellsize = 50)
set.seed(314)
res <- optimUSER(
points = 50, candi = candi,
fun = MVKV,
psample = mysample_SC, esample = mysample_eval,
model = "Exp", thetas = thetas, perturbation = 0.01,
schedule = schedule, track = TRUE)
```
```{r, eval = FALSE, echo = FALSE}
write_rds(res, file = "results/MBSample_MVKV_phi200nug02_HunterValley_50pnts.rds")
```
```{r, echo = FALSE}
res <- read_rds(file = "results/MBSample_MVKV_phi200nug02_HunterValley_50pnts.rds")
MVKVopt <- tail(res$objective$energy$obj, 1)
```
Figure \@ref(fig:MBVariogram) shows the optimised sample. The minimised value of MVKV is `r formatC(100*MVKVopt/MVKV0, 0, format = "f")`\% of the value of the simple random sample used to illustrate the computations. The optimised sample points are clustered in an ellipse.
```{r MBVariogram, echo = FALSE, out.width = "100%", fig.cap = "Optimised sampling pattern of 100 points for semivariogram estimation, using the log of the determinant of the inverse Fisher information matrix of the semivariogram parameters (logdet) and the mean estimation variance of the kriging variance (MVKV) as a minimisation criterion."}
res <- read_rds(file = "results/MBSample_logdet_phi200nug02_HunterValley_50pnts.rds")
mysample_logdet <- res$points
res <- read_rds(file = "results/MBSample_MVKV_phi200nug02_HunterValley_50pnts.rds")
mysample_MVKV <- res$points
mysamples <- rbind(mysample_logdet, mysample_MVKV)
mysamples$criterion <- rep(c("logdet", "MVKV"), each = 50)
ggplot(mysamples) +
geom_raster(data = as(grdHunterValley, "data.frame"), mapping = aes(x = s1 / 1000, y = s2 / 1000), fill = "grey") +
geom_point(mapping = aes(x = x / 1000, y = y / 1000), shape = 1, size = 1.5) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
facet_wrap(~ criterion) +
coord_fixed()
```
```{block2, type = 'rmdnote'}
Both minimisation criteria, logdet and MVKV, are a function of the semivariogram parameters $\pmb{\theta}$, showing that the problem is circular. Using a preliminary estimate of the semivariogram parameters, $\hat{\pmb{\theta}}$, leads to a locally optimal design at $\hat{\pmb{\theta}}$. For this reason, @bog99 and @zhu05 proposed a Bayesian approach, in which a multivariate prior distribution for the semivariogram parameters is postulated. The expected value over this distribution of the criterion is minimised. @lar02 computed the average of VKV over a number of semivariograms.
```
Both methods for sample optimisation rely, amongst others, on the assumption that the mean and the variance are constant throughout the area. Under this assumption, it is no problem that the sampling units are spatially clustered. So, we assume that the semivariogram estimated from the data collected in a small portion of the study area is representative for the whole study area. If we do not feel comfortable with this assumption, spreading the sampling units throughout the study area by the sampling methods described in the next two sections can be a good option.
#### Exercises {-}
3. Write an **R** script to design a model-based sample of 50 points for Hunter Valley, to estimate the semivariogram for a study variable. Use logdet as a minimisation criterion. Use as a prior estimate of the semivariogram an exponential model with a distance parameter of 200 m and a ratio of spatial dependence of 0.5. Compare the sample with the optimised sample in Figure \@ref(fig:MBVariogram), which was obtained with the same value for the distance parameter and a spatial dependence ratio of 0.8.
4. Repeat this for MVKV as a minimisation criterion. Use a spatial coverage sample of 100 points for prediction and a square grid of 200 points for evaluation.
## Optimisation of sampling pattern for semivariogram estimation and mapping {#SamplingEstimationandPrediction}
In practice, a reconnaissance survey for semivariogram estimation often is not feasible. A single sample must be designed that is suitable both for estimating the semivariogram parameters and mapping, i.e., prediction with the estimated semivariogram parameters at the nodes of a fine discretisation grid. Another reason is that in a reconnaissance survey we can seldom afford a sample size large enough to obtain reliable estimates of the semivariogram parameters. @Papritz2011 found that for a sample size of 192 points the estimated variance components with balanced and unbalanced nested designs were highly uncertain. For this reason, it is attractive to use also the sampling points designed for spatial prediction (mapping) for estimating the semivariogram. Designing two samples, one for estimation of the semivariogram and one for spatial prediction, is suboptimal. Designing one sample that can be used both for estimation of the semivariogram parameters and for prediction potentially is more efficient.
Finally, with nested sampling and IPP sampling, we aim at estimating the semivariogram of the residuals of a constant mean (see Equation \@ref(eq:ANOVAmodelnested)). In other words, with these designs we aim at estimating the parameters of a semivariogram model used in ordinary kriging. In situations where we have covariates that can partly explain the spatial variation of the study variable, kriging with an external drift is more appropriate. In these situations, the reconnaissance survey should be tailored to estimating both the regression coefficients associated with the covariates and the parameters of the residual semivariogram.
Model-based methods for designing a single sample for estimation of the model parameters and for prediction with the estimated model parameters are proposed, amongst others, by @zim06, @zhu06, @zhu06b, and @Marchant2007. The methods use a different minimisation criterion. @zim06 proposed to minimise the kriging variance (at the centre of a square grid cell) that is augmented by an amount that accounts for the additional uncertainty in the kriging predictions due to uncertainty in the estimated semivariogram parameters, hereafter referred to as the augmented kriging variance \index{Augmented kriging variance} (AKV):
\begin{equation}
AKV(\mathbf{s}_0) = V_\mathrm{OK}(\mathbf{s}_0) + \mathrm{E}[\tau^2(\mathbf{s}_0)]\;,
(\#eq:augmentedvar)
\end{equation}
with $V_\mathrm{OK}(\mathbf{s}_0)$ the ordinary kriging variance, see Equation \@ref(eq:OKvariance), and $\mathrm{E}[\tau^2(\mathbf{s}_0)]$ the expectation of the additional variance component due to uncertainty about the semivariogram parameters estimated by ML. The additional variance component is approximated by a first order Taylor series:
\begin{equation}
\mathrm{E}[\tau^2(\mathbf{s}_0)]=\sum_{i=1}^p \sum_{j=1}^p \mathrm{Cov}(\theta_i,\theta_j) \frac{\partial \lambda^{\mathrm{T}}} {\partial \theta_i} \mathbf{A}\frac{\partial \lambda} {\partial \theta_j}\;,
(\#eq:tausq)
\end{equation}
with $\frac{\partial \lambda} {\partial \theta_j}$ the vector of partial derivatives of the kriging weights with respect to the $j$th semivariogram parameter. Comparing Equations \@ref(eq:tausq) and \@ref(eq:varkrigingvar) shows that the two variances differ. VKV quantifies our uncertainty about the estimated kriging variance, whereas $\mathrm{E}[\tau^2]$ quantifies our uncertainty about the kriging prediction due to uncertainty about the semivariogram parameters. I use the mean of the AKV over the nodes of a prediction grid (evaluation grid) as a minimisation criterion (MAKV). The same criterion can also be used in situations where we have maps of covariates that we want to use in prediction. In that case, the aim is to design a single sample that is used both for estimation of the *residual* semivariogram and for prediction by kriging with an external drift. The ordinary kriging variance $V_\mathrm{OK}(\mathbf{s}_0)$ in Equation \@ref(eq:augmentedvar) is then replaced by the prediction error variance with kriging with an external drift $V_\mathrm{KED}(\mathbf{s}_0)$, see Equation \@ref(eq:KEDvariance).
@zhu06 proposed as a minimisation criterion a linear combination of AKV (Equation \@ref(eq:augmentedvar)) and VKV (Equation \@ref(eq:varkrigingvar)), referred to as the estimation adjusted criterion\index{Estimation adjusted criterion} (EAC):
\begin{equation}
EAC(\mathbf{s}_0) = AKV(\mathbf{s}_0) + \frac{1}{2V_{\mathrm{OK}}(\mathbf{s}_0)} VKV(\mathbf{s}_0)\;.
(\#eq:EAC)
\end{equation}
Again, the mean of the EAC values (MEAC) over the nodes of a prediction grid (evaluation) is used as a minimisation criterion.
Computing time for optimisation of the coordinates of a large sample, say, $>50$ points, can become prohibitively long. To reduce computing time, @zhu06 proposed a two-step approach. In the first step, for a fixed proportion $p \in (0,1)$ the locations of $(1-p)\;n$ points are optimised for prediction with given parameters, for instance by minimising MKV. This 'prediction sample' is supplemented with $p \;n$ points, so that the two combined samples of size $n$ minimise logdet or MVKV. This is repeated for different values of $p$. In the second step, MEAC is computed for the combined samples of size $n$, and the proportion and the associated sample with minimum MEAC are selected.
A simplification of this two-step approach is to select in the first step a square grid or a spatial coverage sample (Section \@ref(SpatialCoverage)), and to supplement this sample by a fixed number of points whose coordinates are optimised by spatial simulated annealing (SSA), using either MAKV or MEAC computed from both samples (grid sample or spatial coverage sample plus supplemental sample) as a minimisation criterion. In SSA the grid or spatial coverage sample is fixed, i.e., the locations are not further optimised. @Lark2018 recommended as a rule of thumb to add about 10\% of the fixed sample as short distance points.
The following code chunks show how the AKV and EAC can be computed. First, a spatial coverage sample of 90 points is selected using function `stratify` of package **spcosa**, see Section \@ref(SpatialCoverage). In addition, a simple random sample of 10 points is selected. This sample is the initial supplemental sample, whose locations are optimised. As before, a square grid of 200 points is used as an evaluation sample.
```{r}
library(spcosa)
set.seed(314)
mystrata <- stratify(grdHunterValley, nStrata = 90, equalArea = FALSE, nTry = 10)
mysample_SC <- as(spsample(mystrata), "SpatialPoints")
nsup <- 10
units <- sample(nrow(grdHunterValley), nsup)
mysample_sup0 <- as(grdHunterValley[units, ], "SpatialPoints")
mysample_eval <- spsample(
x = grdHunterValley, n = 200, type = "regular", offset = c(0.5, 0.5))
```
The next step is to compute the inverse of the Fisher information matrix, given a preliminary semivariogram model, which is used as the variance-covariance matrix of the estimated semivariogram parameters. Contrary to Section \@ref(samplingforvariogram) now *all* sampling locations are used to compute this matrix. The locations of the spatial coverage sample and the supplemental sample are merged into one `SpatialPoints` object.
```{r}
mysample <- rbind(mysample_sup0, mysample_SC)
```
To learn how the Fisher information matrix is computed, refer to the code chunks in Section \@ref(samplingforvariogram). The inverse of this matrix can be computed with function `solve`.
```{r, echo = FALSE}
D <- spDists(mysample)
vgmdl <- vgm(model = "Exp", psill = thetas[1], range = thetas[2], nugget = 1 - thetas[1])
A <- variogramLine(vgmdl, dist_vector = D, covariance = TRUE)
pA <- dA <- list()
for (i in seq_len(length(thetas))) {
thetas_pert <- thetas
thetas_pert[i] <- (1 + perturbation) * thetas[i]
vgmdl <- vgm(model = "Exp", psill = thetas_pert[1],
range = thetas_pert[2], nugget = 1 - thetas_pert[1])
pA[[i]] <- variogramLine(vgmdl, dist_vector = D, covariance = TRUE)
dA[[i]] <- (pA[[i]] - A) / (thetas[i] * perturbation)
}
I <- matrix(0, length(thetas), length(thetas))
for (i in seq_len(length(thetas))) {
m_i <- solve(A, dA[[i]])
for (j in i:length(thetas)) {
m_j <- solve(A, dA[[j]])
I[i, j] <- I[j, i] <- 0.5 * sum(diag(m_i %*% m_j))
}
}
invI <- solve(I)
nobs <- length(mysample)
nrowB <- nobs + 1
B <- matrix(data = 1, nrow = nrowB, ncol = nrowB)
B[1:nobs, 1:nobs] <- A
#compute matrix with covariances between prediction nodes and sampling points
D0 <- spDists(x = mysample_eval, y = mysample)
vgmodel <- vgm(model = "Exp", psill = thetas[1], range = thetas[2], nugget = 1 - thetas[1])
A0 <- variogramLine(vgmodel, dist_vector = D0, covariance = TRUE)
#compute perturbed B and perturbed b
pB <- pA0 <- pb <- list()
for (i in seq_len(length(thetas))) {
pB[[i]] <- B
pB[[i]][1:nobs, 1:nobs] <- pA[[i]]
thetas_pert <- thetas
thetas_pert[i] <- (1 + perturbation) * thetas[i]
vgmodel_pert <- vgm(model = "Exp", psill = thetas_pert[1], range = thetas_pert[2], nugget = 1 - thetas_pert[1])
pA0[[i]] <- variogramLine(vgmodel_pert, dist_vector = D0, covariance = TRUE)
pb[[i]] <- cbind(pA0[[i]], 1)
}
```
In the next code chunk, for each evaluation point the kriging weights (`L`), the kriging variance (`var`), the perturbed kriging weights (`pL`), and the perturbed kriging variances (`pvar`) are computed. In the final lines, the partial derivatives of the kriging weights (`dL`) and the kriging variances (`dvar`) with respect to the semivariogram parameters are computed. The partial derivatives of the kriging variances with respect to the semivariogram parameters are needed for computing VKV, see Equation \@ref(eq:varkrigingvar), which in turn is needed for computing criterion EAC, see Equation \@ref(eq:EAC).
```{r}
L <- matrix(nrow = length(mysample_eval), ncol = nobs)
pL <- array(
dim = c(length(mysample_eval), length(mysample), length(thetas)))
var <- numeric(length = length(mysample_eval))
pvar <- matrix(nrow = length(mysample_eval), ncol = length(thetas))
for (i in seq_len(length(mysample_eval))) {
b <- c(A0[i, ], 1)
l <- solve(B, b)
L[i, ] <- l[1:nobs]
var[i] <- 1 - l[1:nobs] %*% A0[i, ] - l[-(1:nobs)]
for (j in seq_len(length(thetas))) {
pl <- solve(pB[[j]], pb[[j]][i, ])
pL[i, , j] <- pl[1:nobs]
pvar[i, j] <- 1 - pl[1:nobs] %*% pA0[[j]][i, ] - pl[-(1:nobs)]
}
}
dL <- dvar <- list()
for (i in seq_len(length(thetas))) {
dL[[i]] <- (pL[, , i] - L) / (thetas[i] * perturbation)
dvar[[i]] <- (pvar[, i] - var) / (thetas[i] * perturbation)
}
```
In the next code chunk, the expected variance due to uncertainty about the semivariogram parameters (Equation \@ref(eq:tausq)) is computed.
```{r}
tausq <- numeric(length = length(mysample_eval))
tausqk <- 0
for (k in seq_len(length(mysample_eval))) {
for (i in seq_len(length(dL))) {
for (j in seq_len(length(dL))) {
tausqijk <- invI[i, j] * t(dL[[i]][k, ]) %*% A %*% dL[[j]][k, ]
tausqk <- tausqk + tausqijk
}
}
tausq[k] <- tausqk
tausqk <- 0
}
```
The AKVs are computed by adding the kriging variances and the extra variances due to semivariogram uncertainty (Equation \@ref(eq:augmentedvar)). The VKV values and the EAC values are computed. Both the AKV and the EAC differ among the evaluation points. As a summary, the mean of the two variables is computed.
```{r}
augmentedvar <- var + tausq
MAKV0 <- mean(augmentedvar)
VKV <- numeric(length = length(var))
for (i in seq_len(length(dvar))) {
for (j in seq_len(length(dvar))) {
VKVij <- invI[i, j] * dvar[[i]] * dvar[[j]]
VKV <- VKV + VKVij
}
}
EAC <- augmentedvar + (VKV / (2 * var))
MEAC0 <- mean(EAC)
```
For the spatial coverage sample of 90 points supplemented by a simple random sample of 10 points, MAKV equals `r formatC(MAKV0, 3, format = "f")` and MEAC equals `r formatC(MEAC0, 3, format = "f")`.
The sample can be optimised with **spsann** function `optimUSER`. Argument `points` is a list containing a data frame, or matrix, with the coordinates of the fixed points (assigned to subargument `fixed`) and an integer of the number of supplemental points of which the locations are optimised (assigned to sub-argument `free`). As already stressed above, an important difference with Section \@ref(samplingforvariogram) is that the free and the fixed sample are merged and used together both for estimation of the semivariogram and for prediction. The objective functions `MAKV` and `MEAC` are defined in package **sswr**. Computing time for minimisation of MAKV was 28.16 minutes and of MEAC 28.88 minutes.
```{r, eval = FALSE}
candi <- spsample(grdHunterValley, type = "regular",
cellsize = c(50, 50), offset = c(0.5, 0.5))
candi <- as.data.frame(candi)
names(candi) <- c("x", "y")
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8, 0.95),
initial.temperature = 0.008, temperature.decrease = 0.8,
chains = 300, chain.length = 10, stopping = 10,
x.min = 0, y.min = 0, cellsize = 50)
fixed <- coordinates(mysample_SC)
names(fixed) <- c("x", "y")
pnts <- list(fixed = fixed, free = nsup)
set.seed(314)
res <- optimUSER(
points = pnts, candi = candi,
fun = MAKV,
esample = mysample_eval,
model = "Exp", thetas = thetas, perturbation = 0.01,
schedule = schedule, track = TRUE)
```
```{r, eval = FALSE, echo = FALSE}
save(res, mysample_SC, mystrata, file = "results/MBSample_MAKV_phi200nug02_HunterValley.rda")
```
```{r, echo = FALSE, eval = FALSE}
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8, 0.95),
initial.temperature = 0.015,
temperature.decrease = 0.8,
chains = 300,
chain.length = 10,
stopping = 10,
x.min = 0, y.min = 0,
cellsize = 50)
set.seed(314)
res <- optimUSER(
points = pnts,
candi = candi,
fun = MEAC,
esample = mysample_eval,
model = "Exp",
thetas = thetas,
perturbation = 0.01,
schedule = schedule,
track = TRUE)
save(res, mysample_SC, mystrata, file = "results/MBSample_MEAC_phi200nug02_HunterValley.rda")
```
Figure \@ref(fig:FigMBEK) shows for Hunter Valley a spatial coverage sample of 90 points, supplemented by 10 points optimised by SSA, using MAKV and MEAC as a minimisation criterion.
```{r FigMBEK, echo = FALSE, out.width = "100%", fig.cap = "Optimised sampling pattern of 10 points supplemented to spatial coverage sample of 90 points, for semivariogram estimation and prediction, using the mean augmented kriging variance (MAKV) and the mean estimation adjusted criterion (MEAC) as a minimisation criterion. The prior semivariogram used in optimising the sampling pattern of the supplemental sample is an exponential semivariogram with a range of 200 m and a ratio of spatial dependence of 0.5."}
load(file = "results/MBSample_MAKV_phi200nug02_HunterValley.rda")
mysample <- res$points
MAKVopt <- tail(res$objective$energy$obj, 1)
units <- which(mysample$free == 1)
mysupsample_AKV <- mysample[units, c("x", "y")]
myscsample_AKV <- mysample[-units, c("x", "y")]
#compute shortest distance from supplemental point to spatial coverage points
D <- spDists(x = as(mysupsample_AKV, "matrix"), y = as(myscsample_AKV, "matrix"))
DminAV <- apply(D, MARGIN = 1, FUN = min)
#compute distance between neighbouring spatial coverage points
DSC <- spDists(x = as(myscsample_AKV, "matrix"), y = as(myscsample_AKV, "matrix"))
diag(DSC) <- NA
DSCmin <- apply(DSC, MARGIN = 1, FUN = min, na.rm = T)
mDSCmin <- mean(DSCmin)
load(file = "results/MBSample_MEAC_phi200nug02_HunterValley.rda")
mysample <- res$points
MEACopt <- tail(res$objective$energy$obj, 1)
units <- which(mysample$free == 1)
mysupsample_EAC <- mysample[units, c("x", "y")]
myscsample_EAC <- mysample[-units, c("x", "y")]
#compute shortest distance from supplemental point to spatial coverage points
D <- spDists(x = as(mysupsample_EAC, "matrix"), y = as(myscsample_EAC, "matrix"))
DminEAC <- apply(D, MARGIN = 1, FUN = min)
mysupsamples <- rbind(mysupsample_AKV, mysupsample_EAC)
mysupsamples$criterion <- rep(c("MAKV", "MEAC"), each = 10)
myscsamples <- rbind(myscsample_AKV, myscsample_EAC)
myscsamples$criterion <- rep(c("MAKV", "MEAC"), each = 90)
plot(mystrata) +
geom_point(data = myscsamples, mapping = aes(x = x, y = y), size = 1) +
geom_point(data = mysupsamples, mapping = aes(x = x, y = y), shape = 1, size = 1, colour = "red") +
coord_fixed() +
facet_wrap(~ criterion)
```
The frequency distribution of the shortest distance to the spatial coverage sample is shown in Figure \@ref(fig:HistogramShortestDistance). With both criteria there are several supplemental points at very short distance of a point of the spatial coverage sample. The remaining points are at large distances of spatial coverage sample points. The average distance between neighbouring spatial coverage sampling points equals `r formatC(mDSCmin, 0, format = "f")` m.
```{r HistogramShortestDistance, echo = FALSE, out.width = "100%", fig.asp = 0.5, fig.cap = "Frequency distributions of the shortest distance to the spatial coverage sample of the supplemental sample, optimised with the mean augmented kriging variance (MAKV) and the mean estimation adjusted criterion (MEAC)."}
df <- data.frame(MAKV = DminAV, MEAC = DminEAC)
dflf <- df %>% pivot_longer(cols = c("MAKV", "MEAC"))
ggplot(data = dflf) +
geom_histogram(aes(x = value), fill = "black", alpha = 0.5, colour = "black", binwidth = 25) +
scale_y_continuous(name = "Count") +
xlab("Minimum distance (m)") +
facet_grid(~ name)
```
MAKV of the optimised sample equals `r formatC(MAKVopt, 3, format = "f")` which is `r formatC(MAKVopt/MAKV0*100, 0, format = "f")`\% of MAKV of the initial sample. MEAC of the optimised sample equals `r formatC(MEACopt, 3, format = "f")` which is `r formatC(MEACopt/MEAC0*100, 0, format = "f")`\% of MEAC of the initial sample. The reduction of these two criteria through the optimisation is much smaller than for logdet and MVKV in Section \@ref(samplingforvariogram). This can be explained by the small number of sampling units that is optimised: only the locations of 10 points are optimised, 90 are fixed. In Section \@ref(samplingforvariogram) all 100 locations were optimised.
#### Exercises {-}
5. Write an **R** script to select from Hunter Valley a spatial coverage sample of 80 points supplemented by 20 points. Use MEAC as a minimisation criterion, an exponential semivariogram with a distance parameter of 200 m and a ratio of spatial dependence of 0.8. Compare the minimised MEAC with MEAC reported above, obtained by supplementing a spatial coverage sample of 90 points by 10 points.
## A practical solution
Based on the optimised samples shown above, a straightforward, simple sampling design for estimation of the model parameters and for prediction is a spatial coverage sample supplemented with randomly selected points between the points of the spatial coverage sample at some chosen fixed distances. Figure \@ref(fig:SpatialCoveragePlus) shows an example. A simple random subsample without replacement of size 10 is selected from the 90 points of the spatial coverage sample. These points are used as starting points to select a point at a distance of 20 m in a random direction.
```{r, echo = FALSE}
mysample_SC <- myscsample_AKV
```
```{r}
h <- 20
m <- 10
set.seed(314)
units <- sample(nrow(mysample_SC), m, replace = FALSE)
mySCsubsample <- mysample_SC[units, ]
dxy <- matrix(nrow = m, ncol = 2)
angle <- runif(n = m, min = 0, max = 2 * pi)
dxy[, 1] <- h * sin(angle); dxy[, 2] <- h * cos(angle)
mysupsample <- mySCsubsample + dxy
```
```{r SpatialCoveragePlus, echo = FALSE, fig.width = 5, fig.cap = "Spatial coverage sample of 90 points supplemented by 10 points at short distance (20 m) from randomly selected spatial coverage points."}
#pdf(file = "SpatialCoveragePlusSample.pdf", width = 5, height = 5)
plot(mystrata) +
geom_point(data = mysample_SC, mapping = aes(x = x, y = y), size = 1) +
geom_point(data = mysupsample, mapping = aes(x = x, y = y), shape = 1, size = 1, colour = "red")
#dev.off()
```
```{r, echo = FALSE}
mysample <- rbind(mysupsample, mysample_SC)
MAKVpract <- MAKV(points = mysample,
esample= mysample_eval,
model = "Exp",
thetas = thetas,
perturbation = 0.01)
MEACpract <- MEAC(points = mysample,
esample= mysample_eval,
model = "Exp",
thetas = thetas,
perturbation = 0.01)
```
MAKV of this sample equals `r formatC(MAKVpract, 3, format = "f")`, and MEAC equals `r formatC(MEACpract, 3, format = "f")`. For MAKV, `r formatC(100*(MAKVpract-MAKVopt)/(MAKV0-MAKVopt), 0, format = "f")`\% of the maximal reduction is realised by this practical solution; for MEAC, this is `r formatC(100*(MEACpract-MEACopt)/(MEAC0-MEACopt), 0, format = "f")`\%.
```{r, echo = FALSE}
rm(list = ls())
```