-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathspatial-3D.Rmd
522 lines (434 loc) · 25.9 KB
/
spatial-3D.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# Spatial interpolation in 3D using Ensemble ML
```{r, results = "asis", echo = FALSE}
status("drafting")
```
```{r, include=FALSE, message=FALSE, results='hide'}
ls <- c("rgdal", "raster", "plotKML", "ranger", "mlr", "forestError", "randomForestSRC",
"xgboost", "glmnet", "matrixStats", "landmap", "yardstick", "Cubist", "ggfortify",
"hexbin", "parallelMap", "Metrics", "fastSave", "devtools", "openair")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "https://cloud.r-project.org")
lapply(ls, require, character.only = TRUE)
#load.pigz("eml_data.RData")
source("PSM_functions.R")
```
## Mapping concentrations of geochemical elements
Ensemble ML can also be used for mapping soil variables in 3D. Consider for example
the Geochemical and minerological data set for USA48 [@smith2014geochemical]. This is a public data set
produced and maintained by the [USA Geological Survey](https://mrdata.usgs.gov/ngdb/soil/) and contains laboratory measurements
of chemical elements and minerals in soil at 4,857 sites (for three depths 0 to 5 cm,
A horizon and C horizon; Fig. \@ref(fig:ds801-example)).
We have previously imported and overlaid the sampling points vs large stack of
covariate layers using e.g.:
```{r}
## Training data ----
ngs.m <- read.csv(gzfile("./input/ds_801_all.csv.gz"))
ngs.m$olc_id = olctools::encode_olc(ngs.m$latitude, ngs.m$longitude, 11)
## Spatial overlay ----
sel.pnts = !duplicated(ngs.m$olc_id)
summary(sel.pnts)
ngs.m.pnts = ngs.m[which(sel.pnts), c("olc_id", "longitude", "latitude")]
coordinates(ngs.m.pnts) <- ~ longitude + latitude
proj4string(ngs.m.pnts) <- "+init=epsg:4326"
prj = "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"
ngs.xy = spTransform(ngs.m.pnts, prj)
```
```{r, eval=FALSE}
tif.lst = list.files("./1km", glob2rx("*.tif$"), full.names = TRUE)
## 262 layers
ov.tmp = parallel::mclapply(1:length(tif.lst), function(j){ terra::extract(terra::rast(tif.lst[j]), terra::vect(ngs.xy)) }, mc.cores = 80)
ov.tmp = dplyr::bind_cols(lapply(ov.tmp, function(i){i[,2]}))
names(ov.tmp) = tools::file_path_sans_ext(basename(tif.lst))
ov.tmp$olc_id = ngs.xy$olc_id
reg.matrix = plyr::join(ngs.m, ov.tmp)
saveRDS.gz(reg.matrix, "./input/ds801_geochem1km.rds")
```
We can load the regression matrix which contains all target variables and covariates:
```{r}
ds801 = readRDS("./input/ds801_geochem1km.rds")
dim(ds801)
```
which has a total of 4818 unique locations:
```{r}
str(levels(as.factor(ds801$olc_id)))
```
The individual records can be browsed directly via <https://mrdata.usgs.gov/ngdb/soil/>,
for example a single record includes:
```{r ds801-example, echo = FALSE, out.width = "100%", fig.cap = "Example of a geochemical sample with observations and measurements and coordiates of site."}
knitr::include_graphics("./img/Fig_geochemical_USGS_record.jpg")
```
Covariates prepared to help interpolation of the geochemicals include:
- Distance to cities [@nelson2019suite];
- [MODIS LST](https://doi.org/10.5281/zenodo.1420114) (monthly daytime and nighttime);
- [MODIS EVI](https://lpdaac.usgs.gov/products/mod13q1v006/) (long-term monthly values);
- [Lights at night images](https://eogdata.mines.edu/products/dmsp/);
- [Snow occurrence probability](https://climate.esa.int/en/odp/#/project/snow);
- Soil property and class maps for USA48 [@ramcharan2018soil];
- Terrain / hydrological indices;
We can focus on predict concentration of lead (Pb). Pb seems to change
with depth and this is seems to be consistent for different land cover classes:
```{r cor-depth, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Distribution of Pb as a function of land cover classes."}
openair::scatterPlot(ds801[ds801$pb_ppm<140,], x = "hzn_depth", y = "pb_ppm", method = "hexbin", col = "increment", log.x=TRUE, log.y=TRUE, xlab="Depth", ylab="Pb [ppm]", z.lim=c(0,100), type="landcover1")
```
Because we are aiming at producing predictions of geochemical elements for different
depths in soil, we can also use depth of the soil sample as one of the covariates.
This makes the prediction system 3D and the model can thus be used to predict at
any new 3D location ($X, Y, d$). To fit a RF model for this data we can use:
```{r}
ds801$log.pb = log1p(ds801$pb_ppm)
pr.vars = c(readRDS("./input/pb.pr.vars.rds"), "hzn_depth")
sel.pb = complete.cases(ds801[,c("log.pb", pr.vars)])
mrf = ranger::ranger(y=ds801$log.pb[sel.pb], x=ds801[sel.pb, pr.vars],
num.trees = 85, importance = 'impurity')
mrf
```
which results in R-square of about 0.67. Because many training samples have exactly
the same coordinates (same site, three depths), we assume that this model is
over-fitting i.e. that the out-of-bag accuracy is [probably over-optimistic](https://opengeohub.github.io/spatial-sampling-ml/).
Instead we can fit an Ensemble model where we block points within 30 by 30-km blocks:
```{r, cache=TRUE, warning=FALSE, message=FALSE}
if(!exists("eml.pb")){
lrn.rf = mlr::makeLearner("regr.ranger", num.trees=85, importance="impurity",
num.threads = parallel::detectCores())
lrns.pb <- list(lrn.rf, mlr::makeLearner("regr.xgboost"), mlr::makeLearner("regr.cvglmnet"))
tsk0.pb <- mlr::makeRegrTask(data = ds801[sel.pb, c("log.pb", pr.vars)],
target = "log.pb", blocking = as.factor(ds801$ID[sel.pb]))
init.pb <- mlr::makeStackedLearner(lrns.pb, method="stack.cv", super.learner="regr.lm",
resampling=mlr::makeResampleDesc(method="CV", blocking.cv=TRUE))
parallelMap::parallelStartSocket(parallel::detectCores())
eml.pb = train(init.pb, tsk0.pb)
parallelMap::parallelStop()
}
```
```{r}
summary(eml.pb$learner.model$super.model$learner.model)
```
which shows somewhat lower R-square of 0.44, this time that whole sites have
been taken out and hence this seems to be somewhat more realistic estimate of the
mapping accuracy [@roberts2017cross]. The accuracy plot shows that the model has
some problems with predicting higher values, but overall matches the observed values:
```{r, message=FALSE, echo=TRUE}
t.pb = quantile(ds801$log.pb, c(0.001, 0.01, 0.999), na.rm=TRUE)
plot_hexbin(varn="log.pb", breaks=c(t.pb[1], seq(t.pb[2], t.pb[3], length=25)),
meas=eml.pb$learner.model$super.model$learner.model$model$log.pb,
pred=eml.pb$learner.model$super.model$learner.model$fitted.values,
main="Pb [EML]")
```
```{r ac-pb1, echo=FALSE, fig.width=6, fig.cap="Accuracy plot for Pb concentration in soil fitted using Ensemble ML.", out.width="90%"}
knitr::include_graphics("./img/plot_CV_log.pb.png")
```
Variables most important for explaining distribution of the target variable (based on the variable importance)
seem to be soil depth (`hnz_depth`), soil type maps, annual day time temperature and travel time to cities:
```{r varimp-pb, echo=TRUE, fig.width=8, out.width="90%", fig.cap="Variable importance for 3D prediction model for Pb concentrations."}
library(ggplot2)
xl.pb <- as.data.frame(mlr::getFeatureImportance(eml.pb[["learner.model"]][["base.models"]][[1]])$res)
xl.pb$relative_importance = 100*xl.pb$importance/sum(xl.pb$importance)
xl.pb = xl.pb[order(xl.pb$relative_importance, decreasing = T),]
xl.pb$variable = paste0(c(1:nrow(xl.pb)), ". ", xl.pb$variable)
ggplot(data = xl.pb[1:20,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=15))
```
If we plot the travel time to cities vs Pb concentrations, we can clearly see that
Pb is negatively correlated with travel time to cities (following a log-log linear relationship):
```{r cor-cities, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Distribution of Pb as a function of travel time to cities for different depths."}
openair::scatterPlot(ds801[ds801$pb_ppm<140,], x = "travel_time_to_cities_1_usa48", y = "pb_ppm", method = "hexbin", col = "increment", log.x=TRUE, log.y=TRUE, xlab="Travel time to cities 1", ylab="Pb [ppm]", type="hzn_depth")
```
## Predictions in 3D
To produce predictions we can focus on area around Chicago conglomeration. We can
load the covariate layers by using:
```{r}
g1km = readRDS("./input/chicago_grid1km.rds")
```
This contains all layers we used for training. We can generate predictions by
adding a depth column, then write predictions to GeoTIFFs:
```{r}
for(k in c(5, 30, 60)){
out.tif = paste0("./output/pb_ppm_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, eml.pb$features]
pred = predict(eml.pb, newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=pred$data, proj4string=CRS("EPSG:5070"))
g1km.sp$pred = expm1(g1km.sp$response)
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=-32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
```
This finally gives the following pattern:
```{r pred-pb1, echo=FALSE, fig.width=6, fig.cap="Predictions of Pb concentration for different soil depths based on Ensemble ML. Points indicate training points used to build the predictive mapping model. Red color indicates high values. Values of Pb clearly drop with soil depth.", out.width="90%"}
knitr::include_graphics("./img/Pb_predictions_d1.jpg")
knitr::include_graphics("./img/Pb_predictions_d2.jpg")
knitr::include_graphics("./img/Pb_predictions_d3.jpg")
```
Based on these results, it can be said that in general:
1. Distribution of Pb across USA seem to be controlled by a mixture of factors
including climatic factors, soil-terrain units and anthropogenic factors (travel distance to cities);
2. Overall big urban areas show significantly higher concentrations for some heavy
metals and this relationship if log-log linear;
3. Soil depth for some geochemical elements comes as the overall most important
covariate hence mapping soil variables in 3D is fully justified;
## Modeling multiple geochemicals using a single ML model
The package **randomForestSRC** [@Ishwaran2021] provides functionality to run so-called
[“Multivariate Outcomes”](https://www.randomforestsrc.org/articles/getstarted.html#multivariate-outcomes) which means that one can fit a model with multivariate outcomes (one model to predict multiple target variables).
In the case of geochemical data, usually multitude of geochemicals (50+) are analyzed in lab, so that
it seems more practical to fit a single model to predict p+ target variables (outcomes) all at once, than
to fit p+ models instead.
In the case of `ds801` dataset we can first specify number of target variables that we would like to map:
```{r}
tvs = c("as_ppm", "al_pct", "cd_ppm", "cr_ppm", "cu_ppm", "fe_pct",
"mo_ppm", "ni_ppm", "se_ppm", "th_ppm", "zn_ppm", "pb_ppm")
## number of missing values per variable:
sapply(tvs, function(i){sum(is.na(ds801[,i]))})
```
These variables are obviously highly cross-correlated:
```{r corPlot, echo=FALSE, fig.width=7, out.width="100%", fig.cap="Correlation matrics for a selection of geochemicals. Number indicates correlation coefficient (0 to 100%). All variables seem to be positively correlated."}
ds801.x <- scale(log1p(ds801[,tvs]))
openair::corPlot(as.data.frame(ds801.x))
```
Here especially `zn`, `cu`, `fe`, `al` and `ni` seem to be highly cross-correlated.
The **multicolinearity** of these variables could be reduced by converting them e.g.
to Principal Components [@Venables2002Springer]. Principal Component Analysis helps
reduce overlap in data and can help with modeling.
Before transferring the original values to Principal Components we can also filter out
all missing values by replacing them with a mean value (for scaled variable this is 0):
```{r}
ds801.x[is.na(ds801.x)] <- 0
ds801.xd <- as.data.frame(ds801.x)
pcs.ds801 <- stats::prcomp(x = ds801.xd)
```
```{r biplot, echo=FALSE, fig.width=6, out.width="90%", fig.cap="Results of PCA in a biplot."}
ggplot2::autoplot(pcs.ds801, size=1, asp=1, colour="darkgrey", shape=3, loadings = TRUE, loadings.label = TRUE)
```
This again shows that most of geochemicals are positively correlated with about
6 groups of most correlated variables.
Note that, in the previous step, we convert all log-normal or percent variables
using `log1p` and/or logit transformation to: (1) produce close-to-normal distribution
and (2) avoid producing predictions beyond the physical range of data (e.g. negative values or values >100%).
First 6 components explain majority of variance in data, but it appears that to be on a safe-side,
we can model up to PC11:
```{r}
summary(pcs.ds801)
```
As a rule of thumb, we advise to focus on components that explain >95% of variance in the data;
to be certain one can also use >99% as the threshold.
Note the steps we have taken so far include: (1) we first transform the target variables to achieve close to normal-distribution,
(2) we scale all numbers to have 0 = mean and 1 = standard deviation, (3) we replace missing value with the
mean value (0 in this case), (4) we derive Principle Components with all data so nothing is lost.
Next, we can fit a model using the randomForestSRC package and all PCs expect for the last PC (which is
usually pure noise):
```{r}
nComp = 11; Ntree = 85
xdta = ds801[,pr.vars]
ydta = pcs.ds801$x[,1:nComp]
tfm = get.mv.formula(colnames(ydta))
## spatial tiles 30x30km
sp.tiles = unique(ds801$ID)
## we define own custom bootstrapping principle by
## organizing training points using spatial blocks;
samp = randomForestSRC:::make.sample(ntree=Ntree, samp.size=length(sp.tiles),
boot.size=length(sp.tiles)/2)
samp.df = plyr::join(ds801["ID"], cbind(data.frame(ID=sp.tiles), data.frame(samp)), type="left")
samp.df = as.matrix(samp.df[,-1])
## sampsize must be identical for each tree
n.col = min(colSums(samp.df))
samp.dfa = apply(X = samp.df, MARGIN = 2, FUN = function(i){fill.null(i, n.col)})
m.s1 = randomForestSRC::rfsrc(tfm, data.frame(ydta, xdta),
bootstrap = "by.user",
samp = samp.dfa,
importance=TRUE)
print(m.s1, outcome.target = "PC1")
```
This basically fits a multivariate RF of form `rfsrc(Multivar(y1, y2, ..., yd) ~ . , my.data, ...)`.
The resulting model show that the models explain significant part of variance. As expected the significance of the model
is usually the highest for the higher level components (PC1, PC2), then it gradually drops.
We can access the variable importance of each individual model by using:
```{r varimp-pc1, echo=TRUE, fig.width=8, out.width="90%", fig.cap="Variable importance for 3D prediction model for PC1."}
vmp <- as.data.frame(get.mv.vimp(m.s1))["PC1"]
vmp$relative_importance = 100*vmp$PC1/sum(vmp$PC1)
vmp = vmp[order(vmp$relative_importance, decreasing = TRUE),]
vmp$variable = paste0(c(1:nrow(vmp)), ". ", row.names(vmp))
ggplot(data = vmp[1:20,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=15))
```
## Predicting geochemicals using decomposition-composition method
To generate predictions for different depths and for different PCs, we can loop
the `predict.rfsrc` function across different soil depths:
```{r}
for(j in 1:nComp){
for(k in c(5, 30, 60)){
out.tif = paste0("./output/pc", j, "_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, pr.vars]
pred = predict(m.s1, m.target = paste0("PC", j), newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=data.frame(pred$regrOutput[[paste0("PC", j)]][[1]]),
proj4string=CRS("EPSG:5070"))
g1km.sp$pred = g1km.sp@data[,1]*100
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=-32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
}
```
This gives predictions (in the scaled space) for each of the original target variables and for all depths.
Note that, because the randomForestSRC package [uses by default all cores](https://www.randomforestsrc.org/articles/parallel.html) to run model training and prediction,
there is no need to specify any further parallelization i.e. it is fine to run processing in a loop.
After we generate predictions, we can (1) first [back-transform values to original scale](https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com) using the PCA model eigenvectors, (2) scale-back from log-transformation.
For example the predictions for top-soil, we first load all predicted PC's:
```{r}
ls.5cm = list.files('./output', pattern=glob2rx("^pc*_5cm_1km.tif$"), full.names = TRUE)
pred.tops = raster::stack(ls.5cm)
pred.tops = as(pred.tops, "SpatialGridDataFrame")
```
next, we [back-transform values](https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com) to the original scale by:
```{r}
Xhat = as.matrix(pred.tops@data[,paste0("pc", 1:nComp, "_5cm_1km")]/100) %*% t(pcs.ds801$rotation[,1:nComp])
## scale-back to the original mean / stdev:
Xhat.s = sweep(Xhat %*% diag(attr(ds801.x, 'scaled:scale')), 2, attr(ds801.x, 'scaled:center'), "+")
Xhat.s = as.data.frame(expm1( Xhat.s ))
names(Xhat.s) = tvs
```
We can plot two predictions of `Pb` (using EML produced in the previous example; and using PCA analysis) next to each other:
```{r pca-pred, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Predicted Pb concentration using ensemble ML (left) vs using the decomposition-composition PCA method (right)."}
pred.tops$Pb_5cm_1km.PCA = log1p(Xhat.s[,"pb_ppm"])
pred.tops$pb.pca_ppm_5cm_1km = pred.tops$Pb_5cm_1km.PCA * 100
#rgdal::writeGDAL(pred.tops["pb.pca_ppm_5cm_1km"], "./output/pb.pca_ppm_5cm_1km.tif",
# type="Int16", mvFlag=-32768, options=c("COMPRESS=DEFLATE"))
pred.tops$Pb_5cm_1km.EML = log1p(readGDAL("./output/pb_ppm_5cm_1km.tif")$band1)
r.pal = rev(rainbow(65)[1:48])
pred.tops.df = as.data.frame(pred.tops[c("Pb_5cm_1km.EML", "Pb_5cm_1km.PCA")])
all_df_long <- tidyr::gather(pred.tops.df, 'layer', 'value', -s1, -s2)
pred.pnts = as.data.frame(ngs.xy[!is.na(over(y=pred.tops[1], x=ngs.xy)[,1]),], xy=TRUE)
ggplot() +
geom_raster(data = all_df_long, aes(x=s1, y=s2, fill = value)) +
facet_wrap(. ~ layer, ncol = 2) +
geom_point(data = pred.pnts, aes(x=longitude, y=latitude), shape="+", size=2) +
scale_fill_gradientn(colours=r.pal, name="", limits=c(2.8,3.9)) +
coord_equal() +
theme_minimal() + xlab("Westing")+ ylab("Northing")
```
This shows that the predictions are quite similar, except the PCA-method somewhat smooths out some very
high values (Chicago city area). The difference also happens due to the fact that,
for the map on the left, we have use an Ensemble model, while the map on the right is
purely based on RF.
It seem to be also possible to derive prediction errors using the `quantreg` [function](https://www.randomforestsrc.org/reference/quantreg.rfsrc.html), although
this can get at the order of magnitude more computational. In addition, for multiple
predicted PCs one would need to derive a composite prediction error (usually a
geometric mean of variances).
## Predicting geochemicals assuming No-human-influence
Using this integrated model for geochemicals, we could also map hypothetical distribution of the target geochemical
**assuming no or marginal human influence** (NoHI). For this we can use the same model,
but during prediction replace values of the covariates for those that quantify
human impact. For example layers `travel_time_to_cities`, `population_density`,
lights at night images, density of industrial complexes and proximity to roads and
cities, are all variables that basically quantify human / industrial footprint [@mu2022global].
To predict values of geochemicals assuming no human influence i.e. to estimate the
background or natural concentrations of geochemicals we can thus replace values
for all such layers using the following simple principles [@sanderman2017soil]:
- for lights at night images and human population density we replace all values with 0 value;
- for travel time to cities, buffer densities to industrial sources of pollution and similar
we use some very high value e.g. 99% probability quantile value for the whole of USA;
We can thus re-make all predictions using NoHI settings:
```{r}
## only top-soil
k = 5
for(j in 1:nComp){
out.tif = paste0("./output/pc", j, ".nohi_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, pr.vars]
## replace values / simulate noHI:
newdata[,grep("travel", pr.vars)] = 1000
newdata[,grep("pop", pr.vars)] = 0
#newdata[,grep("light", pr.vars)] = 0
pred = predict(m.s1, m.target = paste0("PC", j), newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=data.frame(pred$regrOutput[[paste0("PC", j)]][[1]]),
proj4string=CRS("EPSG:5070"))
g1km.sp$pred = g1km.sp@data[,1]*100
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=-32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
```
Again we need to back-transform the values to original scale:
```{r}
lsN.5cm = list.files('./output', pattern=glob2rx("^pc*.nohi_5cm_1km.tif$"), full.names = TRUE)
predN.tops = raster::stack(lsN.5cm)
predN.tops = as(predN.tops, "SpatialGridDataFrame")
XhatN = as.matrix(predN.tops@data[,paste0("pc", 1:nComp, ".nohi_5cm_1km")]/100) %*% t(pcs.ds801$rotation[,1:nComp])
XhatN.s = sweep(XhatN %*% diag(attr(ds801.x, 'scaled:scale')), 2, attr(ds801.x, 'scaled:center'), "+")
XhatN.s = as.data.frame(expm1( XhatN.s ))
names(XhatN.s) = tvs
```
We can plot the two predictions for e.g. Pb to see if there is any difference:
```{r nohi-pred, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Predicted Pb concentration using decomposition-composition PCA method with original covariates (left) and assuming no human influence (right)."}
pred.tops$Pb.nohi_5cm_1km.PCA = log1p(XhatN.s[,"pb_ppm"])
pred.tops$pb.pca.nohi_ppm_5cm_1km = pred.tops$Pb.nohi_5cm_1km.PCA * 100
#rgdal::writeGDAL(pred.tops["pb.pca.nohi_ppm_5cm_1km"], "./output/pb.pca.nohi_ppm_5cm_1km.tif",
# type="Int16", mvFlag=-32768, options=c("COMPRESS=DEFLATE"))
predN.tops.df = as.data.frame(pred.tops[c("Pb_5cm_1km.PCA", "Pb.nohi_5cm_1km.PCA")])
allN_df_long <- tidyr::gather(predN.tops.df, 'layer', 'value', -s1, -s2)
ggplot() +
geom_raster(data = allN_df_long, aes(x=s1, y=s2, fill = value)) +
facet_wrap(. ~ layer, ncol = 2) +
geom_point(data = pred.pnts, aes(x=longitude, y=latitude), shape="+", size=2) +
scale_fill_gradientn(colours=r.pal, name="", limits=c(2.8,3.9)) +
coord_equal() +
theme_minimal() + xlab("Westing")+ ylab("Northing")
```
This has significantly reduced the total values of Pb. Basically the map on the right
is most likely showing the *background* or *natural* concentration of Pb assuming
no human influence (notice the city of Chicago and similar are not visible on the map any more).
The difference between the two maps is an estimate of human-induced pollution by Pb.
If there are historic Pb soil measurements the map on the right could also be validated.
Note that we have only *modified* values for travel time to cities, population density etc,
also the MODIS LST night time temperatures should be probably adjusted since they
also show big urban areas and similar. We have not done it here because converting
MODIS LST images to no-human influence would probably require modeling in itself, but it is
most likely doable.
## Advantages and limitations of running 3D predictive mapping
In summary 3D soil mapping is relatively straight forward to implement especially
for mapping soil variables from soil profiles (multiple samples per soil layer).
Before modeling the target variable with depth, it is a good idea to plot
relationship between target variable and depth under different settings
(as in Fig. \@ref(fig:cor-depth)). 3D soil mapping based on Machine Learning is
now increasingly common [@hengl2019predictive; @sothe2022large].
A limitation for 3D predictive mapping is the size of data i.e. data volumes
increasing proportionally to number of slices we need to predict (in this case three).
Also, we show that points with exactly the same coordinates might result in e.g.
Random Forest overfitting emphasizing some covariate layers that are possibly
less important, hence it is again important to use the blocking parameter that
separates training and validation points.
Soil depth is for many soil variables most important explanatory variables, but
in the cases it does not correlate with the target variable, there is probably
also no need for 3D soil mapping. In the case depth is not significantly
correlated, one could simply first aggregate all values to fixed block depth and
convert modeling from 3D to 2D.
By combining PCA analysis and predictive mapping (we refer to this method as the **Decomposition-composition PCA method**)
one can integrate modeling of multiple cross-correlated variables and multicolinearity reduction (PCA analysis).
This seems to be especially applicable for cases where there are ≫10 target variables which
are highly cross-correlated, possibly have inconsistent missing values and after transformation
match close-to-normal distribution.
As a rule of thumb one can model only PCs that explain up to 99% of variance in data, this way modeling
effort can be significantly reduced (e.g. for 50+ target variables one could model all target variables
only a dozen of PCs). A disadvantage of the Decomposition-composition PCA method is that interpretation of
scaled values is somewhat more complex, certainly more abstract.