glmidwcv: Cross validation, n-fold and leave-one-out for the hybrid method of generalised linear models ('glm') and inverse distance weighted ('IDW') ('glmidw')

Description

This function is a cross validation function for the hybrid method of 'glm' and 'idw' using 'gstat' (glmidw) (see reference #1), where the data splitting is based on a stratified random sampling method (see the 'datasplit' function for details).

Usage

glmidwcv(
  formula = NULL,
  longlat,
  trainxy,
  y,
  family = "gaussian",
  idp = 2,
  nmaxidw = 12,
  validation = "CV",
  cv.fold = 10,
  predacc = "VEcv",
  ...
)

Value

A list with the following components: me, rme, mae, rmae, mse, rmse, rrmse, vecv and e1; or vecv only.

Arguments

formula: a formula defining the response variable and predictive variables for 'glm'.
longlat: a dataframe contains longitude and latitude of point samples.
trainxy: a dataframe contains longitude (long), latitude (lat), predictive variables and the response variable of point samples.
y: a vector of the response variable in the formula, that is, the left part of the formula.
family: a description of the error distribution and link function to be used in the model. See '?glm' for details.
idp: a numeric number specifying the inverse distance weighting power.
nmaxidw: for a local predicting: the number of nearest observations that should be used for a prediction or simulation, where nearest is defined in terms of the space of the spatial locations. By default, 12 observations are used.
validation: validation methods, include 'LOO': leave-one-out, and 'CV': cross-validation.
cv.fold: integer; number of folds in the cross-validation. if > 1, then apply n-fold cross validation; the default is 10, i.e., 10-fold cross validation that is recommended.
predacc: can be either "VEcv" for vecv or "ALL" for all measures in function pred.acc.
...: other arguments passed on to 'glm' and 'gstat'.

Author

Jin Li

References

Li, J., Alvarez, B., Siwabessy, J., Tran, M., Huang, Z., Przeslawski, R., Radke, L., Howard, F. and Nichol, S. (2017). "Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness." Environmental Modelling & Software 97: 112-129.

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18-22.

Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30: 683-691.

Examples

Run this code

# \donttest{
library(spm)

data(petrel)
gravel <- petrel[, c(1, 2, 6:9, 5)]
longlat <- petrel[, c(1, 2)]
model <- log(gravel + 1) ~  lat +  bathy + I(long^3) + I(lat^2) + I(lat^3)
y <- log(gravel[, 7] +1)

set.seed(1234)
glmidwcv1 <- glmidwcv(formula = model, longlat = longlat, trainxy =  gravel,
y = y, idp = 2, nmaxidw = 12, validation = "CV", predacc = "ALL")
glmidwcv1 # Since the default 'family' is used, actually a 'lm' model is used.

data(spongelonglat)
longlat <- spongelonglat[, 7:8]
model <- sponge ~ long + I(long^2)
y = spongelonglat[, 1]
set.seed(1234)
glmidwcv1 <- glmidwcv(formula = model, longlat = longlat, trainxy = spongelonglat,
y = y, family = poisson, idp = 2, nmaxidw = 12, validation = "CV",
predacc = "ALL")
glmidwcv1

# glmidw for count data
data(spongelonglat)
longlat <- spongelonglat[, 7:8]
model <- sponge ~ . # use all predictive variables in the dataset
y = spongelonglat[, 1]
set.seed(1234)
n <- 20 # number of iterations,60 to 100 is recommended.
VEcv <- NULL
for (i in 1:n) {
 glmidwcv1 <- glmidwcv(formula = model, longlat = longlat, trainxy = spongelonglat,
 y = y, family = poisson, idp = 2, nmaxidw = 12, validation = "CV",
 predacc = "VEcv")
 VEcv [i] <- glmidwcv1
 }
 plot(VEcv ~ c(1:n), xlab = "Iteration for GLM", ylab = "VEcv (%)")
 points(cumsum(VEcv) / c(1:n) ~ c(1:n), col = 2)
 abline(h = mean(VEcv), col = 'blue', lwd = 2)

# glmidw for percentage data
longlat <- petrel[, c(1, 2)]
model <- gravel / 100 ~  lat +  bathy + I(long^3) + I(lat^2) + I(lat^3)
set.seed(1234)
n <- 20 # number of iterations,60 to 100 is recommended.
VEcv <- NULL
for (i in 1:n) {
glmidwcv1 <- glmcv(formula = model, longlat = longlat, trainxy = gravel,
y = gravel[, 7] / 100, family = binomial(link=logit), idp = 2, nmaxidw = 12,
validation = "CV", predacc = "VEcv")
VEcv [i] <- glmidwcv1
}
plot(VEcv ~ c(1:n), xlab = "Iteration for GLM", ylab = "VEcv (%)")
points(cumsum(VEcv) / c(1:n) ~ c(1:n), col = 2)
abline(h = mean(VEcv), col = 'blue', lwd = 2)
# }

Run the code above in your browser using DataLab