outliers.detect: Detect outliers in a set of geographical coordinates

Description

This function generates pseudo-absences from an input data.frame containing latitude and longitude coordinates by using environmental data and then uses both presences and pseudo-absences to train a SVM model used to flag possible outliers for a given species.

Usage

outliers.detect(
  longlat,
  training = NULL,
  hi_res = TRUE,
  crop = FALSE,
  threshold = 0.05,
  method = "all"
)

Value

list if method = "all", containing whether or not a given point was classified as TRUE or FALSE along with the confusion matrix for the training data. If method = "geo" or method = "env" a data.frame is returned.

Arguments

longlat: data.frame. With two columns containing latitude and longitude, describing the locations of a species, which may contain outliers.
training: data.frame. With the same formatting as longlat, indicating only known locations where a target species occurs. Used exclusively as training data for method 'svm'.
hi_res: logical. Specifies if 1 KM resolution environmental data should be used. If FALSE 10 KM resolution data is used instead.
crop: logical. Indicates whether environmental data should be cropped to an extent similar to what is given in longlat and training. Useful to avoid large processing times of higher resolutions.
threshold: numeric. Value indicating the threshold for classifying outliers in methods "geo" and "env". E.g.: under the default of 0.05, points that are at an average distance greater than the 95 of the average distances of all points, will be classified as outliers.
method: A string specifying the outlier detection method. "geo" calculates the euclidean distance between point coordinates and classifies as outliers those outside the 0 "env" performs the same calculation but instead uses the environmental data extracted from those points. "svm" will use the dataset given to "longlat" and it corresponding extracted environmental data to train a support vector machine model that then predicts outliers.

Details

Environmental data used is WorldClim and requires a long download, see gecko::gecko.setDir() This function is heavily based on the methods described in Liu et al. (2017). There the authors describe SVM_pdSDM, a pseudo-SDM method similar to a two-class presence only SVM that is capable of using pseudo-absence points, implemented with the ksvm function in the R package kernlab. It is suggested that, for each set of "n" occurence records, "2 * n" pseudo-absences points are generated. Whilst using it keep in mind works highlighting limitations such as such as Meynard et al. (2019). See References section.

References

Liu, C., White, M. and Newell, G. (2017) ‘Detecting outliers in species distribution data’, Journal of Biogeography, 45(1), pp. 164–176. doi:10.1111/jbi.13122.

Meynard, C.N., Kaplan, D.M. and Leroy, B. (2019) ‘Detecting outliers in species distribution data: Some caveats and clarifications on a virtual species study’, Journal of Biogeography, 46(9), pp. 2141–2144. doi:10.1111/jbi.13626.

Examples

Run this code

if (FALSE) {
new_occurences = gecko.data("records")
new_occurences = new_occurences[new_occurences$species == "Hogna maderiana", 2:3]
old_occurences = data.frame(X = runif(10, -17.1, -17.05), Y = runif(10, 32.73, 32.76))
outliers.detect(new_occurences, old_occurences)
}

Run the code above in your browser using DataLab