CleanCoordinatesDS: Geographic Coordinate Cleaning based on Dataset Properties

Description

Identifies potentially problematic coordinates based on dataset properties. Includes test to flag potential errors with converting ddmm to dd.dd, and periodicity in the data decimals indicating rounding or a raster basis linked to low coordinate precision.

Usage

CleanCoordinatesDS(x, lon = "decimallongitude", lat = "decimallatitude",
                   ds = "dataset",
                   ddmm = TRUE, periodicity = TRUE,
                   ddmm.pvalue = 0.025, ddmm.diff = 0.2, 
                   periodicity.target = "lon_lat", periodicity.thresh = 3.5, 
                   periodicity.diagnostics = FALSE, 
                   periodicity.subsampling = NULL,
                   value = "dataset", verbose = TRUE)

Arguments

a data.frame. Containing geographical coordinates and species names.

lon

a character string. The column with the longitude coordinates. Default = “decimallongitude”.

lat

a character string. The column with the longitude coordinates. Default = “decimallatitude”.

a character string. The column with the dataset of each record. In case x should be treated as a single dataset, identical for all records. Default = “dataset”.

ddmm

logical. If TRUE, testing for erroneous conversion from a degree minute format (ddmm) to a decimal degree (dd.dd) format. See details.

periodicity

logical. If TRUE, testing for periodicity in the data, which can indicate imprecise coordinates, due to rounding or rasterization.

ddmm.pvalue

numeric. The p-value for the one-sided t-test to flag the ddmm test as passed or not. Both ddmm.pvalue and ddmm.diff must be met. Default = 0.025.

ddmm.diff

numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.025. Default = 0.2

periodicity.target

a character string. One of ‘lat’, ‘lon’, ‘lon_lat’. Sets the target for the periodicity test which runs on latitude and longitude separately. If ‘lon_lat’. Tests run sequentially and results for both and a combined flag are returned.

periodicity.thresh

numerical. The threshold to for flagging in the periodicity test. Indicates the factor by which one of the two periodic bins must outnumber the other. Default = 1.5. Higher values are more conservative/ flag less datasets.

periodicity.diagnostics

logical. If TRUE, plots a series of diagnostics visualizing the periodicity test. Default = FALSE.

periodicity.subsampling

numerical. If defined, only a random subsample of n = subsampling records is used for the periodicity test. Speeds up analyses, for the use with many large datasets.

value

a character string. Defining the output value. See value. Default = “dataset”.

verbose

logical. If TRUE reports the name of the test and the number of records flagged.

Value

Depending on the ‘value’ argument, either a summary per dataset dataset, a dataframe containing the records considered correct by the test (“clean”) or a logical vector, with TRUE = test passed and FALSE = test failed/potentially problematic (“flags”). Default = “clean”. If “dataset”: data.frame with one row for each dataset in x and columns depending on the output option: ‘detail’ shows most level of detail, ‘flag’ shows only flags from the test and ‘minimal’ shows only the combined flags. Available columns are: binomial.pvalue = p-value compared to ddmm.pvalue; perc.difference = the percentage of difference from the expectation under a binomial test; pass.ddmm = logical flag summarizing the ddmm test, if TRUE: passed, if FALSE: potentially problematic; mle = the maximum likelihood for the rate parameters of the periodicity test; rate.ratio = rate ratio between the two rates of the periodicity model compared to periodicity.thresh; zero.mle = size of the maximum likelihood zero size bin from the zero test; zero.rate.ratio = ratio by which the number of zero decimals surpasses the number of records with other decimals; pass.zero = logical flag summarizing the zero test, if TRUE: passed, if FALSE: potentially problematic; pass.periodicity = logical flag summarizing the periodicity test, if TRUE: passed, if FALSE: potentially problematic. Flags for the periodicity test will be marked dependent on the meridional direction tested: ‘lon’ = longitude, ‘lat’ = latitude, ‘com’ = AND combination of the two former.

Details

This function checks the statistical distribution of decimals within datasets of geographic distribution records to identify datasets with potential errors/biases. Three potential error sources can be identified. The ddmm flag tests for the particular pattern that emerges if geographical coordinates in a degree minute annotation are transferred into decimal degrees, simply replacing the degree symbol with the decimal point. This kind of problem has been observed by in older datasets first recorded on paper using typewriters, where e.g. a floating point was used as symbol for degrees. The function uses a binomial test to check if more records then expected have decimals blow 0.6 (which is the maximum that can be obtained in minutes, as one degree has 60 minutes) and if the number of these records is higher than those above 0.59 by a certain proportion. The periodicity test uses rate estimation in a poison process to estimate if there is periodicity in the decimals of a dataset (as would be expected by for example rounding or data that was collected in a raster format) and if there is an over proportional number of records with the decimal 0 (full degrees) which indicates rounding and thus low precision. The default values are empirically optimized by with GBIF data, but should probably be adapted.

Examples

Run this code

# NOT RUN {
#Create test dataset
clean <- data.frame(dataset = rep("clean", 1000),
                    decimallongitude = runif(min = -42, max = -40, n = 1000),
                    decimallatitude = runif(min = -12, max = -10, n = 1000))
                    
bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1),
               round(runif(min = -42, max = -40, n = 300), 0),
               runif(min = -42, max = -40, n = 200))
bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1),
              round(runif(min = -12, max = -10, n = 300), 0),
              runif(min = -12, max = -10, n = 200))
bias <- data.frame(dataset = rep("biased", 1000),
                   decimallongitude = bias.long,
                   decimallatitude = bias.lat)
test <- rbind(clean, bias)

# }
# NOT RUN {
#run CleanCoordinatesDS
flags <- CleanCoordinatesDS(test)

#check problems
#clean
hist(test[test$dataset == rownames(flags[flags$summary,]), "decimallongitude"])
#biased
hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimallongitude"])

# }