dist_mat: Distance matrix estimation

Description

This function estimates the distance matrix separately from Conley standard errors. Such step can be helpful when running multiple Conley standard error estimations based on the same distance matrix. A pre-requisite of using this function is that the data must not be modified between applying this function and inserting the results into conleyreg.

Usage

dist_mat(
  data,
  unit = NULL,
  time = NULL,
  lat = NULL,
  lon = NULL,
  dist_comp = NULL,
  dist_cutoff = NULL,
  crs = NULL,
  verbose = TRUE,
  ncores = NULL,
  par_dim = c("cross-section", "time", "r", "cpp"),
  sparse = FALSE,
  batch = TRUE,
  batch_ram_opt = NULL,
  dist_round = FALSE,
  st_distance = FALSE,
  dist_which = NULL
)

Value

Returns an object of S3 class conley_dist. It contains modified distance matrices, the used dist_cutoff, a sparse matrix identifier, and information on the potential panel structure. In the cross-sectional case and the balanced panel case, the distances are stored in one matrix, while in unbalanced panel applications, distances come as a list of matrices. The function optimizes the distance matrices with respect to computational performance, setting distances beyond dist_cutoff to zero and actual off-diagonal zeros to NaN. Hence, these objects are only to be used in conleyreg.

Arguments

data: input data. Either (i) in non-spatial data frame format (includes tibbles and data tables) with columns denoting coordinates or (ii) in sf format. In case of an sf object, all non-point geometry types are converted to spatial points, based on the feature's centroid. When using a non-spatial data frame format the with projected, i.e. non-longlat, coordinates, crs must be specified. Note that the projection can influence the computed distances, which is a general phenomenon in GIS software and not specific to conleyreg. The computationally fastest option is to use a data table with coordinates in the crs in which the distances are to be derived (longlat for spherical and projected for planar), and with time and unit set as keys in the panel case. An sf object as input is the slowest option.
unit: the variable identifying the cross-sectional dimension. Only needs to be specified, if data is not cross-sectional. Assumes that units do not change their location over time.
time: the variable identifying the time dimension. Only needs to be specified, if data is not cross-sectional.
lat: the variable specifying the latitude
lon: the variable specifying the longitude
dist_comp: choice between spherical and planar distance computations. When unspecified, the input data determines the method: longlat uses spherical (Haversine) distances, alternatives (projected data) use planar (Euclidean) distances. When inserting projected data but specifying dist_comp = "spherical", the data is transformed to longlat. Combining unprojected data with dist_comp = "planar" transforms the data to an azimuthal equidistant format centered at the data's centroid.
dist_cutoff: the distance cutoff in km. If not specified, the distances matrices contain all bilateral distances. If specified, the cutoff most be as least as large as the largest distance cutoff in the Conley standard error corrections in which you use the resulting matrix. If you e.g. specify distance cutoffs of 100, 200, and 500 km in the subsequent conleyreg calls, dist_cutoff in this function must be set to at least 500. dist_cutoff allows to pre-compute distance matrices also in applications where a full distance matrix would not fit into the computer's memory - conditional on that sparse = TRUE.
crs: the coordinate reference system, if the data is projected. Object of class crs or input string to sf::st_crs. This parameter can be omitted, if the data is in longlat format (EPSG: 4326), i.e. not projected. If the projection does not use meters as units, this function converts to units to meters.
verbose: logical specifying whether to print messages on intermediate estimation steps. Defaults to TRUE.
ncores: the number of CPU cores to use in the estimations. Defaults to the machine's number of CPUs.
par_dim: the dimension along which the function parallelizes in unbalanced panel applications. Can be set to "cross-section" (default) or "time". Use "r" and "cpp" to define parallelization based on the language rather than the dimension. In this function, "r" is equivalent to "time" and parallelizes along the time dimension using the parallel package. "cross-section" is equivalent to "cpp" and parallelizes along the cross-sectional dimension using OpenMP in C++. Some MAC users do not have access to OpenMP by default. par_dim is then always set to "r". Thus, depending on the application, the function can be notably faster on Windows and Linux than on MACs. When st_distance = TRUE, par_dim defaults to "time".
sparse: logical specifying whether to use sparse rather than dense (regular) matrices in distance computations. Defaults to FALSE. Only has an effect when st_distance = FALSE. Sparse matrices are more efficient than dense matrices, when the distance matrix has a lot of zeros arising from points located outside the respective dist_cutoff. It is recommended to keep the default unless the machine is unable to allocate enough memory. The function always uses dense matrices when dist_cutoff is not specified.
batch: logical specifying whether distances are inserted into a sparse matrix element by element (FALSE) or all at once as a batch (TRUE). Defaults to FALSE. This argument only has an effect when st_distance = FALSE and sparse = TRUE. Batch insertion is faster than element-wise insertion, but requires more memory.
batch_ram_opt: the degree to which batch insertion should be optimized for RAM usage. Can be set to one out of the three levels: "none", "moderate" (default), and "heavy". Higher levels imply lower RAM usage, but also lower speeds.
dist_round: logical specifying whether to round distances to full kilometers. This further reduces memory consumption and can be a solution when even sparse matrices cannot accomodate the data. Note, though, that this rounding introduces a bias.
st_distance: logical specifying whether distances should be computed via sf::st_distance (TRUE) or via conleyreg's internal, computationally optimized distance functions (FALSE). The default (FALSE) produces the same distances as sf::st_distance does with S2 enabled. I.e. it uses Haversine (great circle) distances for longlat data and Euclidean distances otherwise. Cases in which you might want to set this argument to TRUE are e.g. when you want enforce the GEOS approach to computing distances or when you are using a peculiar projection, for which the sf package might include further procedures. Cross-sectional parallelization is not available when st_distance = TRUE and the function automatically switches to parallelization along the time dimension, if the data is a panel and ncores != 1. Third and fourth dimensions, termed Z and M in sf, are not accounted for in any case. Note that sf::st_distance is considerably slower than conleyreg's internal distance functions.
dist_which: the type of distance to use when st_distance = TRUE. If unspecified, the function defaults to great circle distances for longlat data and to Euclidean distances otherwise. See sf::st_distance for options.

Details

This function runs the distance matrix estimations separately from the Conley standard error correction. You can pass the resulting object to the dist_mat argument in conleyreg, skipping the distance matrix computations and various checks in that function. Pre-computing the distance matrix is only more efficient than deriving it via conleyreg when estimating various models that use the same distance matrices. The input data must not be modified between calling this function and inserting the results into conleyreg. Do not reorder the observations, add or delete variables, or undertake any other operation on the data.

Examples

Run this code

if (FALSE) {
# Generate cross-sectional example data
data <- rnd_locations(100, output_type = "data.frame")
data$y <- sample(c(0, 1), 100, replace = TRUE)
data$x1 <- stats::runif(100, -50, 50)

# Compute distance matrix in cross-sectional case
dm <- dist_mat(data, lat = "lat", lon = "lon")

# Compute distance matrix in panel case
data$time <- rep(1:10, each = 10)
data$unit <- rep(1:10, times = 10)
dm <- dist_mat(data, unit = "unit", time = "time", lat = "lat", lon = "lon")

# Use distance matrix in conleyreg function
conleyreg(y ~ x1, data, 1000, dist_mat = dm)
}

Run the code above in your browser using DataLab