fuzzyjoin (version 0.1.5)

geo_join: Join two tables based on a geo distance of longitudes and latitudes

Description

This allows joining based on combinations of longitudes and latitudes. If you are using a distance metric that is *not* based on latitude and longitude, use distance_join instead. Distances are calculated based on the distHaversine, distGeo, distCosine, etc methods in the geosphere package.

Usage

geo_join(x, y, by = NULL, max_dist, method = c("haversine", "geo",
  "cosine", "meeus", "vincentysphere", "vincentyellipsoid"),
  unit = c("miles", "km"), mode = "inner", distance_col = NULL, ...)

geo_inner_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

geo_left_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

geo_right_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

geo_full_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

geo_semi_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

geo_anti_join(x, y, by = NULL, method = "haversine", max_dist = 1, distance_col = NULL, ...)

Arguments

x

A tbl

y

A tbl

by

Columns by which to join the two tables

max_dist

Maximum distance to use for joining

method

Method to use for computing distance: one of "haversine" (default), "geo", "cosine", "meeus", "vincentysphere", "vincentyellipsoid"

unit

Unit of distance for threshold (default "miles")

mode

One of "inner", "left", "right", "full" "semi", or "anti"

distance_col

If given, will add a column with this name containing the geographical distance between the two

...

Extra arguments passed on to the distance method

Details

"Haversine" was chosen as default since in some tests it is approximately the fastest method. Note that by far the slowest method is vincentyellipsoid, and on fuzzy joins should only be used when there are very few pairs and accuracy is imperative.

If you need to use a custom geo method, you may want to write it directly with the multi_by and multi_match_fun arguments to fuzzy_join.

Examples

Run this code
# NOT RUN {
library(dplyr)
data("state")

# find pairs of US states whose centers are within
# 200 miles of each other
states <- data_frame(state = state.name,
                     longitude = state.center$x,
                     latitude = state.center$y)

s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)

pairs <- s1 %>%
 geo_inner_join(s2, max_dist = 200) %>%
 filter(state1 != state2)

pairs

# plot them
library(ggplot2)
ggplot(pairs, aes(x = longitude.x, y = latitude.x,
                  xend = longitude.y, yend = latitude.y)) +
  geom_segment(color = "red") +
  borders("state") +
  theme_void()

# also get distances
s1 %>%
  geo_inner_join(s2, max_dist = 200, distance_col = "distance")

# }

Run the code above in your browser using DataLab