matchmaker R package

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

  • preservation of factor orders
  • ability to specify explicit and implicit missing values
  • option to replace by fuzzy matching (regular expressions, anchored by default)
  • optional variable selection by fuzzy matching

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

  • match_vec() will translate the values in a single vector
  • match_df() will translate values in all specified columns of a data frame

Each of these functions have four manditory options:

  • x: your data. This will be a vector or data frame depending on the function.
  • dictionary: This is a data frame with at least two columns specifying keys and values to modify
  • from: a character or number specifying which column contains the keys
  • to: a character or number specifying which column contains the values

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!
library("matchmaker")

# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)

# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)

Data

This is the top of our data set, generated for example purposes

iddatereadmissiontreatedfacilityage_grouplab_result_01lab_result_02lab_result_03has_symptomsfollowup
ef267c2019-07-08NA0C10unkhighincNAu
e80a372019-07-07y0310incunknormyoui
b728832019-07-07y1830incnormincoui
c9ee862019-07-09n1440incincunkyoui
40bc7a2019-07-12n160normunknormNAn
46566e2019-07-14yNAB50unkunkincNANA

Dictionary

The dictionary looks like this:

optionsvaluesgrporders
yYesreadmission1
nNoreadmission2
uUnknownreadmission3
.missingMissingreadmission4
0Yestreated1
1Notreated2
.missingMissingtreated3
1Facility 1facility1
2Facility 2facility2
3Facility 3facility3
4Facility 4facility4
5Facility 5facility5
6Facility 6facility6
7Facility 7facility7
8Facility 8facility8
9Facility 9facility9
10Facility 10facility10
.defaultUnknownfacility11
00-9age_group1
1010-19age_group2
2020-29age_group3
3030-39age_group4
4040-49age_group5
5050+age_group6
highHigh.regex ^lab_result_1
normNormal.regex ^lab_result_2
incInconclusive.regex ^lab_result_3
yyes.globalInf
nno.globalInf
uunknown.globalInf
unkunknown.globalInf
ouiyes.globalInf
.missingmissing.globalInf

Matching

# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp"
)
head(cleaned)
#>       id       date readmission treated    facility age_group
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+
#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1       unknown          High  Inconclusive      missing  unknown
#> 2  Inconclusive       unknown        Normal          yes      yes
#> 3  Inconclusive        Normal  Inconclusive      missing      yes
#> 4  Inconclusive  Inconclusive       unknown          yes      yes
#> 5        Normal       unknown        Normal      missing       no
#> 6       unknown       unknown  Inconclusive      missing  missing

Copy Link

Version

Down Chevron

Install

install.packages('matchmaker')

Monthly Downloads

555

Version

0.1.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

February 21st, 2020

Functions in matchmaker (0.1.1)