compare: Compare similar data sets

Description

Compare versions of a data set by comparing their performance against a set of rules or other quality indicators. This function takes two or more data sets and compares the perfomance of data set \(2,3,\ldots\) against that of the first data set (default) or to the previous one (by setting how='sequential').

Usage

compare(x, ...)
# S4 method for validator
compare(x, ..., .list = list(), how = c("to_first",
  "sequential"))
# S4 method for indicator
compare(x, ..., .list = NULL)

Arguments

An R object

...

data frames, comma separated. Names become column names in the output.

.list

Optional list of data sets, will be concatenated with ....

how

how to compare

Value

For validator: An array where each column represents one dataset. The rows count the following attributes:

Number of validations performed
Number of validations that evaluate to NA (unverifiable)
Number of validations that evaluate to a logical (verifiable)
Number of validations that evaluate to TRUE
Number of validations that evaluate to FALSE
Number of extra validations that evaluate to NA (new unverifiable)
Number of validations that still evaluate to NA (still unverifialble)
Number of validations that still evaluate to TRUE
Number of extra validations that evaluate to TRUE
Number of validations that still evaluate to FALSE
Number of extra validations that evaluate to FALSE

For indicator: A list with the following components:

numeric: An array collecting results of scalar indicator (e.g. mean(x)).
nonnumeric: An array collecting results of nonnumeric scalar indicators (e.g. names(which.max(table(x))))
array: A list of arrays, collecting results of vector-indicators (e.g. x/mean(x))

Comparing datasets by performance against validator objects

Suppose we have a current and a previous version of a data set. Both can be inspected by confronting them with a rule set. The status changes in rule violations can be partitioned as shown in the following figure. cellwise splitting

This function computes the partition for two or more datasets, comparing the current set to the first (default) or to the previous (by setting compare='sequential').

References

The figure is reproduced from MPJ van der Loo and E. De Jonge (2018) Statistical Data Cleaning with applications in R (John Wiley & Sons).

Examples

Run this code

# NOT RUN {
data(retailers)

rules <- validator(turnover >=0, staff>=0, other.rev>=0)

# start with raw data
step0 <- retailers

# impute turnovers
step1 <- step0
step1$turnover[is.na(step1$turnover)] <- mean(step1$turnover,na.rm=TRUE)

# flip sign of negative revenues
step2 <- step1
step2$other.rev <- abs(step2$other.rev)
  
# create an overview of differences, comparing to the previous step
compare(rules, raw = step0, imputed = step1, flipped = step2, how="sequential")

# create an overview of differences compared to raw data
out <- compare(rules, raw = step0, imputed = step1, flipped = step2)
out

# graphical overview
plot(out)
barplot(out)

# transform data to data.frame (easy for use with ggplot)
as.data.frame(out)


# }

Run the code above in your browser using DataLab