checkDataIntegrity: Conveniently checking data integrity

Description

This function is designed to make it easy to perform some data integrity checks, specifically checking for values that are impossible or unrealistic. These values can then be replaced by another value, or the offending cases can be deleted from the dataframe.

Usage

checkDataIntegrity(x,
                   dat,
                   newValue = NA,
                   removeCases = FALSE,
                   validValueSuffix = "_validValue",
                   newValueSuffix = "_newValue",
                   totalVarName = "numberOfInvalidValues",
                   append = TRUE,
                   replace = TRUE,
                   silent = FALSE,
                   rmarkdownOutput=FALSE,
                   callingSelf = FALSE)

Arguments

This can be either a vector or a list. If it is a vector, it should have two elements, the first one being a regular expression matching one or more variables in the dataframe specified in dat, and second one being the condition the matching variables have to satisfy. If it is a list, it should be a list of such vectors. The conditions should start with a Comparison operator followed by a value (e.g. "<30" or ">=0).

dat

The dataframe containing the variables of which we should check the integrity.

newValue

The new value to be assigned to cases not satisfying the specified conditions.

removeCases

Whether to delete cases that do not satisfy the criterion from the dataframe (if FALSE, they're not deleted, but the offending value is replaced by newValue).

validValueSuffix

Suffix to append to variable names when creating variable names for new variables that contain TRUE and FALSE to specify for each original variable whether its value satisfied the specified criterion.

newValueSuffix

If replace is FALSE, original values are not replaced, but instead new variables are created where the offending values have been replaced. This suffix is appended to each original variable name to create the new variable name.

totalVarName

This is the name of a variable that contains, for each case, the total number of invalid values among all variables checked.

append

Whether to append the columns to the dataframe, or only return the new columns.

replace

Whether to replace the offending values with the value specified in newValue or whether to create new columns (see newValueSuffix).

silent

Whether to display the log, or only set it as attribute of the returned dataframe.

rmarkdownOutput

Whether to format the log so that it's ready to be included in RMarkdown reports.

callingSelf

For internal use; whether the function calls itself.

Value

The dataframe with the corrections, and the log stored in attribute checkDataIntegrity_log.

Examples

Run this code

# NOT RUN {
### Default behavior: return dataframe with
### offending values replaced by NA

checkDataIntegrity(c('mpg', '<30'),
                   mtcars);

### Check two conditions, and instead of returning the
### dataframe with the results appended, only return the
### columns indicating which cases 'pass', what the new
### values would be, and how many invalid values were
### found for each case (to easily remove cases that
### provided many invalid values)

checkDataIntegrity(list(c('mpg', '<30'),
                        c('gear', '<5')),
                   mtcars,
                   append=FALSE);
# }

Run the code above in your browser using DataLab