Learn R Programming

dataPreparation (version 0.1)

whichAreIncluded: Identify columns that are included in others

Description

Find all the columns that don't contain more information than another column. For example if you have a column with an amount, and another with the same amount but rounded, the second column is included in the first.

Usage

whichAreIncluded(dataSet, verbose = TRUE)

Arguments

dataSet

Matrix, data.frame or data.table

verbose

Should the algorithm talk (logical, default to TRUE)

Value

A list of index of columns that have an exact duplicate in the dataSet set.

Details

This function is performing exponential search and is looking to every couple of columns. Be very carefull while using this function: - if there is an id column, it will say everything is included in the id column, - the order of columns will influence the result.

And last but not least, sing machine learning algorithm it's not always smart to drop columns even if they don't give more info: the extrem example is the id example.

Examples

Run this code
# NOT RUN {
# Load toy data set
require(data.table)
data(messy_adult)

# Check for included columns
whichAreIncluded(messy_adult)

# Return columns that are also constant, double and bijection
# Let's add a truly just included column
messy_adult$are50OrMore <- messy_adult$age > 50
whichAreIncluded(messy_adult)

# As one can, see this column that doesn't have additional info than age is spotted.

# But you should be carefull, if there is a column id, every column will be dropped:
messy_adult$id = 1:nrow(messy_adult) # build id
setcolorder(messy_adult, c("id", setdiff(names(messy_adult), "id"))) # Set id as first column
whichAreIncluded(messy_adult)
# }

Run the code above in your browser using DataLab