fscaret (version 0.8.5.1)

dataPreprocess: dataPreprocess

Description

The functionality is realized in two main steps:
  1. Check for near zero variance predictors and flag as near zero if:
    1. the percentage of unique values is less than 20% and
    2. the ratio of the most frequent to the second most frequent value is greater than 20,
  2. Check for susceptibility to multicollinearity
    1. Calculate correlation matrix
    2. Find variables with correlation 0.9 or more and delete them

Usage

dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)

Arguments

trainMatryca_nr
Input training data matrix
testMatryca_nr
Input testing data matrix
labelsFrame
Transposed data frame of column names
lk_col
Number of columns
lk_row
Number of rows
with.labels
If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated

References

Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5) http://www.jstatsoft.org/.

Examples

Run this code
library(fscaret)

# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)

# Adding some near-zero variance attributes

temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)

# Adding some highly correlated attributes

sampleColIndex <- sample(ncol(trainMatrix), size=10)

temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)

# Output variable

output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)

trainMatrix <- cbind(trainMatrix,temp1,temp2, output)

colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")

# Subset test data set

testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]

labelsDF <- data.frame("Labels"=paste("X",c(1:ncol(trainMatrix)),sep=""))

lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)

with.labels = TRUE

testRes <- dataPreprocess(trainMatrix, testMatrix,
			  labelsDF, lk_col, lk_row, with.labels)
			  
summary(testRes)

# Selected attributes after data set preprocessing
testRes$labelsDF

# Training and testing data sets after preprocessing
testRes$trainMatryca
testRes$testMatryca

Run the code above in your browser using DataCamp Workspace