dataPreprocess
dataPreprocess
The functionality is realized in two main steps:
Check for near zero variance predictors and flag as near zero if:
the percentage of unique values is less than 20
the ratio of the most frequent to the second most frequent value is greater than 20,
Check for susceptibility to multicollinearity
Calculate correlation matrix
Find variables with correlation 0.9 or more and delete them
Usage
dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)
Arguments
- trainMatryca_nr
Input training data matrix
- testMatryca_nr
Input testing data matrix
- labelsFrame
Transposed data frame of column names
- lk_col
Number of columns
- lk_row
Number of rows
- with.labels
If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated
References
Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5) http://www.jstatsoft.org/.
Examples
# NOT RUN {
library(fscaret)
# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)
# Adding some near-zero variance attributes
temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)
# Adding some highly correlated attributes
sampleColIndex <- sample(ncol(trainMatrix), size=10)
temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)
# Output variable
output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)
trainMatrix <- cbind(trainMatrix,temp1,temp2, output)
colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")
# Subset test data set
testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]
labelsDF <- data.frame("Labels"=paste("X",c(1:(ncol(trainMatrix)-1)),sep=""))
lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)
with.labels = TRUE
testRes <- dataPreprocess(trainMatrix, testMatrix,
labelsDF, lk_col, lk_row, with.labels)
summary(testRes)
# Selected attributes after data set preprocessing
testRes$labelsDF
# Training and testing data sets after preprocessing
testRes$trainMatryca
testRes$testMatryca
# }