dataPreprocess
From fscaret v0.8.5.6
by Jakub Szlek
dataPreprocess
The functionality is realized in two main steps:
- Check for near zero variance predictors and flag as near zero if:
- the percentage of unique values is less than 20% and
- the ratio of the most frequent to the second most frequent value is greater than 20,
- Check for susceptibility to multicollinearity
- Calculate correlation matrix
- Find variables with correlation 0.9 or more and delete them
Usage
dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)
Arguments
- trainMatryca_nr
- Input training data matrix
- testMatryca_nr
- Input testing data matrix
- labelsFrame
- Transposed data frame of column names
- lk_col
- Number of columns
- lk_row
- Number of rows
- with.labels
- If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated
References
Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5)
Examples
library(fscaret)
# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)
# Adding some near-zero variance attributes
temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)
# Adding some highly correlated attributes
sampleColIndex <- sample(ncol(trainMatrix), size=10)
temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)
# Output variable
output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)
trainMatrix <- cbind(trainMatrix,temp1,temp2, output)
colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")
# Subset test data set
testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]
labelsDF <- data.frame("Labels"=paste("X",c(1:ncol(trainMatrix)),sep=""))
lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)
with.labels = TRUE
testRes <- dataPreprocess(trainMatrix, testMatrix,
labelsDF, lk_col, lk_row, with.labels)
summary(testRes)
# Selected attributes after data set preprocessing
testRes$labelsDF
# Training and testing data sets after preprocessing
testRes$trainMatryca
testRes$testMatryca
Community examples
Looks like there are no examples yet.