# dataPreprocess

0th

Percentile

##### dataPreprocess

The functionality is realized in two main steps:

1. Check for near zero variance predictors and flag as near zero if:
1. the percentage of unique values is less than 20% and
2. the ratio of the most frequent to the second most frequent value is greater than 20,
2. Check for susceptibility to multicollinearity
1. Calculate correlation matrix
2. Find variables with correlation 0.9 or more and delete them

Keywords
robust, univar
##### Usage
dataPreprocess(trainMatryca_nr, testMatryca_nr, labelsFrame, lk_col, lk_row, with.labels)
##### Arguments
trainMatryca_nr
Input training data matrix
testMatryca_nr
Input testing data matrix
labelsFrame
Transposed data frame of column names
lk_col
Number of columns
lk_row
Number of rows
with.labels
If with.labels=TRUE, additional data frame with preprocessed inputs corresponding to original data set column numbers as output is generated
##### References

Kuhn M. (2008) Building Predictive Models in R Using the caret Package Journal of Statistical Software 28(5) http://www.jstatsoft.org/.

##### Aliases
• dataPreprocess
##### Examples
library(fscaret)

# Create data sets and labels data frame
trainMatrix <- matrix(rnorm(150*120,mean=10,sd=1), 150, 120)

# Adding some near-zero variance attributes

temp1 <- matrix(runif(150,0.0001,0.0005), 150, 12)

# Adding some highly correlated attributes

sampleColIndex <- sample(ncol(trainMatrix), size=10)

temp2 <- matrix(trainMatrix[,sampleColIndex]*2, 150, 10)

# Output variable

output <- matrix(rnorm(150,mean=10,sd=1), 150, 1)

trainMatrix <- cbind(trainMatrix,temp1,temp2, output)

colnames(trainMatrix) <- paste("X",c(1:ncol(trainMatrix)),sep="")

# Subset test data set

testMatrix <- trainMatrix[sample(round(0.1*nrow(trainMatrix))),]

labelsDF <- data.frame("Labels"=paste("X",c(1:ncol(trainMatrix)),sep=""))

lk_col <- ncol(trainMatrix)
lk_row <- nrow(trainMatrix)

with.labels = TRUE

testRes <- dataPreprocess(trainMatrix, testMatrix,
labelsDF, lk_col, lk_row, with.labels)

summary(testRes)

# Selected attributes after data set preprocessing
testRes$labelsDF # Training and testing data sets after preprocessing testRes$trainMatryca
testRes\$testMatryca
Documentation reproduced from package fscaret, version 0.8.5.1, License: GPL-2 | GPL-3

### Community examples

Looks like there are no examples yet.