vtreat-package: vtreat: a package for simple variable treatment

Description

vtreat is a package that treats variables so that models can be used in production. Common problems vtreat defends against: NA, too many categorical values, rare categorical values, new categorical values (levels seen during application, but not during training). prepare should be used as you would use model.matrix. prepare treated variables are all numeric and never take the value NA (so are very safe to use in modeling).

Arguments

Details

ll{ Package: vtreat Type: Package Version: 0.5.16 Date: 2015-09-12 License: GNU General Public License version 3 }

References

See: http://www.win-vector.com/blog/2014/06/r-minitip-dont-use-data-matrix-when-you-mean-model-matrix/ http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/ http://practicaldatascience.com/ "Effect codes" in Cohen et. al. Applied multiple regression/correlation for the behavioral sciences.

First build a list of variable treatments from your training data using designTreatmentsC (for models predicting binary categorical outcomes) or designTreatmentsN (for models predicting numeric outcomes). If you have enough data we suggested running the design step on a subset of data disjoint from training and test (this avoids many issues including mis-counting degrees of freedom on effect or impact codes arising from categorical variables with a large number of levels). Then apply a list of treatments to a data frame to get a treated data frame using prepare. All the code is assuming we are working only with rows where the outcome or y-value is not NA, finite and not nan.

Examples

Run this code

# categorical example
dTrainC <- data.frame(x=c('a','a','a','b','b',NA,NA),
   z=c(1,2,3,4,NA,6,NA),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE))
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=1.0,scale=TRUE)
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
# all slopes should be 1
sapply(varsC,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainCTreated)$coefficients[[2]]})
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)

# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA,NA),
   z=c(1,2,3,4,5,NA,7,NA),y=c(0,0,0,1,0,1,1,1))
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
dTrainNTreated <- prepare(treatmentsN,dTrainN,pruneSig=1.0,scale=TRUE)
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean) 
# all slopes should be 1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
   data=dTrainNTreated)$coefficients[[2]]}) 
dTestNTreated <- prepare(treatmentsN,dTestN,pruneSig=c(),scale=TRUE)

# for large data sets you can consider designing the treatments on 
# a subset like so: d[sample(1:dim(d)[[1]],1000),,drop=FALSE]

Run the code above in your browser using DataLab

Description

Arguments

Details

References

See Also

Examples