Learn R Programming

vtreat (version 1.0.1)

prepare: Apply treatments and restrict to useful variables.

Description

Use a treatment plan to prepare a data frame for analysis. The resulting frame will have new effective variables that are numeric and free of NaN/NA. If the outcome column is present it will be copied over. The intent is that these frames are compatible with more machine learning techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels). Note: each column is processed independently of all others. Also copies over outcome if present.

Usage

prepare(treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE,
  doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL,
  parallelCluster = NULL)

Arguments

treatmentplan

Plan built by designTreantmentsC() or designTreatmentsN()

dframe

Data frame to be treated

...

no additional arguments, declared to forced named binding of later arguments

pruneSig

suppress variables with significance above this level

scale

optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for varaibles with signficant less than 1) slope 1 when regressed (lm for regression problems/glm for classificaiton problems) against outcome.

doCollar

optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.

varRestriction

optional list of treated variable names to restrict to

codeRestriction

optional list of treated variable codes to restrict to

parallelCluster

(optional) a cluster object created by package parallel or package snow

Value

treated data frame (all columns numeric- without NA, NaN)

See Also

mkCrossFrameCExperiment, mkCrossFrameNExperiment, designTreatmentsC designTreatmentsN designTreatmentsZ

Examples

Run this code
# NOT RUN {
dTrainN <- data.frame(x= c('a','a','a','a','b','b','b'),
                      z= c(1,2,3,4,5,6,7),
                      y= c(0,0,0,1,0,1,1))
dTestN <- data.frame(x= c('a','b','c',NA),
                     z= c(10,20,30,NA))
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN), 'y')
dTrainNTreated <- prepare(treatmentsN, dTrainN, pruneSig= 0.2)
dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig= 0.2)

dTrainC <- data.frame(x= c('a','a','a','b','b','b'),
                      z= c(1,2,3,4,5,6),
                      y= c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
dTestC <- data.frame(x= c('a','b','c',NA),
                     z= c(10,20,30,NA))
treatmentsC <- designTreatmentsC(dTrainC, colnames(dTrainC),'y',TRUE)
dTrainCTreated <- prepare(treatmentsC, dTrainC, varRestriction= c('z_clean'))
dTestCTreated <- prepare(treatmentsC, dTestC, varRestriction= c('z_clean'))

dTrainZ <- data.frame(x= c('a','a','a','b','b','b'),
                      z= c(1,2,3,4,5,6))
dTestZ <- data.frame(x= c('a','b','c',NA),
                     z= c(10,20,30,NA))
treatmentsZ <- designTreatmentsZ(dTrainZ, colnames(dTrainZ))
dTrainZTreated <- prepare(treatmentsZ, dTrainZ, codeRestriction= c('lev'))
dTestZTreated <- prepare(treatmentsZ, dTestZ, codeRestriction= c('lev'))


# }

Run the code above in your browser using DataLab