Learn R Programming

vtreat (version 0.5.23)

prepare: Apply treatments and restrict to useful variables.

Description

Use a treatment plan to prepare a data frame for analysis. The resulting frame will have new effective variables that are numeric and free of NaN/NA. If the outcome column is present it will be copied over. The intent is that these frames are compatible with more machine learning techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels). Note: each column is processed independently of all others. Also copies over outcome if present.

Usage

prepare(treatmentplan, dframe, pruneSig, ..., scale = FALSE,
  doCollar = TRUE, varRestriction = c(), parallelCluster = NULL)

Arguments

treatmentplan
Plan built by designTreantmentsC() or designTreatmentsN()
dframe
Data frame to be treated
pruneSig
suppress variables with significance above this level
...
no additional arguments, declared to forced named binding of later arguments
scale
optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for varaibles with signficant less than 1) slope 1 when regressed against outcome.
doCollar
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.
varRestriction
optional list of treated variable names to restrict to
parallelCluster
(optional) a cluster object created by package parallel or package snow

Value

  • treated data frame (all columns numeric, without NA,NaN)

See Also

designTreatmentsC designTreatmentsN

Examples

Run this code
dTrainN <- data.frame(x=c('a','a','a','a','b','b','b'),
    z=c(1,2,3,4,5,6,7),y=c(0,0,0,1,0,1,1))
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
dTrainNTreated <- prepare(treatmentsN,dTrainN,1.0)
dTestNTreated <- prepare(treatmentsN,dTestN,1.0)

dTrainC <- data.frame(x=c('a','a','a','b','b','b'),
    z=c(1,2,3,4,5,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
dTrainCTreated <- prepare(treatmentsC,dTrainC,1.0)
dTestCTreated <- prepare(treatmentsC,dTestC,1.0)

Run the code above in your browser using DataLab