prepare.treatmentplan: Apply treatments and restrict to useful variables.

Description

Use a treatment plan to prepare a data frame for analysis. The resulting frame will have new effective variables that are numeric and free of NaN/NA. If the outcome column is present it will be copied over. The intent is that these frames are compatible with more machine learning techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels). Note: each column is processed independently of all others. Also copies over outcome if present. Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of vtreat that produced the plan differs from the version running prepare().

Usage

# S3 method for treatmentplan
prepare(
  treatmentplan,
  dframe,
  ...,
  pruneSig = NULL,
  scale = FALSE,
  doCollar = FALSE,
  varRestriction = NULL,
  codeRestriction = NULL,
  trackedValues = NULL,
  extracols = NULL,
  parallelCluster = NULL,
  use_parallel = TRUE,
  check_for_duplicate_frames = TRUE
)

Arguments

treatmentplan

Plan built by designTreantmentsC() or designTreatmentsN()

dframe

Data frame to be treated

...

no additional arguments, declared to forced named binding of later arguments

pruneSig

suppress variables with significance above this level

scale

optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome.

doCollar

optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.

varRestriction

optional list of treated variable names to restrict to

codeRestriction

optional list of treated variable codes to restrict to

trackedValues

optional named list mapping variables to know values, allows warnings upon novel level appearances (see track_values)

extracols

extra columns to copy.

parallelCluster

(optional) a cluster object created by package parallel or package snow.

use_parallel

logical, if TRUE use parallel methods.

check_for_duplicate_frames

logical, if TRUE check if we called prepare on same data.frame as design step.

Value

treated data frame (all columns numeric- without NA, NaN)

Examples

Run this code

# NOT RUN {
# categorical example
set.seed(23525)

# we set up our raw training and application data
dTrainC <- data.frame(
  x = c('a', 'a', 'a', 'b', 'b', NA, NA),
  z = c(1, 2, 3, 4, NA, 6, NA),
  y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE))

dTestC <- data.frame(
  x = c('a', 'b', 'c', NA), 
  z = c(10, 20, 30, NA))

# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsC
# and dTrainCTreated
unpack[
  treatmentsC = treatments,
  dTrainCTreated = crossFrame
  ] <- mkCrossFrameCExperiment(
    dframe = dTrainC,
    varlist = setdiff(colnames(dTrainC), 'y'),
    outcomename = 'y',
    outcometarget = TRUE,
    verbose = FALSE)

# the treatments include a score frame relating new
# derived variables to original columns
treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
  print(.)

# the treated frame is a "cross frame" which
# is a transform of the training data built 
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainCTreated %.>%
  head(.) %.>%
  print(.)

# Any future application data is prepared with
# the prepare method.
dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL)

dTestCTreated %.>%
  head(.) %.>%
  print(.)

# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples