designTreatmentsZ: Design variable treatments with no outcome variable.

Description

Data frame is assumed to have only atomic columns except for dates (which are converted to numeric). Note: each column is processed independently of all others.

Usage

designTreatmentsZ(
  dframe,
  varlist,
  ...,
  minFraction = 0,
  weights = c(),
  rareCount = 0,
  collarProb = 0,
  codeRestriction = NULL,
  customCoders = NULL,
  verbose = TRUE,
  parallelCluster = NULL,
  use_parallel = TRUE,
  missingness_imputation = NULL,
  imputation_map = NULL
)

Value

treatment plan (for use with prepare)

Arguments

dframe: Data frame to learn treatments from (training data), must have at least 1 row.
varlist: Names of columns to treat (effective variables).
...: no additional arguments, declared to forced named binding of later arguments
minFraction: optional minimum frequency a categorical level must have to be converted to an indicator column.
weights: optional training weights for each row
rareCount: optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
collarProb: what fraction of the data (pseudo-probability) to collar data at if doCollar is set during prepare.treatmentplan.
codeRestriction: what types of variables to produce (character array of level codes, NULL means no restriction).
customCoders: map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md).
verbose: if TRUE print progress.
parallelCluster: (optional) a cluster object created by package parallel or package snow.
use_parallel: logical, if TRUE use parallel methods (if parallel cluster is set).
missingness_imputation: function of signature f(values: numeric, weights: numeric), simple missing value imputer.
imputation_map: map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.

Details

The main fields are mostly vectors with names (all with the same names in the same order):

- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame

See the vtreat vignette for a bit more detail and a worked example.

Columns that do not vary are not passed through.

Examples

Run this code


dTrainZ <- data.frame(x=c('a','a','a','a','b','b',NA,'e','e'),
    z=c(1,2,3,4,5,6,7,NA,9))
dTestZ <- data.frame(x=c('a','x','c',NA),
    z=c(10,20,30,NA))
treatmentsZ = designTreatmentsZ(dTrainZ, colnames(dTrainZ),
  rareCount=0)
dTrainZTreated <- prepare(treatmentsZ, dTrainZ)
dTestZTreated <- prepare(treatmentsZ, dTestZ)

Run the code above in your browser using DataLab

Get 50% off unlimited learning