parsnip (version 0.1.1)

descriptors: Data Set Characteristics Available when Fitting Models

Description

When using the fit() functions there are some variables that will be available for use in arguments. For example, if the user would like to choose an argument value based on the current number of rows in a data set, the .obs() function can be used. See Details below.

Usage

.cols()

.preds()

.obs()

.lvls()

.facts()

.x()

.y()

.dat()

Arguments

Details

Existing functions:

  • .obs(): The current number of rows in the data set.

  • .preds(): The number of columns in the data set that are associated with the predictors prior to dummy variable creation.

  • .cols(): The number of predictor columns available after dummy variables are created (if any).

  • .facts(): The number of factor predictors in the dat set.

  • .lvls(): If the outcome is a factor, this is a table with the counts for each level (and NA otherwise).

  • .x(): The predictors returned in the format given. Either a data frame or a matrix.

  • .y(): The known outcomes returned in the format given. Either a vector, matrix, or data frame.

  • .dat(): A data frame containing all of the predictors and the outcomes. If fit_xy() was used, the outcomes are attached as the column, ..y.

For example, if you use the model formula Sepal.Width ~ . with the iris data, the values would be

 .preds() =   4          (the 4 columns in `iris`)
 .cols()  =   5          (3 numeric columns + 2 from Species dummy variables)
 .obs()   = 150
 .lvls()  =  NA          (no factor outcome)
 .facts() =   1          (the Species predictor)
 .y()     = <vector>     (Sepal.Width as a vector)
 .x()     = <data.frame> (The other 4 columns as a data frame)
 .dat()   = <data.frame> (The full data set)

If the formula Species ~ . where used:

 .preds() =   4          (the 4 numeric columns in `iris`)
 .cols()  =   4          (same)
 .obs()   = 150
 .lvls()  =  c(setosa = 50, versicolor = 50, virginica = 50)
 .facts() =   0
 .y()     = <vector>     (Species as a vector)
 .x()     = <data.frame> (The other 4 columns as a data frame)
 .dat()   = <data.frame> (The full data set)

To use these in a model fit, pass them to a model specification. The evaluation is delayed until the time when the model is run via fit() (and the variables listed above are available). For example:

library(modeldata) data("lending_club")

rand_forest(mode = "classification", mtry = .cols() - 2)

When no descriptors are found, the computation of the descriptor values is not executed.