run_classifiers
tunes classifiers, post-stratifies and carries out
EMBA.
run_classifiers(
y,
L1.x,
L2.x,
mrp.L2.x,
L2.unit,
L2.reg,
L2.x.scale,
pcs,
pc.names,
folds,
bin.proportion,
bin.size,
cv.folds,
cv.data,
ebma.fold,
census,
ebma.size,
ebma.n.draws,
k.folds,
cv.sampling,
loss.unit,
loss.fun,
best.subset,
lasso,
pca,
gb,
svm,
mrp,
deep.mrp,
best.subset.L2.x,
lasso.L2.x,
pca.L2.x,
gb.L2.x,
svm.L2.x,
gb.L2.unit,
gb.L2.reg,
svm.L2.unit,
svm.L2.reg,
deep.L2.x,
deep.L2.reg,
deep.splines,
lasso.lambda,
lasso.n.iter,
gb.interaction.depth,
gb.shrinkage,
gb.n.trees.init,
gb.n.trees.increase,
gb.n.trees.max,
gb.n.minobsinnode,
svm.kernel,
svm.gamma,
svm.cost,
ebma.tol,
cores,
verbose
)
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in survey
.
Individual-level covariates. A character vector containing the
column names of the individual-level variables in survey
and
census
used to predict outcome y
. Note that geographic unit
is specified in argument L2.unit
.
Context-level covariates. A character vector containing the
column names of the context-level variables in survey
and
census
used to predict outcome y
. To exclude context-level
variables, set L2.x = NULL
.
MRP context-level covariates. A character vector containing
the column names of the context-level variables in survey
and
census
to be used by the MRP classifier. The character vector
empty if no context-level variables should be used by the MRP
classifier. If NULL
and mrp
is set to TRUE
, then MRP
uses the variables specified in L2.x
. Default is NULL
. Note:
For the empty MrP model, set L2.x = NULL
and mrp.L2.x = ""
.
Geographic unit. A character scalar containing the column
name of the geographic unit in survey
and census
at which
outcomes should be aggregated.
Geographic region. A character scalar containing the column
name of the geographic region in survey
and census
by which
geographic units are grouped (L2.unit
must be nested within
L2.reg
). Default is NULL
.
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is TRUE
. Note that if set to FALSE
, then the
context-level covariates should be normalized prior to calling
auto_MrP()
.
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
survey
and census
. Default is NULL
.
A character vector of the principal component variable names in the data.
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in survey
that specifies the fold
to which an observation is allocated. The variable should contain integers
running from \(1\) to \(k + 1\), where \(k\) is the number of
cross-validation folds. Value \(k + 1\) refers to the EBMA fold. Default
is NULL
. Note: if folds
is NULL
, then
ebma.size
, k.folds
, and cv.sampling
must be specified.
Proportion of ideal types. A character scalar
containing the column name of the variable in census
that indicates
the proportion of individuals by ideal type and geographic unit. Default is
NULL
. Note: if bin.proportion
is NULL
, then
bin.size
must be specified.
Bin size of ideal types. A character scalar containing the
column name of the variable in census
that indicates the bin size of
ideal types by geographic unit. Default is NULL
. Note:
ignored if bin.proportion
is provided, but must be specified
otherwise.
Data for cross-validation. A list
of \(k\)
data.frames
, one for each fold to be used in \(k\)-fold
cross-validation.
A data.frame containing the survey data used in classifier training.
A data.frame containing the data not used in classifier training.
Census data. A data.frame
whose column names include
L1.x
, L2.x
, L2.unit
, if specified, L2.reg
and
pcs
, and either bin.proportion
or bin.size
.
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is \(1/3\). Note: ignored if folds
is provided, but
must be specified otherwise.
EBMA number of samples. An integer-valued scalar specifying the number of bootstrapped samples to be drawn from the EBMA fold and used for tuning EBMA. Default is \(100\).
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
\(5\). Note: ignored if folds
is provided, but must be
specified otherwise.
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents (individuals
) or geographic units
(L2 units
). Default is L2 units
. Note: ignored if
folds
is provided, but must be specified otherwise.
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents (individuals
), geographic units (L2 units
) or at
both levels. Default is c("individuals", "L2 units")
. With multiple
loss units, parameters are ranked for each loss unit and the loss unit with
the lowest rank sum is chosen. Ties are broken according to the order in
the search grid.
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error (MSE
),
the mean absolute error (MAE
), binary cross-entropy
(cross-entropy
), mean squared false error (msfe
), the f1
score (f1
), or a combination thereof. Default is c("MSE",
"cross-entropy","msfe", "f1")
. With multiple loss functions, parameters
are ranked for each loss function and the parameter combination with the
lowest rank sum is chosen. Ties are broken according to the order in the
search grid.
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
y
. Default is TRUE
.
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome y
. Default is
TRUE
.
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome y
. Default is
TRUE
.
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome y
. Default is
TRUE
.
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome y
. Default is
TRUE
.
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome y
. Default is
FALSE
.
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for predicting outcome y
.
Default is FALSE
.
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
survey
and census
to be used by the best subset classifier.
If NULL
and best.subset
is set to TRUE
, then best
subset uses the variables specified in L2.x
. Default is NULL
.
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
survey
and census
to be used by the lasso classifier. If
NULL
and lasso
is set to TRUE
, then lasso uses the
variables specified in L2.x
. Default is NULL
.
PCA context-level covariates. A character vector containing
the column names of the context-level variables in survey
and
census
whose principal components are to be used by the PCA
classifier. If NULL
and pca
is set to TRUE
, then PCA
uses the principal components of the variables specified in L2.x
.
Default is NULL
.
GB context-level covariates. A character vector containing the
column names of the context-level variables in survey
and
census
to be used by the GB classifier. If NULL
and gb
is set to TRUE
, then GB uses the variables specified in L2.x
.
Default is NULL
.
SVM context-level covariates. A character vector containing
the column names of the context-level variables in survey
and
census
to be used by the SVM classifier. If NULL
and
svm
is set to TRUE
, then SVM uses the variables specified in
L2.x
. Default is NULL
.
GB L2.unit. A logical argument indicating whether
L2.unit
should be included in the GB classifier. Default is
FALSE
.
GB L2.reg. A logical argument indicating whether
L2.reg
should be included in the GB classifier. Default is
FALSE
.
SVM L2.unit. A logical argument indicating whether
L2.unit
should be included in the SVM classifier. Default is
FALSE
.
SVM L2.reg. A logical argument indicating whether
L2.reg
should be included in the SVM classifier. Default is
FALSE
.
Deep MRP context-level covariates. A character vector
containing the column names of the context-level variables in survey
and census
to be used by the deep MRP classifier. If NULL
and
deep.mrp
is set to TRUE
, then deep MRP uses the variables
specified in L2.x
. Default is NULL
.
Deep MRP L2.reg. A logical argument indicating whether
L2.reg
should be included in the deep MRP classifier. Default is
TRUE
.
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is TRUE
.
Lasso penalty parameter. A numeric vector
of
non-negative values. The penalty parameter controls the shrinkage of the
context-level variables in the lasso model. Default is a sequence with
minimum 0.1 and maximum 250 that is equally spaced on the log-scale. The
number of values is controlled by the lasso.n.iter
parameter.
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
\(100\). Note: Is ignored if a vector of lasso.lambda
values is provided.
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is c(1, 2, 3)
.
GB learning rate. A numeric vector whose values specify
the learning rate or step-size reduction of GB. Values between \(0.001\)
and \(0.1\) usually work, but a smaller learning rate typically requires
more trees. Default is c(0.04, 0.01, 0.008, 0.005, 0.001)
.
GB initial total number of trees. An integer-valued scalar specifying the initial number of total trees to fit by GB. Default is \(50\).
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until gb.n.trees.max
is reached).
Default is \(50\).
GB maximum number of trees. An integer-valued scalar specifying the maximum number of trees to fit by GB. Default is \(1000\).
GB minimum number of observations in the terminal nodes. An integer-valued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is \(20\).
SVM kernel. A character-valued scalar specifying the kernel
to be used by SVM. The possible values are linear
,
polynomial
, radial
, and sigmoid
. Default is
radial
.
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale.
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale.
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the log-likelihood before the EM
algorithm stops optimization. Values should range at least from \(0.01\)
to \(0.001\). Default is
c(0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001)
.
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1.
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is FALSE
.