As the document-term matrix quickly grows with an increasing number of abstracts, it can easily reach several thousand columns. Thus, it can be important to extract the columns that carry most of the information in the decision making process. This function uses a generalized linear model combined with elasticnet regularization to extract these features. In contrast to a usual regression model or a L2 penalty (ridge regression), elasticnet (and LASSO) sets some regression parameters to 0. Thus, the selected features are exactly the features with a non-zero entry.
select_features(object, ...)# S4 method for MetaNLP
select_features(object, alpha = 0.8, lambda = "avg", seed = NULL, ...)
An object of class MetaNLP, where the columns were selected
via elastic net.
An object of class MetaNLP
Additional arguments for cv.glmnet. An important
option might be type.measure to specify which loss is used when
the cross validation is executed.
The elastic net mixing parameter, with \(0\leq \alpha \leq 1\).
alpha = 1 then equals the lasso penalty, alpha = 0 is the ridge
penalty.
The weight parameter of the penalty. The possible values are
"avg", "min", "1se" or a numeric value which directly determines
\(\lambda\). When choosing "avg", "min" or "1se", cross
validation is executed to determine \(\lambda\).
Note that cross validation uses random folds, so the results are not necessarily
replicable.
"avg" calls select_features 10 times, computes the \(\lambda\) which
minimizes the loss for each iteration and then uses the median of these
values as the final value, for which the objective function is
minimized. "min" and "1se" carry out the cross validation just
once and \(\lambda\) is either the value, for which the cross-validated
error is minimized (option "min") or the value, that gives
the most regularized model such that the cross-validated error is within
one standar error of the minimum (option "1se").
A numeric value which is used as a local seed for this function.
Default is seed = NULL, so no seed is set.
Setting a seed leads to replicable results of
the cross validation, such that each call of select_features selects
the same columns. If a seed is set, the option lambda = "avg"
yields the same results as lambda = "min".
The computational aspects are executed by the glmnet
package. At first, a model is fitted via glmnet. The
elastic net parameter \(\alpha\) can be specified by the user. The
parameter \(\lambda\), which determines the weight of the penalty, can
either be chosen via cross validation (using cv.glmnet or by
giving a numeric value.
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE)
obj <- MetaNLP(path)
obj2 <- select_features(obj, alpha = 0.7, lambda = "min")
Run the code above in your browser using DataLab