glinternet: Fit a linear interaction model with group-lasso regularization that enforces strong hierarchy in the estimated coefficients

Description

The regularization path is computed along a grid of values for the regularization parameter lambda. Can deal with categorical variables with arbitrary numbers of levels, continuous variables, and combinations of the two. Accommodates squared error loss and logistic loss.

The multicore option requires that the package be compiled with OpenMP support. Examples of compilers that qualify include gcc (>= 4.2) and icc. We also recommend a higher level of optimization, such as -O3 in gcc.

Usage

glinternet(X, Y, numLevels, lambda = NULL, nLambda = 50, lambdaMinRatio = 0.01,
 interactionCandidates=NULL, interactionPairs=NULL, screenLimit = NULL, numToFind = NULL,
family = c("gaussian","binomial"), tol = 1e-05, maxIter=5000, verbose=FALSE,
numCores = 1)

Arguments

Matrix of features or predictors with dimension nobs x nvars; each row is an observation vector. Categorical variables must be coded as 0, 1, 2, ...

Target variable of length nobs. Continuous for squared error loss, 0-1 for logistic loss.

numLevels

Number of levels for each variable, of length nvars. Set to 1 for continuous variables.

lambda

A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nLambda and lambdaMinRatio. Supplying a value of lambda overrides this.

nLambda

The number of lambda values. Default is 50.

lambdaMinRatio

Smallest value for lambda, as a fraction of lambdaMax, the (data derived) entry value (i.e. the smallest value for which all coefficients are zero). The default is 0.01.

interactionCandidates

An optional vector of variable indices. This will force the algorithm to only consider interactions between interactionCandidates and all other variables.

interactionPairs

An optional nx2 matrix of variable indices. This will force the algorithm to only consider the interaction pairs defined by this matrix. For example, matrix(c(1,2,1,5), ncol=2, byrow=TRUE) restricts the model to two interaction pairs: one between variables 1 and 2, and another between 1 and 5.

screenLimit

If not null (the default), limits the size of the interaction search space to screenLimit x nvars by only considering interactions with the best screenLimit candidate main effects at each point along the regularization path. Set this accordingly for large problems or if there are memory limitations.

numToFind

Stops the program after numToFind interaction pairs are found. Default is null - fit all values of lambda.

family

A character string describing the target variable: "gaussian" for continuous (the default), "binomial" for logistic.

tol

Convergence tolerance in the adaptive FISTA algorithm.

maxIter

Maximum number of iterations in adaptive FISTA. Default 5000.

verbose

Prints progress. False by default.

numCores

Number of threads to run. For this to work, the package must be installed with OpenMP enabled. Default is 1 thread.

Value

An object of class glinternet with the components

call

The user function call.

fitted

The fitted values, with dimension nobs x nLambda. If numToFind is specified, the program is likely to stop before all nLambda models have been fit.

lambda

The actual lambda sequence used.

objValue

Objective values for each lambda.

activeSet

A list (of length nLambda) of the variables found. Internally, the categorical and continuous variables are separated into two groups, and each has their own indexing system (1-based). For example, the categorical-continuous interaction c(i, j) refers to the interaction between the i-th categorical variable with the j-th continuous variable.

betahat

List (of length lambda) of coefficients for the variables in activeSet. The first component is the intercept. Subsequent entries correspond to the variables in activeSet. For example, if the first variable in activeSet is a 3-level categorical variable, then components 2-4 of betahat are the coefficients for this variable.

numLevels

The number of levels for each variable.

family

The target variable type.

Details

The sequence of models implied by lambda is fit by FISTA (fast iterative soft thresholding) with adaptive step size and adaptive momentum restart. The continuous features are standardized to have unit norm and mean zero before computing the lambda sequence and fitting. The returned coefficients are unstandardized. Categorical variables are not standardized.

References

Michael Lim and Trevor Hastie (2013)Learning interactions via hierarchical group-lasso regularization, https://arxiv.org/abs/1308.2719

Examples

Run this code

# NOT RUN {
# gaussian response, continuous features
Y = rnorm(100)
X = matrix(rnorm(100*10), nrow=100)
numLevels = rep(1, 10)
fit = glinternet(X, Y, numLevels)

#binary response, continuous features
Y = rbinom(100, 1, 0.5)
fit = glinternet(X, Y, numLevels, family="binomial")

#binary response, categorical variables
X = matrix(sample(0:2, 100*10, replace=TRUE), nrow=100)
numLevels = rep(3, 10)
fit = glinternet(X, Y, numLevels, family="binomial")
# }

Run the code above in your browser using DataLab