MGLMsparsereg: Fit multivariate GLM sparse regression

Description

Fit sparse regression in multivariate generalized linear models.

Usage

MGLMsparsereg(
  formula,
  data,
  dist,
  lambda,
  penalty,
  weight,
  init,
  penidx,
  maxiters = 150,
  ridgedelta,
  epsilon = 1e-05,
  regBeta = FALSE,
  overdisp
)
MGLMsparsereg.fit(
  Y,
  X,
  dist,
  lambda,
  penalty,
  weight,
  init,
  penidx,
  maxiters = 150,
  ridgedelta,
  epsilon = 1e-05,
  regBeta = FALSE,
  overdisp
)

Arguments

formula

an object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted. The response has to be on the left hand side of ~.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data when using function MGLMsparsereg, the variables are taken from environment(formula), typically the environment from which MGLMsparsereg is called.

dist

a description of the error distribution to fit. See dist for details.

lambda

penalty parameter.

penalty

penalty type for the regularization term. Can be chosen from "sweep", "group", or "nuclear". See Details for the description of each penalty type.

weight

an optional vector of weights assigned to each row of the data. Should be NULL or a numeric vector. Could be a variable from data, or a variable from environment(formula) with the length equal to the number of rows of the data. If weight=NULL, equal weights of ones will be assigned.

init

an optional matrix of initial value of the parameter estimates. Should have the compatible dimension with the data. See dist for details of the dimensions in each distribution.

penidx

a logical vector indicating the variables to be penalized. The default value is rep(TRUE, p), which means all predictors are subject to regularization. If X contains intercept, use penidx=c(FALSE,rep(TRUE,p-1)).

maxiters

an optional numeric controlling the maximum number of iterations. The default value is maxiters=150.

ridgedelta

an optional numeric controlling the behavior of the Nesterov's accelerated proximal gradient method. The default value is $\frac{1}{pd}$.

epsilon

an optional numeric controlling the stopping criterion. The algorithm terminates when the relative change in the objective values of two successive iterates is less then epsilon. The default value is epsilon=1e-5.

regBeta

an optional logical variable used when running negative multinomial regression (dist="NegMN"). regBeta controls whether to run regression on the over-dispersion parameter. The default is regBeta=FALSE.

overdisp

an optional numerical variable used only when fitting sparse negative multinomial model dist="NegMN" and regBeta=FALSE. overdisp gives the over dispersion value for all the observations. The default value is estimated using negative-multinomial regression. When dist="MN", "DM", "GDM" or regBeta=TRUE, the value of overdisp is ignored.

a matrix containing the multivariate categorical response data. Rows of the matrix represent observations, while columns are the different categories. Rows and columns of all zeros are automatically removed when dist="MN", "DM", or "GDM".

design matrix (including intercept). Number of rows of the matrix should match that of Y.

Value

Returns an object of class "MGLMsparsereg". An object of class "MGLMsparsereg" is a list containing at least the following components:

coefficients the estimated matrix of regression coefficients.
logL the final loglikelihood value.
AIC Akaike information criterion.
BIC Bayesian information criterion.
Dof degrees of freedom of the estimated parameter.
iter number of iterations used.
maxlambda the maxmum tuning parameter such that the estimated coefficients are not all zero. This value is returned only when the tuning parameter lambda given to the function is large enough such that all the parameter estimates are zero; otherwise, maxlambda is not computed.
call a matched call.
data the data used to fit the model: a list of the predictor matrix and the response matrix.
penalty the penalty chosen when running the penalized regression.

Details

In general, we consider regularization problem $$ \min_B h(B) = -l(B)+ J(B), $$ where $l(B)$ is the loglikelihood function and $J(B)$ is the regularization function.

Sparsity in the individual elements of the parameter matrix $B$ is achieved by the lasso penalty (dist="sweep") $$ J(B) = \lambda \sum_{k\in penidx} \sum_{j=1}^d \|B_{kj}\| $$

Sparsity in the rows of the regression parameter matrix $B$ is achieved by the group penalty (dist="group") $$ J(B) = \lambda \sum_{k \in penidx} \|B_{k \cdot}\|_2, $$ where $\|v\|_2$ is the $l_2$ norm of a vector $v$. In other words, $\|B_{k\cdot}\|_2$ is the $l_2$ norm of the $k$-th row of the parameter matrix $B$.

Sparsity in the rank of the parameter matrix $B$ is achieved by the nuclear norm penalty (dist="nuclear") $$ J(B) = \lambda \|B\|_*= \lambda \sum_{i=1}^{min(p, d)} \sigma_i(B), $$ where $\sigma_i(B)$ are the singular values of the parameter matrix $B$. The nuclear norm $\|B\|_*$ is a convex relaxation of $rank(B)=\|\sigma(B)\|_0$.

See dist for details about distributions.

Examples

Run this code

# NOT RUN {
## Generate Dirichlet Multinomial data
dist <- "DM"
n <- 100
p <- 15
d <- 5
m <- runif(n, min=0, max=25) + 25
set.seed(134)
X <- matrix(rnorm(n*p),n, p)
alpha <- matrix(0, p, d)
alpha[c(1,3, 5), ] <- 1
Alpha <- exp(X%*%alpha)
Y <- rdirmn(size=m, alpha=Alpha)

## Tuning
ngridpt <- 10
p <- ncol(X)
d <- ncol(Y)
pen <- 'nuclear'
spfit <- MGLMsparsereg(formula=Y~0+X, dist=dist, lambda=Inf, penalty=pen)

# }

Run the code above in your browser using DataLab