h2o.glm: H2O: Generalized Linear Models

Description

Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

Usage

h2o.glm(x, y, data, key = "", offset = NULL, family, link,
        tweedie.p = ifelse(family == "tweedie", 1.5, NA_real_),
        prior = NULL, nfolds = 0, alpha = 0.5, lambda = 1e-5,
        lambda_search = FALSE, nlambda = -1, lambda.min.ratio = -1,
        max_predictors = -1, return_all_lambda = FALSE,
        strong_rules = TRUE, standardize = TRUE, intercept = TRUE,
        non_negative = FALSE, use_all_factor_levels = FALSE,
        variable_importances = FALSE, epsilon = 1e-4, iter.max = 100,
        higher_accuracy = FALSE, beta_constraints = NULL, 
        disable_line_search = FALSE)

Arguments

A character vector containing the column names of the predictors in the model.

A character string representing the response variable in the model.

data

An H2OParsedData object containing the variables in the model.

key

An optional unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.

offset

An optional character string representing the offset term in the model.

family

A character string specifying the error distribution of the model; one of "gaussian", "binomial", "poisson", "gamma", and "tweedie".

link

A character string specifying the link function. The default is the canonical link for the family. The supported links for each of the family specifications are: "gaussian": "identity",

tweedie.p

A numeric specifying the power for the variance function when family = "tweedie".

prior

An optional numeric specifying the prior probability of class 1 in the response when family = "binomial". The default prior is the observational frequency of class 1.

nfolds

A non-negative integer specifying the number of folds for cross-validation and nfolds = 0 indicates no cross-validation.

alpha

A numeric in [0, 1] specifying the elastic-net mixing parameter. The elastic-net penalty is defined to be $$P(\alpha,\beta) = (1-\alpha)/2||\beta||_2^2 + \alpha||\beta||_1 = \sum_j [(1-\alpha)/2 \beta_j^2 + \alpha|\beta_j|]$$, making

lambda

A non-negative shrinkage parameter for the elastic-net, which multiplies $P(\alpha,\beta)$ in the objective. When lambda = 0, then no elastic-net penalty is applied and ordinary generalized linear models are fit.

lambda_search

A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda argument to lambda.min.ratio times the smallest lambda that produces zeros for all the coefficient est

nlambda

The number of lambda values to use when lambda_search = TRUE.

lambda.min.ratio

A non-negative number that specifies the minimum value for lambda as a fraction of smallest lambda that yields the zero vector for the coefficient estimates.

max_predictors

When lambda_search = TRUE, a non-negative integer specifying an early stopping rule for the maximum number of predictors in the model.

return_all_lambda

A logical value indicating whether to return every model built during the lambda search. If return_all_lambda = FALSE, then only the model corresponding to the optimal lambda will be returned.

strong_rules

A logical value indicating whether to use strong rules to remove predictors with gradients near zero at the starting solution prior to model training.

standardize

A logical value indicating whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to training the models.

intercept

A logical value indicating whether to include the intercept term in the models. This will only have a practical effect in the presence of all numeric predictors.

non_negative

A logical value indicating whether the coefficient estimates will be constrained to be non-negative.

use_all_factor_levels

A logical value indicating whether dummy variables should be used for all factor levels of the categorical predictors. When TRUE, results in an over parameterized models.

variable_importances

A logical value indicating whether the variable importances should be computed.

epsilon

A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion for h2o.glm.

iter.max

A non-negative integer specifying the maximum number of iterations.

higher_accuracy

A logical value indicating whether to use line search to produce more accurate estimates.

beta_constraints

A data.frame or H2OParsedData object with the columns ["names", "lower_bounds", "upper_bounds", "beta_given"], where each row corresponds to a predictor in the GLM. "names" contains the predictor names, "lower"/"upper_bounds", are the lower and

disable_line_search

A logical value indicating whether line search should be disabled.

Value

An object of class H2OGLMModel with slots key, data, model, and xval. The model slot is a list of the following components:
coefficientsA named vector of the coefficients estimated in the model.
rankThe numeric rank of the fitted linear model.
familyThe family of the error distribution.
devianceThe deviance of the fitted model.
aicAkaike's Information Criterion for the final computed model.
null.devianceThe deviance for the null model.
iterNumber of algorithm iterations to compute the model.
df.residualThe residual degrees of freedom.
df.nullThe residual degrees of freedom for the null model.
yThe response variable in the model.
xA vector of the predictor variable(s) in the model.
aucArea under the curve.
training.errAverage training error.
thresholdBest threshold.
confusionConfusion matrix.
The xval slot is a list of H2OGLMModel objects representing the cross-validation models. (Each of these objects themselves has xval equal to an empty list).

Examples

Run this code

# -- CRAN examples begin --
library(h2o)
localH2O = h2o.init()

# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(localH2O, path = prostatePath, key = "prostate.hex")
h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, 
        family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE, 
        use_all_factor_levels = FALSE, variable_importances = FALSE,
        higher_accuracy = FALSE)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL"))
h2o.glm(y = "VOL", x = myX, data = prostate.hex, family = "gaussian",
        nfolds = 0, alpha = 0.1, lambda_search = FALSE,
        use_all_factor_levels = FALSE, variable_importances = FALSE,
        higher_accuracy = FALSE)
# -- CRAN examples end --

# GLM variable importance
# Also see:
#   https://github.com/h2oai/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  localH2O,
  path = "https://raw.github.com/h2oai/h2o/master/smalldata/bank-additional-full.csv",
  key = "data.hex")
myX = 1:20
myY="y"
my.glm = h2o.glm(x=myX, y=myY, data=data.hex, family="binomial",
                 standardize=TRUE, use_all_factor_levels=TRUE,
                 higher_accuracy=TRUE, lambda_search=TRUE,
                 return_all_lambda=TRUE, variable_importances=TRUE)
best_model = my.glm@best_model
n_coeff = abs(my.glm@models[[best_model]]@model$normalized_coefficients)
VI = abs(n_coeff[-length(n_coeff)])
glm.VI = VI[order(VI,decreasing=T)]
print(glm.VI)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples