h2o.glm: H2O Generalized Linear Models

Description

Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

Usage

h2o.glm(x, y, training_frame, model_id, validation_frame = NULL,
  ignore_const_cols = TRUE, max_iterations = 50, beta_epsilon = 0,
  solver = c("IRLSM", "L_BFGS"), standardize = TRUE,
  family = c("gaussian", "binomial", "poisson", "gamma", "tweedie",
  "multinomial"), link = c("family_default", "identity", "logit", "log",
  "inverse", "tweedie"), tweedie_variance_power = NaN,
  tweedie_link_power = NaN, alpha = 0.5, prior = NULL, lambda = 1e-05,
  lambda_search = FALSE, nlambdas = -1, lambda_min_ratio = -1,
  nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random",
  "Modulo"), keep_cross_validation_predictions = FALSE,
  beta_constraints = NULL, offset_column = NULL, weights_column = NULL,
  intercept = TRUE, max_active_predictors = -1, objective_epsilon = -1,
  gradient_epsilon = -1, non_negative = FALSE, compute_p_values = FALSE,
  remove_collinear_columns = FALSE, max_runtime_secs = 0,
  missing_values_handling = c("MeanImputation", "Skip"))

Arguments

A vector containing the names or indices of the predictor variables to use in building the GLM model.

A character string or index that represent the response variable in the model.

training_frame

An H2OFrame object containing the variables in the model.

model_id

(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

validation_frame

An H2OFrame object containing the variables in the model. Defaults to NULL.

ignore_const_cols

A logical value indicating whether or not to ignore all the constant columns in the training frame.

max_iterations

A non-negative integer specifying the maximum number of iterations.

beta_epsilon

A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion for h2o.glm.

solver

A character string specifying the solver used: IRLSM (supports more features), L_BFGS (scales better for datasets with many columns)

standardize

A logical value indicating whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to training the models.

family

A character string specifying the distribution of the model: gaussian, binomial, poisson, gamma, tweedie.

link

A character string specifying the link function. The default is the canonical link for the family. The supported links for each of the family specifications are: "gaussian": "identity", "log"

tweedie_variance_power

A numeric specifying the power for the variance function when family = "tweedie".

tweedie_link_power

A numeric specifying the power for the link function when family = "tweedie".

alpha

A numeric in [0, 1] specifying the elastic-net mixing parameter. The elastic-net penalty is defined to be: $$P(\alpha,\beta) = (1-\alpha)/2||\beta||_2^2 + \alpha||\beta||_1 = \sum_j [(1-\alpha)/2 \beta_j^2 + \alpha|\beta_j|]$$ making alpha = 1

prior

(Optional) A numeric specifying the prior probability of class 1 in the response when family = "binomial". The default prior is the observational frequency of class 1. Must be from (0,1) exclusive range or NULL (no prior).

lambda

A non-negative shrinkage parameter for the elastic-net, which multiplies $P(\alpha,\beta)$ in the objective function. When lambda = 0, no elastic-net penalty is applied and ordinary generalized linear models are fit.

lambda_search

A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda max, given lambda is interpreted as lambda min.

nlambdas

The number of lambda values to use when lambda_search = TRUE.

lambda_min_ratio

Smallest value for lambda as a fraction of lambda.max. By default if the number of observations is greater than the the number of variables then lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables the

nfolds

(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_column

(Optional) Column with cross-validation fold index assignment per observation.

fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified Must be "AUTO", "Random" or "Modulo".

keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

beta_constraints

A data.frame or H2OParsedData object with the columns ["names", "lower_bounds", "upper_bounds", "beta_given", "rho"], where each row corresponds to a predictor in the GLM. "names" contains the predictor names, "lower_bounds" and "upper_bounds" are the low

offset_column

Specify the offset column.

weights_column

Specify the weights column.

intercept

Logical, include constant term (intercept) in the model.

max_active_predictors

(Optional) Convergence criteria for number of predictors when using L1 penalty.

objective_epsilon

Convergence criteria. Converge if relative change in objective function is below this threshold.

gradient_epsilon

Convergence criteria. Converge if gradient l-infinity norm is below this threshold.

non_negative

Logical, allow only positive coefficients.

compute_p_values

(Optional) Logical, compute p-values, only allowed with IRLSM solver and no regularization. May fail if there are collinear predictors.

remove_collinear_columns

(Optional) Logical, valid only with no regularization. If set, co-linear columns will be automatically ignored (coefficient will be 0).

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

missing_values_handling

(Optional) Controls handling of missing values. Can be either "MeanImputation" or "Skip". MeanImputation replaces missing values with mean for numeric and most frequent level for categorical, Skip ignores observations with any missing value. Applied both

...

(Currently Unimplemented) coefficients.

Value

A subclass of H2OModel is returned. The specific subclass depends on the machine learning task at hand (if it's binomial classification, then an H2OBinomialModel is returned, if it's regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specifc information can be queried out of the object. To access these various items, please refer to the seealso section below.
Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices. Please refer to the more in-depth GLM documentation available here: http://h2o-release.s3.amazonaws.com/h2o-dev/rel-shannon/2/docs-website/h2o-docs/index.html#Data+Science+Algorithms-GLM,

Examples

Run this code

h2o.init()

# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.hex,
        family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL"))
h2o.glm(y = "VOL", x = myX, training_frame = prostate.hex, family = "gaussian",
        nfolds = 0, alpha = 0.1, lambda_search = FALSE)


# GLM variable importance
# Also see:
#   https://github.com/h2oai/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  path = "https://s3.amazonaws.com/h2o-public-test-data/smalldata/demos/bank-additional-full.csv",
  destination_frame = "data.hex")
myX = 1:20
myY="y"
my.glm = h2o.glm(x=myX, y=myY, training_frame=data.hex, family="binomial", standardize=TRUE,
                 lambda_search=TRUE)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples