h2o (version 3.10.3.6)

h2o.naiveBayes: Compute naive Bayes probabilities on an H2O dataset.

Description

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

Usage

h2o.naiveBayes(x, y, training_frame, model_id = NULL, nfolds = 0,
  seed = -1, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
  fold_column = NULL, keep_cross_validation_predictions = FALSE,
  keep_cross_validation_fold_assignment = FALSE, validation_frame = NULL,
  ignore_const_cols = TRUE, score_each_iteration = FALSE,
  balance_classes = FALSE, class_sampling_factors = NULL,
  max_after_balance_size = 5, max_hit_ratio_k = 0, laplace = 0,
  threshold = 0.001, eps = 0, compute_metrics = TRUE,
  max_runtime_secs = 0)

Arguments

x
A vector containing the names or indices of the predictor variables to use in building the model. If x is missing,then all columns except y are used.
y
The name of the response variable in the model.If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. (The response must be either an integer or a categorical variable).
training_frame
Id of the training data frame (Not required, to allow initial validation of model parameters).
model_id
Destination id for this model; auto-generated if not specified.
nfolds
Number of folds for N-fold cross-validation (0 to disable or >= 2). Defaults to 0.
seed
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default) Defaults to -1 (time-based random number).
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
fold_column
Column with cross-validation fold index assignment per observation.
keep_cross_validation_predictions
Logical. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
keep_cross_validation_fold_assignment
Logical. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
validation_frame
Id of the validation data frame.
ignore_const_cols
Logical. Ignore constant columns. Defaults to TRUE.
score_each_iteration
Logical. Whether to score during each iteration of model training. Defaults to FALSE.
balance_classes
Logical. Balance training data class counts via over/under-sampling (for imbalanced data). Defaults to FALSE.
class_sampling_factors
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0.
max_hit_ratio_k
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) Defaults to 0.
laplace
Laplace smoothing parameter Defaults to 0.
threshold
The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.
eps
A threshold cutoff to deal with numeric instability, must be positive.
compute_metrics
Logical. Compute metrics on training data Defaults to TRUE.
max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.

Value

Returns an object of class if the response has two categorical levels, and otherwise.

Details

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

Examples

Run this code
h2o.init()
votesPath <- system.file("extdata", "housevotes.csv", package="h2o")
votes.hex <- h2o.uploadFile(path = votesPath, header = TRUE)
h2o.naiveBayes(x = 2:17, y = 1, training_frame = votes.hex, laplace = 3)

Run the code above in your browser using DataLab