spark.glm: Generalized Linear Models

Description

Fits generalized linear model against a SparkDataFrame. Users can call summary to print a summary of the fitted model, predict to make predictions on new data, and write.ml/read.ml to save/load fitted models.

Usage

spark.glm(data, formula, ...)
# S4 method for SparkDataFrame,formula
spark.glm(data, formula, family = gaussian,
  tol = 1e-06, maxIter = 25, weightCol = NULL, regParam = 0)
# S4 method for GeneralizedLinearRegressionModel
summary(object)
# S3 method for summary.GeneralizedLinearRegressionModel
print(x, ...)
# S4 method for GeneralizedLinearRegressionModel
predict(object, newData)
# S4 method for GeneralizedLinearRegressionModel,character
write.ml(object, path,
  overwrite = FALSE)

Arguments

data

a SparkDataFrame for training.

formula

a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'.

...

additional arguments passed to the method.

family

a description of the error distribution and link function to be used in the model. This can be a character string naming a family function, a family function or the result of a call to a family function. Refer R family at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html. Currently these families are supported: binomial, gaussian, Gamma, and poisson.

tol

positive convergence tolerance of iterations.

maxIter

integer giving the maximal number of IRLS iterations.

weightCol

the weight column name. If this is not set or NULL, we treat all instance weights as 1.0.

regParam

regularization parameter for L2 regularization.

object

a fitted generalized linear model.

summary object of fitted generalized linear model returned by summary function.

newData

a SparkDataFrame for testing.

path

the directory where the model is saved.

overwrite

overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

Value

spark.glm returns a fitted generalized linear model.

summary returns summary information of the fitted model, which is a list. The list of components includes at least the coefficients (coefficients matrix, which includes coefficients, standard error of coefficients, t value and p value), null.deviance (null/residual degrees of freedom), aic (AIC) and iter (number of iterations IRLS takes). If there are collinear columns in the data, the coefficients matrix only provides coefficients.

predict returns a SparkDataFrame containing predicted labels in a column named "prediction".

Examples

Run this code

# NOT RUN {
sparkR.session()
data(iris)
df <- createDataFrame(iris)
model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
summary(model)

# fitted values on training data
fitted <- predict(model, df)
head(select(fitted, "Sepal_Length", "prediction"))

# save fitted model to input path
path <- "path/to/model"
write.ml(model, path)

# can also read back the saved model and print
savedModel <- read.ml(path)
summary(savedModel)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples