glm.fit: Fitting generalized linear models

Description

glm.fit is used to fit generalized linear models specified by a model matrix and response vector. glm is a simplified interface for scidbdf objects similar (but much simpler than) glm.

Usage

"glm.fit"(x,y,weights=NULL,family=gaussian())
"glm"(formula, family=gaussian(), data, weights)
model_scidb(formula, data, factors=NULL)

Arguments

a model matrix of dimension 'n * p'.

a response vector of length 'n'.

formula

an object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted. See details for limitations.

data

an object of class scidbdf.

weights

an optional vector of 'prior weights' to be used in the fitting process. Should be 'NULL' or a numeric or scidb vector.

family

a description of the error distribution and link function to be used in the model, supplied as the result of a call to a family function.

factors

a list of factor encodings to use in the model matrix. See details.

Value

The glm.fit and glm functions return a list of model output values described below. The glm method uses an S3 class to implement printing summary, and predict methods.

coefficients model coefficient vector (SciDB array)
stderr vector of model coefficient standard errors (SciDB array)
tval vector of model coefficient t ratio values using estimated dispersion value (SciDB array)
pval vector of two-tailed p-values corresponding to the t ratio based on a Student t distribution. (It is possible that the dispersion is not known and there are no residual degrees of freedom from which to estimate it. In that case the estimate is 'NaN'.)
aic a version of Akaike's An Information Criterion value.
null.deviance the deviance for the null model, comparable with deviance.
res.deviance up to a constant, minus twice the maximized log-likelihood.
dispersion For binomial and Poison families the dispersion is fixed at one and the number of parameters is the number of coefficients. For gaussian, Gamma and inverse gaussian families the dispersion is estimated from the residual deviance, and the number of parameters is the number of coefficients plus one. For a gaussian family the MLE of the dispersion is used so this is a valid value of AIC, but for Gamma and inverse gaussian families it is not. Other families set this value to NA.
df.null the residual degrees of freedom for the null model.
df.residual the residual degrees of freedom.
converged FALSE if the model did not converge.
totalObs total number of observations in the model.
nOK total number of observations corresponding to nonzero weights.
loglik converged model log-likelihood value.
rss residual sum of squares.
iter number of model iterations.
weights vector of weights used in the model (SciDB array).
family model family function.
y response vector (SciDB array).
x model matrix (SciDB array).
factors a list of factor variable levels (SciDB arrays) or NULL if no factors are present in the data.

model_scidb returns an output list with:

formula the model forumual.
model the model matrix (SciDB array).
response the model response vector (SciDB array).
names an R character vector of variable names in the model matrix.
intercept a logical value; if TRUE the model includes an intercept term.
factors a list of factor variable levels (SciDB arrays) or NULL if no factors are present in the data.

Details

The glm function works similarly to a limited version of the usual glm function, but with a scidbdf data.frame-like SciDB array instead of a standard data.frame.

Formulas in the glm function may only refer to variables explicitly defined in the data scidbdf object. That means that you should bind interaction and transformed terms to your data before invoking the function. The indicated response must refer to a single-column response term in the data (the two-column response form is not accepted).

Categorical (factor) variables in the data must be represented as strings. They will be encoded as treatment-style contrast variables with the first listed value set to the baseline value. No other automated contrast encodings are available yet (you are free to build your own model matrix and use glm.fit for that). All other variables will be coerced to double-precision values.

Use the model_scidb function to build a model matrix from a formula and a scidbdf data frame-like SciDB array. The matrix is returned within an output list as a sparse SciDB matrix of class scidb with character string variables encoded as treatment contrasts as described above. If you already have a list of factor-level codes for categorical variables (for example from the output of glm, you can supply that in the factor argument. See help for predict for an example.

Examples

Run this code

## Not run: 
# # Using glm.fit
# x <- as.scidb(matrix(rnorm(5000*20),nrow=5000))
# y <- as.scidb(rnorm(5000))
# M <- glm.fit(x, y)
# coef(M)[]
# 
# # Using glm (similar to standard glm in this case)
# # From the 'glm' help:
# ## Dobson (1990) Page 93: Randomized Controlled Trial :
# counts <- c(18,17,15,20,10,20,25,13,12)
# outcome <- gl(3,1,9)
# treatment <- gl(3,3)
# d.AD <- data.frame(treatment, outcome, counts)
# glm.D93 <- glm(counts ~ outcome + treatment, family = poisson(),data=d.AD)
# summary(glm.D93)
# 
# # Compare with:
# d.AD_sci = as.scidb(d.AD)
# glm.D93_sci = glm(counts ~ outcome + treatment, family = poisson(), data=d.AD_sci)
# summary(glm.D93_sci)
# ## End(Not run)