stan_biglm: Bayesian regularized linear but big models via Stan

Description

This is the same model as with stan_lm but it utilizes the output from biglm in the biglm package in order to proceed when the data is too large to fit in memory.

Usage

stan_biglm(biglm, xbar, ybar, s_y, has_intercept = TRUE, ..., prior = R2(stop("'location' must be specified")), prior_intercept = NULL, prior_PD = FALSE, algorithm = c("sampling", "meanfield", "fullrank"), adapt_delta = NULL)
stan_biglm.fit(b, R, SSR, N, xbar, ybar, s_y, has_intercept = TRUE, ..., prior = R2(stop("'location' must be specified")), prior_intercept = NULL, prior_PD = FALSE, algorithm = c("sampling", "meanfield", "fullrank"), adapt_delta = NULL)

Arguments

biglm

The list output by biglm in the biglm package. The original call to biglm must not have an intercept and must utilize centered but not standardized predictors. See the Details section or the Example.

xbar

A numeric vector of means in the implicit design matrix for the observations included in the model

ybar

A numeric scalar indicating the same mean of the outcome for the observations included in the model

s_y

A numeric scalar indicating the unbiased sample standard deviation of the outcome for the observations included in the model

has_intercept

A logical scalar indicating whether to add an intercept to the model when estimating it

...

Further arguments passed to the function in the rstan package (sampling, vb, or optimizing), corresponding to the estimation method named by algorithm. For example, if algorithm is "sampling" it is possibly to specify iter, chains, cores, refresh, etc.

prior

Must be a call to R2 with its location argument specified or NULL, which would indicate a standard uniform prior for the $R^2$.

prior_intercept

Either NULL (the default) or a call to normal. If a normal prior is specified without a scale, then the standard deviation is taken to be the marginal standard deviation of the outcome divided by the square root of the sample size, which is legitimate because the marginal standard deviation of the outcome is a primitive parameter being estimated.

prior_PD

A logical scalar (defaulting to FALSE) indicating whether to draw from the prior predictive distribution instead of conditioning on the outcome.

algorithm

A string (possibly abbreviated) indicating the estimation approach to use. Can be "sampling" for MCMC (the default), "optimizing" for optimization, "meanfield" for variational inference with independent normal distributions, or "fullrank" for variational inference with a multivariate normal distribution. See rstanarm-package for more details on the estimation algorithms. NOTE: not all fitting functions support all four algorithms.

adapt_delta

Only relevant if algorithm="sampling". See adapt_delta for details.

A numeric vector of OLS coefficients, excluding the intercept

A square upper-triangular matrix from the QR decomposition of the design matrix

SSR

A numeric scalar indicating the sum-of-squared residuals for OLS

A integer scalar indicating the number of included observations

Value

The output of both stan_biglm and stan_biglm.fit is an object of stanfit-class rather than stanreg-objects, which is more limited and less convenient but necessitated by the fact that stan_biglm does not bring the full design matrix into memory. Without the full design matrix,some of the elements of a stanreg-objects object cannot be calculated, such as residuals. Thus, the functions in the rstanarm package that input stanreg-objects, such as posterior_predict cannot be used.

Details

The stan_biglm function is intended to be used in the same circumstances as the biglm function in the biglm package but with an informative prior on the $R^2$ of the regression. Like biglm, the memory required to estimate the model depends largely on the number of predictors rather than the number of observations. However, the original call to biglm must be a little unconventional. The original formula must not include an intercept and all the columns of the implicit design matrix must be expressed as deviations from the sample mean. If the design matrix is on the hard disk, the column sums must be accumulated, divided by the sample size to produce the column means, and then the column means must be swept from the design matrix on disk. If any observations have any missing values on any of the predictors or the outcome, such observations do not contribute to the column means, which must be passed as the xbar argument. If the outcome is also expressed as the deviation from its sample mean, then the coefficients produced by biglm are the same as if the raw data were used and an intercept were included. The sample mean and sample standard deviation of the outcome must also be passed.

Examples

Run this code

# create inputs
ols <- lm(mpg ~ wt + qsec + am - 1, # next line is critical for centering
          data = as.data.frame(scale(mtcars, scale = FALSE)))
b <- coef(ols)
R <- qr.R(ols$qr)
SSR <- crossprod(ols$residuals)[1]
N <- length(ols$fitted.values)
xbar <- colMeans(mtcars[,c("wt", "qsec", "am")])
y <- mtcars$mpg
ybar <- mean(y)
s_y <- sd(y)
post <- stan_biglm.fit(b, R, SSR, N, xbar, ybar, s_y, prior = R2(.75),
                       # the next line is only to make the example go fast
                       chains = 1, iter = 1000, seed = 12345)
cbind(lm = b, stan_lm = rstan::get_posterior_mean(post)[14:16]) # shrunk

Run the code above in your browser using DataLab