ca,cabase,calm,caglm,caprcomp,cakm,cameans,caquantile,caagg: Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones

Description

Easy parallelization of most statistical computations.

Usage

ca(cls,z,ovf,estf,estcovf=NULL,conv2mat=TRUE,findmean=TRUE)
cabase(cls,ovf,estf,estcovf=NULL,findmean=TRUE,cacall=FALSE)
calm(cls,lmargs) 
caglm(cls,glmargs) 
caprcomp(cls,prcompargs, p)
cakm(cls,mtdf,ncenters,p)
cameans(cls,cols,na.rm=FALSE) 
caquantile(cls,vec, probs = c(0.25, 0.5, 0.75),na.rm=FALSE) 
caagg(cls,ynames,xnames,dataname,FUN)

Arguments

cls

A cluster run under the parallel package.

A data frame, matrix or vector, one observation per row/element.

ovf

Overall statistical function, say glm.

estf

Function to extract the point estimate (typically vector-valued) from the output of ovf.

estcovf

If provided, function to extract the estimated covariance matrix of the output of estf

conv2mat

If TRUE, convert data frame input to a matrix (needed for some cases of 'ovf').

findmean

If TRUE, output the average of the estimates from the chunks; otherwise, output only the estimates themselves.

lmargs

Quoted string representing arguments to lm, e.g. R formula, data specification.

glmargs

Quoted string representing arguments to glm, e.g. R formula, data specification, and family argument.

prcompargs

Quoted string representing arguments to prcomp.

Number of columns in data

na.rm

If TRUE, remove NA values from the analysis.

mtdf

Quoted name of a distributed matrix or data frame.

ncenters

Number of clusters to find.

cacall

If TRUE, indicates that cabase had been called by ca.

cols

A quoted string that evaluates to a data frame or matrix.

vec

A quoted string that evaluates to a vector.

ynames

A vector of quoted variable names.

xnames

A vector of quoted variable names.

dataname

Quoted name of a data frame or matrix.

probs

As in the argument with the same name in quantile. Should not be 0.00 or 1.00, as asymptotic normality doesn't hold.

FUN

Quoted name of a function.

Value

R list with these components:
- thts, the results of applying the requested estimator to the chunks; the estimator from chunk i is in row i
- tht, the chunk-averaged overall estimator, if requested
- thtcov, the estimated covariance matrix oftht, if available
The wrapper functions return the following list elements:
- calm,caglm: estimated regression coefficients and their estimated covariance matrix
- caprcomp:sdev(square roots of the eigenvalues) androtation, as withprcomp;thtsis returned as well.
- cakm:centersandsize, as withkmeans;thtsis returned as well.
The wrappers that return thts are useful for algorithms that may exhibit some instability. For prcomp, for instance, the eigenvectors corresponding to the smaller eigenvalues may have high variances in the nonparallel version, which will be reflected in large differences from chunk to chunk; thus caprcomp returns the thts element from the output of cabase. Note that this reflects a fundamental problem with the algorithm on these variables, not due to Software Alchemy; on the contrary, this is an important advantage of the Software Alchemy approach.

Details

Implements the ``Software Alchemy'' method for parallelizing statistical computations (N. Matloff, Parallel Computation for Data Science, Chapman and Hall, 2015, research article to appear in the Journal of Statistical Software.) This can result in substantial speedups in computation.

The data are broken into chunks, and the given estimator is applied to each. The results are averaged, and an estimated covariance matrix computed (optional).

In cabase, the data object is assumed to be a distributed data frame or matrix, produced by distribsplit or readnscramble. Note by the way that the data object is not specified explicitly in the argument list; this is done through the function ovf.

The key point is that the resulting estimator is statistically equivalent to the original, nonparallel one, in the sense that they have the same asymptotic statistical accuracy. Since one would use Software Alchemy only with large data sets anyway (otherwise, parallel computation would not be needed for speed), the asymptotic aspect should not be an issue. In other words, one achieves the same statistical accuracy while possibly attaining much faster computation.

Wrapper functions, applying Software Alchemy to the corresponding R function (or function elsewere in this package):

calm: Wrapper forlm.
caglm: Wrapper forglm.
caprcomp: Wrapper forprcomp.
cakm: Wrapper forkmeans.
cameans: Wrapper forcolMeans.
caquantile: Wrapper forquantile.
caagg: Likedistribagg, but finds the average value ofFUNacross the cluster nodes.

A note on NA values: Some R functions such as lm, glm and prcomp have an na.action argument. The default is na.omit, which means that cases with at least one NA value will be discarded. (This is also settable via options().) However, na.omit seems to have no effect in prcomp unless that function's formula option is used. When in doubt, apply the function na.omit directly; e.g. na.omit(d) for a data frame d returns a data frame consisting of only the intact rows of d.

The method assumes that the base estimator is asymptotically normal, and assumes i.i.d. data. If your data set had been stored in some sorted order, it must be randomized first, say using the scramble option in distribsplit or by calling readnscramble, depending on whether your data is already in memory or still in a file.

Examples

Run this code

# set up 'parallel' cluster
cls <- makeCluster(2)
setclsinfo(cls)

# generate simulated test data, as distributed data frame
n <- 25000
u <- matrix(nrow=n,ncol=4)
u[,1:3] <- rnorm(3*n)
u[,4] <- u[,1] + 2*u[,2] + u[,3]
distribsplit(cls,"u")
# apply the function
calm(cls,"u[,4] ~ u[,1]+u[,2]")$tht
# check; results should be approximately the same
lm(u[,4] ~ u[,1]+u[,2])

# Census data on programmers and engineers; include a quadratic term for
# age, due to nonmonotone relation to income
data(prgeng) 
distribsplit(cls,"prgeng") 
caout <- calm(cls,"wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng")
caout$tht
# compare to nonparallel
lm(wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng)
# get standard errors of the beta-hats
sqrt(diag(caout$thtcov))

# find mean age for all combinations of the cit and sex variables
caagg(cls,"age",c("cit","sex"),"prgeng","mean") 
# compare to nonparallel
aggregate(age ~ cit+sex,data=prgeng,mean)  

data(newadult) 
distribsplit(cls,"newadult") 
caglm(cls,"gt50 ~ ., family = binomial,data=newadult")$tht 

caprcomp(cls,'newadult,scale=TRUE',5)$sdev
prcomp(newadult,scale=TRUE)$sdev

cameans(cls,"prgeng")
cameans(cls,"prgeng[,c('age','wageinc')]")
caquantile(cls,'prgeng$age')

pe <- prgeng[,c(1,3,8)] 
distribsplit(cls,"pe") 
z1 <- cakm(cls,'pe',3,3); z1$size; z1$centers 
# check algorithm unstable
z1$thts  # looks unstable

Run the code above in your browser using DataLab