assign.N.sample: Obtain a Set of Random Samples for X.spmd

Description

This utility function samples data randomly from X.spmd to form a relatively small subset of original data. The EM algorithm on the smaller subset is topically performing fast and capturing rough structures of entire dataset.

Usage

assign.N.sample(total.sample = 5000, N.org.spmd)

Arguments

total.sample

a total number of samples which will be selected from the original data X.spmd.

N.org.spmd

the original data size, i.e. nrow(X.spmd).

Value

A list variable will be returned and containing:

`N`	total sample size across all \(S\) processors
`N.spmd`	sample size of given processor
`N.allspmds`	a collection of sample sizes for all \(S\) processors

Note that N and N.allspmds are the same across all \(S\) processors, but N.spmd and ID.spmd are most likely all distinct. The lengths of these elements are \(1\) for N and N.spmd, \(S\) for N.allspmd, and N.spmd for ID.spmd.

Details

This utility function performs simple random sampling without replacement for the original dataset X.spmd. Different random seeds should be set before calling this function.

References

Programming with Big Data in R Website: http://r-pbd.org/

Examples

Run this code

# NOT RUN {
# Save code in a file "demo.r" and run in 4 processors by
# > mpiexec -np 4 Rscript demo.r

### Setup environment.
library(pmclust, quiet = TRUE)
comm.set.seed(123)

### Generate an example data.
N.org.spmd <- 5000 + sample(1:1000, 1)
ret.spmd <- assign.N.sample(total.sample = 5000, N.org.spmd)
cat("Rank:", comm.rank(), " Size:", ret.spmd$N.spmd,
    "\n", sep = "")

### Quit.
finalize()
# }

Run the code above in your browser using DataLab