assign.N.sample: Obtain a set of random samples for X.worker

Description

This utility function samples data randomly from X.worker to form a relatively small subset of original data. The EM algorithm on the smaller subset is topically performing fast and capturing rough structures of entire dataset.

Usage

assign.N.sample(total.sample = 5000, N.org.worker)

Arguments

total.sample

a total number of samples which will be selected from the original data X.worker.

N.org.worker

the original data size, i.e. nrow(X.worker).

Value

A list variable will be returned and containing: ll{N total sample size across all $S$ processors N.worker sample size of given processor N.allworkers a collection of sample sizes for all $S$ processors ID.worker index of selected samples ranged from 1 to N.org.worker } Note that N and N.allworkers are the same across all $S$ processors, but N.worker and ID.worker are most likely all distinct. The lengths of these elements are $1$ for N and N.worker, $S$ for N.allworker, and N.worker for ID.worker.

Details

This utility function performs simple random sampling without replacement for the original dataset X.worker. Different random seeds should be set before calling this function. A easy way is to call for example set.seed(123 + mpi.comm.rank()).

References

High Performance Statistical Computing Website: http://thirteen-01.stat.iastate.edu/snoweye/hpsc/

Examples

Run this code

# Save code in a file "demo.r" and run in 4 processors by
# > mpirun -np 4 Rscript demo.r

### Setup mpi environment.
library(Rmpi)
invisible(mpi.comm.dup(0, 1))

### Generate an example data.
set.seed(123 + mpi.comm.rank())
N.org.worker <- 5000 + sample(1:1000, 1)
ret.worker <- assign.N.sample(total.sample = 5000, N.org.worker)
cat("Rank:", mpi.comm.rank(), " Size:", ret.worker$N.worker,
    "\n", sep = "")

### Quit Rmpi.
mpi.quit()

Run the code above in your browser using DataLab