pbdMPI (version 0.3-9)

global reading: Global Reading Functions

Description

These functions are global reading from specified file.

Usage

comm.read.table(file, header = FALSE, sep = "", quote = "\"'",
                dec = ".",
                na.strings = "NA", colClasses = NA, nrows = -1, skip = 0,
                check.names = TRUE, fill = !blank.lines.skip,
                strip.white = FALSE,
                blank.lines.skip = TRUE, comment.char = "#",
                allowEscapes = FALSE,
                flush = FALSE,
                fileEncoding = "", encoding = "unknown",
                read.method = .pbd_env$SPMD.IO$read.method[1],
                balance.method = .pbd_env$SPMD.IO$balance.method[1],
                comm = .pbd_env$SPMD.CT$comm)

comm.read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ..., read.method = .pbd_env$SPMD.IO$read.method[1], balance.method = .pbd_env$SPMD.IO$balance.method[1], comm = .pbd_env$SPMD.CT$comm) comm.read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ..., read.method = .pbd_env$SPMD.IO$read.method[1], balance.method = .pbd_env$SPMD.IO$balance.method[1], comm = .pbd_env$SPMD.CT$comm)

Arguments

file

as in read.table().

header

as in read.table().

sep

as in read.table().

quote

as in read.table().

dec

as in read.table().

na.strings

as in read.table().

colClasses

as in read.table().

nrows

as in read.table().

skip

as in read.table().

check.names

as in read.table().

fill

as in read.table().

strip.white

as in read.table().

blank.lines.skip

as in read.table().

comment.char

as in read.table().

allowEscapes

as in read.table().

flush

as in read.table().

fileEncoding

as in read.table().

encoding

as in read.table().

...

as in read.csv*().

read.method

either "gbd" or "common".

balance.method

balance method for read.method = "gbd" as nrows = -1 and skip = 0 are set.

comm

a communicator number.

Value

A distributed data.frame is returned.

All factors are disable and read as characters or as what data should be.

Details

These functions will apply read.table() locally and sequentially from rank 0, 1, 2, ...

By default, rank 0 reads the file only, then scatter to other ranks for small datasets (.pbd_env$SPMD.IO$max.read.size) in read.method = "gbd". (bcast to others in read.method = "common".)

As dataset size increases, the reading is performed from each ranks and read portion of rows in "gbd" format as described in pbdDEMO vignettes and used in pmclust.

comm.load.balance() is called for "gbd" method as as nrows = -1 and skip = 0 are set. Note that the default method "block" is the better way for performance in general that distributes equally and leaves residuals on higher ranks evenly. "block0" is the other way around. "block.cyclic" is only useful for converting to ddmatrix as in pbdDMAT.

References

Programming with Big Data in R Website: http://r-pbd.org/

See Also

comm.load.balance() and comm.write.table()

Examples

Run this code
# NOT RUN {
### Save code in a file "demo.r" and run with 4 processors by
### SHELL> mpiexec -np 4 Rscript demo.r

spmd.code <- "
### Initial.
suppressMessages(library(pbdMPI, quietly = TRUE))

### Check.
if(comm.size() != 4){
  comm.stop(\"4 processors are requried.\")
}

### Manually distributed iris.
da <- iris[get.jid(nrow(iris)),]

### Dump data.
comm.write.table(da, file = \"iris.txt\", quote = FALSE, sep = \"\\t\",
                 row.names = FALSE)

### Read back in.
da.gbd <- comm.read.table(\"iris.txt\", header = TRUE, sep = \"\\t\",
                          quote = \"\")
comm.print(c(nrow(da), nrow(da.gbd)), all.rank = TRUE)

### Read in common.
da.common <- comm.read.table(\"iris.txt\", header = TRUE, sep = \"\\t\",
                             quote = \"\", read.method = \"common\")
comm.print(c(nrow(da.common), sum(da.common != iris)))

### Finish.
finalize()
"
# execmpi(spmd.code, nranks = 4L)
# }

Run the code above in your browser using DataCamp Workspace