Learn R Programming

runjags (version 1.2.1-0)

xgrid.run: Remote execution of user-specified R functions on Apple Xgrid distributed computing clusters

Description

Allows arbitrary R code to be executed on Apple Xgrid distributed computing clusters and the results returned to the R session of the user. Jobs can either be run synchronously (the process will wait for the model to complete before returning the results) or asynchronously (the process will terminate on submission of the job and results are retrieved at a later time). Access to an Xgrid cluster with R (along with all packages required by the function) installed is required. Due to the dependance on Xgrid software to perform the underlying submission and retrieval of jobs, these functions can only be used on machines running Mac OS X. The two utility functions xgrid.jobs and xgrid.delete allow the currently running jobs to be examined and deleted from inside R. *Note* Apple has discontinued Xgrid from Mac OS 10.8 onwards, so future development and testing of these functions will be extremely limited

Usage

xgrid.run(f=function(iteration){}, niters=1, object.list=list(),
	file.list=character(0), max.threads=100, arguments=as.list(1:niters),
	Rversion="", packages=list(), artfun=function() writeLines("1"),
	email=NA, profiling=TRUE, cpuarch=NA, minosversion=NA,
	queueforserver=FALSE, hostnode=NA, forcehost=FALSE, ramrequired=10,
	jobname=NA, cleanup=TRUE, showprofiles=FALSE, Rpath='/usr/bin/R',
	Rbuild='64', max.filesize="1GB", 
	mgridpath=system.file("xgrid", "mgrid.sh", package="runjags"),
	hostname=Sys.getenv("XGRID_CONTROLLER_HOSTNAME"),
	password=Sys.getenv("XGRID_CONTROLLER_PASSWORD"), tempdir=FALSE,
	keep.files=FALSE, show.output=TRUE, threads=min(niters, max.threads), ...)

xgrid.submit(f=function(iteration){}, niters=1, object.list=list(), file.list=character(0), max.threads=100, arguments=as.list(1:niters), Rversion="", packages=list(), artfun=function() writeLines("1"), email=NA, profiling=TRUE, cpuarch=NA, minosversion=NA, queueforserver=FALSE, hostnode=NA, forcehost=FALSE, ramrequired=10, jobname=NA, Rpath='/usr/bin/R', Rbuild='64', max.filesize="1GB", mgridpath=system.file("xgrid", "mgrid.sh", package="runjags"), hostname=Sys.getenv("XGRID_CONTROLLER_HOSTNAME"), password=Sys.getenv("XGRID_CONTROLLER_PASSWORD"), show.output=TRUE, separate.jobs=FALSE, threads=min(niters, max.threads), ...)

xgrid.results(jobinfo, wait=TRUE, partial.retrieve=!wait, cleanup=!partial.retrieve, show.output=TRUE)

xgrid.jobs(comment=FALSE, user=FALSE, jobs=10, mgridpath=system.file("xgrid", "mgrid.sh", package="runjags"), hostname=Sys.getenv("XGRID_CONTROLLER_HOSTNAME"), password=Sys.getenv("XGRID_CONTROLLER_PASSWORD"))

xgrid.delete(jobinfo, keep.files=FALSE)

xapply(X, FUN, method.options=list(), ...)

Arguments

f
the function to be iterated over on Xgrid. This must take at least 1 argument, the first of which represents the value of the 'arguments' list to be passed to the function for that iteration, which is the iteration number unless 'arguments' (or 'X' for x
niters
the total number of iterations over which to evaluate the function f. This can be less than the number of threads, in which case multiple iterations are evaluated serially as part of the same task. No default.
object.list
a named list of objects that will be copied to the global environment on Xgrid and so will be visible inside the function. Alternatively, this can be a character vector of objects, that will be looked for in the global environment, rather than a named lis
file.list
a vector of filenames representing files in the current working directory that will be copied to the working directory of the executed function. This allows R code to be source()d, datasets to be loaded, and compiled code to be dynamically linked within
max.threads
the maximum number of tasks (or jobs) to split into.
arguments
a list of values to be passed as the first argument to the function, with each element of the list specifying the value at that iteration. Default is as.list(1:niters) which passes only the iteration number to the function.
Rversion
the required R version for worker nodes to be given tasks - may include '=' or '>=' to signify exact or minimum version requirements.
packages
a list of R packages that must be installed on host nodes for them to be used.
artfun
an optional user-specified R function to determine the suitability of nodes in an ART script - must either cat() 1 (indicating suitable) or 0 (indicating unsuitable) to stdout.
email
an email address to be used to notify of job status.
profiling
option to use ART ranking to select the most suitable host nodes preferentially.
cpuarch
option to restrict the job to 'ppc' or 'intel' nodes.
minosversion
option to restrict the job to nodes running a minimum Mac OS version.
queueforserver
option to restrict the job to nodes considered to be Server machines.
hostnode
option to prefer (or restrict to if forcehost==TRUE) running the job on the specified nodes - must be provided as a single character string with the colon character (:) separating node names.
forcehost
option to restrict the job to only nodes specified by 'hostnode'.
ramrequired
the minimum amount of free RAM (obtained using an approximation) for each node to be assigned a task.
jobname
the name to give the job on Xgrid (optional).
cleanup
option to remove the job from Xgrid after completion.
showprofiles
option to show the node scores based on the ART ranking used.
Rpath
the path to the R executable on the xgrid machines. If not all machines on the xgrid cluster have R (or a required package) installed then it is possible to use an ART script to ensure the job is sent to only machines that do - see the examples section fo
Rbuild
the preferred binary of R to invoke. '64' results in '{Rpath}64' (if it exists), '32' in '{Rpath}32' (if it exists) and '' (or either of '32' or '64' if they are not found) results in {Rpath}. Notice that this indicates a preference, not a certainty - if
max.filesize
the maximum total size of the objects produced by the function for each thread if xgrid.method=separatejobs, or for the entire job if xgrid.method=separatetasks. This is a failsafe designed to prevent attempted transfer of huge files bringing the xgrid c
mgridpath
the path to the local mgrid script - default uses the version installed with the runjags package.
hostname
the hostname of the Xgrid server to connect to.
password
the password for the Xgrid server given by hostname.
tempdir
for xgrid.run, option to use the temporary directory as specified by the system rather than creating files in the working directory. Any files created in the temporary directory are removed when the function exits. A temporary directory cannot be used for
keep.files
option to keep the folder with files needed to run the job rather than deleting it when the job is deleted from Xgrid. This may be useful for attempting to bug fix failing jobs. Default FALSE.
show.output
option to print the output of the function (obtained using cat, writeLine or print for example) at each iteration after retrieving the job(s) from xgrid. If FALSE, the output is suppressed. Default TRUE.
separate.jobs
option to submit multiple jobs to Xgrid, to help with file size constraints (see the entry for 'threads' below).
threads
the number of threads (either jobs if separate.jobs==TRUE or tasks otherwise) to generate for the job. Each thread is sent to a separate node for execution, so the more threads there are the faster the job will finish (unless the number of threads exceeds
...
additional arguments to be passed to the function provided by f.
jobinfo
the output of a call to xgrid.submit.
wait
option to wait for the Xgrid job to complete if it has not done so already.
partial.retrieve
for xgrid.results, option to retrieve results of partially completed jobs. By default makes cleanup FALSE. Default TRUE.
comment
option to display any comments relevant to the Xgrid jobs running.
user
option to display information on the user that submitted each Xgrid job.
jobs
the number of (most recent) jobs to display information for.
X
for xapply, a vector (atomic or list) over which to apply the function provided. Equivalent to 'arguments' for xgrid.run, with niters = length(X).
FUN
for xapply, the function to be passed to xgrid.run as 'f'.
method.options
for xapply, any arguments (with the exception of 'f', 'niters' and 'arguments' which are ignored) to be passed to xgrid.run.

Value

  • For xgrid.submit, a list containing the jobname (which will be required by xgrid.results to retrieve the job) and the job ID(s) for use with the xgrid command line facilities. For xgrid.run and xgrid.results, the output of the function over all iterations is returned as a list, with each element of the list representing the results at each iteration. If the function returned an error, then the error will be held in the list as the return value at the iteration that returned the error. If the function returns an object that exceeds the 'max.filesize' when combined with the results for other iterations in that job (or greater than max.filesize/threads for multi-task jobs), the results for that thread are replaced with an error message (this is to prevent the xgrid controller crashing due to transferring large files). The xapply function returns as xgrid.run (or xgrid.submit if xgrid.options=list(submitandstop=TRUE) in which case the results can be retrieved using xgrid.results).

Details

These functions allow JAGS models to be run on Xgrid distributed computing clusters from within R using the same syntax as required to run the models locally. All the functionality could be replicated by saving all necessary objects to files and using the Xgrid command line utility to submit and retrieve the job manually; these functions merely provide the convenience of not having to do this manually. Xgrid support is only available on Mac OS X machines running OS X 10.5-10.7 (Xgrid support was discontinued in Mac OS X 10.8).

The xgrid controller hostname and password can also be set as environmental variables. The command line version of R knows about environmental variables set in the .profile file, but unfortunately the GUI version does not and requires them to be set from within R using:

Sys.setenv(XGRID_CONTROLLER_HOSTNAME="")

Sys.setenv(XGRID_CONTROLLER_PASSWORD="")

(These lines could be copied into your .Rprofile file for a 'set and forget' solution)

Note that the runjags package also contains a utility shell script called 'mgrid' that enhances the capabilities of Xgrid substantially - to install this from the command line navigate to the folder given by system.file("xgrid", package="runjags") and from the terminal type 'sudo cp mgrid.sh /usr/local/bin/mgrid (or similar) to make the script visible in your search path. Help on the mgrid script can then be obtained by typing 'mgrid' (with no arguments) at the command line.

See Also

xgrid.run.jags for functions to run JAGS models on Xgrid, or run.jags to do so locally.

mclapply and parLapply in the parallel package for parallel execution of code over multiple local (or remote) cores.

Examples

Run this code
# A basic example of synchronous running of code over 100 iterations, 
# split up between 10 tasks:

# The function to evaluate:
f <- function(iteration){
	# All objects supplied to object.list will be visible here, but
	# remember to call all necessary libraries within the function
	
	cat("Running iteration", iteration, "\n")
	# Some lengthy code evaluation....
	
	output <- rpois(10, iteration)
	return(output)
}

# Run the function on xgrid for 100 iterations split between 10 machines:
results <- xgrid.run(f, niters=100, threads=10)



# A basic example of xapply to calculate the mean of a list of numbers:

# A list of 3 datasets from which to calculate the mean:
datasets <- list(c(1,5,6,NA), c(9,2,NA,0), c(-1,4,10,20))

# Standard lapply syntax:
results1 <- lapply(datasets, mean, na.rm=TRUE)

# Equivalent xapply syntax:
results2 <- xapply(datasets, mean, 
xgrid.options=list(wait.interval='15s'), na.rm=TRUE)

# Or submit the job:
id <- xapply(datasets, mean, xgrid.options=list(submitandstop=TRUE),
na.rm=TRUE)
# And retrieve the results:
results3 <- xgrid.results(id)




# Subit an xgrid job just to see which packages are installed 
# on a particular machine.


# A function to harvest details of R version and installed packages:
f <- function(i){

archavail <- any(dimnames(installed.packages())[[2]]=='Archs')

# To deal with older versions of R:
if(archavail){
packagesinst <- installed.packages()[,c('Version', 'Archs', 'Built')]
}else{
packagesinst <- installed.packages()[,c('Version', 'OS_type', 'Built')]
}

Rinst <- unlist(R.version[c('version.string', 'arch', 'platform')])
names(Rinst) <- c('Version', 'Archs', 'Built')
return(rbind(R=Rinst, packagesinst))

}

# Or to get more details about a particular package:
g <- function(i){
	p <- library(help='bayescount')
	return(p$info)
}

# Get the information back from 2 specific machines called 'newnode1' 
# and 'newnode2':
results <- xgrid.run(f, niters=2, threads=2, 
hostnode='newnode1:newnode2')

# See the installed packages on the two nodes:
results

Run the code above in your browser using DataLab