runSimulation: Run a Monte Carlo simulation given a data.frame of conditions and simulation functions

Description

This function runs a Monte Carlo simulation study given a set of predefined simulation functions, design conditions, and number of replications. Results can be saved as temporary files in case of interruptions and may be restored by re-running runSimulation, provided that the respective temp file can be found in the working directory. runSimulation supports parallel and cluster computing, global and local debugging, error handling (including fail-safe stopping when functions fail too often, even across nodes), and tracking of error and warning messages. For convenience, all functions available in the R workspace are exported across all computational nodes so that they are more easily accessible (however, other R objects are not, and therefore must be passed to the fixed_objects input to become available across nodes).

Usage

runSimulation(design, replications, generate, analyse, summarise, fixed_objects = NULL, packages = NULL, filename = "SimDesign-results", save = FALSE, save_results = FALSE, save_seeds = FALSE, load_seed = NULL, seed = NULL, parallel = FALSE, ncores = parallel::detectCores(), cl = NULL, MPI = FALSE, max_errors = 50, as.factor = TRUE, save_generate_data = FALSE, save_details = list(), edit = "none", verbose = TRUE)
"print"(x, drop.extras = FALSE, drop.design = FALSE, ...)
"head"(x, ...)
"tail"(x, ...)
"summary"(object, ...)

Arguments

design

a data.frame object containing the Monte Carlo simulation conditions to be studied, where each row represents a unique condition

replications

number of replication to perform per condition (i.e., each row in design). Must be greater than 0

generate

user-defined data and parameter generating function. See Generate for details

analyse

user-defined computation function which acts on the data generated from Generate. See Analyse for details

summarise

(optional but recommended) user-defined summary function to be used after all the replications have completed within each design condition. Omitting this function will return a list of matrices (or a single matrix, if only one row in design is supplied) containing only the results returned form Analyse. Ommiting this function is only recommended for didactic purposes because it leaves out a large amount of information and generally is not as flexible internally

fixed_objects

(optional) an object (usually a list) containing additional user-defined objects that should remain fixed across conditions. This is useful when including long fixed vectors/matrices of population parameters, data that should be used across all conditions and replications (e.g., including a fixed design matrix for linear regression), or simply can be used to control constant global elements such as sample size

packages

a character vector of external packages to be used during the simulation (e.g., c('MASS', 'mvtnorm', 'simsem') ). Use this input when parallel = TRUE or MPI = TRUE to use non-standard functions from additional packages, otherwise the functions must be made available by using explicit library or require calls within the provided simulation functions. Alternatively, functions can be called explicitly without attaching the package with :: (e.g., mvtnorm::rmvnorm())

filename

(optional) the name of the .rds file to save the final simulation results to when save = TRUE. When NULL, the final simulation object is not saved to the drive. As well, if the same file name already exists in the working directly at the time of saving then a new file will be generated instead and a warning will be thrown; this helps avoid accidentally overwriting existing files. Default is 'SimDesign-results'

save

logical; save the simulation state and final results to the hard-drive? This is useful for simulations which require an extended amount of time. When TRUE, a temp file will be created in the working directory which allows the simulation state to be saved and recovered (in case of power outages, crashes, etc). To recover you simulation at the last known location simply rerun the same code you used to initially define the simulation and the object will automatically be detected and read-in. Upon completion, and if filename is not NULL, the final results will also be saved to the working directory. Default is FALSE

save_results

logical; save the results returned from Analyse to external .rds files located in the defined save_results_dirname directory/folder? Use this if you would like to keep track of the individual parameters returned from the analyses. Each saved object will contain a list of three elements containing the condition (row from design), results (as a list or matrix), and try-errors. When TRUE, a temp file will be used to track the simulation state (in case of power outages, crashes, etc). When TRUE, temporary files will also be saved to the working directory (in the same was as when save = TRUE to better track the state of the simulation. Default is FALSE

save_seeds

logical; save the .Random.seed states prior to performing each replication into plain text files located in the defined save_seeds_dirname directory/folder? Use this if you would like to keep track of the simulation state within each replication and design condition. Primarily, this is useful for completely replicating any cell in the simulation if need be, especially when tracking down hard-to-find errors and bugs. As well, see the load_seed input to load a given .Random.seed to exactly replicate the generated data and analysis state (handy for debugging). When TRUE, temporary files will also be saved to the working directory (in the same was as when save = TRUE to better track the state of the simulation. Default is FALSE

load_seed

a character object indicating which file to load from when the .Random.seeds have be saved (after a run with save_seeds = TRUE). E.g., load_seed = 'design-row-2/seed-1' will load the first seed in the second row of the design input. Note that it is important NOT to modify the design input object, otherwise the path may not point to the correct saved location. Default is NULL

seed

a vector of integers to be used for reproducibility. The length of the vector must be equal the number of rows in design. This argument calls set.seed or clusterSetRNGStream for each condition, respectively, but will not be run when MPI = TRUE. Default is NULL, indicating that no seed is set for each condition

parallel

logical; use parallel processing from the parallel package over each unique condition?

ncores

number of cores to be used in parallel execution. Default uses all available

cluster object defined by makeCluster used to run code in parallel. If NULL and parallel = TRUE, a local cluster object will be defined which selects the maximum number cores available and will be stop the cluster when the simulation is complete. Note that supplying a cl object will automatically set the parallel argument to TRUE

MPI

logical; use the foreach package in a form usable by MPI to run simulation in parallel on a cluster? Default is FALSE

max_errors

the simulation will terminate when more than this number of constitutive errors are thrown in any given condition. The purpose of this is to indicate that likely something problematic is going wrong in the generate-analyse phases and should be inspected. Default is 50

as.factor

logical; coerce the input design elements into factors when the simulation is complete? If the columns inputs are numeric then these will be treated as ordered. Default is TRUE

save_generate_data

logical; save the data returned from Generate to external .rds files located in the defined save_generate_data_dirname directory/folder? It is generally recommended to leave this argument as FALSE because saving datasets will often consume a large amount of disk space, and by and large saving data is not required or recommended for simulations. A more space-friendly version is available when using the save_seed flag. When TRUE, temporary files will also be saved to the working directory (in the same was as when save = TRUE to better track the state of the simulation. Default is FALSE

save_details

a list pertaining to information about how and where files should be saved when save, save_results, or save_generate_data are triggered.

edit

a string indicating where to initiate a browser() call for editing and debugging. General options are 'none' (default) and 'all', which are used to disable debugging and to debug all the user defined functions, respectively. Specific options include: 'generate' to edit the data simulation function, 'analyse' to edit the computational function, and 'summarise' to edit the aggregation function.

Alternatively, users may place browser calls within the respective functions for debugging at specific lines (note: parallel computation flags will automatically be disabled when a browser() is detected)

verbose

logical; print messages to the R console? Default is TRUE

SimDesign object returned from runSimulation

drop.extras

logical; don't print information about warnings, errors, simulation time, and replications? Default is FALSE

drop.design

logical; don't include information about the (potentially factorized) simulation design? This may be useful if you wish to cbind() the original design data.frame to the simulation results instead of using the auto-factorized version. Default is FALSE

...

additional arguments

object

SimDesign object returned from runSimulation

Value

a data.frame (also of class 'SimDesign') with the original design conditions in the left-most columns, simulation results and ERROR/WARNING's (if applicable) in the middle columns, and additional information (such as REPLICATIONS, SIM_TIME, and SEED) in the right-most columns.

Saving data, results, seeds, and the simulation state

To conserve RAM, temporary objects (such as data generated across conditions and replications) are discarded; however, these can be saved to the hard-disk by passing the appropriate flags. For longer simulations it is recommended to use save = TRUE to temporarily save the simulation state, and to use the save_results flag to write the analysis results the to hard-disc. The generated data can be saved by passing save_generate_data = TRUE, however it is often more memory efficient to use the save_seeds option instead to only save R's .Random.seed state instead (still allowing for complete reproducibility); individual .Random.seed terms may also be read in with the load_seed input to reproduce the exact simulation state at any given replication. Finally, providing a vector of seeds is also possible to ensure that each simulation condition is completely reproducible under the single/multi-core method selected. Finally, when the Monte Carlo simulation is complete it is recommended to write the results to a hard-drive for safe keeping, particularly with the save and filename arguments provided (for reasons that are more obvious in the parallel computation descriptions below). Using the filename argument (along with save = TRUE) supplied is much safer than using something like saveRDS directly because files will never accidentally be overwritten, and instead a new file name will be created when a conflict arises; this type of safety is prevalent in many aspects of the package and helps to avoid many unrecoverable (yet surprisingly common) mistakes.

Resuming temporary results

In the event of a computer crash, power outage, etc, if save = TRUE was used then the original code used to execute runSimulation() need only be re-run to resume the simulation. The saved temp file will be read into the function automatically, and the simulation will continue one the condition where it left off before the simulation state was terminated.

A note on parallel computing

When running simulations in parallel (either with parallel = TRUE or MPI = TRUE) R objects defined in the global environment will generally not be visible across nodes. Hence, you may see errors such as Error: object 'something' not found if you try to use an object that is defined in the workspace but is not passed to runSimulation. To avoid this type or error, simply pass additional objects to the fixed_objects input (usually it's convenient to supply a named list of these objects). Fortunately, however, custom functions defined in the global environment are exported across nodes automatically. This makes it convenient when writing code because custom functions will always be available across nodes if they are visible in the R workspace. As well, note the packages input to declare packages which must be loaded via library() in order to make specific non-standard R functions available across nodes.

Cluster computing

SimDesign code may be released to a computing system which supports parallel cluster computations using the industry standard Message Passing Interface (MPI) form. This simply requires that the computers be setup using the usual MPI requirements (typically, running some flavor of Linux, have password-less open-SSH access, IP addresses have been added to the /etc/hosts file or ~/.ssh/config, etc). More generally though, these resources are widely available through professional organizations dedicated to super-computing. To setup the R code for an MPI cluster one need only add the argument MPI = TRUE, wrap the appropriate MPI directives around runSimulation, and submit the files using the suitable BASH commands to execute the mpirun tool. For example,

The necessary SimDesign files must be uploaded to the dedicated master node so that a BASH call to mpirun can be used to distribute the work across slaves. For instance, if the following BASH command is run on the master node then 16 processes will be summoned (1 master, 15 slaves) across the computers named localhost, slave1, and slave2 in the ssh config file. mpirun -np 16 -H localhost,slave1,slave2 R --slave -f simulation.R

Network computing

If you access have to a set of computers which can be linked via secure-shell (ssh) on the same LAN network then Network computing (a.k.a., a Beowulf cluster) may be a viable and useful option. This approach is similar to MPI computing approach except that it offers more localized control and requires more hands-on administrative access to the master and slave nodes. The setup generally requires that the master node has SimDesign installed and the slave/master nodes have all the required R packages pre-installed (Unix utilities such as dsh are very useful for this purpose). Finally, the master node must have ssh access to the slave nodes, each slave node must have ssh access with the master node, and a cluster object (cl) from the parallel package must be defined on the master node. Setup for network computing is generally more straightforward and controlled than the setup for MPI jobs in that it only requires the specification of a) the respective IP addresses within a defined R script, and b) the user name (if different from the master node's user name. Otherwise, only a) is required). However, on Linux I have found it is also important to include relevant information about the host names and IP addresses in the /etc/hosts file on the master and slave nodes, and to ensure that the selected port (passed to makeCluster) on the master node is not hindered by a firewall. As an example, using the following code the master node (primary) will spawn 7 slaves and 1 master, while a separate computer on the network with the associated IP address will spawn an additional 6 slaves. Information will be collected on the master node, which is also where the files and objects will be saved using the save inputs (if requested).

The object cl is passed to runSimulation on the master node and the computations are distributed across the respective IP addresses. Finally, it's usually good practice to use stopCluster(cl) when all the simulations are said and done to release the communication between the computers, which is what the above code shows. Alternatively, if you have provided suitable names for each respective slave node, as well as the master, then you can define the cl object using these instead (rather than supplying the IP addresses in your R script). This requires that the master node has itself and all the slave nodes defined in the /etc/hosts and ~/.ssh/config files, while the slave nodes require themselves and the master node in the same files (only 2 IP addresses required on each slave). Following this setup, and assuming the user name is the same across all nodes, the cl object could instead be defined with

Or, even more succinctly if all communication elements required are identical to the master node,

Poor man's cluster computing for independent nodes

In the event that you do not have access to a Beowulf-type cluster (described in the section on "Network Computing") but have multiple personal computers then the simulation code can be manually distributed across each independent computer instead. This simply requires passing a smaller value to the replications argument on each computer and later aggregating the results using the aggregate_simulations function. For instance, if you have two computers available on different networks and wanted a total of 500 replications you could pass replications = 300 to one computer and replications = 200 to the other along with a filename argument (or simply saving the final objects as .rds files manually after runSimulation() has finished). This will create two distinct .rds files which can be combined later with the aggregate_simulations function. The benefit of this approach over MPI or setting up a Beowulf cluster is that computers need not be linked on the same network, and, should the need arise, the temporary simulation results can be migrated to another computer in case of a complete hardware failure by moving the saved temp files to another node, modifying the suitable compname input to save_details (or, if the filename and tmpfilename were modified, matching those files accordingly), and resuming the simulation as normal. Note that this is also a useful tactic if the MPI or Network computing options require you to submit smaller jobs due to time and resource constraint-related reasons, where fewer replications/nodes should be requested. After all the jobs are completed and saved to their respective files, aggregate_simulations can then collapse the files as if the simulations were run all at once. Hence, SimDesign makes submitting smaller jobs to super-computing resources considerably less error prone than managing a number of smaller jobs manually .

Details

The strategy for organizing the Monte Carlo simulation work-flow is to

For a skeleton version of the work-flow, which is often useful when initially defining a simulation, see SimFunctions. This function will write template simulation code to one/two files so that modifying the required functions and objects can begin immediately with minimal error. This means that you can focus on your Monte Carlo simulation immediately rather than worrying about the administrative code-work required to organize the simulation work-flow.

Additional information for each condition are also contained in the data.frame object returned by runSimulation: REPLICATIONS to indicate the number of Monte Carlo replications, SIM_TIME to indicate how long (in seconds) it took to complete all the Monte Carlo replications for each respective design condition, SEED if the seed argument was used, columns containing the number of replications which had to be re-run due to errors (where the error messages represent the names of the columns prefixed with a ERROR: string), and columns containing the number of warnings prefixed with a WARNING: string.

Additional examples, presentation files, and tutorials can be found on the package wiki located at https://github.com/philchalmers/SimDesign/wiki.

Examples

Run this code


# skeleton functions to be saved and edited
SimFunctions()

#### Step 1 --- Define your conditions under study and create design data.frame

# (use EXPLICIT names, avoid things like N <- 100. That's fine in functions, not here)
Design <- expand.grid(sample_size = c(30, 60, 90, 120),
                      group_size_ratio = c(1, 4, 8),
                      standard_deviation_ratio = c(.5, 1, 2))
dim(Design)
head(Design)

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

# help(Generate)
Generate <- function(condition, fixed_objects = NULL){
    N <- condition$sample_size      # alternatively, could use Attach() to make objects available
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio
    if(grs < 1){
        N2 <- N / (1/grs + 1)
        N1 <- N - N2
    } else {
        N1 <- N / (grs + 1)
        N2 <- N - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

# help(Analyse)
Analyse <- function(condition, dat, fixed_objects = NULL){
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)

    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- c(welch = welch$p.value, independent = ind$p.value)
    ret
}

# help(Summarise)
Summarise <- function(condition, results, fixed_objects = NULL){
    #find results of interest here (e.g., alpha < .1, .05, .01)
    ret <- EDR(results, alpha = .05)
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# test to see if it works and for debugging
Final <- runSimulation(design=Design, replications=5,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
head(Final)

# didactic demonstration when summarise function is not supplied (returns list of matrices)
Final2 <- runSimulation(design=Design, replications=5,
                       generate=Generate, analyse=Analyse)
print(Final2[1:3])

# or a single condition
Final3 <- runSimulation(design=Design[1, ], replications=5,
                       generate=Generate, analyse=Analyse)
Final3


## Not run: 
# # complete run with 1000 replications per condition
# Final <- runSimulation(design=Design, replications=1000, parallel=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise)
# head(Final, digits = 3)
# View(Final)
# 
# ## save results to a file upon completion (not run)
# # runSimulation(design=Design, replications=1000, parallel=TRUE, save=TRUE, filename = 'mysim',
# #               generate=Generate, analyse=Analyse, summarise=Summarise)
# 
# 
# 
# ## Debug the generate function. See ?browser for help on debugging
# ##   Type help to see available commands (e.g., n, c, where, ...),
# ##   ls() to see what has been defined, and type Q to quit the debugger
# runSimulation(design=Design, replications=1000,
#               generate=Generate, analyse=Analyse, summarise=Summarise,
#               parallel=TRUE, edit='generate')
# 
# ## Alternatively, place a browser() within the desired function line to
# ##   jump to a specific location
# Summarise <- function(condition, results, fixed_objects = NULL){
#     #find results of interest here (e.g., alpha < .1, .05, .01)
#     ret <- EDR(results[,nms], alpha = .05)
#     browser()
#     ret
# }
# 
# runSimulation(design=Design, replications=1000,
#               generate=Generate, analyse=Analyse, summarise=Summarise,
#               parallel=TRUE)
# 
# 
# 
# 
# ## EXTRA: To run the simulation on a MPI cluster, use the following setup on each node (not run)
# # library(doMPI)
# # cl <- startMPIcluster()
# # registerDoMPI(cl)
# # Final <- runSimulation(design=Design, replications=1000, MPI=TRUE, save=TRUE,
# #                        generate=Generate, analyse=Analyse, summarise=Summarise)
# # saveRDS(Final, 'mysim.rds')
# # closeCluster(cl)
# # mpi.quit()
# 
# 
# ## Similarly, run simulation on a network linked via ssh
# ##  (two way ssh key-paired connection must be possible between master and slave nodes)
# ##
# ## define IP addresses, including primary IP
# # primary <- '192.168.2.20'
# # IPs <- list(
# #     list(host=primary, user='phil', ncore=8),
# #     list(host='192.168.2.17', user='phil', ncore=8)
# # )
# # spec <- lapply(IPs, function(IP)
# #                    rep(list(list(host=IP$host, user=IP$user)), IP$ncore))
# # spec <- unlist(spec, recursive=FALSE)
# #
# # cl <- parallel::makeCluster(type='PSOCK', master=primary, spec=spec)
# # Final <- runSimulation(design=Design, replications=1000, parallel = TRUE, save=TRUE,
# #                        generate=Generate, analyse=Analyse, summarise=Summarise, cl=cl)
# 
# #~~~~~~~~~~~~~~~~~~~~~~~~
# # Step 4 --- Post-analysis: Create a new R file for analyzing the Final data.frame with R based
# # regression stuff, so use the lm() function to find main effects, interactions, plots, etc.
# # This is where you get to be a data analyst!
# 
# library(dplyr)
# Final2 <- tbl_df(Final)
# Final2 %>% summarise(mean(welch), mean(independent))
# Final2 %>% group_by(standard_deviation_ratio, group_size_ratio) %>%
#    summarise(mean(welch), mean(independent))
# 
# # quick ANOVA analysis method with all two-way interactions
# SimAnova( ~ (sample_size + group_size_ratio + standard_deviation_ratio)^2, Final)
# 
# # or more specific anovas
# SimAnova(independent ~ (group_size_ratio + standard_deviation_ratio)^2,
#     Final)
# 
# # make some plots
# library(ggplot2)
# library(reshape2)
# welch_ind <- Final[,c('group_size_ratio', "standard_deviation_ratio",
#     "welch", "independent")]
# dd <- melt(welch_ind, id.vars = names(welch_ind)[1:2])
# 
# ggplot(dd, aes(factor(group_size_ratio), value)) +
#     geom_abline(intercept=0.05, slope=0, col = 'red') +
#     geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
#     geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
#     geom_boxplot() + facet_wrap(~variable)
# 
# ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
#     geom_abline(intercept=0.05, slope=0, col = 'red') +
#     geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
#     geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
#     geom_boxplot() + facet_grid(variable~standard_deviation_ratio) +
#     theme(legend.position = 'none')
# 
# ## End(Not run)

Run the code above in your browser using DataLab