runSimulation: Run a Monte Carlo simulation given a data.frame of conditions and simulation functions

Description

This function runs a Monte Carlo simulation study given a set of predefined simulation functions, design conditions, and number of replications. Results can be saved as temporary files in case of interruptions and may be restored by re-running runSimulation, provided that the respective temp file can be found in the working directory. To conserve RAM, temporary objects (such as data generated across conditions and replications) are discarded; however, these can be saved to the hard-disk by passing the appropriate flags. For longer simulations it is recommended to use save = TRUE to temporarily save the simulation state and also use the save_results flag to write the analysis results the to hard-disc. runSimulation supports parallel and cluster computing, global and local debugging, error handling (including fail-safe stopping when functions fail too often, even across nodes), and tracking of error and warning messages.

Usage

runSimulation(design, replications, generate, analyse, summarise,
  fixed_objects = NULL, parallel = FALSE, packages = NULL,
  ncores = parallel::detectCores(), MPI = FALSE, save = FALSE,
  save_results = FALSE, save_generate_data = FALSE, filename = NULL,
  max_errors = 50, cl = NULL, seed = NULL, save_details = list(),
  edit = "none", verbose = TRUE)

Arguments

design

a data.frame object containing the Monte Carlo simulation conditions to be studied, where each row represents a unique condition

replications

number of replication to perform per condition (i.e., each row in design). Must be greater than 0

generate

user-defined data and parameter generating function. See generate for details

analyse

user-defined computation function which acts on the data generated from generate. See analyse for details

summarise

user-defined summary function to be used after all the replications have completed within each design condition.

Note that if you only care to save the results from analyse then simply have this function return some arb

fixed_objects

(optional) an object (usually a list) containing additional user-defined objects that should remain fixed across conditions. This is useful when including long fixed vectors/matrices of population parameters, data that should be used across a

parallel

logical; use parallel processing from the parallel package over each unique condition?

packages

a character vector of external packages to be used during the simulation (e.g., c('MASS', 'mvtnorm', 'simsem') ). Use this input when parallel = TRUE or MPI = TRUE to use non-standard functions from additional packag

ncores

number of cores to be used in parallel execution. Default uses all available

MPI

logical; use the foreach package in a form usable by MPI to run simulation in parallel on a cluster? Default is FALSE

save

logical; save the simulation state to the hard-drive? This is useful for simulations which require an extended amount of time. When TRUE, a temp file will be created in the working directory which allows the simulation state to be saved and r

save_results

logical; save the results returned from analyse to external .rds files located in the defined save_results_dirname directory/folder? Use this if you would like to keep track of the

save_generate_data

logical; save the data returned from generate to external .rds files located in the defined save_generate_data_dirname directory/folder? It is generally recommended to leave this

filename

(optional) the name of the .rds file to save the final simulation results to. When NULL the final simulation object is not saved to the drive. As well, if the same file name already exists in the working directly at the time of s

max_errors

the simulation will terminate when more than this number of constitutive errors are thrown in any given condition. The purpose of this is to indicate that likely something problematic is going wrong in the generate-analyse phases and should be inspected.

cluster object defined by makeCluster to be used when parallel = TRUE. If NULL a local cluster object will be defined which selects the maximum number cores available and will

seed

a vector of integers to be used for reproducibility. The length of the vector must be equal the number of rows in design. This argument calls set.seed or

save_details

a list pertaining to information about how and where files should be saved when save, save_results, or save_generate_data are triggered.

[object Object],[object Object],[object Object],[object Object],[object Ob

edit

a string indicating where to initiate a browser() call for editing and debugging. General options are 'none' (default) and 'all', which are used to disable debugging and to debug all the user defined functions, r

verbose

logical; print messages to the R console? Default is TRUE

Value

a data.frame (also of class 'SimDesign') with the original design conditions in the left-most columns, simulation results in the middle columns, additional information (such as REPLICATIONS and SIM_TIME), to the right of the results, and ERROR/WARNING's in the right-most columns

A note on parallel computing

When running simulations in parallel (either with parallel = TRUE or MPI = TRUE) R objects defined in the global environment will generally not be visible across nodes. Hence, you may see errors such as Error: object 'something' not found if you try to use an object that is defined in the workspace but is not passed to runSimulation. To avoid this type or error, simply pass additional objects to the fixed_objects input (usually it's convenient to supply a named list of these objects). Fortunately, however, custom functions defined in the global environment are exported across nodes automatically. This makes it convenient when writing code because custom functions will always be available across nodes if they are visible in the R workspace. As well, note the packages input to declare packages which must be loaded via library() in order to make specific non-based R functions available across nodes.

Storing and resuming temporary results

In the event of a computer crash, power outage, etc, if save = TRUE was used then the original code used to execute runSimulation() need only be re-run to resume the simulation. The saved temp file will be read into the function automatically, and the simulation will continue where it left off before the simulation state was terminated. Upon completion, a data.frame with the simulation will be returned in the R session. If specified, an .rds file may also be saved to the hard-drive if a suitable filename argument was included. Finally, to save the complete list of results returned from analyse to unique files use save_results = TRUE.

Poor man's cluster computing for independent nodes

In the event that you do not have access to a Beowulf-type cluster (described in the section on "Network Computing") but have multiple personal computers then the simulation code can be manually distributed across each independent computer instead. This simply requires passing a smaller value to the replications argument on each computer and later aggregating the results using the aggregate_simulations function.

For instance, if you have two computers available on different networks and wanted a total of 500 replications you could pass replications = 300 to one computer and replications = 200 to the other along with a filename argument (or simply saving the final objects as .rds files manually). This will create two distinct .rds files which can be combined later with the aggregate_simulations function. The benefit of this approach over MPI or setting up a Beowulf cluster is that computers need not be linked over a LAN network, and should the need arise the temporary simulation results can be migrated to another computer in case of a complete hardware failure by modifying the suitable compname input to save_details (or, if the filename and tmpfilename were modified, matching those files accordingly).

Note that this is also a useful tactic if the MPI or Network computing options require you to submit smaller jobs due to time and resource constraint-related reasons, where fewer replications/nodes should be requested. After all the jobs are completed and saved to their respective files the aggregate_simulations can then collapse the files as if the simulations were run all at once. Hence, SimDesign makes submitting smaller jobs to super-computing resources considerably less error prone then managing a number of smaller jobs manually.

Details

The strategy for organizing the Monte Carlo simulation work-flow is to

[object Object],[object Object],[object Object],[object Object]

For a skeleton version of the work-flow, which is often useful when initially defining a simulation, see SimDesign_functions. This function will write template simulation code to one/two files so that modifying the required functions and objects can begin immediately with minimal error. This means that you can focus on your Monte Carlo simulation immediately rather than worrying about the administrative code-work required to organize the simulation work-flow.

Additional information for each condition are also contained in the data.frame object returned by runSimulation: REPLICATIONS to indicate the number of Monte Carlo replications, SIM_TIME to indicate how long (in seconds) it took to complete all the Monte Carlo replications for each respective design condition, SEED if the seed argument was used, columns containing the number of replications due to try() errors where the error messages represent the names of the columns prefixed with a ERROR: string, and columns containing the number of warnings prefixed with a WARNING: string.

Additional examples, presentation files, and tutorials can be found on the package wiki located at https://github.com/philchalmers/SimDesign/wiki.

Examples

Run this code

#### Step 1 --- Define your conditions under study and create design data.frame

# (use EXPLICIT names, avoid things like N <- 100. That's fine in functions, not here)
sample_sizes <- c(30, 60, 90, 120)
standard_deviation_ratios <- c(1, 4, 8)
group_size_ratios <- c(.5, 1, 2)

Design <- expand.grid(sample_size=sample_sizes,
                      group_size_ratio=group_size_ratios,
                      standard_deviation_ratio=standard_deviation_ratios)
dim(Design)
head(Design)

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

# skeleton functions to be edited
SimDesign_functions()

# help(generate)
Generate <- function(condition, fixed_objects = NULL){

    #require packages/define functions if needed, or better yet index with the :: operator

    N <- condition$sample_size
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio

    if(grs < 1){
        N2 <- N / (1/grs + 1)
        N1 <- N - N2
    } else {
        N1 <- N / (grs + 1)
        N2 <- N - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))

    return(dat)
}

# help(analyse)

Analyse <- function(condition, dat, fixed_objects = NULL, parameters = NULL){

    # require packages/define functions if needed, or better yet index with the :: operator
    require(stats)
    mygreatfunction <- function(x) print('Do some stuff')

    #wrap computational statistics in try() statements to control estimation problems
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)

    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- c(welch = welch$p.value, independent = ind$p.value)

    return(ret)
}

# help(summarise)

Summarise <- function(condition, results, fixed_objects = NULL, parameters_list = NULL){

    #find results of interest here (e.g., alpha < .1, .05, .01)
    lessthan.05 <- EDR(results, alpha = .05)

    # return the results that will be appended to the design input
    ret <- c(lessthan.05=lessthan.05)
    return(ret)
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# test to see if it works and for debugging
Final <- runSimulation(design=Design, replications=5, parallel=FALSE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)

# complete run with 1000 replications per condition
Final <- runSimulation(design=Design, replications=1000, parallel=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
head(Final)
View(Final)

## save results to a file upon completion (not run)
# runSimulation(design=Design, replications=1000, parallel=TRUE, filename = 'mysim',
#               generate=Generate, analyse=Analyse, summarise=Summarise)



## Debug the generate function. See ?browser for help on debugging
##   Type help to see available commands (e.g., n, c, where, ...),
##   ls() to see what has been defined, and type Q to quit the debugger
runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE, edit='generate')

## Alternatively, place a browser() within the desired function line to
##   jump to a specific location
Summarise <- function(condition, results, parameters_list = NULL){

    #find results of interest here (e.g., alpha < .1, .05, .01)
    nms <- c('welch', 'independent')
    lessthan.05 <- EDR(results[,nms], alpha = .05)

    browser()

    # return the results that will be appended to the design input
    ret <- c(lessthan.05=lessthan.05)
    return(ret)
}

runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE)




## EXTRA: To run the simulation on a MPI cluster, use the following setup on each node (not run)
# library(doMPI)
# cl <- startMPIcluster()
# registerDoMPI(cl)
# Final <- runSimulation(design=Design, replications=1000, MPI=TRUE, save=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise)
# saveRDS(Final, 'mysim.rds')
# closeCluster(cl)
# mpi.quit()


## Similarly, run simulation on a network linked via ssh
##  (two way ssh key-paired connection must be possible between master and slave nodes)
##
## define IP addresses, including primary IP
# primary <- '192.168.2.20'
# IPs <- list(
#     list(host=primary, user='phil', ncore=8),
#     list(host='192.168.2.17', user='phil', ncore=8)
# )
# spec <- lapply(IPs, function(IP)
#                    rep(list(list(host=IP$host, user=IP$user)), IP$ncore))
# spec <- unlist(spec, recursive=FALSE)
#
# cl <- parallel::makeCluster(type='PSOCK', master=primary, spec=spec)
# Final <- runSimulation(design=Design, replications=1000, parallel = TRUE, save=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise, cl=cl)

#~~~~~~~~~~~~~~~~~~~~~~~~
# Step 4 --- Post-analysis: Create a new R file for analyzing the Final data.frame with R based
# regression stuff, so use the lm() function to find main effects, interactions, plots, etc.
# This is where you get to be a data analyst!

psych::describe(Final)
psych::describeBy(Final, group = Final$standard_deviation_ratio)

# make into factors (if helpful)
Final$f_gsr <- with(Final, factor(group_size_ratio))
Final$f_sdr <- with(Final, factor(standard_deviation_ratio))

#lm analysis (might want to change DV to a logit for better stability)
mod <- lm(lessthan.05.welch ~ f_gsr * f_sdr, Final)
car::Anova(mod)

mod2 <- lm(lessthan.05.independent ~ f_gsr * f_sdr, Final)
car::Anova(mod2)

# make some plots
library(ggplot2)
library(reshape2)
welch_ind <- Final[,c('group_size_ratio', "standard_deviation_ratio",
    "lessthan.05.welch", "lessthan.05.independent")]
dd <- melt(welch_ind, id.vars = names(welch_ind)[1:2])

ggplot(dd, aes(factor(group_size_ratio), value)) +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    geom_boxplot() + facet_wrap(~variable)

ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    geom_boxplot() + facet_grid(variable~standard_deviation_ratio) +
    theme(legend.position = 'none')

Run the code above in your browser using DataLab