runSimulation: Run a Monte Carlo simulation given a data.frame of conditions and simulation functions

Description

This function runs a Monte Carlo simulation study given the simulation functions, the design conditions, and the number of replications. Results can be saved as temporary files in case of interruptions and may be restored by rerunning the exact function calls again, provided that the respective temp file can be found in the working directory. To conserve RAM, temporary objects (such as generated data across conditions and replications) are discarded. Supports parallel and cluster computing, global and local debugging, and is designed to be cross-platform. For a skeleton version of the work-flow which may be useful when initially defining a simulation, see SimDesign_functions. Additional examples can be found on the package wiki, located at https://github.com/philchalmers/SimDesign/wiki

Usage

runSimulation(design, replications, generate, analyse, summarise,
  parallel = FALSE, MPI = FALSE, save = FALSE, save_every = 1,
  clean = TRUE, seed = NULL, compname = Sys.info()["nodename"],
  filename = paste0(compname, "_Final_", replications, ".rds"),
  tmpfilename = paste0(compname, "_tmpsim.rds"), main = NULL,
  ncores = parallel::detectCores(), edit = "none", verbose = TRUE)

Arguments

design

a data.frame object containing the Monte Carlo simulation conditions to be studied, where each row represents a unique condition

replications

number of replication to perform per condition (i.e., each row in design)

generate

user-defined data and parameter generating function. See generate for details

analyse

user-defined computation function which acts on the data generated from generate. See analyse for details

summarise

user-defined summary function to be used after all the replications have completed. See summarise for details

parallel

logical; use parallel processing from the parallel package over each unique condition?

NOTE: When using packages other than the basic packages which are attached by default (e.g., stats, graphics, utils<

MPI

logical; use the doMPI package to run simulation in parallel on a cluster? Default is FALSE

save

logical; save the final simulation and temp files to the hard-drive? This is useful for simulations which require an extended amount of time. Default is FALSE

save_every

a number indicating how often to temporarily save your simulation results to disk. Default is 1 to save after every condition is complete, but set to NA if you don't want to save any temp files

clean

logical; remove any temp files that are created after the simulation is complete? Default is TRUE

seed

a vector of integers (or single number) to be used for reproducibility. The length of the vector must be equal to either 1 or the number of rows in design; if 1, this will be repeated for each condition. This argument calls

compname

name of the computer running the simulation. Normally this doesn't need to be modified, but in the event that a node breaks down while running a simulation the results from the tmp files may be resumed on another computer by changing the name of the n

filename

the name of the file to save the results to. Default is the system name with the number of replications and 'Final' appended to the string

tmpfilename

the name of the temporary file, default is the system name with 'tmpsim.rds' appended at the end. This file will be read in if it is in the working directory, and the simulation will continue where at the last point this file was saved (useful in ca

main

(optional) user-defined organization function defining how the simulation should be organized. When NULL, the internal function definition is used (and the majority of the time this is sufficient). See main

ncores

number of cores to be used in parallel execution. Default uses all available

edit

a string indicating where to initiate a browser() call for editing and debugging. Options are 'none' (default), 'main' to edit the main function calls loop, 'generate' to edit the data simulation function,

verbose

logical; print messages to the R console?

code

filename

Cluster computing

If the package is installed across a cluster of computers, and all the computers are accessible on the same LAN network, then the package may be run within the MPI paradigm. This simply requires that the computers be setup using the usual MPI requirements (typically, running some flavor of Linux, have password-less openSSH access, addresses have been added to the /etc/hosts file, etc). To setup the R code for an MPI cluster one need only add the argument MPI = TRUE and submit the files using the suitable BASH commands.

For instances, if the following code is run on the master node through a terminal then 16 processes will be summoned (1 master, 15 slaves) across the computers named localhost, slave1, and slave2.

mpirun -np 16 -H localhost,slave1,slave2 R --slave -f simulation.R

Poor man's cluster computing

In the event that you do not have access to a Beowulf-type cluster, but have multiple personal computers, then the simulation code can be manually distributed across each computer instead. This simply requires passing a smaller value to the each argument on each computer, and later aggregating the results using the aggregate_simulations function.

For instance, if you have two computers available and wanted 500 replications you could pass replications = 300 to one computer and replications = 200 to the other along with a save = TRUE argument. This will create two distinct .rds files which can be combined later with the aggregate_simulations function. The benefit of this approach over MPI is that computers need not be linked over a LAN network, and should the need arise the temporary simulation results can be migrated to another computer in case of a complete hardware failure by modifying the suitable compname input (or, if the filename and tmpfilename were modified, matching those files as well).

Details

The strategy for organizing the Monte Carlo simulation work-flow is to

[object Object],[object Object],[object Object],[object Object]

Examples

Run this code

#### Step 1 --- Define your conditions under study and create design data.frame

# (use EXPLICIT names, avoid things like N <- 100. That's fine in functions, not here)
sample_sizes <- c(30, 60, 90, 120)
standard_deviation_ratios <- c(1, 4, 8)
group_size_ratios <- c(.5, 1, 2)

Design <- expand.grid(sample_size=sample_sizes,
                      group_size_ratio=group_size_ratios,
                      standard_deviation_ratio=standard_deviation_ratios)
dim(Design)
head(Design)

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

# skeleton functions to be edited
SimDesign_functions()

# help(generate)
Generate <- function(condition){

    #require packages/define functions if needed, or better yet index with the :: operator

    N <- condition$sample_size
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio

    if(grs == 0.5){
        N2 <- N/3
        N1 <- N - N2
    } else if(grs == 2){
        N1 <- N/3
        N2 <- N - N1
    } else {
        N1 <- N2 <- N / 2
    }

    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))

    return(dat)
}

# help(analyse)

Analyse <- function(condition, dat, parameters = NULL){

    # require packages/define functions if needed, or better yet index with the :: operator
    require(stats)
    mygreatfunction <- function(x) print('Do some stuff')

    #wrap computational statistics in try() statements to control estimation problems
    welch <- try(t.test(DV ~ group, dat), silent=TRUE)
    ind <- try(t.test(DV ~ group, dat, var.equal=TRUE), silent=TRUE)

    # check if any errors occured. This will re-draw the data
    check_error(welch, ind)

    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- c(welch = welch$p.value, independent = ind$p.value)

    return(ret)
}

# help(summarise)

Summarise <- function(condition, results, parameters_list = NULL){

    #find results of interest here (e.g., alpha < .1, .05, .01)
    nms <- c('welch', 'independent')
    lessthan.05 <- EDR(results[,nms], alpha = .05)

    # return the results that will be appended to the design input
    ret <- c(lessthan.05=lessthan.05)
    return(ret)
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# test to see if it works and for debugging
Final <- runSimulation(design=Design, replications=10, parallel=FALSE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)

# complete run with 1000 replications per condition
Final <- runSimulation(design=Design, replications=1000, parallel=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
head(Final)
View(Final)

## Debug the generate function. See ?browser for help on debugging
##   Type help to see available commands (e.g., n, c, where, ...),
##   ls() to see what has been defined, and type Q to quit the debugger
runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE, edit='generate')

## Alternatively, place a browser() within the desired function line to
##   jump to a specific location
Summarise <- function(condition, results, parameters_list = NULL){

    #find results of interest here (e.g., alpha < .1, .05, .01)
    nms <- c('welch', 'independent')
    lessthan.05 <- EDR(results[,nms], alpha = .05)

    browser()

    # return the results that will be appended to the design input
    ret <- c(lessthan.05=lessthan.05)
    return(ret)
}

runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE)




## EXTRA: To run the simulation on a MPI cluster, use the following setup on each node (not run)
# library(doMPI)
# cl <- startMPIcluster()
# registerDoMPI(cl)
# Final <- runSimulation(design=Design, replications=1000, MPI=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise)
# closeCluster(cl)
# mpi.quit()



#~~~~~~~~~~~~~~~~~~~~~~~~
# Step 4 --- Post-analysis: Create a new R file for analyzing the Final data.frame with R based
# regression stuff, so use the lm() function to find main effects, interactions, plots, etc.
# This is where you get to be a data analyst!

psych::describe(Final)
psych::describeBy(Final, group = Final$standard_deviation_ratio)

# make into factors (if helpful)
Final$f_gsr <- with(Final, factor(group_size_ratio))
Final$f_sdr <- with(Final, factor(standard_deviation_ratio))

#lm analysis (might want to change DV to a logit for better stability)
mod <- lm(lessthan.05.welch ~ f_gsr * f_sdr, Final)
car::Anova(mod)

mod2 <- lm(lessthan.05.independent ~ f_gsr * f_sdr, Final)
car::Anova(mod2)

# make some plots
library(ggplot2)
library(reshape2)
welch_ind <- Final[,c('group_size_ratio', "standard_deviation_ratio",
    "lessthan.05.welch", "lessthan.05.independent")]
dd <- melt(welch_ind, id.vars = names(welch_ind)[1:2])

ggplot(dd, aes(factor(group_size_ratio), value)) +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    geom_boxplot() + facet_wrap(~variable)

ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    geom_boxplot() + facet_grid(variable~standard_deviation_ratio) +
    theme(legend.position = 'none')

Run the code above in your browser using DataLab