# SimCorrMix v0.1.1

Monthly downloads

## Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

Generate continuous (normal, non-normal, or mixture distributions), binary, ordinal,
and count (regular or zero-inflated, Poisson or Negative Binomial) variables with a specified
correlation matrix, or one continuous variable with a mixture distribution. This package can
be used to simulate data sets that mimic real-world clinical or genetic data sets (i.e.,
plasmodes, as in Vaughan et al., 2009 <DOI:10.1016/j.csda.2008.02.032>). The methods
extend those found in the 'SimMultiCorrData' R package. Standard normal variables with an
imposed intermediate correlation matrix are transformed to generate the desired distributions.
Continuous variables are simulated using either Fleishman (1978)'s third order
<DOI:10.1007/BF02293811> or Headrick (2002)'s fifth order
<DOI:10.1016/S0167-9473(02)00072-5> polynomial transformation method (the power method
transformation, PMT). Non-mixture distributions require the user to specify mean, variance,
skewness, standardized kurtosis, and standardized fifth and sixth cumulants. Mixture
distributions require these inputs for the component distributions plus the mixing
probabilities. Simulation occurs at the component level for continuous mixture
distributions. The target correlation matrix is specified in terms of correlations with
components of continuous mixture variables. These components are transformed into the
desired mixture variables using random multinomial variables based on the mixing
probabilities. However, the package provides functions to approximate expected correlations
with continuous mixture variables given target correlations with the components. Binary and
ordinal variables are simulated using a modification of ordsample() in package 'GenOrd'.
Count variables are simulated using the inverse CDF method. There are two simulation
pathways which calculate intermediate correlations involving count variables differently.
Correlation Method 1 adapts Yahav and Shmueli's 2012 method <DOI:10.1002/asmb.901> and
performs best with large count variable means and positive correlations or small means and
negative correlations. Correlation Method 2 adapts Barbiero and Ferrari's 2015
modification of the 'GenOrd' package <DOI:10.1002/asmb.2072> and performs best under the
opposite scenarios. The optional error loop may be used to improve the accuracy of the
final correlation matrix. The package also contains functions to calculate the
standardized cumulants of continuous mixture distributions, check parameter inputs,
calculate feasible correlation boundaries, and summarize and plot simulated variables.

## Readme

# SimCorrMix

The goal of **SimCorrMix** is to generate continuous (normal, non-normal, or mixture distributions), binary, ordinal, and count (Poisson or Negative Binomial, regular or zero-inflated) variables with a specified correlation matrix, or one continuous variable with a mixture distribution. This package can be used to simulate data sets that mimic real-world clinical or genetic data sets (i.e. plasmodes, as in Vaughan et al., 2009, **SimMultiCorrData** package. Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either Fleishman (1978)'s third-order (**Correlation Method 1** adapts Yahav and Shmueli's 2012 method (**Correlation Method 2** adapts Barbiero and Ferrari's 2015 modification of the **GenOrd** package (

There are several vignettes which accompany this package that may help the user understand the simulation and analysis methods.

**Comparison of Correlation Methods 1 and 2**describes the two simulation pathways that can be followed for generation of correlated data (using`corrvar`

and`corrvar2`

).**Continuous Mixture Distributions**demonstrates how to simulate one continuous mixture variable using`contmixvar1`

and gives a step-by-step guideline for comparing a simulated distribution to the target distribution.**Expected Cumulants and Correlations for Continuous Mixture Variables**derives the equations used by the function`calc_mixmoments`

to find the mean, standard deviation, skew, standardized kurtosis, and standardized fifth and sixth cumulants for a continuous mixture variable. The vignette also explains how the functions`rho_M1M2`

and`rho_M1Y`

calculate the expected correlations with continuous mixture variables based on the target correlations with the components.**Overall Workflow for Generation of Correlated Data**gives a step-by-step guideline to follow with an example containing continuous non-mixture and mixture, ordinal, zero-inflated Poisson, and zero-inflated Negative Binomial variables. It executes both correlated data simulation functions with and without the error loop.**Variable Types**describes the different types of variables that can be simulated in**SimCorrMix**, details the algorithm involved in the optional error loop that helps to minimize correlation errors, and explains how the feasible correlation boundaries are calculated for each of the two simulation pathways (using`validcorr`

and`validcorr2`

).

## Installation instructions

**SimCorrMix** can be installed using the following code:

```
## from GitHub
install.packages("devtools")
devtools::install_github("AFialkowski/SimCorrMix", build_vignettes = TRUE)
## from CRAN
install.packages("SimCorrMix")
```

## Example

This is a basic example which shows you how to solve a common problem:

Headrick and Kowalchuk's steps (2007, *Normal(-2, 1)* and *Normal(2, 1)*. The mixing proportions are 0.4 and 0.6.

### Step 1: Obtain the standardized cumulants

The values of *γ*_{1}, *γ*_{2}, *γ*_{3}, and *γ*_{4} are all 0 for normal variables. The mean and standard deviation of the mixture variable are found with `calc_mixmoments`

.

```
library("SimCorrMix")
#> Loading required package: SimMultiCorrData
#>
#> Attaching package: 'SimMultiCorrData'
#> The following object is masked from 'package:stats':
#>
#> poly
library("printr")
options(scipen = 999)
n <- 10000
mix_pis <- c(0.4, 0.6)
mix_mus <- c(-2, 2)
mix_sigmas <- c(1, 1)
mix_skews <- rep(0, 2)
mix_skurts <- rep(0, 2)
mix_fifths <- rep(0, 2)
mix_sixths <- rep(0, 2)
Nstcum <- calc_mixmoments(mix_pis, mix_mus, mix_sigmas, mix_skews,
mix_skurts, mix_fifths, mix_sixths)
```

### Step 2: Simulate the variable

Note that `calc_mixmoments`

returns the standard deviation, not the variance. The simulation functions require variance as the input. First, the parameter inputs are checked with `validpar`

.

```
validpar(k_mix = 1, method = "Polynomial", means = Nstcum[1],
vars = Nstcum[2]^2, mix_pis = mix_pis, mix_mus = mix_mus,
mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
mix_fifths = mix_fifths, mix_sixths = mix_sixths)
#> [1] TRUE
Nmix2 <- contmixvar1(n, "Polynomial", Nstcum[1], Nstcum[2]^2, mix_pis, mix_mus,
mix_sigmas, mix_skews, mix_skurts, mix_fifths, mix_sixths)
#> Total Simulation time: 0.002 minutes
```

Look at a summary of the target distribution and compare to a summary of the simulated distribution.

```
SumN <- summary_var(Y_comp = Nmix2$Y_comp, Y_mix = Nmix2$Y_mix,
means = Nstcum[1], vars = Nstcum[2]^2, mix_pis = mix_pis, mix_mus = mix_mus,
mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
mix_fifths = mix_fifths, mix_sixths = mix_sixths)
knitr::kable(SumN$target_mix, digits = 5, row.names = FALSE,
caption = "Summary of Target Distribution")
```

Distribution | Mean | SD | Skew | Skurtosis | Fifth | Sixth |
---|---|---|---|---|---|---|

1 | 0.4 | 2.2 | -0.2885 | -1.15402 | 1.79302 | 6.17327 |

```
knitr::kable(SumN$mix_sum, digits = 5, row.names = FALSE,
caption = "Summary of Simulated Distribution")
```

Distribution | N | Mean | SD | Median | Min | Max | Skew | Skurtosis | Fifth | Sixth |
---|---|---|---|---|---|---|---|---|---|---|

1 | 10000 | 0.4 | 2.19989 | 1.05078 | -5.69433 | 5.341 | -0.2996 | -1.15847 | 1.84723 | 6.1398 |

### Step 3: Determine if the constants generate a valid PDF

```
Nmix2$constants
```

c0 | c1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|

0 | 1 | 0 | 0 | 0 | 0 |

0 | 1 | 0 | 0 | 0 | 0 |

```
Nmix2$valid.pdf
#> [1] "TRUE" "TRUE"
```

### Step 4: Select a critical value

Let *α* = 0.05. Since there are no quantile functions for mixture distributions, determine where the cumulative probability equals 1 − *α* = 0.95. The boundaries for `uniroot`

were determined through trial and error.

```
fx <- function(x) 0.4 * dnorm(x, -2, 1) + 0.6 * dnorm(x, 2, 1)
cfx <- function(x, alpha, FUN = fx) {
integrate(function(x, FUN = fx) FUN(x), -Inf, x, subdivisions = 1000,
stop.on.error = FALSE)$value - (1 - alpha)
}
y_star <- uniroot(cfx, c(3.3, 3.4), tol = 0.001, alpha = 0.05)$root
y_star
#> [1] 3.382993
```

### Step 5: Calculate the cumulative probability for the simulated variable up to 1 − *α*

We will use the function `SimMultiCorrData::sim_cdf_prob`

to determine the cumulative probability for *Y* up to `y_star`

. This function is based on Martin Maechler's `ecdf`

function [@Stats].

```
sim_cdf_prob(sim_y = Nmix2$Y_mix[, 1], delta = y_star)$cumulative_prob
#> [1] 0.9504
```

This is approximately equal to the 1 − *α* value of 0.95, indicating the method provides a **good approximation to the actual distribution.**

### Step 6: Plot graphs

```
plot_simpdf_theory(sim_y = Nmix2$Y_mix[, 1], ylower = -10, yupper = 10,
title = "PDF of Mixture of Normal Distributions", fx = fx, lower = -Inf,
upper = Inf)
```

We can also plot the empirical cdf and show the cumulative probability up to y_star.

```
plot_sim_cdf(sim_y = Nmix2$Y_mix[, 1], calc_cprob = TRUE, delta = y_star)
```

## Functions in SimCorrMix

Name | Description | |

corrvar2 | Generation of Correlated Ordinal, Continuous (mixture and non-mixture), and/or Count (Poisson and Negative Binomial, regular and zero-inflated) Variables: Correlation Method 2 | |

norm_ord | Calculate Correlations of Ordinal Variables Obtained from Discretizing Normal Variables | |

intercorr_nb | Calculate Intermediate MVN Correlation for Negative Binomial Variables: Correlation Method 1 | |

calc_mixmoments | Find Standardized Cumulants of a Continuous Mixture Distribution by Method of Moments | |

intercorr_cont | Calculate Intermediate MVN Correlation for Continuous Variables Generated by Polynomial Transformation Method | |

intercorr_pois | Calculate Intermediate MVN Correlation for Poisson Variables: Correlation Method 1 | |

intercorr_cat_nb | Calculate Intermediate MVN Correlation for Ordinal - Negative Binomial Variables: Correlation Method 1 | |

intercorr_cat_pois | Calculate Intermediate MVN Correlation for Ordinal - Poisson Variables: Correlation Method 1 | |

intercorr_cont_nb | Calculate Intermediate MVN Correlation for Continuous - Negative Binomial Variables: Correlation Method 1 | |

intercorr_pois_nb | Calculate Intermediate MVN Correlation for Poisson - Negative Binomial Variables: Correlation Method 1 | |

plot_simtheory | Plot Simulated Data and Target Distribution Data by Name or Function for Continuous or Count Variables | |

rho_M1M2 | Approximate Correlation between Two Continuous Mixture Variables M1 and M2 | |

rho_M1Y | Approximate Correlation between Continuous Mixture Variable M1 and Random Variable Y | |

validpar | Parameter Check for Simulation or Correlation Validation Functions | |

summary_var | Summary of Simulated Variables | |

ord_norm | Calculate Intermediate MVN Correlation to Generate Variables Treated as Ordinal | |

plot_simpdf_theory | Plot Simulated Probability Density Function and Target PDF by Distribution Name or Function for Continuous or Count Variables | |

intercorr | Calculate Intermediate MVN Correlation for Ordinal, Continuous, Poisson, or Negative Binomial Variables: Correlation Method 1 | |

SimCorrMix | Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions | |

intercorr2 | Calculate Intermediate MVN Correlation for Ordinal, Continuous, Poisson, or Negative Binomial Variables: Correlation Method 2 | |

contmixvar1 | Generation of One Continuous Variable with a Mixture Distribution Using the Power Method Transformation | |

corr_error | Error Loop to Correct Final Correlation of Simulated Variables | |

corrvar | Generation of Correlated Ordinal, Continuous (mixture and non-mixture), and/or Count (Poisson and Negative Binomial, regular and zero-inflated) Variables: Correlation Method 1 | |

intercorr_cont_nb2 | Calculate Intermediate MVN Correlation for Continuous - Negative Binomial Variables: Correlation Method 2 | |

intercorr_cont_pois | Calculate Intermediate MVN Correlation for Continuous - Poisson Variables: Correlation Method 1 | |

maxcount_support | Calculate Maximum Support Value for Count Variables: Correlation Method 2 | |

intercorr_cont_pois2 | Calculate Intermediate MVN Correlation for Continuous - Poisson Variables: Correlation Method 2 | |

validcorr | Determine Correlation Bounds for Ordinal, Continuous, Poisson, and/or Negative Binomial Variables: Correlation Method 1 | |

validcorr2 | Determine Correlation Bounds for Ordinal, Continuous, Poisson, and/or Negative Binomial Variables: Correlation Method 2 | |

No Results! |

## Vignettes of SimCorrMix

Name | ||

Bibliography.bib | ||

cont_mixture.Rmd | ||

corr_mixture.Rmd | ||

method_comp.Rmd | ||

preamble-mathjax.tex | ||

variable_types.Rmd | ||

workflow.Rmd | ||

No Results! |

## Last month downloads

## Details

Type | Package |

License | GPL-2 |

Encoding | UTF-8 |

LazyData | true |

RoxygenNote | 6.0.1 |

VignetteBuilder | knitr |

URL | https://github.com/AFialkowski/SimCorrMix |

NeedsCompilation | no |

Packaged | 2018-07-01 12:57:48 UTC; Allison |

Repository | CRAN |

Date/Publication | 2018-07-01 13:31:03 UTC |

#### Include our badge in your README

```
[![Rdoc](http://www.rdocumentation.org/badges/version/SimCorrMix)](http://www.rdocumentation.org/packages/SimCorrMix)
```