Learn R Programming

sortinghat (version 0.1)

simdata_contaminated: Generates random variates from K multivariate contaminated normal populations.

Description

We generate $n_k$ observations $(k = 1, \ldots, K)$ from each of $K$ multivariate contaminated normal distributions. Let $N_p(\mu, \Sigma)$ denote the p-dimensional multivariate normal distribution with mean vector $\mu$ and positive-definite covariance matrix $\Sigma$. Then, let the $k$th population have a $p$-dimensional multivariate contaminated normal distribution:

Usage

simdata_contaminated(n, mean, cov, epsilon = rep(0, K),
    kappa = rep(1, K), seed = NULL)

Arguments

n
a vector (of length K) of the sample sizes for each population
mean
a vector or a list (of length K) of mean vectors
cov
a symmetric matrix or a list (of length K) of symmetric covariance matrices.
epsilon
a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population
kappa
a vector (of length K) that determines the amount of scale contamination for each population
seed
seed for random number generation (If NULL, does not set seed)

Value

  • named list containing: [object Object],[object Object]

Details

$$(1 - \epsilon_k) N_p(\mu_k, \Sigma_k) + \epsilon_k N_p(\mu_k, \kappa_k \Sigma_k),$$

where $\epsilon_k \in [0, 1]$ is the probability of sampling from a contaminated population (i.e., outlier) and $\kappa_k \ge 1$ determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large $\kappa_k$.

The number of populations, K, is determined from the length of the vector of sample sizes, code{n}. The mean vectors and covariance matrices each can be given in a list of length K. If one covariance matrix is given (as a matrix or a list having 1 element), then all populations share this common covariance matrix. The same logic applies to population means.

The contamination probabilities in epsilon can be given as a numeric vector or a single value, in which case the degrees of freedom is replicated K times. The same idea applies to the scale contamination in the kappa argument.

By default, epsilon is a vector of zeros, and kappa is a vector of ones. Hence, no contamination is applied by default.

Examples

Run this code
# Generates 10 observations from each of two multivariate contaminated normal
# populations with equal covariance matrices. Each population has a
# contamination probability of 0.05 and scale contamination of 10.
mean_list <- list(c(1, 0), c(0, 1))
cov_identity <- diag(2)
data <- simdata_contaminated(n = c(10, 10), mean = mean_list,
                             cov = cov_identity, epsilon = 0.05, kappa = 10,
                             seed = 42)
dim(data$x)
table(data$y)

# Generates 10 observations from each of three multivariate contaminated
# normal populations with unequal covariance matrices. The contamination
# probabilities and scales differ for each population as well.
set.seed(42)
mean_list <- list(c(-3, -3), c(0, 0), c(3, 3))
cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity)
data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list,
                              cov = cov_list, epsilon = c(0.05, 0.1, 0.2),
                              kappa = c(2, 5, 10))
dim(data2$x)
table(data2$y)

Run the code above in your browser using DataLab