simdataset: Dataset Simulation

Description

Simulates a datasets of sample size n given parameters of finite mixture model with Gaussian components

Usage

simdataset(n, Pi, Mu, S, n.noise = 0, n.out = 0, alpha = 0.001,
           max.out = 100000, int = NULL, lambda = NULL)

Arguments

sample size

vector of mixing proprtions (length K)

matrix consisting of components' mean vectors (K x p)

set of components' covariance matrices (p x p x K)

n.noise

number of noise variables

n.out

number of outlying observations

alpha

level for simulating outliers

max.out

maximum number of trials to simulate outliers

int

interval for noise and outlier generation

lambda

inverse Box-Cox transformation coefficients

Value

Xsimulated dataset (n + n.out) x (p + n.noise); noise coordiantes are provided in the last n.noise columns
idclassification vector (length n + n.out); 0 represents an outlier

Details

The function simulates a dataset of n observations from a mixture model with parameters 'Pi' (mixing proportions), 'Mu' (mean vectors), and 'S' (covariance matrices). Mixture component sample sizes are produced as a realization from a multinomial distribution with probabilities given by mixing proportions. To make a dataset more challenging for clustering, a user might want to simulate noise variables or outliers. Parameter 'n.noise' specifies the desired number of noise variables. If an interval 'int' is specified, noise will be simulated from a Uniform distribution on the interval given by 'int'. Otherwise, noise will be simulated uniformly between the smallest and largest coordinates of mean vectors. 'n.out' specifies the number of obervations outside (1 - 'alpha') elipsoidal contours for all mixture components. Outliers are simulated on a hypercube specified by the interval 'int'. A user can apply an inverse Box-Cox transformation providing a vector of coefficients 'lambda'. The value 1 implies that no transformation is needed for the corresponding coordinate.

References

Maitra, R. and Melnykov, V. (2010) "Simulating data to study performance of finite mixture modeling and clustering algorithms", The Journal of Computational and Graphical Statistics, 2:19, 354-376.

Examples

Run this code

set.seed(1234)

repeat{
   Q <- MixSim(BarOmega = 0.01, K = 4, p = 2)
   if (Q$fail == 0) break
}

# simulate a dataset of size 300 and add 10 outliers simulated on (0,1)x(0,1)
A <- simdataset(n = 500, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.out = 10, int = c(0, 1))
colors <- c("red", "green", "blue", "brown", "magenta")
plot(A$X, xlab = "x1", ylab = "x2", type = "n")
for (k in 0:4){
   points(A$X[A$id == k, ], col = colors[k+1], pch = 19, cex = 0.5)
}

repeat{
   Q <- MixSim(MaxOmega = 0.1, K = 4, p = 1)
   if (Q$fail == 0) break
}

# simulate a dataset of size 300 with 1 noise variable
A <- simdataset(n = 300, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.noise = 1)
plot(A$X, xlab = "x1", ylab = "x2", type = "n")
for (k in 1:4){
   points(A$X[A$id == k, ], col = colors[k+1], pch = 19, cex = 0.5)
}

Run the code above in your browser using DataLab