Learn R Programming

MixSim (version 1.0-1)

simdataset: Dataset Simulation

Description

Simulates a datasets of sample size n given parameters of finite mixture model with Gaussian components

Usage

simdataset(n, Pi, Mu, S, n.noise = 0, n.out = 0, alpha = 0.001,
           max.out = 100000, int = NULL, lambda = NULL)

Arguments

n
sample size
Pi
vector of mixing proprtions (length K)
Mu
matrix consisting of components' mean vectors (K x p)
S
set of components' covariance matrices (p x p x K)
n.noise
number of noise variables
n.out
number of outlying observations
alpha
level for simulating outliers
max.out
maximum number of trials to simulate outliers
int
interval for noise and outlier generation
lambda
inverse Box-Cox transformation coefficients

Value

  • Xsimulated dataset (n + n.out) x (p + n.noise); noise coordiantes are provided in the last n.noise columns
  • idclassification vector (length n + n.out); 0 represents an outlier

Details

The function simulates a dataset of n observations from a mixture model with parameters 'Pi' (mixing proportions), 'Mu' (mean vectors), and 'S' (covariance matrices). Mixture component sample sizes are produced as a realization from a multinomial distribution with probabilities given by mixing proportions. To make a dataset more challenging for clustering, a user might want to simulate noise variables or outliers. Parameter 'n.noise' specifies the desired number of noise variables. If an interval 'int' is specified, noise will be simulated from a Uniform distribution on the interval given by 'int'. Otherwise, noise will be simulated uniformly between the smallest and largest coordinates of mean vectors. 'n.out' specifies the number of obervations outside (1 - 'alpha') elipsoidal contours for all mixture components. Outliers are simulated on a hypercube specified by the interval 'int'. A user can apply an inverse Box-Cox transformation providing a vector of coefficients 'lambda'. The value 1 implies that no transformation is needed for the corresponding coordinate.

References

Maitra, R. and Melnykov, V. (2010) "Simulating data to study performance of finite mixture modeling and clustering algorithms", The Journal of Computational and Graphical Statistics, 2:19, 354-376.

See Also

MixSim, overlap, pdplot

Examples

Run this code
set.seed(1234)

repeat{
   Q <- MixSim(BarOmega = 0.01, K = 4, p = 2)
   if (Q$fail == 0) break
}

# simulate a dataset of size 300 and add 10 outliers simulated on (0,1)x(0,1)
A <- simdataset(n = 500, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.out = 10, int = c(0, 1))
colors <- c("red", "green", "blue", "brown", "magenta")
plot(A$X, xlab = "x1", ylab = "x2", type = "n")
for (k in 0:4){
   points(A$X[A$id == k, ], col = colors[k+1], pch = 19, cex = 0.5)
}

repeat{
   Q <- MixSim(MaxOmega = 0.1, K = 4, p = 1)
   if (Q$fail == 0) break
}

# simulate a dataset of size 300 with 1 noise variable
A <- simdataset(n = 300, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.noise = 1)
plot(A$X, xlab = "x1", ylab = "x2", type = "n")
for (k in 1:4){
   points(A$X[A$id == k, ], col = colors[k+1], pch = 19, cex = 0.5)
}

Run the code above in your browser using DataLab