Learn R Programming

stream (version 1.5-1)

DSD_Gaussians: Mixture of Gaussians Data Stream Generator

Description

A data stream generator that produces a data stream with a mixture of static Gaussians.

Usage

DSD_Gaussians(
  k = 2,
  d = 2,
  mu,
  sigma,
  p,
  noise = 0,
  noise_range,
  separation_type = c("auto", "Euclidean", "Mahalanobis"),
  separation = 0.2,
  space_limit = c(0.2, 0.8),
  variance_limit = 0.01,
  outliers = 0,
  outlier_options = NULL,
  verbose = FALSE
)

Value

Returns a DSD_Gaussians object (subclass of DSD_R, DSD) which is a list of the defined params. The params are either passed in from the function or created internally. They include:

description

A brief description of the DSD object.

k

The number of clusters.

d

The number of dimensions.

mu

The matrix of means of the dimensions in each cluster.

sigma

The covariance matrix.

p

The probability vector for the clusters.

noise

A flag that determines if or if not noise is generated.

outs

Outlier spatial positions.

outs_pos

Outlier stream positions.

outs_vv

Outlier virtual variance.

Arguments

k

Determines the number of clusters.

d

Determines the number of dimensions.

mu

A matrix of means for each dimension of each cluster.

sigma

A list of length k of covariance matrices.

p

A vector of probabilities that determines the likelihood of generated a data point from a particular cluster.

noise

Noise probability between 0 and 1. Noise is uniformly distributed within noise range (see below).

noise_range

A matrix with d rows and 2 columns. The first column contains the minimum values and the second column contains the maximum values for noise.

separation_type

The type of the separation distance calculation. It can be either Euclidean norm or Mahalanobis distance.

separation

Depends on the separation_type parameter. It means minimum separation distance between all generated constructs. When k>0, generated constructs include clusters. When outliers>0, generated constructs include outliers.

space_limit

Defines the space bounds. All constructs are generated inside these bounds. For clusters this means that their centroids must be within these space bounds.

variance_limit

Upper limit for the randomly generated variance when creating cluster covariance matrices.

outliers

Determines the number of data points marked as outliers. Outliers generated by DSD_Gaussians are statistically separated enough from clusters, so that outlier detectors can find them in the overall data stream. Cluster and outlier separation distance is determined by separation and outlier_virtual_variance parameters. The outlier virtual variance defines an empty space around outliers, which separates them from their surrounding. Unlike noise, outliers are data points of interest for end-users, and the goal of outlier detectors is to find them in data streams. For more details, read the "Introduction to stream" vignette.

outlier_options

Effective only when outliers>0. Comprises the following list of options:

  • predefined_outlier_space_positions - (Default=NULL) A predefined list of outlier spatial positions. Similar to mu.

  • predefined_outlier_stream_positions - (Default=NULL) A predefined list of outlier stream positions. Must have the same number of elements as predefined_outlier_space_positions.

  • outlier_horizon - (Default=500) A horizon in the generated data stream measured in data points that will contain requested number of outliers.

  • outlier_virtual_variance - (Default=1) A variance used to create the virtual covariance matrices for outliers. Such virtual statistical distribution helps to define an empty space around outliers that separates them from other constructs, both clusters and outliers.

verbose

Printout of the cluster and outlier generation process.

Author

Michael Hahsler, Dalibor Krleža

Details

DSD_Gaussians creates a mixture of k static clusters and outliers outliers in a d-dimensional space. The cluster centers mu and the covariance matrices sigma can be supplied or will be randomly generated. The probability vector p defines for each cluster the probability that the next data point will be chosen from it (defaults to equal probability). The outlier spatial positions predefined_outlier_space_positions and the outlier stream positions predefined_outlier_stream_positions can be supplied or will be randomly generated.

Separation between generated clusters and outliers can be imposed by using Euclidean or Mahalanobis distance, which is controlled by the separation_type parameter. Separation value then is supplied in the separation parameter.

The generation method is similar to the one suggested by Jain and Dubes (1988).

References

Jain and Dubes(1988) Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

See Also

DSD

Examples

Run this code

# create data stream with three clusters in 3-dimensional data space
stream1 <- DSD_Gaussians(k=3, d=3)
plot(stream1)


# create data stream with specified cluster positions,
# 20% noise in a given bounding box and
# with different densities (1 to 9 between the two clusters)
stream2 <- DSD_Gaussians(k=2, d=2,
    mu=rbind(c(-.5,-.5), c(.5,.5)),
    noise=0.2, noise_range=rbind(c(-1,1),c(-1,1)),
    p=c(.1,.9))
plot(stream2)

# create 2 clusters and 2 outliers. Clusters and outliers
# are separated by Euclidean distance of 0.5 or more.
stream3 <- DSD_Gaussians(k=2, d=2,
    separation_type="Euclidean", separation=0.5,
    space_limit=c(0,1),
    outliers=2)
plot(stream3)

# create 2 clusters and 2 outliers separated by a Mahalanobis
# distance of 6 or more.
stream4 <- DSD_Gaussians(k=2, d=2,
  separation_type="Mahalanobis", separation=6,
  space_limit=c(0,25), variance_limit=2,
  outliers=2)
plot(stream4)

# spread outliers over 20000 data instances
stream5 <- DSD_Gaussians(k=2, d=2,
  separation_type="Mahalanobis", separation=6,
  space_limit=c(0,45), variance_limit=2,
  outliers=20, outlier_options=list(
    outlier_horizon=20000,
    outlier_virtual_variance = 0.3))
plot(stream5, n=20000)

Run the code above in your browser using DataLab