This function creates a synthetic data stream with data points in roughly \([0, 1]^p\) by choosing points form k clusters following a sequence through these clusters. Each cluster has a density function following a d-dimensional normal distributions. In the test set outliers are introduced.
synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0,
n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))
number of clusters.
dimensionality of data set.
length of subsequence which will be repeat to create the data set.
probability that the next position in the subsequence will belong to a different cluster.
probability that two data points are swapped. This represents measurement errors (e.g., a data points arrive out of order) or that the data stream does not exactly follow the subsequence.
size of training set (without outliers).
size of test set (with outliers).
probability that a data point is replaced by an outlier (a randomly chosen point in \([0,1]^p\)).
Used to create the random covariance matrices for the
clusters. See genPositiveDefMat()
in clusterGeneration
for details.
A list with the following elements:
test data.
training data.
sequence of the test data points through the clusters.
sequence of the training data points through the clusters.
index where points are swapped.
index where points are swapped.
logical vector for outliers in test data.
centers and covariance matrices for the clusters.
The data generation process creates a data set consisting of k
clusters in
roughly \([0,1]^d\). The data points for each cluster are be drawn from a
multivariate normal distribution given a random mean and a random
variance/covariance matrix for each cluster. The temporal aspect is modeled by
a fixed subsequence (of length n\_subseq
) through the k
clusters. In each step in the subsequence we
have a transition probability p\_transition
that the next data point
is in the same
cluster or in a randomly chosen other cluster, thus we can create slowly or
fast changing data. For the complete sequence, the subsequence is repeated
to create n_test
/n_train
data points.
The data set is generated by drawing a data point from
the cluster corresponding to each position in the sequence. Outliers are
introduced by replacing data points in the data set with probability
$p_outlier
by
randomly chosen data points in \([0,1]^d\).
Finally, to introduce imperfection
in the temporal sequence (e.g., because the data does not follow exactly a
repeating sequence or because observations do not arrive in the correct order),
we swap two consecutive observations with probability p_swap
.
# NOT RUN {
## create only test data (with outliers)
ds <- synthetic_stream(n_train=0)
## plot test data
plot(ds$test, pch = ds$sequence_test, col ="gray")
text(ds$model$mu[,1], ds$model$mu[,2], 1:10)
## mark outliers
points(ds$test[ds$outlier_position,], pch=3, lwd=2, col="red")
# }
Run the code above in your browser using DataCamp Workspace