synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0,
n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))
genPositiveDefMat()
in k
clusters in
roughly $[0,1]^d$. The data points for each cluster are be drawn from a
multivariate normal distribution given a random mean and a random
variance/covariance matrix for each cluster. The temporal aspect is modeled by
a fixed subsequence (of length n_subseq
) through the k
clusters. In each step in the subsequence we
have a transition probability p_transition
that the next data point
is in the same
cluster or in a randomly chosen other cluster, thus we can create slowly or
fast changing data. For the complete sequence, the subsequence is repeated
to create n_test
/n_train
data points.
The data set is generated by drawing a data point from
the cluster corresponding to each position in the sequence. Outliers are
introduced by replacing data points in the data set with probability
$p_outlier
by
randomly chosen data points in $[0,1]^d$.
Finally, to introduce imperfection
in the temporal sequence (e.g., because the data does not follow exactly a
repeating sequence or because observations do not arrive in the correct order),
we swap two consecutive observations with probability p_swap
.## create only test data (with outliers)
ds <- synthetic_stream(n_train=0)
## plot test data
plot(ds$test, pch = ds$sequence_test, col ="gray")
text(ds$model$mu[,1], ds$model$mu[,2], 1:10)
## mark outliers
points(ds$test[ds$outlier_position,], pch=3, lwd=2, col="red")
Run the code above in your browser using DataLab