DSC_Sample
Extract a Fixedsize Sample from a Data Stream
Extracts a sample form a data stream using Reservoir Sampling. The sample is stored as a set of microclusters to be compatible with other data DSC stream clustering algorithms.
Usage
DSC_Sample(k = 100, biased = FALSE)
Arguments
 k
 the number of points to be sampled from the stream.
 biased
 if
FALSE
then a regular (unbiased) reservoir sampling is used. If true then the sample is biased towards keeping more recent data points (see Details section).
Details
If biased=FALSE
then the reservoir sampling algorithm by McLeod
and Bellhouse (1983) is used. This sampling makes sure that each
data point has the same chance to be sampled. All sampled points
will have a weight of 1. Note that this might not be ideal for an evolving
stream since very old data points have the same chance to be in the sample
as newer points.
If bias=TRUE
then sampling prefers newer points using the modified
reservoir sampling algorithm 2.1 by Aggarwal (2006). New points are always
added. They replace a random point in thre reservoir with a probability
of reservoir size over k
. This an exponential bias function of
$2^{lambda}$ with $lambda=1/k$.
Value

An object of class
DSC_Sample
(subclass of
DSC
, DSC_R
, DSC_Micro
).
References
Vitter, J. S. (1985): Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 3757.
McLeod, A.I., Bellhouse, D.R. (1983): A Convenient Algorithm for Drawing a Simple Random Sample. Applied Statistics, 32(2), 182184. Aggarwal C. (2006) On Biased Reservoir Sampling in the Presence of Stream Evolution. International Conference on Very Large Databases (VLDB'06). 607618.
See Also
Examples
stream < DSD_Gaussians(k=3, noise=0.05)
sample < DSC_Sample(k=20)
update(sample, stream, 500)
sample
# plot microclusters
plot(sample, stream)
# reclustering
kmeans < DSC_Kmeans(3)
recluster(kmeans, sample)
plot(kmeans, stream, type="both")
# sample from an evolving stream
stream < DSD_Benchmark(1)
sample < DSC_Sample(k=20)
update(sample, stream, 1000)
plot(sample, stream)
# Note: the clusters move from left to right and the sample keeps many
# outdated points
# use a biased sample to keep more recent data points
stream < DSD_Benchmark(1)
sample < DSC_Sample(k=20, biased=TRUE)
update(sample, stream, 1000)
plot(sample, stream)