A DSD class that reads a data stream from a file or any R connection.
DSD_ReadCSV(file, k=NA, o=NA,
take=NULL, class=NULL, outlier=NULL, loop=FALSE,
sep=",", header=FALSE, skip=0, colClasses = NA, ...)
close_stream(dsd)
A file/URL or an open connection.
Number of true clusters, if known.
Number of outliers, if known.
indices of columns to extract from the file.
column index for the class attribute/cluster label.
If take
is specified then it needs to also include the class/label
column.
column index for the outlier mark.
If take
is specified then it needs to also include the outlier
column.
If enabled, the object will loop through the stream when the end has been reached. If disabled, the object will warn the user upon reaching the end.
The character string that separates dimensions in data points in the stream.
Does the first line contain variable names?
the number of lines of the data file to skip before beginning to read data.
A vector of classes to be assumed for the columns
passed on to read.table
.
Further arguments are passed on to read.table
.
This can for example be used for encoding, quotes, etc.
A object of class DSD_ReadCSV
.
An object of class DSD_ReadCSV
(subclass of DSD_R
,
DSD
).
DSD_ReadCSV
uses read.table()
to read in data from an R
connection. The connection is responsible for maintaining where the stream
is currently being read from. In general, the connections will consist of files
stored on disk but have many other possibilities (see connection
).
The implementation tries to gracefully deal with slightly corrupted data by dropping points with inconsistent reading and producing a warning. However, this might not always be possible resulting in an error instead.
The position in the file can be reset to the beginning using
reset_stream()
. The connection can be closed using close_stream()
.
# NOT RUN {
# creating data and writing it to disk
stream <- DSD_Gaussians(k=3, d=5, outliers=1, space_limit=c(0,2),
outlier_options = list(outlier_horizon=10))
write_stream(stream, "data.txt", n=10, header = TRUE, sep=",", class=TRUE, write_outliers=TRUE)
# reading the same data back (as a loop)
stream2 <- DSD_ReadCSV(k=3, o=1, "data.txt", sep=",", header = TRUE, loop=TRUE, class="class",
outlier="outlier")
stream2
# get points (fist a single point and then 20 using loop)
get_points(stream2)
p <- get_points(stream2, n=20, outlier=TRUE)
message(paste("Outliers",sum(attr(p,"outlier"))))
# clean up
close_stream(stream2)
file.remove("data.txt")
# example with a part of the kddcup1999 data (take only cont. variables)
file <- system.file("examples", "kddcup10000.data.gz", package="stream")
stream <- DSD_ReadCSV(gzfile(file),
take=c(1, 5, 6, 8:11, 13:20, 23:42), class=42, k=7)
stream
get_points(stream, 5, class = TRUE)
# plot 100 points (projected on the first two principal components)
plot(stream, n=100, method="pc")
close_stream(stream)
# }
Run the code above in your browser using DataLab