Read a Data Stream from File
A DSD class that reads a data stream from a file or any R connection.
DSD_ReadCSV(file, k=NA, take=NULL, class=NULL, loop=FALSE, sep=",", header=FALSE, skip=0, colClasses = NA, ...) close_stream(dsd)
- A file/URL or an open connection.
- Number of true clusters, if known.
- indices of columns to extract from the file.
- column index for the class attribute/cluster label.
takeis specified then it needs to also include the class/label column.
- If enabled, the object will loop through the stream when the end has been reached. If disabled, the object will warn the user upon reaching the end.
- The character string that separates dimensions in data points in the stream.
- Does the first line contain variable names?
- the number of lines of the data file to skip before beginning to read data.
- A vector of classes to be assumed for the columns
passed on to
- Further arguments are passed on to
read.table. This can for example be used for encoding, quotes, etc.
- A object of class
read.table() to read in data from an R
connection. The connection is responsible for maintaining where the stream
is currently being read from. In general, the connections will consist of files
stored on disk but have many other possibilities (see
The implementation tries to gracefully deal with slightly corrupted data by dropping points with inconsistent reading and producing a warning. However, this might not always be possible resulting in an error instead.
The position in the file can be reset to the beginning using
reset_stream(). The connection can be closed using
An object of class
# creating data and writing it to disk stream <- DSD_Gaussians(k=3, d=5) write_stream(stream, "data.txt", n=10, header = TRUE, sep=",") # reading the same data back (as a loop) stream2 <- DSD_ReadCSV("data.txt", sep=",", header = TRUE, loop=TRUE) stream2 # get points (fist a single point and then 20 using loop) get_points(stream2) get_points(stream2, n=20) # clean up close_stream(stream2) file.remove("data.txt") # example with a part of the kddcup1999 data (take only cont. variables) file <- system.file("examples", "kddcup10000.data.gz", package="stream") stream <- DSD_ReadCSV(gzfile(file), take=c(1, 5, 6, 8:11, 13:20, 23:42), class=42, k=7) stream get_points(stream, 5, class = TRUE) # plot 100 points (projected on the first two principal components) plot(stream, n=100, method="pc") close_stream(stream)