DSD_ReadCSV: Read a Data Stream from File

Description

A DSD class that reads a data stream from a file or any R connection.

Usage

DSD_ReadCSV(file, k=NA, o=NA,
  take=NULL, class=NULL, outlier=NULL, loop=FALSE,
  sep=",", header=FALSE, skip=0, colClasses = NA, ...)
close_stream(dsd)

Arguments

file

A file/URL or an open connection.

Number of true clusters, if known.

Number of outliers, if known.

take

indices of columns to extract from the file.

class

column index for the class attribute/cluster label. If take is specified then it needs to also include the class/label column.

outlier

column index for the outlier mark. If take is specified then it needs to also include the outlier column.

loop

If enabled, the object will loop through the stream when the end has been reached. If disabled, the object will warn the user upon reaching the end.

sep

The character string that separates dimensions in data points in the stream.

header

Does the first line contain variable names?

skip

the number of lines of the data file to skip before beginning to read data.

colClasses

A vector of classes to be assumed for the columns passed on to read.table.

...

Further arguments are passed on to read.table. This can for example be used for encoding, quotes, etc.

dsd

A object of class DSD_ReadCSV.

Value

An object of class DSD_ReadCSV (subclass of DSD_R, DSD).

Details

DSD_ReadCSV uses read.table() to read in data from an R connection. The connection is responsible for maintaining where the stream is currently being read from. In general, the connections will consist of files stored on disk but have many other possibilities (see connection).

The implementation tries to gracefully deal with slightly corrupted data by dropping points with inconsistent reading and producing a warning. However, this might not always be possible resulting in an error instead.

The position in the file can be reset to the beginning using reset_stream(). The connection can be closed using close_stream().

Examples

Run this code

# NOT RUN {
# creating data and writing it to disk
stream <- DSD_Gaussians(k=3, d=5, outliers=1, space_limit=c(0,2),
  outlier_options = list(outlier_horizon=10))
write_stream(stream, "data.txt", n=10, header = TRUE, sep=",", class=TRUE, write_outliers=TRUE)

# reading the same data back (as a loop)
stream2 <- DSD_ReadCSV(k=3, o=1, "data.txt", sep=",", header = TRUE, loop=TRUE, class="class",
                       outlier="outlier")
stream2

# get points (fist a single point and then 20 using loop)
get_points(stream2)
p <- get_points(stream2, n=20, outlier=TRUE)
message(paste("Outliers",sum(attr(p,"outlier"))))

# clean up
close_stream(stream2)
file.remove("data.txt")

# example with a part of the kddcup1999 data (take only cont. variables)
file <- system.file("examples", "kddcup10000.data.gz", package="stream")
stream <- DSD_ReadCSV(gzfile(file),
        take=c(1, 5, 6, 8:11, 13:20, 23:42), class=42, k=7)
stream

get_points(stream, 5, class = TRUE)


# plot 100 points (projected on the first two principal components)
plot(stream, n=100, method="pc")

close_stream(stream)
# }

Run the code above in your browser using DataLab