sample.textmatrix: Create a random sample of files

Description

Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.

Usage

sample.textmatrix(textmatrix, samplesize, index.return=FALSE)

Arguments

textmatrix

A document-term matrix.

samplesize

Desired number of files

index.return

if set to true, the positions of the subset in the original column vectors will be returned as well.

Value

filelist

a list of filenames of the documents in the corpus.).

If index.return is set to true, a list is returned; x contains the filenames and ix contains the position of the sample files in the original filelist.

Details

Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.

Examples

Run this code

# NOT RUN {
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/"))
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/"))
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/"))

# create matrices
myMatrix = textmatrix(td, minWordLength=1)

sample(myMatrix, 3)

# clean up
unlink(td, recursive=TRUE)

# }