Learn R Programming

lsa (version 0.73.4)

sample.textmatrix: Create a random sample of files

Description

Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.

Usage

sample.textmatrix(textmatrix, samplesize, index.return=FALSE)

Value

filelist

a list of filenames of the documents in the corpus.).

ix

If index.return is set to true, a list is returned; x contains the filenames and ix contains the position of the sample files in the original filelist.

Arguments

textmatrix

A document-term matrix.

samplesize

Desired number of files

index.return

if set to true, the positions of the subset in the original column vectors will be returned as well.

Author

Fridolin Wild f.wild@open.ac.uk

Details

Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.

See Also

textmatrix

Examples

Run this code

# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/"))
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/"))
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/"))

# create matrices
myMatrix = textmatrix(td, minWordLength=1)

sample(myMatrix, 3)

# clean up
unlink(td, recursive=TRUE)

Run the code above in your browser using DataLab