TextReuseCorpus: TextReuseCorpus

Description

This is the constructor function for a TextReuseCorpus, modeled on the virtual S3 class Corpus from the tm package. The object is a TextReuseCorpus, which is basically a list containing objects of class TextReuseTextDocument. Arguments are passed along to that constructor function. To create the corpus, you can pass either a character vector of paths to text files using the paths = parameter, a directory containing text files (with any extension) using the dir = parameter, or a character vector of documents using the text = parameter, where each element in the characer vector is a document. If the character vector passed to text = has names, then those names will be used as the document IDs. Otherwise, IDs will be assigned to the documents. Only one of the paths, dir, or text parameters should be specified.

Usage

TextReuseCorpus(
  paths,
  dir = NULL,
  text = NULL,
  meta = list(),
  progress = interactive(),
  tokenizer = tokenize_ngrams,
  ...,
  hash_func = hash_string,
  minhash_func = NULL,
  keep_tokens = FALSE,
  keep_text = TRUE,
  skip_short = TRUE
)
is.TextReuseCorpus(x)
skipped(x)

Arguments

paths

A character vector of paths to files to be opened.

dir

The path to a directory of text files.

text

A character vector (possibly named) of documents.

Details

If skip_short = TRUE, this function will skip very short or empty documents. A very short document is one where there are two few words to create at least two n-grams. For example, if five-grams are desired, then a document must be at least six words long. If no value of n is provided, then the function assumes a value of n = 3. A warning will be printed with the document ID of each skipped document. Use skipped() to get the IDs of skipped documents.

This function will use multiple cores on non-Windows machines if the "mc.cores" option is set. For example, to use four cores: options("mc.cores" = 4L).

Examples

Run this code

# NOT RUN {
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))
# Subset by position or file name
corpus[[1]]
names(corpus)
corpus[["ca1851-match"]]

# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples