readCorpus: Create kRp.corpus objects from text files or data frames

Description

You can either read a corpus from text files (one file per text, also see the Hierarchy section below) or from TIF compliant data frames (see the Data frames section below).

Usage

readCorpus(
  dir,
  hierarchy = list(),
  lang = "kRp.env",
  tagger = "kRp.env",
  encoding = "",
  pattern = NULL,
  recursive = FALSE,
  ignore.case = FALSE,
  mode = "text",
  format = "file",
  mc.cores = getOption("mc.cores", 1L),
  id = "",
  ...
)

Arguments

dir

Either a file path to the root directory of the text corpus, or a TIF compliant data frame. If a directory path (character string), texts can be recursively ordered into subfolders named exactly as defined by hierarchy. If hierarchy is an empty list, all text files located in dir are parsed without a hierachical structure. If a data frame, also set format="obj" and provide hierarchy levels as additional columns, as described in the Data frames section.

hierarchy

A named list of named character vectors describing the directory hierarchy level by level. If TRUE instead, the hierarchy structure is taken directly from the directory tree. See section Hierarchy for details.

lang

A character string naming the language of the analyzed corpus. See kRp.POS.tags for all supported languages. If set to "kRp.env" this is got from get.kRp.env. This information will also be passed to the readerControl list of the VCorpus call.

tagger

A character string pointing to the tokenizer/tagger command you want to use for basic text analysis. Defaults to tagger="kRp.env" to get the settings by get.kRp.env. Set to "tokenize" to use tokenize.

encoding

Character string describing the current encoding. See DirSource for details, omitted if format="obj".

pattern

A regular expression for file matching. See DirSource for details, omitted if format="obj".

recursive

Logical, indicates whether directories should be read recursively. See DirSource for details, omitted if format="obj".

ignore.case

Logical, indicates whether pattern is matched case sensitive. See DirSource for details, omitted if format="obj".

mode

Character string defining the reading mode. See DirSource for details, omitted if format="obj".

format

Either "file" or "obj", depending on whether you want to scan files or analyze the text in a given object, like a character vector. If the latter and treetag is used as the tagger, texts will be written to temporary files for the process (see dir).

mc.cores

The number of cores to use for parallelization, see mclapply. This value is passed through to simpleCorpus.

A character string describing the main subject/purpose of the text corpus.

...

Additional options which are passed through to the defined tagger.

Value

An object of class kRp.corpus.

Hierarchy

To import a hierarchically structured text corpus you must categorize all texts in a directory structure that resembles the hierarchy. If for example you would like to import a corpus on two different topics and two differnt sources, your hierarchy has two nested levels (topic and source). The root directory dir would then need to have two subdirectories (one for each topic) which in turn must have two subdirectories (one for each source), and the actual text files are found in those.

To use this hierarchical structure in readCorpus, the hierarchy argument is used. It is a named list, where each list item represents one hierachical level (here again topic and source), and its value is a named character vector describing the actual topics and sources to be used. It is important to understand how these character vectors are treated: The names of elements must exactly match the corresponding subdirectroy name, whereas the value is a free text description. The names of the list items however describe the hierachical level and are not matched with directory names.

Data frames

In order to import a corpus from a data frame, the object must be in Text Interchange Format (TIF) as described by [1]. As a minimum, the data frame must have two character columns, doc_id and text.

You can provide additional information on hierarchical categories by using further columns, where the column name must match the category name (hierachical level). The order of those columns in the data frame is not important, as you must still fully define the hierarchical structure using the hierarchy argument. All columns you omit are ignored, but the values used in the hierarchy list and the respective columns must match, as rows with unmatched category levels will also be ignored.

Note that the special column names path and file will also be imported automatically.

References

[1] Text Interchange Formats (https://github.com/ropensci/tif)

Examples

Run this code

# NOT RUN {
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  # "flat" corpus, parse all texts in the given dir
  myCorpus <- readCorpus(
    dir=file.path(
      path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_prev"
    ),
    # use tokenize() so examples run without a TreeTagger installation
    tagger="tokenize",
    lang="en"
  )
 
  # corpus with one category names "Source"
  myCorpus <- readCorpus(
    dir=file.path(
      path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner"
    ),
    hierarchy=list(
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    tagger="tokenize",
    lang="en"
  )
 
  # two hieraryhical levels, "Topic" and "Source"
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=list(
      Topic=c(
        Winner="Reality Winner",
        Edwards="Natalie Edwards"
      ),
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    tagger="tokenize",
    lang="en"
  )
 
  # get hierarchy from directory tree
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=TRUE,
    tagger="tokenize",
    lang="en"
  )
  
  
# }
# NOT RUN {
    # if the same corpus is available as TIF compliant data frame
    myCorpus <- readCorpus(
      dir=myCorpus_df,
      hierarchy=list(
        Topic=c(
          Winner="Reality Winner",
          Edwards="Natalie Edwards"
        ),
        Source=c(
          Wikipedia_prev="Wikipedia (old)",
          Wikipedia_new="Wikipedia (new)"
        )
      ),
      lang="en",
      format="obj"
    )
  
# }
# NOT RUN {
} else {}
# }

Run the code above in your browser using DataLab