cwbtools (version 0.1.0)

cwbtools-package: cwbtools-package

Description

Tools to Create, Modify and Manage CWB Corpora.

Arguments

Details

The Corpus Workbench (CWB) offers a classic approach for working with large, linguistically and structurally annotated corpora that ensures memory efficiency and makes running queries fast (Evert and Hardie 2011). Technically, indexing and compressing corpora as suggested by Witten et al. (1999) is the approach implemented in the design of the CWB (Christ 1994).

The maturity of the CWB and the efficiency of the original C implementation notwithstanding, both the convenience and the flexibility of traditional CWB command line tools is limited. Restrictions to the portability of code across platforms inhibits the ideal of reproducible research.

The 'cwbtools' package combines portable pure R tools to create indexed corpus files and convenience wrappers for the original C implementation of CWB as exposed by the RcppCWB package. Additional functionality to add and modify annotations of corpora from within R makes working with CWB indexed corpora much more flexible. "Pure R" workflows to enrich corpora with annotations using standard NLP tools or generated manually can be implemented seamlessly and conveniently.

The cwbtools package is a companion of the RcppCWB and the polmineR package and is a building block of an infrastructure to support the combination of quantitative and qualitative approaches when working with textual data.

References

Christ, Oliver (1994): "A Modular and Flexible Architecture for an Integrated Corpus Query System". Proceedings of COMPLEX'94, pp.23-32. (available online here)

Evert, Stefan and Andrew Hardie (2011): "Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium." In: Proceedings of the Corpus Linguistics 2011 conference, University of Birmingham, UK. (available online here)

Witten, Ian H., Alistair Moffat and Timothy C. Bell (1999): Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd edition. San Francisco et al.: Morgan Kaufmann.

Examples

Run this code
# NOT RUN {
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters.tm <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

library(tidytext)
reuters.tibble <- tidy(reuters.tm)
reuters.tibble[["topics_cat"]] <- sapply(
  reuters.tibble[["topics_cat"]],
  function(x) paste(reuters.tibble[["topics_cat"]], collapse = "|")
)
reuters.tibble[["places"]] <- sapply(
  reuters.tibble[["places"]],
  function(x) paste(x, collapse = "|")
)
reuters.tidy <- unnest_tokens(
  reuters.tibble, output = "word", input = "text", to_lower = FALSE
)

cdata <- list(
  tokenstream = as.data.table(reuters.tidy[, c("id", "word")]),
  metadata = as.data.table(reuters.tibble[,c("id", "topics_cat", "places", "language")])
  )
cdata <- add_corpus_positions(cdata)

registry_dir_tmp <- normalizePath(tempdir(), winslash = "/")
data_dir_tmp <- normalizePath(tempdir(), winslash = "/")

encode_corpusdata(
  cdata, corpus = "REUTERS", encoding = "utf8",
  registry_dir = registry_dir_tmp, data_dir = data_dir_tmp
  )
# }

Run the code above in your browser using DataCamp Workspace