decode: Decode corpus or subcorpus.

Description

Decode corpus or subcorpus and return class specified by argument to.

Usage

decode(.Object, ...)
# S4 method for character
decode(.Object, to = c("data.table", "Annotation"),
  ...)
# S4 method for slice
decode(.Object, to = "data.table")
# S4 method for partition
decode(.Object, to = "data.table")

Arguments

.Object

The corpus or subcorpus to decode.

...

Further arguments.

The class of the returned object, stated as a length-one character vector.

Value

The return value will correspond to the class specified by argument to.

Details

The primary purpose of the method is type conversion. By obtaining the corpus or subcorpus in the format specified by the argument to, the data can be processed with tools that do not rely on the Corpus Workbench (CWB). Supported output formats are data.table (which can be converted to a data.frame or tibble easily) or an Annotation object as defined in the package NLP. Another purpose of decoding the corpus can be to rework it, and to re-import it into the CWB (e.g. using the cwbtools-package).

An earlier version of the method included an option to decode a single s-attribute, which is not supported any more. See the s_attribute_decode function of the package RcppCWB.

Examples

Run this code

# NOT RUN {
use("polmineR")

# Decode corpus as data.table
dt <- decode("GERMAPARLMINI", to = "data.table")

# Decode a subcorpus
sc <- subset(corpus("GERMAPARLMINI"), speaker == "Angela Dorothea Merkel")
dt <- decode(sc, to = "data.table")

# Decode partition
P <- partition("REUTERS", places = "kuwait", regex = TRUE)
dt <- decode(P)

# Previous versions of polmineR offered an option to decode a single
# s-attribute. This is how you could proceed to get a table with metadata.
dt[, "word" := NULL]
dt[,
  {list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]), id = unique(.SD[["id"]]))},
  by = "struc"
  ]

# Decode subcorpus as Annotation object
# }
# NOT RUN {
if (requireNamespace("NLP")){
  library(NLP)
  p <- subset(corpus("GERMAPARLMINI"), date == "2009-11-10" & speaker == "Angela Dorothea Merkel")
  s <- as(p, "String")
  a <- as(p, "Annotation")
  
  # The beauty of having this NLP Annotation object is that you can now use 
  # the different annotators of the openNLP package. Here, just a short scenario
  # how you can have a look at the tokenized words and the sentences.

  words <- s[a[a$type == "word"]]
  sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols 
}
# }

Run the code above in your browser using DataLab