partition: Initialize a partition.

Description

Create a subcorpus stored in an object of the partition class. Counts are performed for the p-attribute defined by the parameter p_attribute.

Usage

partition(.Object, ...)
# S4 method for character
partition(.Object, def = NULL, name = "",
  encoding = NULL, p_attribute = NULL, regex = FALSE, xml = "flat",
  decode = TRUE, type = get_type(.Object), mc = FALSE, verbose = TRUE,
  ...)
# S4 method for list
partition(.Object, ...)
# S4 method for environment
partition(.Object, slots = c("name", "corpus", "size",
  "p_attribute"))
# S4 method for partition
partition(.Object, def = NULL, name = "",
  regex = FALSE, p_attribute = NULL, decode = TRUE, xml = NULL,
  verbose = TRUE, mc = FALSE, ...)
# S4 method for Corpus
partition(.Object, def = NULL, name = "",
  encoding = NULL, regex = FALSE, xml = "flat",
  type = get_type(.Object), verbose = TRUE, ...)

Arguments

.Object

character-vector - the CWB-corpus to be used

...

parameters passed into the partition-method

def

list consisting of a set of character vectors (see details and examples)

name

name of the new partition, defaults to "

encoding

encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used b

p_attribute

the p_attribute(s) for which term frequencies shall be retrieved

regex

logical (defaults to FALSE)

xml

either 'flat' (default) or 'nested'

decode

whether to turn token ids to strings (set FALSE to minimize object.size / memory consumption)

type

character vector (length 1) specifying the type of corpus / partition (e.g. "plpr")

whether to use multicore (for counting terms)

verbose

logical, defaults to TRUE

slots

character vector

Value

An object of the S4 class 'partition'

Details

The function sets up a partition (subcorpus) based on a list of s-attributes with respective values.

The s-attributes defining the partition can be passed in as a list, e.g. list(interjection="speech", year="2013"), or - for convencience - directly.

The values defining the partition may contain regular expressions. To use regular expression syntax, set the parameter regex to "TRUE". Regular expressions are passed into grep, i.e. the regex syntax used in R needs to be used (double backlashes etc.). If regular expressions are used, the length of the character vector needs to be 1. If regex is "FALSE", the length of the character vectors can be > 1, matching s-attributes are identifies with the operator in.

The XML imported into the CWB may be "flat" or "nested". This needs to be indicated with the parameter xml (default is "flat"). If you generate a partition based on a flat XML structure, some performance gain may be achieved when ordering the s-attributes with decreasingly restrictive conditions. If you have a nested XML, it is mandatory that the order of the s-attributes provided reflects the hierarchy of the XML: The top-level elements need to be positioned at the beginning of the list with the s-attributes, the the most restrictive elements at the end.

If p_attribute is not NULL, a count of tokens in the corpus will be performed and kept in the stat-slot of the partition-object. The length of the p_attribute character vector may be 1 or more. If two or more p-attributes are provided, The occurrence of combinations will be counted. A typical scenario is to combine the p-attributes "word" or "lemma" and "pos".

Examples

Run this code

# NOT RUN {
use("polmineR")
spd <- partition("GERMAPARLMINI", party = "SPD", interjection = "speech")
kauder <- partition("GERMAPARLMINI", speaker = "Volker Kauder", p_attribute = "word")
merkel <- partition("GERMAPARLMINI", speaker = ".*Merkel", p_attribute = "word", regex = TRUE)
s_attributes(merkel, "date")
s_attributes(merkel, "speaker")
merkel <- partition(
  "GERMAPARLMINI", speaker = "Angela Dorothea Merkel",
  date = "2009-11-10", interjection = "speech", p_attribute = "word"
  )
merkel <- subset(merkel, !word %in% punctuation)
merkel <- subset(merkel, !word %in% tm::stopwords("de"))
   
# a certain defined time segment
days <- seq(
  from = as.Date("2009-10-28"),
  to = as.Date("2009-11-11"),
  by = "1 day"
)
period <- partition("GERMAPARLMINI", date = days)
# }