partition: Initialize a partition

Description

Set up an object of the partition class. Frequency lists are computeted and kept in the stat-slot if pAttribute is not NULL.

Usage

partition(.Object, ...)
"partition"(.Object, def = NULL, name = c(""), encoding = NULL, pAttribute = NULL, meta = NULL, regex = FALSE, xml = "flat", id2str = TRUE, type = NULL, mc = FALSE, verbose = TRUE, ...)
"partition"(.Object, ...)
"partition"(.Object)
"partition"(.Object, def = NULL, name = c(""), regex = FALSE, pAttribute = NULL, id2str = TRUE, type = NULL, verbose = TRUE, mc = FALSE, ...)

Arguments

.Object

character-vector - the CWB-corpus to be used

...

parameters passed into the partition-method

def

list consisting of a set of character vectors (see details and examples)

name

name of the new partition, defaults to "noName"

encoding

encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used b

pAttribute

the pAttribute(s) for which term frequencies shall be retrieved

Value

An object of the S4 class 'partition'

Details

The function sets up a partition based on a list of s-attributes with respective values. The s-attributes defining the partition are a list, e.g. list(text_type="speech", text_year="2013"). The values of the list may contain regular expressions. To use regular expression syntax, set the parameter regex to "TRUE". Regular expressions are passed into grep, i.e. the regex syntax used in R needs to be used (double backlashes etc.).

The XML imported into the CWB may be "flat" or "nested". This needs to be indicated with the parameter xml (default is "flat"). If you generate a partition based on a flat XML structure, some performance gain may be achieved when ordering the sAttributes with decreasingly restrictive conditions. If you have a nested XML, it is mandatory that the order of the sAttributes provided reflects the hierarchy of the XML: The top-level elements need to be positioned at the beginning of the list with the s-attributes, the the most restrictive elements at the end.

If pAttribute is not NULL, a count of tokens in the corpus will be performed and kept in the stat-slot of the partition-object. The length of the pAttribute character vector may be 1 or more. If two or more p-attributes are provided, The occurrence of combinations will be counted. A typical scenario is to combine the p-attributes "word" or "lemma" and "pos".

Examples

Run this code

if (require(polmineR.sampleCorpus) && require(rcqp)){
   use(polmineR.sampleCorpus)
   spd <- partition(
     "PLPRBTTXT", text_party="SPD", text_type="speech"
     )
   kauder <- partition(
   "PLPRBTTXT", text_name="Volker Kauder", pAttribute="word"
   )
   merkel <- partition(
     "PLPRBTTXT", text_name=".*Merkel",
     pAttribute="word", regex=TRUE
     )
   sAttributes(merkel, "text_date")
   sAttributes(merkel, "text_name")
   merkel <- partition(
     "PLPRBTTXT", text_name="Angela Dorothea Merkel",
     text_date="2009-11-10", text_type="speech", pAttribute="word"
     )
   merkel <- subset(merkel, !word %in% punctuation)
   merkel <- subset(merkel, !word %in% tm::stopwords("de"))
}

Run the code above in your browser using DataLab