Learn R Programming

⚠️There's a newer version (0.6.7) of this package.Take me there.

Rcpp bindings for the Corpus Workbench (CWB)

The package exposes functions of the Corpus Worbench (CWB) by way of Rcpp wrappers. Furthermore, the packages includes Rcpp code for performance critical operations. The main purpose of the package is to serve as an interface to the CWB for the package polmineR.

There is a huge intellectual debt to the developers of the R-package ‘rcqp’, Bernard Desgraupes and Sylvain Loiseau. The main impetus for developing RcppCWB is that using Rcpp decreases the pains to maintain the package, to expand the CWB functionality exposed, and – most importantly – to make it portable to Windows systems.

Installation on Windows

Pre-compiled ‘RcppCWB’ binaries can be installed from CRAN.

install.packages("RcppCWB")

If you want to get the development version, you need to compile RcppCWB yourself. Having Rtools installed on your system is necessary. Using the mechanism offered by the devtools package, you can install RcppCWB from GitHub.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")

During the installation, cross-compiled versions of the corpus library (CL) are downloaded from the GitHub repository PolMine/libcl. The libcl repository also includes a reproducible workflow using Docker to build static libraries from the CWB source code.

Installation on Ubuntu

The package includes the source code of the Corpus Workbench (CWB), slightly modified to make it compatible with R requirements. Compiling the CWB requires the pcre2 and glib libraries to be present. Using the Aptitude package manager (Ubuntu/Debian), running the following command from the shell will fulfill these dependencies.

sudo apt-get install libpcre2-dev libglib2.0-dev

Then, use the conventional R installation mechanism to install R dependencies, and the release of RcppCWB at CRAN.

install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")

To install the development version, using the mechanism offered by the devtools package is recommended.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB", ref = "dev")

Installation on macOS

On macOS, the pcre2 and Glib libraries need to be present. We recommend to use ‘Homebrew’ as a package manager for macOS. To install Homebrew, follow the instructions on the Homebrew Website. It may also be necessary to also install Xcode and XQuartz.

The following commands then need to be executed from a terminal window. They will install the C libraries the CWB relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre2 --universal
brew -v install readline

Then open R and use the conventional R installation mechanism to install dependencies, and the release of RcppCWB at CRAN.

install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")

To install the development version, using the mechanism offered by the devtools package is recommended.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")

Usage

The package offers low-level access to CWB-indexed corpora. Using RcppCWB may not intuitive at the outset: It is designed to serve as a an efficient backend for packages offering higher-level functionality, such as polmineR. the

RcppCWB includes a small sample corpus called (‘REUTERS’). After loading the package, we need to determine whether we can use the registry describing the corpus within the package, or whether we need to work with a temporary registry.

library(RcppCWB)
registry <- use_tmp_registry()

To start with, we get the number of tokens of the corpus.

cpos_total <- cl_attribute_size(
  corpus = "REUTERS", attribute = "word",
  attribute_type = "p", registry = registry
)
cpos_total
## [1] 4050

To decode the token stream of the corpus.

token_stream_str <- cl_cpos2str(
  corpus = "REUTERS", p_attribute = "word",
  cpos = seq.int(from = 0, to = cpos_total - 1),
  registry = registry
  )

To get the corpus positions of a token.

token_to_get <- "oil"
id_oil <- cl_str2id(corpus = "REUTERS", p_attribute = "word", str = token_to_get, registry = registry)
cpos_oil <- cl_id2cpos <- cl_id2cpos(corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = registry)

Get the frequency of token.

oil_freq <- cl_id2freq(corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = registry)

Using regular expressions.

ids <- cl_regex2id(corpus = "REUTERS", p_attribute = "word", regex = "M.*", registry = registry)
m_words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", id = ids, registry = registry)

To use the CQP syntax, we need to initialize CQP first.

cqp_initialize(registry = registry)
## Warning in cqp_initialize(registry = registry): CQP has already been
## initialized. Re-initialization is not possible. Only resetting registry.

## [1] TRUE
cqp_query(corpus = "REUTERS", query = '"crude" "oil"')
## <pointer: 0x103757ca0>
cpos <- cqp_dump_subcorpus(corpus = "REUTERS")
cpos
##       [,1] [,2]
##  [1,]   14   15
##  [2,]   56   57
##  [3,]  548  549
##  [4,]  584  585
##  [5,]  607  608
##  [6,] 2497 2498
##  [7,] 2842 2843
##  [8,] 2891 2892
##  [9,] 2928 2929
## [10,] 3644 3645
## [11,] 3709 3710
## [12,] 3998 3999

License

The packge is licensed under the GNU General Public License 3. For the copyrights for the ‘Corpus Workbench’ (CWB) and acknowledgement of authorship, see the file COPYRIGHTS.

Acknowledgements

There is a huge intellectual debt to the developers of the R-package ‘rcqp’, Bernard Desgraupes and Sylvain Loiseau. Developing RcppCWB would have been unthinkable without their original work to wrap the CWB into an R package.

The CWB is a classic and mature tool: The work of the CWB developers, Oliver Christ, Bruno Maximilian Schulze, Arne Fitschen and Stefan Evert is gratefully acknowledged.

Copy Link

Version

Install

install.packages('RcppCWB')

Monthly Downloads

811

Version

0.6.5

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Andreas Blaette

Last Published

September 23rd, 2024

Functions in RcppCWB (0.6.5)

region_matrix_to_struc_matrix

Get min and max strucs of s-attribute present in region
cl_find_corpus

Load corpus.
attribute_size

Rcpp wrappers for CWB Corpus Library functions
cqp_query

Execute CQP Query and Retrieve Results.
cwb_charsets

Character sets supported by CWB
s_attr_is_descendent

Explore XML structure of CWB corpus
s_attribute_decode

Decode Structural Attribute.
CL: s_attributes

Using Structural Attributes.
get_count_vector

Get Vector with Counts for Positional Attribute.
cwb_version

Get CWB version
cl_list_corpora

Show CL corpora
cwb_encode

CWB Tools for Creating Corpora
cl_lexicon_size

Get Lexicon Size.
subcorpus_get_ranges

Get ranges of subcorpus
matrix_to_subcorpus

Create CWB subcorpus from matrix with regions.
use_tmp_registry

Use Temporary Registry
ids_to_count_matrix

Perform Count for Vector of IDs.
corpus_data_dir

Get information from registry file
s_attr_regions

Get regions defined by a structural attribute
cl_charset_name

Get charset of a corpus.
check

Check Input to Rcpp Functions.
RcppCWB-package

Rcpp Bindings for the Corpus Workbench (CWB).
cl_attribute_size

Get Attribute Size (of Positional/Structural Attribute).
check_pkg_registry_files

Check Paths in Registry Files
cl_rework

Low-level CL access.
cl_load_corpus

Load corpus
cl_delete_corpus

Drop loaded corpus.
get_pkg_registry

Get Registry Directory Within Package
get_region_matrix

Get Matrix with Regions for Strucs.
cqp_initialize

Initialize Corpus Query Processor (CQP).
region_matrix_ops

Get IDs and Counts for Region Matrices.
cqp_list_corpora

List Available CWB Corpora.
get_cbow_matrix

Get CBOW Matrix.
p_attr_default

Get default p-attribute
CL: p_attributes

Using Positional Attributes.
cl_struc_values

Check whether structural attribute has values
corpus_is_loaded

Check whether corpus is loaded