Learn R Programming

⚠️There's a newer version (0.8.9) of this package.Take me there.

polmineR

Purpose: The focus of the package 'polmineR' is the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

Aims: Key aims for developing the package are:

  • To keep the original text accessible. A seamless integration of qualitative and quantitative steps in corpus analysis supports validation, based on inspecting the text behind the numbers.

  • To provide a library with standard tasks. It is an open source platform that will make text mining more productive, avoiding prohibitive costs to reimplement basics, or to run many lines of code to perform a basic tasks.

  • To create a package that makes the creation and analysis of subcorpora ('partitions') easy. A particular strength of the package is to support contrastive/comparative research.

  • To offer performance for users with an ordinary infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed by using the Open Corpus Workbench (CWB). The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP).

  • To support sharing consolidated and documented data, following the ideas of reproducible research.

Background: The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

Core Functions

  • partition: Set up a partition (i.e. subcorpus);
  • count: Count features
  • dispersion: Analyse the dispersion of a query across one or two dimensions (absolute and relative frequencies);
  • cooccurrences: Analyse the context of a query (including some statistics);
  • features: Compare partitions to identify features / keywords (using statistical tests such as chi square).

Installation

Windows (32 bit / i386)

At this stage, an easy way to install polmineR is available only for 32bit R. Usually, an R installation will include both 32bit and 64bit R. So if you want to keep things simple, make sure that you work with 32bit version. If you work with RStudio (highly recommended), the menu Tools > Global Options will open a dialogue where you can choose 32bit R.

Before installing polmineR, the package 'rcqp' needs to be installed. In turn, rcqp requires plyr, which should be installed first.

install.packages("plyr")

To avoid compiling C code in a package, packages with compiled binaries are very handy. Windows binaries for the rcqp package are not available at CRAN, but can be installed from a repository of packages entertained at the server of the PolMine project:

install.packages("rcqp", repos = "http://polmine.sowi.uni-due.de/packages", type = "win.binary")

To explain: Compiling the C code in the rcqp package on a windows machine is not yet possible. The package we offer uses a cross-compilation of these C libraries, i.e. binaries that have been prepared for windows on a MacOS/Linux machine.

Before proceeding to install polmineR, we install dependencies that are not installed automatically.

install.packages(pkgs = c("htmltools", "htmlwidgets", "magrittr", "iterators", "NLP"))

The latest stable version of polmineR can now be installed from CRAN. Several other packages that polmineR depends on, or that dependencies depend on may be installed automatically.

install.packages("polmineR")

The development version of the package, which may include the most recent updates and features, can be installed from GitHub. The easiest way to do this is to use a mechanism offered by the package devtools.

install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")

The installation may throw warnings. There are three warnings you can ignore at this stage:

  • "WARNING: this package has a configure script / It probably needs manual configuration".
  • The environment variable CORPUS_REGISTRY is not defined.
  • package 'rcqp' is not installed for 'arch = x64'.

The configure script is for Linux/MacOS installation, its sole purpose is to pass tests for uploading the package to CRAN. As mentioned, windows binaries are not yet available for 64bit R at present, so that can be ignored. The environment variable "CORPUS_REGISTRY" can be set as follows in R:

Sys.setenv(CORPUS_REGISTRY = "C:/PATH/TO/YOUR/REGISTRY")

To set the environment variable CORPUS_REGISTRY permanently, see the instructions R offer how to find the file '.Renviron' or '.Renviron.site' when calling the help for the startup process(?Startup).

Two important notes concerning problems with the CORPUS_REGISTRY environment variable that may cause serious headaches:

  • The path can not be processed, if there is any whitespace in the path pointing to the registry. Whitespace may occur in the user name ("C:/Users/Donald Duck/Documents"), for instance. We do not yet know any workaround to make rcqp/CWB process whitespace. The recommendation is to create a directory at a path without whitespace to keep the registry and the indexed_corpora (a directory such as "C:/cwb").

  • If you keep data on another volume than your system files, your R packages etc. (eg. volume 'C:' for system files, and 'D:' for data and user files), make sure to set the working directory (setwd()) is set to any directory on the volume with the directory defined via CORPUS_REGISTRY. CWB/rcqp will assume that the CORPUS_REGISTRY directory is on the same volume as the current working directory (which can be identified by calling getwd()).

Finally: polmineR if optimized for working with RStudio. It you work with 32bit R, you may have to check in the settings of RStudio that it will call 32bit R. To be sure, check the startup message.

If everything works, check whether polmineR can be loaded.

library(polmineR)
corpus() # to see corpora available at your system

Windows (64 bit / x86)

At this stage, 64 bit support is still experimental. Apart from an installation of 64 bit R, you will need to install Rtools, available here. Rtools is a collection of tools necessary to build and compile R packages on a Windows machine.

To interface to a core C library of the Corpus Workbench (CWB), you will need an installation of a 64 bit AND a 32 bit version of the CWB.

The "official" 32 bit version of the CWB is available here. Installation instructions are available at the CWB Website. The 32 bit version should be installed in the directory "C:Files", with admin rights.

The 64 bit version, prepared by Andreas Blaette, is available here. Install this 64 bit CWB version to "C:Files (x86)". In the unzipped downloaded zip file, you will find a bat file that will do the installation. Take care that you run the file with administrator rights. Without these rights, no files will be copied.

The interface to the Corpus Workbench is the package polmineR.Rcpp, available at GitHub. If you use git, you can clone that repository, otherwise, you can download a zip file.

The downloaded zip file needs to be unzipped again. Then, in the directory with the 'polmineR.Rcpp'-directory, run:

R CMD build polmineR.Rcpp
R CMD INSTALL polmineR.Rcpp_0.1.0.tar.gz

If you read closely what is going on during the compilation, you will see a few warnings that libraries are not found. If creating the package is not aborted, nothing is wrong. R CMD build will look for the 64 bit files in the directory with the 32 bit dlls first and discover that they do not work for 64 bit, only then will it move to the correct location.

One polmineR.Rcpp is installed, proceed with the instructions for installing polmineR in a 32 bit context. Future binary releases of the polmineR.Rcpp package may make things easier. Anyway, the proof of concept is there that polmineR will work on a 64 bit Windows machine too.

Finally, you need to make sure that polmineR will interface to CWB indexed corpora using polmineR.Rcpp, and not with rcqp (the default). To set the interface accordingly:

setCorpusWorkbenchInterface("Rcpp")

To test whether corpora are available:

corpus()

MacOS

The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. The Open Source License version of RStudio Desktop is what you need.

Installing 'polmineR'

The latest release of polmineR can be installed from CRAN using the usual install.packages-function.

install.packages("polmineR")

The development version of polmineR can be installed using devtools:

install.packages("devtools") # unless devtools is already installed
devtools::install_github("PolMine/polmineR", ref = "dev")

Installing 'rcqp'

The default interface of the polmineR package to access CWB indexed corpora is the package 'rcqp'. Accessing corpora will not work before you have installed the interface.

Installing precompiled binary of rcqp from the PolMine server

The easiest way to get rcqp for Mac is install a precompiled binary that is available at the PolMine server:

install.packages(
  "rcqp",
  repos = "http://polmine.sowi.uni-due.de/packages",
  type = "mac.binary"
  )
Building rcqp from source

If you want to get rcqp from CRAN and/or if you want to to compile the C code yourself, the procedure is as follows.

First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:

xcode-select --install

To compile the C code in the rcqp package, there are system requirements that need to be fulfilled. Using a package manager such as Homebrew or Macports makes things considerably easier.

Option 1: Using Homebrew

We recommend to use 'Homebrew'. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands will install the C libraries the rcqp package relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline

Option 2: Using Macports

If you prefer using Macports, get it from https://www.macports.org/. After installing Macports, it is necessary to restart the computer. Next, an update of Macports is necessary.

sudo port -v selfupdate

Now we can install the libraries rcqp will require. Again, from the terminal.

sudo port install glib2
sudo port install pkgconfig
sudo port install pcre

Install dependencies and rcqp

Once the system requirements are there, the next steps can be done from R. Before installing rcqp, and then polmineR, we install a few packages. In the R console:

install.packages(pkgs = c("RUnit", "devtools", "plyr", "tm"))

Now rcqp can be installed, and then polmineR:

install.packages("rcqp")
install.packages("polmineR")

If you like to work with the development version, that can be installed from GitHub.

devtools::install_github("PolMine/polmineR", ref = "dev")

Linux

The pcre, glib and pkg-config libraries can be installed using apt-get.

sudo apt-get install libglib2.0-dev
sudo apt-get install libssl-dev
sudo apt-get install libcurl4-openssl-dev

The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.

install.packages("RUnit", "devtools", "plyr", "tm")
install.packages("rcqp")
install.packages("polmineR")

Copy Link

Version

Install

install.packages('polmineR')

Monthly Downloads

342

Version

0.7.5

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Andreas Blaette

Last Published

October 4th, 2017

Functions in polmineR (0.7.5)

TermDocumentMatrix

Methods for TermDocumentMatrix / DocumentTermMatrix
TokenStream-class

Class for token stream operations.
Labels-class

Labels class.
RegistryFile-class

Read, parse and modify registry file.
browse

display in browser
bundle-class

Bundle class
decode

Decode corpus or s-attribute.
dispersion-class

dispersion class
encoding

Get and set encoding.
encodings

Conversion between corpus and native encoding.
getTokenStream

Get Token Stream Based on Corpus Positions.
highlight

Highlight tokens.
CQI.super

Interfaces for accessing the CWB
Corpus

Corpus class.
chisquare

perform chisquare-text
context-class

Context class.
count

Get counts.
cpos

Get corpus positions for a query or queries.
features,partition-method

Get features by comparison.
flatten

flatten a nested list
getSlot

Get slot from object.
getTerms

get terms available in a corpus or partition
as.TermDocumentMatrix

Generate TermDocumentMatrix / DocumentTermMatrix.
as.VCorpus,partitionBundle-method

Coerce partitionBundle to VCorpus.
as.speeches

Split partition into speeches
partitionBundle-class

Bundle of partitions (partitionBundle class).
partitionBundle

Generate a bundle of partitions
size

Get number of tokens.
split,partition-method

split partition into partitionBundle
as.markdown

Generate markdown from a partition.
as.sparseMatrix

Type conversion - get sparseMatrix.
context

Analyze context of a node word.
blapply

apply a function over a list or bundle
cqp

Tools for CQP queries.
cqpserver

start CQP server
getEncoding

Get the encoding of a corpus.
getObjects

Get objects of a certain class.
getTemplate

Get and set templates.
terms-partition-method

get terms available in a corpus
install.corpus

Install packaged corpus from repository.
kwic-class

kwic (S4 class)
mail

Mail result.
matches

Matches for queries.
regions

Regions of a CWB corpus.
cooccurrences-class

Cooccurrences class.
cooccurrences

Get cooccurrence statistics.
cooccurrencesReshaped

Methods for manipulating cooccurrencesReshaped-class-objects
corpus

Get corpus.
enrich

Enrich an object.
features-class

Feature selection by comparison (S4 class).
hits-class

Get Hits.
html

restore fulltext as html
pAttributes

Get p-attributes.
partition

Initialize a partition.
resetRegistry

Reload using new CORPUS_REGISTRY.
textstat-class

S4 textstat class
ll

text statistics
means

calculate means
ngrams-class

Get N-Grams
polmineR-package

polmineR-package
read

Display and read full text
contextBundle-class

S4 contextBundle class
dispersion

Dispersion of a query or multiple queries
divide

divide an object into equally sized parts
encode

Encode s-attribute or corpus.
kwic

KWIC output / concordances
label

Assign and get labels.
sAttributes,character-method

Get s-attributes.
scatterplot

word scatterplot
tTest

perform t-test
dotplot

dotplot
noise

detect noise
pAttribute

get pAttribute
partition_class

Partition class and methods.
name

generic methods defined in the polmineR-package
trim

trim an object
use

Use a packaged corpus.
tempcorpus-class

S4 class to capture core information on a temporary CWB corpus
view

browse an object using View()
weigh

weigh a matrix