Learn R Programming

⚠️There's a newer version (0.8.9) of this package.Take me there.

R-package 'polmineR'

Purpose

The focus of the package 'polmineR' is the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

Aims

Key aims for developing the package are:

  • To keep the original text accessible. A seamless integration of qualitative and quantitative steps in corpus analysis supports validation, based on inspecting the text behind the numbers.
  • To provide a library with standard tasks. It is an open source platform that will make text mining more productive, avoiding prohibitive costs to reimplement basics, or to run many lines of code to perform a basic tasks.
  • To create a package that makes the creation and analysis of subcorpora ('partitions') easy. A particular strength of the package is to support contrastive/comparative research.
  • To offer performance for users with an ordinary infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed by using the Corpus Workbench (CWB).
  • To support sharing consolidated and documented data, following the ideas of reproducible research.

Backend

The polmineR relies on the Open Corpus Workbench (CWB) as a backend and uses the rcqp package as an interface. The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP). The architecture may be overengineered if you work with smaller corpora. It is meant to make working with larger corpora efficient, both locally, or on a server.

Background

The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

Core functions

  • partition: Set up a partition (i.e. subcorpus);
  • count: Count features
  • dispersion: Analyse the dispersion of a query across one or two dimensions (absolute and relative frequencies);
  • cooccurrences: Analyse the context of a query (including some statistics);
  • features: Compare partitions to identify features / keywords (using statistical tests such as chi square).

State of affairs

The most recent stable version is available at CRAN. Development versions are available via GitHub.

Installation

The package can be installed on MacOS, Linux, and Windows. On Windows, installation is limited to the 32bit version of R. See the wiki for installation instructions.

Feedback

Getting feedback is most welcome! I want this to be a useful package not just for me. Please do get in touch: Andreas Blaette, University of Duisburg-Essen (andreas.blaette@uni-due.de).

Copy Link

Version

Install

install.packages('polmineR')

Monthly Downloads

342

Version

0.7.4

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Andreas Blaette

Last Published

July 17th, 2017

Functions in polmineR (0.7.4)

RegistryFile-class

Read, parse and modify registry file.
TermDocumentMatrix

Methods for TermDocumentMatrix / DocumentTermMatrix
CQI.super

Interfaces for accessing the CWB
Corpus

Corpus class.
as.VCorpus,partitionBundle-method

Coerce partitionBundle to VCorpus.
as.markdown

Generate markdown from a partition.
TokenStream-class

Class for token stream operations.
as.TermDocumentMatrix

Generate TermDocumentMatrix / DocumentTermMatrix.
Labels-class

Labels class.
Regions-class

Regions of a CWB corpus.
bundle-class

Bundle class
chisquare

perform chisquare-text
cqpserver

start CQP server
decode

Decode corpus.
contextBundle-class

S4 contextBundle class
cooccurrences-class

Cooccurrences class.
corpus

Get corpus.
count

Get counts.
encode

Encode CWB Corpus.
encoding

Get and set encoding.
getObjects

Get objects of a certain class.
browse

display in browser
cooccurrences

Get cooccurrence statistics.
features-class

Feature selection by comparison (S4 class).
features,partition-method

Get features by comparison.
html

restore fulltext as html
blapply

apply a function over a list or bundle
cooccurrencesReshaped

Methods for manipulating cooccurrencesReshaped-class-objects
encodings

Conversion between corpus and native encoding.
enrich

Enrich an object.
highlight

Highlight tokens.
hits-class

Get Hits.
install.corpus

Install packaged corpus from repository.
ngrams-class

Get N-Grams
noise

detect noise
name

generic methods defined in the polmineR-package
label

Assign and get labels.
mail

Mail result.
read

Display and read full text
resetRegistry

Reload using new CORPUS_REGISTRY.
getSlot

Get slot from object.
partition

Initialize a partition.
partitionBundle-class

Bundle of partitions (partitionBundle class).
partitionBundle

Generate a bundle of partitions
context-class

Context class.
context

Analyze context of a node word.
cpos

Get corpus positions for a query or queries.
cqp

Tools for CQP queries.
partition_class

Partition class and methods.
trim

trim an object
use

Use a packaged corpus.
view

browse an object using View()
weigh

weigh a matrix
as.sparseMatrix

Type conversion - get sparseMatrix.
as.speeches

Split partition into speeches
dispersion-class

dispersion class
dispersion

Dispersion of a query or multiple queries
flatten

flatten a nested list
getEncoding

Get the encoding of a corpus.
getTerms

get terms available in a corpus or partition
polmineR-package

polmineR-package
textstat-class

S4 textstat class
ll

text statistics
divide

divide an object into equally sized parts
dotplot

dotplot
kwic-class

kwic (S4 class)
getTokenStream

Get Token Stream Based on Corpus Positions.
matches

Matches for queries.
means

calculate means
kwic

KWIC output / concordances
pAttribute

get pAttribute
pAttributes

Get p-attributes.
size

Get number of tokens.
split,partition-method

split partition into partitionBundle
getTemplate

Get and set templates.
terms-partition-method

get terms available in a corpus
sAttributes,character-method

Get s-attributes.
scatterplot

word scatterplot
tTest

perform t-test
tempcorpus-class

S4 class to capture core information on a temporary CWB corpus