Learn R Programming

Corpustools

The corpustools package offers various tools for anayzing text corpora. The backbone is the tCorpus R6 class, which offers features ranging from corpus management tools such as pre-processing, subsetting, Boolean (Lucene) queries and deduplication, to analysis techniques such as corpus comparison, document comparison, semantic network analysis and topic modeling. Furthermore, by using tokenized texts as the backbone, it is made easy to reconstruct texts for a qualitative analysis and/or validation of the results of computational text analysis methods (e.g., topic browsers, keyword-in-context lists, texts with highlighted segments for search results or document comparisons).

One of the primary goals of corpustools is to make computational text analysis available and intuitive for users that are not experienced programmers. Notably, the authors are both active as researchers in the social sciences, and strive to promote the use of computational text analysis as a research method. This is also why we double down on the feature to reconstruct the original texts to enable a more qualitative investigation and validation of results.

Getting started

You can install corpustools directly from CRAN

install.packages('corpustools')

Or you can install the development version from Github

install_github("kasperwelbers/corpustools")
library(corpustools)

A vignette is provided (HTML version) with instructions on how to use corpustools and an overview of usefull features.

vignette('corpustools')

Copy Link

Version

Install

install.packages('corpustools')

Monthly Downloads

644

Version

0.5.2

License

GPL-3

Issues

Pull Requests

Stars

Forks

Repository

https://github.com/kasperwelbers/corpustools

Maintainer

Kasper Welbers

Last Published

July 7th, 2025

Functions in corpustools (0.5.2)

compare_documents

Calculate the similarity of documents

Compare two document term matrices

Get keyword-in-context (KWIC) strings

Plot a word cloud from a dtm

Get a character vector of stopwords

Compute global feature positions

Create a document term matrix.

coreNLP example sentences

Count results of search hits, or of a given feature in tokens

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Merge tCorpus objects

plot.contextHits

S3 plot for contextHits class

Create an ego network

export_span_annotations

Export span annotations

plot.vocabularyComparison

visualize vocabularyComparison

Create a semantic network based on the co-occurence of tokens in documents

Visualize a semnet network

Create a semantic network based on the co-occurence of tokens in token windows

Laplace (i.e. add constant) smoothing

melt_quanteda_dict

Convert a quanteda dictionary to a long data.table format

Compare tCorpus vocabulary to that of another (reference) tCorpus

Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]

feature_associations

Get common nearby features given a query or query hits

tCorpus$fold_rsyntax

Fold rsyntax annotations

S3 print for tCorpus class

refresh_tcorpus

Refresh a tCorpus object using the current version of corpustools

show_udpipe_models

Show the names of udpipe models

State of the Union addresses

summary.featureHits

S3 summary for featureHits class

Feature statistics

search_dictionary

Dictionary lookup

Access the data from a tCorpus

tCorpus$preprocess

Preprocess feature

Plot a wordcloud with words ordered and coloured according to a dimension (x)

preprocess_tokens

Preprocess tokens in a character vector

Subset tCorpus token data using a query

Support function for subset method

Basic stopword lists

summary.tCorpus

Summary of a tCorpus object

search_features

Find tokens using a Lucene-like search query

Create a tCorpus

Fold rsyntax annotations

tCorpus$replace_dictionary

Replace tokens with dictionary match

Support function for subset method

S3 subset for tCorpus class

require_package

Check if package with given version exists

search_contexts

Search for documents or sentences using Boolean queries

Show top features

tCorpus$deduplicate

Deduplicate documents

Simple Good Turing smoothing

tCorpus$set_name

Change column names of data and meta data

tCorpus$set_levels

Change levels of factor columns

set_network_attributes

Set some default network attributes for pretty plotting

tCorpus_compare

Corpus comparison

tCorpus: a corpus class for tokenized texts

tCorpus$feature_subset

Filter features

tCorpus$feats_to_columns

Cast the "feats" column in UDpipe tokens to columns

summary.contextHits

S3 summary for contextHits class

tCorpus$code_features

Code features in a tCorpus based on a search string

tokenWindowOccurence

Gives the window in which a term occured in a matrix.

transform_rsyntax

Apply rsyntax transformations

tokens_to_tcorpus

Create a tcorpus based on tokens (i.e. preprocessed texts)

tCorpus_modify_by_reference

Modify tCorpus by reference

tCorpus$delete_columns

Delete column from the data and meta data

Subset a tCorpus

tCorpus$subset_query

Subset tCorpus token data using a query

tCorpus_querying

Use Boolean queries to analyze the tCorpus

Creating a tCorpus

Merge the token and meta data.tables of a tCorpus with another data.frame

tCorpus$context

Get a context vector

Document similarity

tCorpus$lda_fit

Estimate a LDA topic model

Methods and functions for viewing, modifying and subsetting tCorpus data

Modify the token and meta data.tables of a tCorpus

tCorpus_features

Preprocessing, subsetting and analyzing features

tCorpus$search_recode

Recode features in a tCorpus based on a search string

plot.featureAssociations

visualize feature associations

plot.featureHits

S3 plot for featureHits class

print.contextHits

S3 print for contextHits class

A tCorpus with a small sample of sotu paragraphs parsed with udpipe

udpipe_simplify

Simplify tokenIndex created with the udpipe parser

Visualize a dependency tree

Create a tCorpus using udpipe

Feature co-occurrence based semantic network analysis

udpipe_spanquote_tqueries

Get a list of tqueries for finding candidates for span quotes.

tCorpus$code_dictionary

Dictionary lookup

tCorpus$udpipe_clauses

Add columns indicating who did what

tCorpus$annotate_rsyntax

Annotate tokens based on rsyntax queries

tCorpus$udpipe_quotes

Add columns indicating who said what

print.featureHits

S3 print for featureHits class

udpipe_clause_tqueries

Get a list of tqueries for extracting who did what

udpipe_quote_tqueries

Get a list of tqueries for extracting quotes

Reconstruct original texts

Helper function for aggregate_rsyntax

Create and view a full text browser

Aggregate the tokens data

as.tcorpus.default

Force an object to be a tCorpus class

as.tcorpus.tCorpus

Force an object to be a tCorpus class

Force an object to be a tCorpus class

View hits in a browser

add_multitoken_label

Choose and add multitoken strings based on multitoken categories

backbone_filter

Extract the backbone of a network.

aggregate_rsyntax

Aggregate rsyntax annotations