Learn R Programming

Corpustools

The corpustools package offers various tools for anayzing text corpora. The backbone is the tCorpus R6 class, which offers features ranging from corpus management tools such as pre-processing, subsetting, Boolean (Lucene) queries and deduplication, to analysis techniques such as corpus comparison, document comparison, semantic network analysis and topic modeling. Furthermore, by using tokenized texts as the backbone, it is made easy to reconstruct texts for a qualitative analysis and/or validation of the results of computational text analysis methods (e.g., topic browsers, keyword-in-context lists, texts with highlighted segments for search results or document comparisons).

One of the primary goals of corpustools is to make computational text analysis available and intuitive for users that are not experienced programmers. Notably, the authors are both active as researchers in the social sciences, and strive to promote the use of computational text analysis as a research method. This is also why we double down on the feature to reconstruct the original texts to enable a more qualitative investigation and validation of results.

Getting started

You can install corpustools directly from CRAN

install.packages('corpustools')

Or you can install the development version from Github

install_github("kasperwelbers/corpustools")
library(corpustools)

A vignette is provided (HTML version) with instructions on how to use corpustools and an overview of usefull features.

vignette('corpustools')

Copy Link

Version

Install

install.packages('corpustools')

Monthly Downloads

1,190

Version

0.5.2

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Kasper Welbers

Last Published

July 7th, 2025

Functions in corpustools (0.5.2)

compare_documents

Calculate the similarity of documents
dtm_compare

Compare two document term matrices
get_kwic

Get keyword-in-context (KWIC) strings
dtm_wordcloud

Plot a word cloud from a dtm
get_stopwords

Get a character vector of stopwords
get_global_i

Compute global feature positions
get_dtm

Create a document term matrix.
corenlp_tokens

coreNLP example sentences
count_tcorpus

Count results of search hits, or of a given feature in tokens
compare_subset

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus
merge_tcorpora

Merge tCorpus objects
plot.contextHits

S3 plot for contextHits class
ego_semnet

Create an ego network
export_span_annotations

Export span annotations
plot.vocabularyComparison

visualize vocabularyComparison
semnet

Create a semantic network based on the co-occurence of tokens in documents
plot_semnet

Visualize a semnet network
semnet_window

Create a semantic network based on the co-occurence of tokens in token windows
laplace

Laplace (i.e. add constant) smoothing
melt_quanteda_dict

Convert a quanteda dictionary to a long data.table format
compare_corpus

Compare tCorpus vocabulary to that of another (reference) tCorpus
calc_chi2

Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]
feature_associations

Get common nearby features given a query or query hits
tCorpus$fold_rsyntax

Fold rsyntax annotations
print.tCorpus

S3 print for tCorpus class
refresh_tcorpus

Refresh a tCorpus object using the current version of corpustools
show_udpipe_models

Show the names of udpipe models
sotu_texts

State of the Union addresses
summary.featureHits

S3 summary for featureHits class
feature_stats

Feature statistics
search_dictionary

Dictionary lookup
tCorpus$get

Access the data from a tCorpus
tCorpus$preprocess

Preprocess feature
plot_words

Plot a wordcloud with words ordered and coloured according to a dimension (x)
preprocess_tokens

Preprocess tokens in a character vector
subset_query

Subset tCorpus token data using a query
docfreq_filter

Support function for subset method
stopwords_list

Basic stopword lists
summary.tCorpus

Summary of a tCorpus object
search_features

Find tokens using a Lucene-like search query
create_tcorpus

Create a tCorpus
fold_rsyntax

Fold rsyntax annotations
tCorpus$replace_dictionary

Replace tokens with dictionary match
freq_filter

Support function for subset method
subset.tCorpus

S3 subset for tCorpus class
require_package

Check if package with given version exists
search_contexts

Search for documents or sentences using Boolean queries
top_features

Show top features
tCorpus$deduplicate

Deduplicate documents
sgt

Simple Good Turing smoothing
tCorpus$set_name

Change column names of data and meta data
tCorpus$set_levels

Change levels of factor columns
set_network_attributes

Set some default network attributes for pretty plotting
tCorpus_compare

Corpus comparison
tCorpus

tCorpus: a corpus class for tokenized texts
tCorpus$feature_subset

Filter features
tCorpus$feats_to_columns

Cast the "feats" column in UDpipe tokens to columns
summary.contextHits

S3 summary for contextHits class
tCorpus$code_features

Code features in a tCorpus based on a search string
tokenWindowOccurence

Gives the window in which a term occured in a matrix.
transform_rsyntax

Apply rsyntax transformations
tokens_to_tcorpus

Create a tcorpus based on tokens (i.e. preprocessed texts)
tCorpus_modify_by_reference

Modify tCorpus by reference
tCorpus$delete_columns

Delete column from the data and meta data
tCorpus$subset

Subset a tCorpus
tCorpus$subset_query

Subset tCorpus token data using a query
tCorpus_querying

Use Boolean queries to analyze the tCorpus
tCorpus_create

Creating a tCorpus
tCorpus$merge

Merge the token and meta data.tables of a tCorpus with another data.frame
tCorpus$context

Get a context vector
tCorpus_docsim

Document similarity
tCorpus$lda_fit

Estimate a LDA topic model
tCorpus_data

Methods and functions for viewing, modifying and subsetting tCorpus data
tCorpus$set

Modify the token and meta data.tables of a tCorpus
tCorpus_features

Preprocessing, subsetting and analyzing features
tCorpus$search_recode

Recode features in a tCorpus based on a search string
plot.featureAssociations

visualize feature associations
plot.featureHits

S3 plot for featureHits class
print.contextHits

S3 print for contextHits class
tc_sotu_udpipe

A tCorpus with a small sample of sotu paragraphs parsed with udpipe
udpipe_simplify

Simplify tokenIndex created with the udpipe parser
tc_plot_tree

Visualize a dependency tree
udpipe_tcorpus

Create a tCorpus using udpipe
tCorpus_topmod

Topic modeling
tCorpus_semnet

Feature co-occurrence based semantic network analysis
udpipe_spanquote_tqueries

Get a list of tqueries for finding candidates for span quotes.
tCorpus$code_dictionary

Dictionary lookup
tCorpus$udpipe_clauses

Add columns indicating who did what
tCorpus$annotate_rsyntax

Annotate tokens based on rsyntax queries
tCorpus$udpipe_quotes

Add columns indicating who said what
print.featureHits

S3 print for featureHits class
udpipe_clause_tqueries

Get a list of tqueries for extracting who did what
udpipe_quote_tqueries

Get a list of tqueries for extracting quotes
untokenize

Reconstruct original texts
agg_label

Helper function for aggregate_rsyntax
browse_texts

Create and view a full text browser
agg_tcorpus

Aggregate the tokens data
as.tcorpus.default

Force an object to be a tCorpus class
as.tcorpus.tCorpus

Force an object to be a tCorpus class
as.tcorpus

Force an object to be a tCorpus class
browse_hits

View hits in a browser
add_multitoken_label

Choose and add multitoken strings based on multitoken categories
backbone_filter

Extract the backbone of a network.
aggregate_rsyntax

Aggregate rsyntax annotations