Learn R Programming

Corpustools

The corpustools package offers various tools for anayzing text corpora. The backbone is the tCorpus R6 class, which offers features ranging from corpus management tools such as pre-processing, subsetting, Boolean (Lucene) queries and deduplication, to analysis techniques such as corpus comparison, document comparison, semantic network analysis and topic modeling. Furthermore, by using tokenized texts as the backbone, it is made easy to reconstruct texts for a qualitative analysis and/or validation of the results of computational text analysis methods (e.g., topic browsers, keyword-in-context lists, texts with highlighted segments for search results or document comparisons).

One of the primary goals of corpustools is to make computational text analysis available and intuitive for users that are not experienced programmers. Notably, the authors are both active as researchers in the social sciences, and strive to promote the use of computational text analysis as a research method. This is also why we double down on the feature to reconstruct the original texts to enable a more qualitative investigation and validation of results.

Getting started

You can install corpustools directly from CRAN

install.packages('corpustools')

Or you can install the development version from Github

install_github("kasperwelbers/corpustools")
library(corpustools)

A vignette is provided (HTML version) with instructions on how to use corpustools and an overview of usefull features.

vignette('corpustools')

Copy Link

Version

Install

install.packages('corpustools')

Monthly Downloads

1,397

Version

0.5.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Kasper Welbers

Last Published

May 8th, 2023

Functions in corpustools (0.5.1)

as.tcorpus

Force an object to be a tCorpus class
browse_hits

View hits in a browser
agg_tcorpus

Aggregate the tokens data
calc_chi2

Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]
agg_label

Helper function for aggregate_rsyntax
as.tcorpus.default

Force an object to be a tCorpus class
feature_associations

Get common nearby features given a query or query hits
dtm_compare

Compare two document term matrices
compare_corpus

Compare tCorpus vocabulary to that of another (reference) tCorpus
feature_stats

Feature statistics
plot.vocabularyComparison

visualize vocabularyComparison
plot_semnet

Visualize a semnet network
browse_texts

Create and view a full text browser
dtm_wordcloud

Plot a word cloud from a dtm
corenlp_tokens

coreNLP example sentences
as.tcorpus.tCorpus

Force an object to be a tCorpus class
get_kwic

Get keyword-in-context (KWIC) strings
merge_tcorpora

Merge tCorpus objects
backbone_filter

Extract the backbone of a network.
count_tcorpus

Count results of search hits, or of a given feature in tokens
get_stopwords

Get a character vector of stopwords
fold_rsyntax

Fold rsyntax annotations
plot.contextHits

S3 plot for contextHits class
freq_filter

Support function for subset method
plot.featureAssociations

visualize feature associations
print.tCorpus

S3 print for tCorpus class
refresh_tcorpus

Refresh a tCorpus object using the current version of corpustools
print.contextHits

S3 print for contextHits class
subset_query

Subset tCorpus token data using a query
plot.featureHits

S3 plot for featureHits class
summary.contextHits

S3 summary for contextHits class
tCorpus$feats_to_columns

Cast the "feats" column in UDpipe tokens to columns
search_dictionary

Dictionary lookup
show_udpipe_models

Show the names of udpipe models
aggregate_rsyntax

Aggregate rsyntax annotations
compare_documents

Calculate the similarity of documents
search_features

Find tokens using a Lucene-like search query
compare_subset

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus
tCorpus$fold_rsyntax

Fold rsyntax annotations
tCorpus$get

Access the data from a tCorpus
tCorpus$feature_subset

Filter features
sotu_texts

State of the Union addresses
tCorpus$set_levels

Change levels of factor columns
tCorpus$lda_fit

Estimate a LDA topic model
create_tcorpus

Create a tCorpus
tCorpus$set_name

Change column names of data and meta data
docfreq_filter

Support function for subset method
ego_semnet

Create an ego network
tCorpus$merge

Merge the token and meta data.tables of a tCorpus with another data.frame
print.featureHits

S3 print for featureHits class
export_span_annotations

Export span annotations
get_dtm

Create a document term matrix.
tCorpus_create

Creating a tCorpus
stopwords_list

Basic stopword lists
subset.tCorpus

S3 subset for tCorpus class
get_global_i

Compute global feature positions
tCorpus_modify_by_reference

Modify tCorpus by reference
tCorpus$subset

Subset a tCorpus
tCorpus_data

Methods and functions for viewing, modifying and subsetting tCorpus data
tCorpus$code_features

Code features in a tCorpus based on a search string
plot_words

Plot a wordcloud with words ordered and coloured according to a dimension (x)
preprocess_tokens

Preprocess tokens in a character vector
tCorpus$context

Get a context vector
tCorpus$search_recode

Recode features in a tCorpus based on a search string
laplace

Laplace (i.e. add constant) smoothing
tCorpus$set

Modify the token and meta data.tables of a tCorpus
tc_plot_tree

Visualize a dependency tree
semnet

Create a semantic network based on the co-occurence of tokens in documents
tCorpus_querying

Use Boolean queries to analyze the tCorpus
tCorpus_semnet

Feature co-occurrence based semantic network analysis
tCorpus$udpipe_clauses

Add columns indicating who did what
tCorpus$udpipe_quotes

Add columns indicating who said what
tCorpus_topmod

Topic modeling
tc_sotu_udpipe

A tCorpus with a small sample of sotu paragraphs parsed with udpipe
melt_quanteda_dict

Convert a quanteda dictionary to a long data.table format
require_package

Check if package with given version exists
semnet_window

Create a semantic network based on the co-occurence of tokens in token windows
summary.featureHits

S3 summary for featureHits class
summary.tCorpus

Summary of a tCorpus object
tCorpus$subset_query

Subset tCorpus token data using a query
udpipe_clause_tqueries

Get a list of tqueries for extracting who did what
tCorpus$annotate_rsyntax

Annotate tokens based on rsyntax queries
udpipe_quote_tqueries

Get a list of tqueries for extracting quotes
search_contexts

Search for documents or sentences using Boolean queries
set_network_attributes

Set some default network attributes for pretty plotting
udpipe_tcorpus

Create a tCorpus using udpipe
sgt

Simple Good Turing smoothing
untokenize

Reconstruct original texts
tCorpus_docsim

Document similarity
tCorpus_features

Preprocessing, subsetting and analyzing features
tCorpus$code_dictionary

Dictionary lookup
tCorpus$deduplicate

Deduplicate documents
tCorpus$preprocess

Preprocess feature
udpipe_simplify

Simplify tokenIndex created with the udpipe parser
tCorpus$delete_columns

Delete column from the data and meta data
udpipe_spanquote_tqueries

Get a list of tqueries for finding candidates for span quotes.
tokenWindowOccurence

Gives the window in which a term occured in a matrix.
tCorpus$replace_dictionary

Replace tokens with dictionary match
tCorpus

tCorpus: a corpus class for tokenized texts
tokens_to_tcorpus

Create a tcorpus based on tokens (i.e. preprocessed texts)
tCorpus_compare

Corpus comparison
top_features

Show top features
transform_rsyntax

Apply rsyntax transformations
add_multitoken_label

Choose and add multitoken strings based on multitoken categories