Learn R Programming

Corpustools

The corpustools package offers various tools for anayzing text corpora. The backbone is the tCorpus R6 class, which offers features ranging from corpus management tools such as pre-processing, subsetting, Boolean (Lucene) queries and deduplication, to analysis techniques such as corpus comparison, document comparison, semantic network analysis and topic modeling. Furthermore, by using tokenized texts as the backbone, it is made easy to reconstruct texts for a qualitative analysis and/or validation of the results of computational text analysis methods (e.g., topic browsers, keyword-in-context lists, texts with highlighted segments for search results or document comparisons).

One of the primary goals of corpustools is to make computational text analysis available and intuitive for users that are not experienced programmers. Notably, the authors are both active as researchers in the social sciences, and strive to promote the use of computational text analysis as a research method. This is also why we double down on the feature to reconstruct the original texts to enable a more qualitative investigation and validation of results.

Getting started

You can install corpustools directly from CRAN

install.packages('corpustools')

Or you can install the development version from Github

install_github("kasperwelbers/corpustools")
library(corpustools)

A vignette is provided (HTML version) with instructions on how to use corpustools and an overview of usefull features.

vignette('corpustools')

Copy Link

Version

Install

install.packages('corpustools')

Monthly Downloads

1,190

Version

0.5.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Repository

https://github.com/kasperwelbers/corpustools

Maintainer

Kasper Welbers

Last Published

May 8th, 2023

Functions in corpustools (0.5.1)

Force an object to be a tCorpus class

View hits in a browser

Aggregate the tokens data

Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]

Helper function for aggregate_rsyntax

as.tcorpus.default

Force an object to be a tCorpus class

feature_associations

Get common nearby features given a query or query hits

Compare two document term matrices

Compare tCorpus vocabulary to that of another (reference) tCorpus

Feature statistics

plot.vocabularyComparison

visualize vocabularyComparison

Visualize a semnet network

Create and view a full text browser

Plot a word cloud from a dtm

coreNLP example sentences

as.tcorpus.tCorpus

Force an object to be a tCorpus class

Get keyword-in-context (KWIC) strings

Merge tCorpus objects

backbone_filter

Extract the backbone of a network.

Count results of search hits, or of a given feature in tokens

Get a character vector of stopwords

Fold rsyntax annotations

plot.contextHits

S3 plot for contextHits class

Support function for subset method

plot.featureAssociations

visualize feature associations

S3 print for tCorpus class

refresh_tcorpus

Refresh a tCorpus object using the current version of corpustools

print.contextHits

S3 print for contextHits class

Subset tCorpus token data using a query

plot.featureHits

S3 plot for featureHits class

summary.contextHits

S3 summary for contextHits class

tCorpus$feats_to_columns

Cast the "feats" column in UDpipe tokens to columns

search_dictionary

Dictionary lookup

show_udpipe_models

Show the names of udpipe models

aggregate_rsyntax

Aggregate rsyntax annotations

compare_documents

Calculate the similarity of documents

search_features

Find tokens using a Lucene-like search query

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

tCorpus$fold_rsyntax

Fold rsyntax annotations

Access the data from a tCorpus

tCorpus$feature_subset

Filter features

State of the Union addresses

tCorpus$set_levels

Change levels of factor columns

tCorpus$lda_fit

Estimate a LDA topic model

Create a tCorpus

tCorpus$set_name

Change column names of data and meta data

Support function for subset method

Create an ego network

Merge the token and meta data.tables of a tCorpus with another data.frame

print.featureHits

S3 print for featureHits class

export_span_annotations

Export span annotations

Create a document term matrix.

Creating a tCorpus

Basic stopword lists

S3 subset for tCorpus class

Compute global feature positions

tCorpus_modify_by_reference

Modify tCorpus by reference

Subset a tCorpus

Methods and functions for viewing, modifying and subsetting tCorpus data

tCorpus$code_features

Code features in a tCorpus based on a search string

Plot a wordcloud with words ordered and coloured according to a dimension (x)

preprocess_tokens

Preprocess tokens in a character vector

tCorpus$context

Get a context vector

tCorpus$search_recode

Recode features in a tCorpus based on a search string

Laplace (i.e. add constant) smoothing

Modify the token and meta data.tables of a tCorpus

Visualize a dependency tree

Create a semantic network based on the co-occurence of tokens in documents

tCorpus_querying

Use Boolean queries to analyze the tCorpus

Feature co-occurrence based semantic network analysis

tCorpus$udpipe_clauses

Add columns indicating who did what

tCorpus$udpipe_quotes

Add columns indicating who said what

A tCorpus with a small sample of sotu paragraphs parsed with udpipe

melt_quanteda_dict

Convert a quanteda dictionary to a long data.table format

require_package

Check if package with given version exists

Create a semantic network based on the co-occurence of tokens in token windows

summary.featureHits

S3 summary for featureHits class

summary.tCorpus

Summary of a tCorpus object

tCorpus$subset_query

Subset tCorpus token data using a query

udpipe_clause_tqueries

Get a list of tqueries for extracting who did what

tCorpus$annotate_rsyntax

Annotate tokens based on rsyntax queries

udpipe_quote_tqueries

Get a list of tqueries for extracting quotes

search_contexts

Search for documents or sentences using Boolean queries

set_network_attributes

Set some default network attributes for pretty plotting

Create a tCorpus using udpipe

Simple Good Turing smoothing

Reconstruct original texts

Document similarity

tCorpus_features

Preprocessing, subsetting and analyzing features

tCorpus$code_dictionary

Dictionary lookup

tCorpus$deduplicate

Deduplicate documents

tCorpus$preprocess

Preprocess feature

udpipe_simplify

Simplify tokenIndex created with the udpipe parser

tCorpus$delete_columns

Delete column from the data and meta data

udpipe_spanquote_tqueries

Get a list of tqueries for finding candidates for span quotes.

tokenWindowOccurence

Gives the window in which a term occured in a matrix.

tCorpus$replace_dictionary

Replace tokens with dictionary match

tCorpus: a corpus class for tokenized texts

tokens_to_tcorpus

Create a tcorpus based on tokens (i.e. preprocessed texts)

tCorpus_compare

Corpus comparison

Show top features

transform_rsyntax

Apply rsyntax transformations

add_multitoken_label

Choose and add multitoken strings based on multitoken categories