Learn R Programming

⚠️There's a newer version (1.5) of this package.Take me there.

text

An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.

The text-package has two main objectives:

  • First, to serve R-users as a point solution for transforming text to state-of-the-art word embeddings that are ready to be used for downstream tasks. The package provides a user-friendly link to language models based on transformers from Hugging Face.

  • Second, to serve as an end-to-end solution that provides state-of-the-art AI techniques tailored for social and behavioral scientists.

Text is created through a collaboration between psychology and computer science to address research needs and ensure state-of-the-art techniques. It provides powerful functions tailored to test research hypotheses in social and behavior sciences for both relatively small and large datasets. Text is continuously tested on Ubuntu, Mac OS and Windows using the latest stable R version.

Please reference our tutorial article when using the package: The text-package: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning.

Short installation guide

Most users simply need to run below installation code. For those experiencing problems or want more alternatives, please see the Extended Installation Guide.

For the text-package to work, you first have to install the text-package in R, and then make it work with text required python packages.

  1. Install text-version (at the moment the second step only works using the development version of text from GitHub).

GitHub development version:

# install.packages("devtools")
devtools::install_github("oscarkjell/text")

CRAN version:

install.packages("text")
  1. Install and initialize text required python packages:
library(text)

# Install text required python packages in a conda environment (with defaults).
textrpp_install()

# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R. 
textrpp_initialize(save_profile = TRUE)

Point solution for transforming text to embeddings

Recent significant advances in NLP research have resulted in improved representations of human language (i.e., language models). These language models have produced big performance gains in tasks related to understanding human language. Text are making these SOTA models easily accessible through an interface to HuggingFace in Python.

Text provides many of the contemporary state-of-the-art language models that are based on deep learning to model word order and context. Multilingual language models can also represent several languages; multilingual BERT comprises 104 different languages.

Table 1. Some of the available language models

ModelsReferencesLayersDimensionsLanguage
‘bert-base-uncased’Devlin et al. 201912768English
‘roberta-base’Liu et al. 201912768English
‘distilbert-base-cased’Sahn et al., 20196768English
‘bert-base-multilingual-cased’Devlin et al. 201912768104 top languages at Wikipedia
‘xlm-roberta-large’Liu et al241024100 language

See HuggingFace for a more comprehensive list of models.

The textEmbed() function is the main embedding function in text; and can output contextualized embeddings for tokens (i.e., the embeddings for each single word instance of each text) and texts (i.e., single embeddings per text taken from aggregating all token embeddings of the text).

library(text)
# Transform the text data to BERT word embeddings

# Example text
texts <- c("I feel great!")

# Defaults
embeddings <- textEmbed(texts)
embeddings

See Get Started for more information.

Language Analysis Tasks

It is also possible to access many language analysis tasks such as textClassify(), textGeneration(), and textTranslate().

library(text)

# Generate text from the prompt "I am happy to"
generated_text <- textGeneration("I am happy to",
                                 model = "gpt2")
generated_text

For a full list of language analysis tasks supported in text see the References

An end-to-end package

Text also provides functions to analyse the word embeddings with well-tested machine learning algorithms and statistics. The focus is to analyze and visualize text, and their relation to other text or numerical variables. For example, the textTrain() function is used to examine how well the word embeddings from a text can predict a numeric or categorical variable. Another example is functions plotting statistically significant words in the word embedding space.

library(text) 
# Use data (DP_projections_HILS_SWLS_100) that have been pre-processed with the textProjectionData function; the preprocessed test-data included in the package is called: DP_projections_HILS_SWLS_100
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  y_axes = TRUE,
  title_top = " Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  position_jitter_hight = 0.5,
  position_jitter_width = 0.8
)
plot_projection$final_plot

M

Copy Link

Version

Install

install.packages('text')

Monthly Downloads

2,754

Version

1.2.3

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Oscar Kjell

Last Published

July 29th, 2024

Functions in text (1.2.3)

textDescriptives

Compute descriptive statistics of character variables.
textModelLayers

Get the number of layers in a given model.
textModels

Check downloaded, available models.
textFineTuneTask

Task Adapted Pre-Training (EXPERIMENTAL - under development)
textGeneration

Predicts the words that will follow a specified text prompt. (experimental)
textFineTuneDomain

Domain Adapted Pre-Training (EXPERIMENTAL - under development)
textEmbedStatic

Applies word embeddings from a given decontextualized static space (such as from Latent Semantic Analyses) to all character variables
textModelsRemove

Delete a specified model and model associated files.
textNER

Named Entity Recognition. (experimental)
textPCA

Compute 2 PCA dimensions of the word embeddings for individual words.
textPredictTest

Significance testing correlations If only y1 is provided a t-test is computed, between the absolute error from yhat1-y1 and yhat2-y1.
textProjectionPlot

Plot words according to Supervised Dimension Projection.
textPCAPlot

Plot words according to 2-D plot from 2 PCA components.
textProjection

Compute Supervised Dimension Projection and related variables for plotting words.
textQA

Question Answering. (experimental)
textSimilarity

Compute the semantic similarity between two text variables.
textPredictAll

Predict from several models, selecting the correct input
textSimilarityMatrix

Compute semantic similarity scores between all combinations in a word embedding
textPredict

Trained models created by e.g., textTrain() or stored on e.g., github can be used to predict new scores or classes from embeddings or text using textPredict.
textPlot

Plot words from textProjection() or textWordPrediction().
textTopicsWordcloud

This functions plots wordclouds of topics from a Topic Model based on their significance determined by a linear or binary regression
textTrain

Train word embeddings to a numeric (ridge regression) or categorical (random forest) variable.
textTopicsTest

This function tests the relationship between a single topic or all topics and a variable of interest. Available tests include correlation, t-test, linear regression, binary regression, and ridge regression. (EXPERIMENTAL - under development)
textTopicsTree

textTopicsTest (EXPERIMENTAL) to get the hierarchical topic tree
textTrainN

(experimental) Compute cross-validated correlations for different sample-sizes of a data set. The cross-validation process can be repeated several times to enhance the reliability of the evaluation.
textTrainLists

Individually trains word embeddings from several text variables to several numeric or categorical variables.
textSimilarityNorm

Compute the semantic similarity between a text variable and a word norm (i.e., a text represented by one word embedding that represent a construct).
textTokenize

Tokenize according to different huggingface transformers
textSum

Summarize texts. (experimental)
textZeroShot

Zero Shot Classification (Experimental)
textWordPrediction

Compute predictions based on single words for plotting words. The word embeddings of single words are trained to predict the mean value associated with that word. P-values does NOT work yet (experimental).
textTrainNPlot

(experimental) Plot cross-validated correlation coefficients across different sample-sizes from the object returned by the textTrainN function. If the number of cross-validations exceed one, then error-bars will be included in the plot.
textTrainRandomForest

Train word embeddings to a categorical variable using random forest.
textrpp_uninstall

Uninstall textrpp conda environment
word_embeddings_4

Word embeddings for 4 text variables for 40 participants
textTopics

This function creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)
textrpp_initialize

Initialize text required python packages
textTopicsReduce

textTopicsReduce (EXPERIMENTAL)
textrpp_install

Install text required python packages in conda or virtualenv environment
textTrainRegression

Train word embeddings to a numeric variable.
textTranslate

Translation. (experimental)
find_textrpp_env

Find text required python pacakges env
textCentrality

Compute semantic similarity score between single words' word embeddings and the aggregated word embedding of all words.
textCentralityPlot

Plot words according to semantic similarity to the aggregated word embedding.
textClassify

Predict label and probability of a text using a pretrained classifier language model. (experimental)
textDistance

Compute the semantic distance between two text variables.
textEmbed

Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.
textEmbedRawLayers

Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.
PC_projections_satisfactionwords_40

Example data for plotting a Principle Component Projection Plot.
centrality_data_harmony

Example data for plotting a Semantic Centrality Plot.
DP_projections_HILS_SWLS_100

Data for plotting a Dot Product Projection Plot.
Language_based_assessment_data_3_100

Example text and numeric data.
raw_embeddings_1

Word embeddings from textEmbedRawLayers function
Language_based_assessment_data_8

Text and numeric data for 10 participants.
find_textrpp

Find text required python packages
textDistanceMatrix

Compute semantic distance scores between all combinations in a word embedding
textEmbedLayerAggregation

Select and aggregate layers of hidden states to form a word embedding.
textDimName

Change the names of the dimensions in the word embeddings.
textDistanceNorm

Compute the semantic distance between a text variable and a word norm (i.e., a text represented by one word embedding that represent a construct/concept).
textEmbedReduce

Pre-trained dimension reduction (experimental)