Learn R Programming

text

Overview

An R-package for analyzing natural language with transformers-based large language models. The text package is part of the R Language Analysis Suite, including talk, text and topics.

  • talk transforms voice recordings into text, audio features, or embeddings.
  • text provides many language tasks such as converting digital text into word embeddings. talk and text offer access to Large Language Models from Hugging Face.
  • topics visualizes language patterns into topics to generate psychological insights.

The R Language Analysis Suite is created through a collaboration between psychology and computer science to address research needs and ensure state-of-the-art techniques. The suite is continuously tested on Ubuntu, Mac OS and Windows using the latest stable R version.

The text-package has two main objectives: * First, to serve R-users as a point solution for transforming text to state-of-the-art word embeddings that are ready to be used for downstream tasks. The package provides a user-friendly link to language models based on transformers from Hugging Face. * Second, to serve as an end-to-end solution that provides state-of-the-art AI techniques tailored for social and behavioral scientists. Please reference our tutorial article when using the text package: The text-package: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning.

Short installation guide

Most users simply need to run below installation code. For those experiencing problems or want more alternatives, please see the Extended Installation Guide.

For the text-package to work, you first have to install the text-package in R, and then make it work with text required python packages.

  1. Install text-version (at the moment the second step only works using the development version of text from GitHub).

GitHub development version:

# install.packages("devtools")
devtools::install_github("oscarkjell/text")

CRAN version:

install.packages("text")
  1. Install and initialize text required python packages:
library(text)

# Install text required python packages in a conda environment (with defaults).
textrpp_install()

# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R. 
textrpp_initialize(save_profile = TRUE)

Point solution for transforming text to embeddings

Recent significant advances in NLP research have resulted in improved representations of human language (i.e., language models). These language models have produced big performance gains in tasks related to understanding human language. Text are making these SOTA models easily accessible through an interface to HuggingFace in Python.

Text provides many of the contemporary state-of-the-art language models that are based on deep learning to model word order and context. Multilingual language models can also represent several languages; multilingual BERT comprises 104 different languages.

Table 1. Some of the available language models

#> Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
#> Warning in attr(x, "format"): 'xfun::attr()' is deprecated.
#> Use 'xfun::attr2()' instead.
#> See help("Deprecated")
ModelsReferencesLayersDimensionsLanguage
‘bert-base-uncased’Devlin et al. 201912768English
‘roberta-base’Liu et al. 201912768English
‘distilbert-base-cased’Sahn et al., 20196768English
‘bert-base-multilingual-cased’Devlin et al. 201912768104 top languages at Wikipedia
‘xlm-roberta-large’Liu et al241024100 language

See HuggingFace for a more comprehensive list of models.

The textEmbed() function is the main embedding function in text; and can output contextualized embeddings for tokens (i.e., the embeddings for each single word instance of each text) and texts (i.e., single embeddings per text taken from aggregating all token embeddings of the text).

library(text)
# Transform the text data to BERT word embeddings

# Example text
texts <- c("I feel great!")

# Defaults
embeddings <- textEmbed(texts)
embeddings

See Get Started for more information.

Language Analysis Tasks

It is also possible to access many language analysis tasks such as textClassify(), textGeneration(), and textTranslate().

library(text)

# Generate text from the prompt "I am happy to"
generated_text <- textGeneration("I am happy to",
                                 model = "gpt2")
generated_text

For a full list of language analysis tasks supported in text see the References

An end-to-end package

Text also provides functions to analyse the word embeddings with well-tested machine learning algorithms and statistics. The focus is to analyze and visualize text, and their relation to other text or numerical variables. For example, the textTrain() function is used to examine how well the word embeddings from a text can predict a numeric or categorical variable. Another example is functions plotting statistically significant words in the word embedding space.

library(text) 
# Use data (DP_projections_HILS_SWLS_100) that have been pre-processed with the textProjectionData function; the preprocessed test-data included in the package is called: DP_projections_HILS_SWLS_100
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  y_axes = TRUE,
  title_top = " Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  position_jitter_hight = 0.5,
  position_jitter_width = 0.8
)
plot_projection$final_plot

Featured Bluesky Post

Version 1.3 of the #r-text package is now available from #CRAN.

This new version makes it easier to apply pre-trained language assessments from the #LBAM-library (r-text.org/articles/LBA…).

#mlsky #PsychSciSky #Statistics #PsychSciSky #StatsSky #NLP[image or embed]

— Oscar Kjell (@oscarkjell.bsky.social) Dec 22, 2024 at 9:48

Copy Link

Version

Install

install.packages('text')

Monthly Downloads

1,704

Version

1.7.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Oscar Kjell

Last Published

September 1st, 2025

Functions in text (1.7.0)

textDomainCompare

Compare two language domains
textDescriptives

Compute descriptive statistics of character variables.
textDimName

Change dimension names
textDiagnostics

Run diagnostics for the text package
textDistance

Semantic distance
textEmbedLayerAggregation

Aggregate layers
textFineTuneTask

Task Adapted Pre-Training (EXPERIMENTAL - under development)
textFindNonASCII

Detect non-ASCII characters
textGeneration

Text generation
textModelLayers

Number of layers
textEmbedRawLayers

Extract layers of hidden states
textExamples

Identify language examples.
textLBAM

The LBAM library
textFineTuneDomain

Domain Adapted Pre-Training (EXPERIMENTAL - under development)
textEmbedStatic

Apply static word embeddings
textEmbedReduce

Pre-trained dimension reduction (experimental)
textNER

Named Entity Recognition. (experimental)
textPCA

textPCA()
textPredictAll

Predict from several models, selecting the correct input
textModels

Check downloaded, available models.
textModelsRemove

Delete a specified model
textPredictTest

Significance testing for model prediction performance
textPredict

textPredict, textAssess and textClassify
textProjection

Supervised Dimension Projection
textPCAPlot

textPCAPlot
textPlot

Plot words
textTokenize

Tokenize text-variables
textTopics

BERTopics
textTopicsReduce

textTopicsReduce (EXPERIMENTAL)
textSimilarityMatrix

Semantic similarity across multiple word embeddings
textSum

Summarize texts. (experimental)
textQA

Question Answering. (experimental)
textTokenizeAndCount

Tokenize and count
textProjectionPlot

Plot Supervised Dimension Projection
textSimilarityNorm

Semantic similarity between a text variable and a word norm
textSimilarity

Semantic Similarity
textTrainRegression

Train word embeddings to a numeric variable.
textTranslate

Translation. (experimental)
textTrainLists

Train lists of word embeddings
textTrain

Trains word embeddings
textTopicsWordcloud

Plot word clouds
textTopicsTest

Wrapper for topicsTest function from the topics package
textTrainNPlot

Plot cross-validated accuracies across sample sizes
textTrainN

Cross-validated accuracies across sample-sizes
textTrainRandomForest

Trains word embeddings usig random forest
textTopicsTree

textTopicsTest (EXPERIMENTAL) to get the hierarchical topic tree
word_embeddings_4

Word embeddings for 4 text variables for 40 participants
textrpp_uninstall

Uninstall textrpp conda environment
textrpp_install

Install text required python packages in conda or virtualenv environment
textZeroShot

Zero Shot Classification (Experimental)
textrpp_initialize

Initialize text required python packages
centrality_data_harmony

Example data for plotting a Semantic Centrality Plot.
textClean

Cleans text from standard personal information
Language_based_assessment_data_8

Text and numeric data for 10 participants.
raw_embeddings_1

Word embeddings from textEmbedRawLayers function
DP_projections_HILS_SWLS_100

Data for plotting a Dot Product Projection Plot.
textCentralityPlot

Plots words from textCentrality()
textCentrality

Semantic similarity score between single words' and an aggregated word embeddings
find_textrpp_env

Find text required python packages env
Language_based_assessment_data_3_100

Example text and numeric data.
PC_projections_satisfactionwords_40

Example data for plotting a Principle Component Projection Plot.
textCleanNonASCII

Clean non-ASCII characters
textDistanceNorm

Semantic distance between a text variable and a word norm
textEmbed

textEmbed() extracts layers and aggregate them to word embeddings, for all character variables in a given dataframe.
textDistanceMatrix

Semantic distance across multiple word embeddings