title: "stanza: An R Interface to the Stanford NLP Toolkit" date: "2025-05-16" output: github_document

Overview

The stanza package provides an R interface to the Stanford NLP Group's Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:

Tokenize text into sentences, words, and multi-word tokens
Perform part-of-speech tagging
Extract lemmas (base forms) of words
Identify named entities (people, locations, organizations, etc.)
Parse syntactic dependencies
And more!

Installation

Step 1: Install the R package

First, install the stanza R package from CRAN:

install.packages("stanza")

Step 2: Install the Python backend

You can install the Python package using either virtualenv (recommended):

library("stanza")
virtualenv_install_stanza()

Or using conda if you prefer:

library("stanza")
conda_install_stanza()

Environment variables

Make sure that pip is installed along with the Python version you choose. To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON. For example testing on Windows I set RETICULATE_PYTHON to "C:/apps/Python/python.exe"

python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")

virtualenv_install_stanza()

during the installation. However, after the installation

library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")

is sufficent since then "~\\.virtualenvs\\stanza" is detected, but if RETICULATE_PYTHON is still "C:/apps/Python/python.exe" it does not find the correct environment and therefore stanza can not be loaded.

Getting Started

Load the package and initialize

library("stanza")
stanza_initialize(virtualenv = "stanza")

Download language models

Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.

To download the English model:

stanza_download("en")

Similarly, for German:

stanza_download("de")

Building a Pipeline

A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:

processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)

The Stanza documentation provides detailed information on all available processors:

tokenize: Split text into sentences and words
mwt: Expand multi-word tokens
pos: Part-of-speech tagging
lemma: Lemmatization
ner: Named entity recognition
depparse: Dependency parsing
And more

Using specific models for processors

To select specific models for each processor, use a named list:

processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)

Processing Text

The stanza_pipeline() function returns a pipeline function that transforms text into annotated document objects:

doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

Extracting Results

Stanza provides several helper functions to extract different types of information from the processed documents:

Sentences

sents(doc)
#> [[1]]
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Words with linguistic features

words(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Tokens

tokens(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Named entities

entities(doc)
#> list()

Multi-word tokens

multi_word_token(doc)
#>   tid wid         token          word
#> 1   1   1             R             R
#> 2   2   2            is            is
#> 3   3   3             a             a
#> 4   4   4 collaborative collaborative
#> 5   5   5       project       project
#> 6   6   6          with          with
#> 7   7   7          many          many
#> 8   8   8  contributors  contributors
#> 9   9   9             .             .