title: "stanza: An R Interface to the Stanford NLP Toolkit" date: "2025-05-16" output: github_document
Overview
The stanza package provides an R interface to the Stanford NLP Group's Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:
- Tokenize text into sentences, words, and multi-word tokens
- Perform part-of-speech tagging
- Extract lemmas (base forms) of words
- Identify named entities (people, locations, organizations, etc.)
- Parse syntactic dependencies
- And more!
Installation
Step 1: Install the R package
First, install the stanza R package from CRAN:
install.packages("stanza")
Step 2: Install the Python backend
You can install the Python package using either virtualenv (recommended):
library("stanza")
virtualenv_install_stanza()
Or using conda if you prefer:
library("stanza")
conda_install_stanza()
Environment variables
Make sure that pip is installed along with the Python version you choose.
To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON
.
For example testing on Windows I set RETICULATE_PYTHON
to "C:/apps/Python/python.exe"
python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")
virtualenv_install_stanza()
during the installation. However, after the installation
library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")
is sufficent since then "~\\.virtualenvs\\stanza"
is detected, but if RETICULATE_PYTHON
is
still "C:/apps/Python/python.exe"
it does not find the correct environment and therefore
stanza can not be loaded.
Getting Started
Load the package and initialize
library("stanza")
stanza_initialize(virtualenv = "stanza")
Download language models
Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.
To download the English model:
stanza_download("en")
Similarly, for German:
stanza_download("de")
Building a Pipeline
A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:
processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)
The Stanza documentation provides detailed information on all available processors:
tokenize
: Split text into sentences and wordsmwt
: Expand multi-word tokenspos
: Part-of-speech tagginglemma
: Lemmatizationner
: Named entity recognitiondepparse
: Dependency parsing- And more
Using specific models for processors
To select specific models for each processor, use a named list:
processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)
Processing Text
The stanza_pipeline()
function returns a pipeline function that transforms text into annotated document objects:
doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
Extracting Results
Stanza provides several helper functions to extract different types of information from the processed documents:
Sentences
sents(doc)
#> [[1]]
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
Words with linguistic features
words(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
Tokens
tokens(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
Named entities
entities(doc)
#> list()
Multi-word tokens
multi_word_token(doc)
#> tid wid token word
#> 1 1 1 R R
#> 2 2 2 is is
#> 3 3 3 a a
#> 4 4 4 collaborative collaborative
#> 5 5 5 project project
#> 6 6 6 with with
#> 7 7 7 many many
#> 8 8 8 contributors contributors
#> 9 9 9 . .