Learn R Programming


title: "stanza: An R Interface to the Stanford NLP Toolkit" date: "2025-05-16" output: github_document

Overview

The stanza package provides an R interface to the Stanford NLP Group's Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:

  • Tokenize text into sentences, words, and multi-word tokens
  • Perform part-of-speech tagging
  • Extract lemmas (base forms) of words
  • Identify named entities (people, locations, organizations, etc.)
  • Parse syntactic dependencies
  • And more!

Installation

Step 1: Install the R package

First, install the stanza R package from CRAN:

install.packages("stanza")

Step 2: Install the Python backend

You can install the Python package using either virtualenv (recommended):

library("stanza")
virtualenv_install_stanza()

Or using conda if you prefer:

library("stanza")
conda_install_stanza()

Environment variables

Make sure that pip is installed along with the Python version you choose. To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON. For example testing on Windows I set RETICULATE_PYTHON to "C:/apps/Python/python.exe"

python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")

virtualenv_install_stanza()

during the installation. However, after the installation

library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")

is sufficent since then "~\\.virtualenvs\\stanza" is detected, but if RETICULATE_PYTHON is still "C:/apps/Python/python.exe" it does not find the correct environment and therefore stanza can not be loaded.

Getting Started

Load the package and initialize

library("stanza")
stanza_initialize(virtualenv = "stanza")

Download language models

Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.

To download the English model:

stanza_download("en")

Similarly, for German:

stanza_download("de")

Building a Pipeline

A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:

processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)

The Stanza documentation provides detailed information on all available processors:

  • tokenize: Split text into sentences and words
  • mwt: Expand multi-word tokens
  • pos: Part-of-speech tagging
  • lemma: Lemmatization
  • ner: Named entity recognition
  • depparse: Dependency parsing
  • And more

Using specific models for processors

To select specific models for each processor, use a named list:

processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)

Processing Text

The stanza_pipeline() function returns a pipeline function that transforms text into annotated document objects:

doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

Extracting Results

Stanza provides several helper functions to extract different types of information from the processed documents:

Sentences

sents(doc)
#> [[1]]
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Words with linguistic features

words(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Tokens

tokens(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Named entities

entities(doc)
#> list()

Multi-word tokens

multi_word_token(doc)
#>   tid wid         token          word
#> 1   1   1             R             R
#> 2   2   2            is            is
#> 3   3   3             a             a
#> 4   4   4 collaborative collaborative
#> 5   5   5       project       project
#> 6   6   6          with          with
#> 7   7   7          many          many
#> 8   8   8  contributors  contributors
#> 9   9   9             .             .

Copy Link

Version

Install

install.packages('stanza')

Version

1.0-3

License

GPL-3

Maintainer

Florian Schwendinger

Last Published

June 2nd, 2025

Functions in stanza (1.0-3)

stanza_pipeline

NLP Pipeline
conda_install_stanza

Conda Install Stanza
stanza_initialize

Initialize Stanza
stanza_version

Stanza Version
is_stanza_initialized

Check if Stanza is Initialized
multi_word_token

Multi-Word Token
stanza_download

Download Models
stanza_download_method_code

Select Download Method
entities

Entities
tokens

Tokens
virtualenv_install_stanza

Install Stanza via Virtual Environment
stanza_options

Options