udpipe v0.3


Monthly downloads



Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <http://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>.


udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe

This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at http://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).


The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 60 treebanks available at http://universaldependencies.org/#ud-treebanks
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is availabe under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation and package vignette for further details.

vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: devtools::install_github("bnosac/udpipe", build_vignettes = TRUE)


Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe_annotate

dl <- udpipe_download_model(language = "dutch")

language                                                                      file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch, 
                     x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
 doc_id paragraph_id sentence_id token_id token lemma  upos                     xpos                                                               feats head_token_id dep_rel deps
   doc1            1           1        1    Ik    ik  PRON        Pron|per|1|ev|nom                          Case=Nom|Number=Sing|Person=1|PronType=Prs             2   nsubj <NA>
   doc1            1           1        2  ging    ga  VERB V|intrans|ovt|1of2of3|ev Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Past|VerbForm=Fin             0    root <NA>
   doc1            1           1        3    op    op   ADP                Prep|voor                                                        AdpType=Prep             4    case <NA>
   doc1            1           1        4  reis  reis  NOUN          N|soort|ev|neut                                                         Number=Sing             2     obj <NA>
   doc1            1           1        5    en    en CCONJ               Conj|neven                                                                <NA>             7      cc <NA>
   doc1            1           1        6    ik    ik  PRON        Pron|per|1|ev|nom                          Case=Nom|Number=Sing|Person=1|PronType=Prs             7   nsubj <NA>
   doc1            1           1        7   nam  neem  VERB   V|trans|ovt|1of2of3|ev Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Fin             2    conj <NA>

Pre-trained models

Pre-trained Universal Dependencies 2.0 models on all UD treebanks are made available for more than 50 languages, namely:

afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese.

These have been made available easily to users of the package by using udpipe_download_model

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at http://universaldependencies.org/#ud-treebanks, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Functions in udpipe

Name Description
as_phrasemachine Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions
brussels_listings Brussels AirBnB address locations available at www.insideairbnb.com
as.data.frame.udpipe_connlu Convert the result of udpipe_annotate to a tidy data frame
as.matrix.cooccurrence Convert the result of cooccurrence to a sparse matrix
document_term_frequencies Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document
document_term_matrix Create a document/term matrix from a data.frame with 1 row per document/term
collocation Extract collocations - a sequence of terms which follow each other
cooccurrence Create a cooccurence data.frame
brussels_reviews Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com
brussels_reviews_anno Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatised
dtm_bind Combine 2 document term matrices either by rows or by columns
dtm_cor Pearson Correlation for Sparse Matrices
dtm_tfidf Term Frequency - Inverse Document Frequency calculation
phrases Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags
txt_sample Boilerplate function to sample one element from a vector.
txt_show Boilerplate function to cat only 1 element of a character vector.
dtm_remove_tfidf Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency
dtm_reverse Inverse operation of the document_term_matrix function
txt_previous Get the n-th previous element of a vector
txt_recode Recode text to other categories
dtm_remove_lowfreq Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms
dtm_remove_terms Remove terms from a Document-Term-Matrix and keep only documents which have a least some terms
txt_next Get the n-th next element of a vector
txt_nextgram Based on a vector with a word sequence, get n-grams
udpipe_load_model Load an UDPipe model
udpipe_read_conllu Read in a CONLL-U file as a data.frame
predict.LDA_VEM Predict method for an object of class LDA_VEM or class LDA_Gibbs
txt_collapse Collapse a character vector while removing missing data.
txt_freq Frequency statistics of elements in a vector
udpipe_accuracy Evaluate the accuracy of your UDPipe model on holdout data
udpipe_annotate Tokenise, Tag and Dependency Parsing Annotation of raw text
udpipe_train Train a UDPipe model
unique_identifier Create a unique identifier for each combination of fields in a data frame
txt_highlight Highlight words in a character vector
udpipe_annotation_params List with training options set by the UDPipe community when building models based on the Universal Dependencies data
udpipe_download_model Download an UDPipe model provided by the UDPipe community for a specific language of choice
No Results!

Vignettes of udpipe

No Results!

Last month downloads


Type Package
License MPL-2.0
URL https://bnosac.github.io/udpipe/en/index.html, https://github.com/bnosac/udpipe
Encoding UTF-8
LinkingTo Rcpp
SystemRequirements C++11
RoxygenNote 6.0.1
VignetteBuilder knitr
NeedsCompilation yes
Packaged 2018-01-15 13:15:07 UTC; Jan
Repository CRAN
Date/Publication 2018-01-15 14:45:23 UTC

Include our badge in your README