A Data Model for the NLP Pipeline

# CRAN will not have spaCy installed, so create static vignette knitr::opts_chunk$set(eval = FALSE)

An annotation object is simply a named list with each item containing a data frame. These frames should be thought of as tables living inside of a single database, with keys linking each table to one another. All tables are in the second normal form of Edgar Codd. For the most part they also satisfy the third normal form, or, equivalently, the formal tidy data model of Hadley Wickham. The limited departures from this more stringent requirement are justified below wherever they exist. In every case the cause is a transitive dependency that would require a complex range join to reconstruct.

We were primarily considered the Java/CoreNLP backend when constructing the schema because it is the most feature-rich. With the sole exception of word embeddings in spaCy, which require a matrix-based data structure, all of the fields from the Python and R backends map seamlessly into a subset of those provided by CoreNLP. Fields where the output may differ slightly amongst backends are described in the commentary below.

Several standards have previously been proposed for representing textual annotations. These include the linguistic Annotation Framework, NLP Interchange Format, and CoNLL-X. The function from_CoNLL is included as a helper function in cleanNLP to convert from CoNLL formats into the cleanNLP data model. All of these, however, are concerned with representing annotations for interoperability between systems. Our goal is instead to create a data model well-suited to direct analysis, and therefore requires a new approach.

In this section each table is presented and justifications for its existence and form are given. Individual tables may be pulled out with access functions of the form get_*. Example tables are pulled from the State of the Union corpus, which will be discussed at length in the next section.


The documents table contains one row per document in the annotation object. What exactly constitutes a document up to the user. It might include something as granular as a paragraph or as coarse as an entire novel. For many applications, particularly stylometry, it may be useful to simultaneously work with several hierarchical levels: sections, chapters, and an entire body of work. The solution in these cases is to define a document as the smallest unit of measurement, denoting the higher-level structures as metadata. The primary key for this table is a document id, stored as an integer index. The schema is given by:

  • id integer. Id of the source document.
  • time date time. The time at which the parser was run on the text.
  • version character. Version of CoreNLP/spaCy/tokenizers library used to parse the text.
  • language character. Language of the text, in ISO 639-1 format.
  • uri character. Description of the raw text location.

By design, there should be no extrinsic meaning placed on this key. Other tables use it to map to one another and to the document table, but any metadata about the document is contained only in the document table rather than being forced into the document key. In other words, the temptation to use keys such as "Obama2016" is avoided because, while these look nice, they ultimately make programming hard.

These are all filled in automatically by the annotation function. Any number of additional corpora-specific metadata, such as the aforementioned section and chapter designations, may be attached as well. The document table for the example corpus is:

## # A tibble: 8 × 5
##      id                time version language
## 1     0 2017-04-02 22:56:00   3.7.0       en
## 2     1 2017-04-02 22:57:00   3.7.0       en
## 3     2 2017-04-02 22:58:00   3.7.0       en
## 4     3 2017-04-02 22:58:00   3.7.0       en
## 5     4 2017-04-02 22:59:00   3.7.0       en
## 6     5 2017-04-02 22:59:00   3.7.0       en
## 7     6 2017-04-02 23:00:00   3.7.0       en
## 8     7 2017-04-02 23:01:00   3.7.0       en
# ... with 1 more variables: uri 

Notice that metadata such as the president, year, and party into the document table has been included. It may seem that common fields such as year and author should be added to the formal specification but the perceived advantage is minimal. It would still be necessary for users to manually add the content of these fields at some point as any other metadata is not unambiguously extractable from the raw text.


The token table contains one row for each unique token, usually a word or punctuation mark, in any document in the corpus. Any annotator that produces an output for each token has its results displayed here. These include the lemmatizer, the part of the speech tagger and speaker indicators. The schema is given by:

  • id integer. Id of the source document.
  • sid integer. Sentence id, starting from 0.
  • tid integer. Token id, with the root of the sentence starting at 0.
  • word character. Raw word in the input text.
  • lemma character. Lemmatized form the token.
  • upos character. Universal part of speech code.
  • pos character. Language-specific part of speech code; uses the Penn Treebank codes.
  • cid integer. Character offset at the start of the word in the original document.

Given the annotators selected during the pipeline initialization, some of these columns may contain only missing data. A composite key exists by taking together the document id, sentence id, and token id. There is also a set of foreign keys \code{cid} and \code{cid_end} giving character offsets back into the original source document. An example of the table looks like this:

## # A tibble: 61,881 × 8
##       id   sid   tid      word     lemma  upos   pos   cid
## 1      0     0     1     Madam     Madam  NOUN   NNP     0
## 2      0     0     2   Speaker   Speaker  NOUN   NNP     6
## 3      0     0     3         ,         ,     .     ,    13
## 4      0     0     4       Mr.       Mr.  NOUN   NNP    15
## 5      0     0     5      Vice      Vice  NOUN   NNP    19
## 6      0     0     6 President President  NOUN   NNP    24
## 7      0     0     7         ,         ,     .     ,    33
## 8      0     0     8   Members   Members  NOUN   NNP    35
## 9      0     0     9        of        of   ADP    IN    43
## 10     0     0    10  Congress  Congress  NOUN   NNP    46
## # ... with 61,871 more rows

A phantom token "ROOT" is included at the start of each sentence (it always has tid equal to 0). This was added so that joins from the dependency table, which contains references to the sentence root, into the token table have no missing values.

The field "upos" contains the universal part of speech code, a language-agnostic classification, for the token. It could be argued that in order to maintain database normalization one should simply look up the universal part of speech code by finding the language code in the document table and joining a table mapping the Penn Treebank codes to the universal codes. This has not been done for several reasons. First, universal parts of speech are very useful for exploratory data analysis as they contain tags much more familiar to non-specialists such as "NOUN" (noun) and "CONJ" (conjunction). Asking users to apply a three table join just to access them seems overly cumbersome. Secondly, it is possible for users to use other parsers or annotation engines. These may not include granular part of speech codes and it would be difficult to figure out how to represent these if there were not a dedicated universal part of speech field.


Dependencies give the grammatical relationship between pairs of tokens within a sentence. As they are at the level of token pairs, they must be represented as a new table. Only one dependency should exist for any pair of tokens; the document id, sentence id, and source and target token ids together serve as a composite key. As dependencies exist only within a sentence, the sentence id does not need to be defined separately for the source and target. The schema is given by:

  • id integer. Id of the source document.
  • sid integer. Sentence id of the source token.
  • tid integer. Id of the source token.
  • tid_target integer. Id of the target token.
  • relation character. Language-agnostic universal dependency type.
  • relation_full character. Language specific universal dependency type.
  • word character. The source word in the raw text.
  • lemma character. Lemmatized form of the source word.
  • word_target character. The target word in the raw text.
  • lemma_target character. Lemmatized form of the target word.

Dependencies take significantly longer to calculate than the lemmatization and part of speech tagging tasks. By default, the set_language function selects a fast neural network parser that requires more memory but runs nearly twice as fast as other default parsers in the CoreNLP pipeline.

The get_dependency function has an option (set to "FALSE" by default) to auto join the dependency to the target and source tokens and words from the token table. This is a common task and involves non-trivial calls to the left_join function making it worthwhile to include as an option. The output, with the option turned on, is given by:

get_dependency(obama, get_token = TRUE)
## # A tibble: 16,592,392 × 11
##       id   sid   tid tid_target relation relation_full  word lemma sid_target
## 1      0     0     0          2     root          root              0
## 2      0     0     0          2     root          root              1
## 3      0     0     0          2     root          root              2
## 4      0     0     0          2     root          root              3
## 5      0     0     0          2     root          root              4
## 6      0     0     0          2     root          root              5
## 7      0     0     0          2     root          root              6
## 8      0     0     0          2     root          root              7
## 9      0     0     0          2     root          root              8
## 10     0     0     0          2     root          root              9
## # ... with 16,592,382 more rows, and 2 more variables: word_target ,
## #   lemma_target 

The word "ROOT" shows up in the first row, which would have been NA had sentence roots not been explicitly included in the token table.

Our parser produces universal dependencies, which have a language-agnostic set of relationship types with language-specific subsets pertaining to specific grammatical relationships with a particular language. For the same reasons that both the part of speech codes and universal part of speech codes are included, each of these relationship types have been added to the dependency table.

Named entities

Named entity recognition is the task of finding entities that can be defined by proper names, categorizing them, and standardizing their formats. The XML output of the Stanford CoreNLP pipeline places named entity information directly into their version of the token table. Doing this repeats information over every token in an entity and gives no canonical way of extracting the entirety of a single entity mention. We instead have a separate entity table, as is demanded by the normalized database structure, and record each entity mention in its own row. The schema is given by:

  • id integer. Id of the source document.
  • sid integer. Sentence id of the entity mention.
  • tid integer. Token id at the start of the entity mention.
  • tid_end integer. Token id at the end of the entity mention.
  • entity_type character. Type of entity.
  • entity character. Raw words of the named entity in the text.
  • entity_normalized character. Normalized version of the entity.

An example of the named entity table is given by:

## # A tibble: 3,166 × 7
##       id   sid   tid tid_end  entity_type              entity
## 1      0     1     2       2       PERSON             Speaker
## 2      0     1    10      10 ORGANIZATION            Congress
## 3      0     1    13      13         MISC           Americans
## 4      0     1    15      17         DATE Fifty-one years ago
## 5      0     1    19      21       PERSON     John F. Kennedy
## 6      0     3     1       1         DATE             Tonight
## 7      0     3    11      11         MISC            American
## 8      0     4     2       3     DURATION            a decade
## 9      0     5     2       2     DURATION               years
## 10     0     5    12      13       NUMBER           6 million
## # ... with 3,156 more rows, and 1 more variables: entity_normalized 

The categories available in the field entity_type are dependent on the models used in the annotation pipeline. The default English model selected for speed codes 2 and above include the categories: "LOCATION", "PERSON", "ORGANIZATION", "MISC", "MONEY", "PERCENT", "DATE" and "TIME". The last four of these also have a normalized form, given in the final field of the table. As with the coreference table, a complete representation of the entity is given as a character string due to the difficulty in reconstructing this after the fact from the token table.


Coreferences link sets of tokens that refer to the same underlying person, object, or idea. One common example is the linking of a noun in one sentence to a pronoun in the next sentence. The coreference table describes these relationships but is not strictly a table of coreferences. Instead, each row represents a single mention of an expression and gives a reference id indicating all of the other mentions that it also coreferences. In theory, a given set of tokens might have two separate mentions if it refers to two different classes of references (though this is quite rare). The document, reference, and mention ids serve as a composite key for the table. The schema is given by:

  • id integer. Id of the source document.
  • rid integer. Relation ID.
  • mid integer. Mention ID; unique to each coreference within a document.
  • mention character. The mention as raw words from the text.
  • mention_type character. One of "LIST", "NOMINAL", "PRONOMINAL", or "PROPER".
  • number character. One of "PLURAL", "SINGULAR", or "UNKNOWN".
  • gender character. One of "FEMALE", "MALE", "NEUTRAL", or "UNKNOWN".
  • animacy character. One of "ANIMATE", "INANIMATE", or "UNKNOWN".
  • sid integer. Sentence id of the coreference.
  • tid integer. Token id at the start of the coreference.
  • tid_end integer. Token id at the start of the coreference.
  • tid_head integer. Token id of the head of the coreference.

Links back into the token table for the start, end and head of the mention are given as well; these are pushed to the right of the table as they should be considered foreign keys within this table.

An example helps to explain exactly what the coreference table represents (the first row is removed as its mention is quite long and makes the table hard to read):

## # A tibble: 6,984 × 12
##       id   rid   mid       mention mention_type   number  gender   animacy
## 1      0  1537  1531   Afghanistan       PROPER SINGULAR NEUTRAL INANIMATE
## 2      0  1537  1537   Afghanistan       PROPER SINGULAR NEUTRAL INANIMATE
## 3      0  2050  2007   a democracy      NOMINAL SINGULAR NEUTRAL INANIMATE
## 4      0  2050  2050 our democracy      NOMINAL SINGULAR NEUTRAL INANIMATE
## 5      0  1796  1746             I   PRONOMINAL SINGULAR UNKNOWN   ANIMATE
## 6      0  1796  1796             I   PRONOMINAL SINGULAR UNKNOWN   ANIMATE
## 7      0  2053  2022            we   PRONOMINAL   PLURAL UNKNOWN   ANIMATE
## 8      0  2053  2023           our   PRONOMINAL   PLURAL UNKNOWN   ANIMATE
## 9      0  2053  2044            We   PRONOMINAL   PLURAL UNKNOWN   ANIMATE
## 10     0  2053  2046            we   PRONOMINAL   PLURAL UNKNOWN   ANIMATE
## # ... with 6,974 more rows, and 4 more variables: sid , tid ,
## #   tid_end , tid_head 

There is a special relationship between the reference id rid and the mention id mid. The coreference annotator selects a specific mention for each reference that gets treated as the canonical mention for the entire class. The mention id for this mention becomes the reference id for the class, as can be in the above table with rows 1, 3, 5, and 9 all corresponding to the canonical mention of their respective classes. This relationship provides a way of identifying the canonical mention within a reference class and a way of treating the coreference table as pairs of mentions rather than individual mentions joined by a given key.

The text of the mention itself is included within the table. This was done because as the mention may span several tokens it would otherwise be very difficult to extract this information from the token table. It is also possible, though not supported in the current CoreNLP pipeline, that a mention could consist of a set of non-contiguous tokens, making this field impossible to otherwise reconstruct.


The schema for sentence-level information is given by:

  • id integer. Id of the source document.
  • sid integer. Sentence id.
  • sentiment integer. Predicted sentiment class of the sentence, from 0 (worst) to 4 (best).

An example of the output can be seen in:

## # A tibble: 2,988 × 3
##       id   sid sentiment
## 1      0     0         1
## 2      0     1         3
## 3      0     2         1
## 4      0     3         1
## 5      0     4         2
## 6      0     5         1
## 7      0     6         3
## 8      0     7         1
## 9      0     8         1
## 10     0     9         1
## # ... with 2,978 more rows

The underlying sentiment model is a neural network.

Word vectors

Our final table in the data model stores the a relatively new concept of a word vector. Also known as word embeddings, these vectors are deterministic maps from the set of all available words into a high-dimensional, real valued vector space. Words with similar meanings or themes will tend to be clustered together in this high-dimensional space. For example, we would expect apple and pear to be very close to one another, with vegetables such as carrots, broccoli, and asparagus only slightly farther away. The embeddings can often be used as input features to fitting models on top of textual data. For a more detailed description of these embeddings, see the papers on either of the most well-known examples: GloVe \citep{pennington2014glove} and word2vec \citep{mikolov2013distributed}. Only the spaCy backend to \pkg{cleanNLP} currently supports word vectors; these are turned off by default because they take a significantly large amount of space to store. The embedding model used is a modification of the GloVe embeddings, mapping words into a 300-dimensional space. To compute the embeddings, set the \code{vector_flag} parameter of \code{init_spaCy} to \code{TRUE} prior to running the annotation.

Word vectors are stored in a separate table from the tokens table out of convenience rather than as a necessity of preserving the data model's normalized schema. Due to its size and the fact that the individual components of the word embedding have no intrinsic meaning, this table is stored as a matrix. ``{r} dim(get_vector(obama))

[1] 61871 303

``` The first three columns hold the keys \code{id}, \code{sid}, and \code{tid}, respectively. If no embedding is computed, the function \code{get_vector} returns an empty matrix.