add_span_quotes: Add span quotes to a source-quote annotations

Description

Quotes can span across sentences, which makes it impossible to find them based on dependency tree quories. This function can be used as post-processing, AFTER using tqueries to find 'source' and 'quote' nodes, to add some of these quotes.

The quotes themselves are often easy to detect due to the use of quotation marks. There are two common ways of indicating the sources.

Firstly, the source might be used before the start of the quote (Steve said: "hey a quote!". "I like quotes!"). Secondly, the source might be implied in the sentence where the quotes starts, or the sentence before that (Steve was mad. "What a stupid way of quoting me!").

In the first case, the source can be found with a tquery. If there is a source (source_val) in the quote_col that is linked to a part of the quote (quote_val), this function will add the rest of the quote.

In the second case, we can look for candidates near the beginning of the quote. The candidate criteria can be specified as tqueries

Usage

add_span_quotes(
  tokens,
  text_col,
  quote_col = "quotes",
  source_val = "source",
  quote_val = "quote",
  tqueries = NULL,
  par_col = NULL,
  space_col = NULL,
  lag_sentences = 1,
  add_quote_symbols = NULL,
  quote_subset = NULL,
  copy = TRUE
)

Arguments

tokens

A tokenIndex with rsyntax annotations for 'sources' and 'quotes'

text_col

The column with the text (often 'token' or 'word')

quote_col

The column that contains the quote annotations

source_val

The value in quote_col that indicates the source

quote_val

The value in quote_col that indicates the quote

tqueries

A list of tqueries, that will be performed to find source candidates. The order of the queries determines which source candidates are preferred. It would make sense to use the same value as in source_val in the 'label' argument for the tquery.

par_col

If available in the parser output, the column with the paragraph id. We can assume that quotes do not span across paragraphs. By using this argument, quotes that are not properly closed (uneven number of quotes) will stop at the end of the paragraph

space_col

If par_col is not used, paragraphs will be identified based on hard enters in the text_col. In some parsers, there is an additional "space" column that hold the whitespace and linebreaks, which can be included here.

lag_sentences

The max number of sentences looked backwards to find source candidates. Default is 1, which means the source candidates have to occur in the sentence where the quote begins (lag = 0) or the sentence before that (lag = 1)

add_quote_symbols

Optionally, add additional punctuation symbols for finding quotation marks. In some contexts and languages it makes sense to add single quotes, but in that case it is oftne necessary to also use the quote_subset argument. For instance, in Spacy (and probably other UD based annotations), single quotes in posessives (e.g., Bob's, scholars') have a PART POS tag, whereas quotation symbols have PUNCT, NOUN, VERB, or ADJ (for some reason).

quote_subset

Optionally, an expression to be evaluated on the columns of 'tokens' for selecting/deselecting tokens that can/cant be quotation marks. For example, pos != "PART" can be used for the example mentioned in add_quote_symbols.

copy

If TRUE, deep copy the data.table (use if output tokens do not overwrite input tokens)

Value

the tokenIndex

Examples

Run this code

# NOT RUN {
## This function is best used after first annotating regular quotes
## Here we first apply 3 tqueries for annotating quotes in spacy tokens

# }
# NOT RUN {
tokens = tokens_spacy[tokens_spacy$doc_id == 'text6',]

verbs = c("tell", "show", "acknowledge", "admit", "affirm", "allege", 
  "announce", "assert", "attest", "avow", "call", "claim", "comment", 
  "concede", "confirm", "declare", "deny", "exclaim", "insist", "mention", 
  "note", "post","predict", "proclaim", "promise", "reply", "remark", 
  "report", "say", "speak", "state", "suggest", "talk", "tell", "think",
  "warn","write", "add")

direct = tquery(lemma = verbs, label='verb',
   children(req=FALSE, relation = c('npadvmod'), block=TRUE),
   children(relation=c('su','nsubj','agent','nmod:agent'), label='source'),
   children(label='quote'))

nosrc = tquery(pos='VERB*',
   children(relation= c('su', 'nsubj', 'agent', 'nmod:agent'), label='source'),
   children(lemma = verbs, relation='xcomp', label='verb',
     children(relation=c("ccomp","dep","parataxis","dobj","nsubjpass","advcl"), label='quote')))

according = tquery(label='quote',
   children(relation='nmod:according_to', label='source',
        children(label='verb')))

tokens = annotate_tqueries(tokens, 'quote', dir=direct, nos=nosrc, acc=according)
tokens

## now we add the span quotes. If a span quote is found, the algorithm will first
## look for already annotated sources as source candidates. If there are none,
## additional tqueries can be used to find candidates. Here we simply look for
## the most recent PERSON entity

tokens = tokens_spacy[tokens_spacy$doc_id == 'text6',]
tokens = annotate_tqueries(tokens, 'quote', dir=direct, nos=nosrc, acc=according)


last_person = tquery(entity = 'PERSON*', label='source')
tokens = add_span_quotes(tokens, 'token', 
                         quote_col = 'quote', source_val = 'source', quote_val = 'quote', 
                         tqueries=last_person)
tokens

## view as full text
syntax_reader(tokens, annotation = 'quote', value = 'source')
# }

Run the code above in your browser using DataLab