merge.chunkrange: CRF Training data construction: add chunk entity category to a tokenised dataset

Description

Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document the entity that it belongs to. As text chunks can contains several words, we need to have a way in order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position where the token is found in the text of the document, this function allows to assign the chunk label to each token of the document.

Usage

# S3 method for chunkrange
merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)

Value

the data.frame y where 2 columns are added, namely:

chunk_entity: The chunk entity of the token if the token is inside the chunk defined in x. If the token is not part of any chunk, the chunk category will be set to the default value.
chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.

Arguments

x: an object of class chunkrange. A chunkrange is just a data.frame which contains one row per chunk/doc_id. It should have the columns doc_id, text, chunk_id, chunk_entity, start and end.
The fields start and end indicate in the original text where the chunks of words starts and where it ends. The chunk_entity is a label you have assigned to the chunk (e.g. ORGANISATION / LOCATION / MONEY / LABELXYZ / ...).
y: a tokenised data.frame containing one row per doc_id/token It should have the columns doc_id, start and end where the fields start and end indicate the positions in the original text of the doc_id where the token starts and where it ends. See the examples.
by.x: a character string of a column of x which is an identifier which defines the sequence. Defaults to 'doc_id'.
by.y: a character string of a column of y which is an identifier which defines the sequence. Defaults to 'doc_id'.
default_entity: character string with the default chunk_entity to be assigned to the token if the token is not part of any chunk range. Defaults to 'O'.
...: not used

Examples

Run this code

# \donttest{
if(require(udpipe)){
library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
  data(airbnb_chunks, package = "crfsuite")
  airbnb_chunks <- head(airbnb_chunks, 20)
  airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])

  airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel)
  head(airbnb_tokens)
  head(airbnb_chunks)

  ## Add the entity of the chunk to the tokenised dataset
  x <- merge(airbnb_chunks, airbnb_tokens)
  x[, c("doc_id", "token", "chunk_entity")]
  table(x$chunk_entity)
}

## cleanup for CRAN
file.remove(udmodel$file_model)
} # End of main if statement running only if the required packages are installed
# }

Run the code above in your browser using DataLab