
Last chance! 50% off unlimited learning
Sale ends in
Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document
the entity that it belongs to. As text chunks can contains several words, we need to have a way in
order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position
where the token is found in the text of the document, this function allows to assign the chunk label to each token
of the document.
# S3 method for chunkrange
merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)
the data.frame y
where 2 columns are added, namely:
chunk_entity: The chunk entity of the token if the token is inside the chunk defined in x
. If the token is not part of any chunk, the chunk category will be set to the default
value.
chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.
an object of class chunkrange
. A chunkrange
is just a data.frame which contains
one row per chunk/doc_id. It should have the columns doc_id, text, chunk_id, chunk_entity, start and end.
The fields start
and end
indicate in the original text
where the chunks of words starts and where it ends.
The chunk_entity
is a label you have assigned to the chunk (e.g. ORGANISATION / LOCATION / MONEY / LABELXYZ / ...).
a tokenised data.frame containing one row per doc_id/token It should have the columns doc_id
, start
and end
where
the fields start
and end
indicate the positions in the original text of the doc_id
where the token starts and where it ends.
See the examples.
a character string of a column of x
which is an identifier which defines the sequence. Defaults to 'doc_id'.
a character string of a column of y
which is an identifier which defines the sequence. Defaults to 'doc_id'.
character string with the default chunk_entity
to be assigned to the token if the token is not part of any chunk range.
Defaults to 'O'.
not used
# \donttest{
if(require(udpipe)){
library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
data(airbnb_chunks, package = "crfsuite")
airbnb_chunks <- head(airbnb_chunks, 20)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])
airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel)
head(airbnb_tokens)
head(airbnb_chunks)
## Add the entity of the chunk to the tokenised dataset
x <- merge(airbnb_chunks, airbnb_tokens)
x[, c("doc_id", "token", "chunk_entity")]
table(x$chunk_entity)
}
## cleanup for CRAN
file.remove(udmodel$file_model)
} # End of main if statement running only if the required packages are installed
# }
Run the code above in your browser using DataLab