udpipe (version 0.3)

txt_nextgram: Based on a vector with a word sequence, get n-grams

Description

If you have annotated your text using udpipe_annotate, your text is tokenised in a sequence of words. Based on this vector of words in sequence getting n-grams comes down to looking at the next word and the subsequent word andsoforth. These words can be pasted together to form an n-gram containing the current word, the next word up, the subsequent word, ...

Usage

txt_nextgram(x, n = 2, sep = " ")

Arguments

x

a character vector where each element is just 1 term or word

n

an integer indicating the ngram. Values of 1 will keep the x, a value of 2 will append the next term to the current term, a value of 3 will append the subsequent term and the term following that term to the current term

sep

a character element indicating how to paste the subsequent words together

Value

a character vector of the same length of x with the n-grams

See Also

paste, shift

Examples

Run this code
# NOT RUN {
x <- sprintf("%s%s", LETTERS, 1:26)
txt_nextgram(x, n = 2)

data.frame(words = x,
           bigram = txt_nextgram(x, n = 2),
           trigram = txt_nextgram(x, n = 3, sep = "-"),
           quatrogram = txt_nextgram(x, n = 4, sep = ""),
           stringsAsFactors = FALSE)
# }

Run the code above in your browser using DataLab