ngram-tokenizers: N-gram tokenizers

Description

These functions tokenize their inputs into different kinds of n-grams. The input can be a character vector of any length, or a list of character vectors where each character vector in the list has a length of 1. See details for an explanation of what each function does.

Usage

tokenize_ngrams(x, lowercase = TRUE, n = 3L, n_min = n,
  stopwords = character(), ngram_delim = " ", simplify = FALSE)
tokenize_skip_ngrams(x, lowercase = TRUE, n = 3, k = 1,
  simplify = FALSE)

Arguments

A character vector or a list of character vectors to be tokenized into n-grams. If x is a character vector, it can be of any length, and each element will be tokenized separately. If x is a list of character vectors, each element

lowercase

Should the tokens be made lower case?

The number of words in the n-gram. This must be an integer greater than or equal to 1.

n_min

This must be an integer greater than or equal to 1, and less than or equal to n.

stopwords

A character vector of stop words to be excluded from the n-grams.

ngram_delim

The separator between words in an n-gram.

simplify

FALSE by default so that a consistent value is returned regardless of length of input. If TRUE, then an input with a single element will return a character vector of tokens instead of a list.

For the skip n-gram tokenizer, the maximum skip distance between words. The function will compute all skip n-grams between 0 and k.

Value

A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If `simplify = TRUE` and only a single element was passed as input, then the output is a character vector of tokens.

Details

[object Object],[object Object]

These functions will strip all punctuation and normalize all whitespace to a single space character.

Examples

Run this code

song <-  paste0("How many roads must a man walk down\n",
                "Before you call him a man?\n",
                "How many seas must a white dove sail\n",
                "Before she sleeps in the sand?\n",
                "\n",
                "How many times must the cannonballs fly\n",
                "Before they're forever banned?\n",
                "The answer, my friend, is blowin' in the wind.\n",
                "The answer is blowin' in the wind.\n")

tokenize_ngrams(song, n = 4)
tokenize_ngrams(song, n = 4, n_min = 1)
tokenize_skip_ngrams(song, n = 4, k = 2)

Run the code above in your browser using DataLab