stringdist (version 0.9.8)

qgrams: Get a table of qgram counts from one or more character vectors.

Description

Get a table of qgram counts from one or more character vectors.

Usage

qgrams(..., .list = NULL, q = 1L, useBytes = FALSE, useNames = !useBytes)

Value

A table with \(q\)-gram counts. Detected \(q\)-grams are column names and the argument names as row names. If no argument names were provided, they will be generated.

Arguments

...

any number of (named) arguments, that will be coerced to character with as.character.

.list

Will be concatenated with the ... argument(s). Useful for adding character vectors named 'q' or 'useNames'.

q

size of q-gram, must be non-negative.

useBytes

Determine byte-wise qgrams. useBytes=TRUE is faster but may yield different results depending on character encoding. For ASCII it is identical. See also stringdist under Encoding issues.

useNames

Add q-grams as column names. If useBytes=useNames=TRUE, the q-byte sequences are represented as 2 hexadecimal numbers per byte, separated by a vertical bar (|).

Details

The input is converted to character. If useBytes=TRUE, each element is converted to utf8 and then to integer as in stringdist. Next,the data is passed to the underlying routine.

Strings with less than q characters and elements containing NA are skipped. Using q=0 therefore counts the number of empty strings "" occuring in each argument.

See Also

stringdist, amatch

Examples

Run this code

qgrams('hello world',q=3)

# q-grams are counted uniquely over a character vector
qgrams(rep('hello world',2),q=3)

# to count them separately, do something like
x <- c('hello', 'world')
lapply(x,qgrams, q=3)

# output rows may be named, and you can pass any number of character vectors
x <- "I will not buy this record, it is scratched"
y <- "My hovercraft is full of eels"
z <- c("this", "is", "a", "dead","parrot")
qgrams(A = x, B = y, C = z,q=2)

# a tonque twister, showing the effects of useBytes and useNames
x <- "peter piper picked a peck of pickled peppers"
qgrams(x, q=2) 
qgrams(x, q=2, useNames=FALSE) 
qgrams(x, q=2, useBytes=TRUE)
qgrams(x, q=2, useBytes=TRUE, useNames=TRUE)




Run the code above in your browser using DataCamp Workspace