splitStrings: Construct unigram and bigram matrices from a vector of strings

Description

A (possibly large) vector of strings is separated into sparse pattern matrices, which allows for efficient computation on the strings.

Usage

splitStrings(strings, sep = "", bigrams = TRUE, boundary = TRUE,
	bigram.binder = "", gap.symbol = "·", left.boundary = "#", right.boundary = "#",
	simplify = FALSE)

Arguments

Value

By default, the output is a list of six elements:segmentsA vector with all splitted parts (i.e. all tokens) in order of occurrence, separated between the original strings with gap symbols.unigramsA vector with all unique parts occuring in the segments.bigramsOnly present when bigrams = T. A vector with all unique bigrams.SWA sparse pattern matrix of class ngCMatrix specifying the distribution of segments (S) over the original strings (W, think 'words'). This matrix is only interesting in combination with the following matrices.USA sparse pattern matrix of class ngCMatrix specifying the distribution of the unique unigrams (U) over the tokenized segments (S).BSOnly present when bigrams = T. A sparse pattern matrix of class ngCMatrix specifying the distribution of the unique bigrams (B) over the tokenized segments (S)When simplify = T the output is a single sparse matrix of class dgCMatrix. This is basically BS %8% SW (when bigrams = T) or US %*% SW (when bigrams = F) with rows and column names added into the matrix.

Examples

Run this code

# a simple example to see the function at work
example <- c("this","is","an","example")
splitStrings(example)
splitStrings(example, simplify = TRUE)

# a bit larger, but still quick and efficient
# taking 15526 wordforms from the English Dalby Bible and splitting them into bigrams
data(bibles)
words <- splitText(bibles$eng)$wordforms
system.time( S <- splitStrings(words, simplify = TRUE) )

# and then taking the cosine similarity between the bigram-vectors for all word pairs
system.time( sim <- cosSparse(S) )

# most similar words to "father"
sort(sim["father",], decreasing = TRUE)[1:20]

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples