Learn R Programming

qlcMatrix (version 0.9.2)

pwMatrix: Construct `part-whole' (pw) Matrices from tokenized strings

Description

A part-whole Matrix is a sparse matrix representation of a vector of strings (`wholes') split into smaller parts by a specified separator. It basically summarises which strings consist of which parts. By itself, this is not a very interesting transformation, but it allows for quite fancy computations by simple matrix manipulations.

Usage

pwMatrix(strings, sep = "", gap.length = 0, gap.symbol = "·", simplify = FALSE)

Arguments

Value

By default (when simplify = F) the output is a list with two elements, containing:Ma sparse pattern Matrix of type ngCMatrix with all input strings as columns, and all separated elements as rows.rownamesall different characters from the strings in order (i.e. all individual tokens of the original strings).When simplify = T, then only the matrix M with row and column names is returned.

Details

Internally, this is basically using strsplit and some cosmetic changes, returning a sparse matrix.

See Also

Used in splitStrings and splitWordlist

Examples

Run this code
# By itself, this functions does nothing really interesting
example <- c("this","is","an","example")
pw <- pwMatrix(example)
pw

# However, making a type-token Matrix (with ttMatrix) of the rownames
# and then taking a matrix product, results in frequencies of each element in the strings
tt <- ttMatrix(pw$rownames)
distr <- (tt$M*1) %*% (pw$M*1)
rownames(distr) <- tt$rownames
colnames(distr) <- example
distr

# Use banded sparse matrix with superdiagonal ('shift matrix') to get co-occurrence counts
# of adjacent characters. Rows list first character, columns adjacent character. 
# Non-zero entries list number of co-occurrences
S <- bandSparse( n = ncol(tt$M), k = 1) * 1
TT <- tt$M * 1
( C <- TT %*% S %*% t(TT) )

# show the non-zero entries as triplets:
s <- summary(C)
first <- tt$rownames[s[,1]]
second <- tt$rownames[s[,2]]
freq <- s[,3]
data.frame(first,second,freq)

Run the code above in your browser using DataLab