Words: Class `"Words"`

Description

Provides the ability to find, count, and plot words of specific length in collections of strings in any sequence language.

Usage

makeWords(opstrings, K, nb = 1)
countWords(opstrings, K, alpha = NULL)
plotWords(K, m)

Value

The makeWords function returns a table of words (of length

K) along with the counts of the number of times each one was seen in the input strings. The countWords function returns the same table, but with the words decoded back to the original language. The plotWords function returns a vector of the word counts for all words of length K in the list m.

Arguments

opstrings: A character vector containing a set of words that have been encoded into an alphabet where each character uses the same number of bytes in the encoding.
K: An integer; the length of the words of interest.
nb: An integer; the number of bytes used to encode each character.
alpha: A Cipher object, used to decode the word-strings.
m: A list of word-counts produced by the makeWords function.

Author

Kevin R. Coombes <krc@silicovore.com>

Details

For constructing motifs, or for producing De Bruijn graphs, we need to be able to decompose a set of input strings into "words" of a fixed length. In our application, the words are derived from long-read sequences that cross multiple breakpoints. Each breakpoint is given a unique name/label, thatwhich can be of arbirtrary length in order to be maningful to the researchers. Using the Cipher class, we encode the breakpoint names into character strings of the same size. (In the original version of this package, we used single characters. That approach eventually proved to be inadequate when we looked at long-read data from samples with a very large number of breakpoints. We then extended the package to work with two-byte codes. This solution may eventually be extended to even longer coding sequences.)

The makeWords and countWords functions take as inputs a vector of character strings (typically describing long-read sequences) that have already been encoded into fixed-byte-length characters. They then find all words in those strings of a given fixed length. They only differ in the form of their output. The former function returns the word counts in their encoded form; the latter decodes them back to the original names (as long as you provide the optional appropriate Cipher argument).

The plotWords function gives a visible representaiton of words of length K sorted by their frequency. The x-axis contains the sorted word list; the y-axis is the frequency. The idea is that one can quickly figure out which words are most common in the input "text".

Examples

Run this code

data(longreads)             # read sample data
raw <- longreads$connection # get the raw strings
alfa <- Cipher(raw)         # make a translation cipher
coded <- encode(alfa, raw)  # encode all the input strings
makeWords(coded, 3)
countWords(coded, 3, alfa)
m <- lapply(1:8, function(J) countWords(coded, J, alfa))
plotWords(3, m)

Run the code above in your browser using DataLab