For constructing motifs, or for producing De Bruijn graphs, we need to
be able to decompose a set of input strings into "words" of a fixed
length. In our application, the words are derived from long-read
sequences that cross multiple breakpoints. Each breakpoint is given a
unique name/label, thatwhich can be of arbirtrary length in order to be
maningful to the researchers. Using the Cipher class, we
encode the breakpoint names into character strings of the same
size. (In the original version of this package, we used single
characters. That approach eventually proved to be inadequate when we
looked at long-read data from samples with a very large number of
breakpoints. We then extended the package to work with two-byte
codes. This solution may eventually be extended to even longer coding
sequences.)
The makeWords and countWords functions take as inputs a
vector of character strings (typically describing long-read
sequences) that have already been encoded into fixed-byte-length
characters. They then find all words in those strings of a given
fixed length. They only differ in the form of their output. The former
function returns the word counts in their encoded form; the latter
decodes them back to the original names (as long as you provide the
optional appropriate Cipher argument).
The plotWords function gives a visible representaiton of words
of length K sorted by their frequency. The x-axis contains the
sorted word list; the y-axis is the frequency. The idea is that one
can quickly figure out which words are most common in the input "text".