PDict-class: PDict objects

Description

The PDict class is a container for storing a preprocessed dictionary of DNA patterns that can later be passed to the matchPDict function for fast matching against a reference sequence (the subject).

PDict is the constructor function for creating new PDict objects.

Usage

PDict(x, max.mismatch=NA, tb.start=NA, tb.end=NA, tb.width=NA, algorithm="ACtree2", skip.invalid.patterns=FALSE)

Arguments

A character vector, a DNAStringSet object or an XStringViews object with a DNAString subject.

max.mismatch

A single non-negative integer or NA. See the "Allowing a small number of mismatching letters" section below.

tb.start,tb.end,tb.width

A single integer or NA. See the "Trusted Band" section below.

algorithm

"ACtree2" (the default) or "Twobit".

skip.invalid.patterns

This argument is not supported yet (and might in fact be replaced by the filter argument very soon).

Trusted Band

What's a Trusted Band? A Trusted Band is a region defined in the original dictionary where the limitations described above will apply. Why use a Trusted Band? Because the limitations described above will apply to the Trusted Band only! For example the Trusted Band cannot contain IUPAC ambiguity codes but the "head" and the "tail" can (see below for what those are). Also with a Trusted Band, if matchPdict is called with a non-null max.mismatch value then mismatching letters will be allowed in the head and the tail. Or, if matchPdict is called with fixed="subject", then IUPAC ambiguity codes in the head and the tail will be treated as ambiguities. How to specify a Trusted Band? Use the tb.start, tb.end and tb.width arguments of the PDict constructor in order to specify a Trusted Band. This will divide each pattern in the original dictionary into three parts: a left part, a middle part and a right part. The middle part is defined by its starting and ending nucleotide positions given relatively to each pattern thru the tb.start, tb.end and tb.width arguments. It must have the same length for all patterns (this common length is called the width of the Trusted Band). The left and right parts are defined implicitely: they are the parts that remain before (prefix) and after (suffix) the middle part, respectively. Therefore three DNAStringSet objects result from this division: the first one is made of all the left parts and forms the head of the PDict object, the second one is made of all the middle parts and forms the Trusted Band of the PDict object, and the third one is made of all the right parts and forms the tail of the PDict object. In other words you can think of the process of specifying a Trusted Band as drawing 2 vertical lines on the original dictionary (note that these 2 lines are not necessarily straight lines but the horizontal space between them must be constant). When doing this, you are dividing the dictionary into three regions (from left to right): the head, the Trusted Band and the tail. Each of them is a DNAStringSet object with the same number of elements than the original dictionary and the original dictionary could easily be reconstructed from those three regions. The width of the Trusted Band must be >= 1 because Trusted Bands of width 0 are not supported. Finally note that calling PDict with tb.start=NA, tb.end=NA and tb.width=NA (the default) is equivalent to calling it with tb.start=1, tb.end=-1 and tb.width=NA, which results in a full-width Trusted Band i.e. a Trusted Band that covers the entire dictionary (no head and no tail).

Allowing a small number of mismatching letters

[TODO]

Accessor methods

In the code snippets below, x is a PDict object.

: length(x): The number of patterns in x.
: width(x): A vector of non-negative integers containing the number of letters for each pattern in x.
: names(x): The names of the patterns in x.
: head(x): The head of x or NULL if x has no head.
: tb(x): The Trusted Band defined on x.
: tb.width(x): The width of the Trusted Band defined on x. Note that, unlike width(tb(x)), this is a single integer. And because the Trusted Band has a constant width, tb.width(x) is in fact equivalent to unique(width(tb(x))), or to width(tb(x))[1].
: tail(x): The tail of x or NULL if x has no tail.

Subsetting methods

In the code snippets below, x is a PDict object.

: x[[i]]: Extract the i-th pattern from x as a DNAString object.

Other methods

In the code snippet below, x is a PDict object.

: duplicated(x): [TODO]
: patternFrequency(x): [TODO]

Details

THIS IS STILL WORK IN PROGRESS!

If the original dictionary x is a character vector or an XStringViews object with a DNAString subject, then the PDict constructor will first try to turn it into a DNAStringSet object.

By default (i.e. if PDict is called with max.mismatch=NA, tb.start=NA, tb.end=NA and tb.width=NA) the following limitations apply: (1) the original dictionary can only contain base letters (i.e. only As, Cs, Gs and Ts), therefore IUPAC ambiguity codes are not allowed; (2) all the patterns in the dictionary must have the same length ("constant width" dictionary); and (3) later matchPdict can only be used with max.mismatch=0.

A Trusted Band can be used in order to relax these limitations (see the "Trusted Band" section below).

If you are planning to use the resulting PDict object in order to do inexact matching where valid hits are allowed to have a small number of mismatching letters, then see the "Allowing a small number of mismatching letters" section below.

Two preprocessing algorithms are currently supported: algorithm="ACtree2" (the default) and algorithm="Twobit". With the "ACtree2" algorithm, all the oligonucleotides in the Trusted Band are stored in a 4-ary Aho-Corasick tree. With the "Twobit" algorithm, the 2-bit-per-letter signatures of all the oligonucleotides in the Trusted Band are computed and the mapping from these signatures to the 1-based position of the corresponding oligonucleotide in the Trusted Band is stored in a way that allows very fast lookup. Only PDict objects preprocessed with the "ACtree2" algo can then be used with matchPdict (and family) and with fixed="pattern" (instead of fixed=TRUE, the default), so that IUPAC ambiguity codes in the subject are treated as ambiguities. PDict objects obtained with the "Twobit" algo don't allow this. See ?`matchPDict-inexact` for more information about support of IUPAC ambiguity codes in the subject.

References

Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.

Examples

Run this code

  ## ---------------------------------------------------------------------
  ## A. NO HEAD AND NO TAIL (THE DEFAULT)
  ## ---------------------------------------------------------------------
  library(drosophila2probe)
  dict0 <- DNAStringSet(drosophila2probe)
  dict0                                # The original dictionary.
  length(dict0)                        # Hundreds of thousands of patterns.
  unique(nchar(dict0))                 # Patterns are 25-mers.

  pdict0 <- PDict(dict0)               # Store the original dictionary in
                                       # a PDict object (preprocessing).
  pdict0
  class(pdict0)
  length(pdict0)                       # Same as length(dict0).
  tb.width(pdict0)                     # The width of the (implicit)
                                       # Trusted Band.
  sum(duplicated(pdict0))
  table(patternFrequency(pdict0))      # 9 patterns are repeated 3 times.
  pdict0[[1]]
  pdict0[[5]]

  ## ---------------------------------------------------------------------
  ## B. NO HEAD AND A TAIL
  ## ---------------------------------------------------------------------
  dict1 <- c("ACNG", "GT", "CGT", "AC")
  pdict1 <- PDict(dict1, tb.end=2)
  pdict1
  class(pdict1)
  length(pdict1)
  width(pdict1)
  head(pdict1)
  tb(pdict1)
  tb.width(pdict1)
  width(tb(pdict1))
  tail(pdict1)
  pdict1[[3]]

Run the code above in your browser using DataLab