biogram-package: biogram - analysis of biological sequences using n-grams

Description

biogram package for the analysis of nucleic acid and protein sequences using n-grams. Possible applications include motif discovery, feature selection, clustering, and classification.

Arguments

n-grams

n-grams (k-tuples) are sets of n characters derived from the input sequence(s). They may form continuous sub-sequences or be discontinuous. For example, from the sequence of nucleotides AATA one can extract the following continuous 2-grams (bigrams): AA, AT and TA. Moreover, there are two possible bigrams separated by a single space: A_T and A_A, and one bigram separated by two spaces: A__A.

Another important n-gram parameter is its position. Instead of just counting n-grams, one may want to count how many n-grams occur at a given position in multiple (e.g. related) sequences. For example, in the sequences AATA and AACA there is only one bigram at position 1: AA, but there are two bigrams at position two: AT and AC. The following notation is used for position-specific n-grams: 1_AA, 2_AT, 2_AC.

In the biogram package, the count_ngrams function is used for counting and extracting n-grams. Using the d argument the user can specify the distance between elements of the n-grams. The pos argument can be used to enable position specificity.

n-gram data dimensionality

We note that n-grams suffer from the curse of dimensionality. For example, for a peptide of length 6 $20^{n}$ n-grams and $6 \times 20^{n}$ positioned n-grams are possible. Data sets of such an enormous size are hard to manage and analyze in R.

The biogram package deals with both of the abovementioned problems. It uses innate properties of the n-gram data which usually can be represented by sparse matrices. Data storage is done using functionalities from the slam package. To ease the selection of significant features, biogram provides the user with QuiPT, a very fast permutation test for binary data (see test_features).

Another way of reducing dimensionality is the aggregation of sequence residues into more general groups. For example, all positively-charged amino acids may be aggregated into one group. This action can be performed using the degenerate function.

Examples

Run this code

#use data set from package
data(human_cleave)
#first nine columns represent subsequent nine amino acids from cleavage sites
#degenerate the sequence to reduce the dimensionality of the problem
#(use five groups instead of 20 amino acids)
deg_seqs <- degenerate(human_cleave[, 1L:9],
                      list(`1` = c(1, 6, 8, 10, 11, 18),
                           `2` = c(2, 13, 14, 16, 17),
                           `3` = c(5, 19, 20),
                           `4` = c(7, 9, 12, 15),
                           '5' = c(3, 4)))
#extract trigrams
trigrams <- count_ngrams(deg_seqs, 3, 1L:4, pos = TRUE)
#select features that differ between the two target groups
test1 <- test_features(human_cleave[, "tar"], trigrams)
#see a summary of the results
summary(test1)
#aggregate features in groups based on their p-value
gr <- cut(test1)
#analyze deeper the most significant n-grams
#get position map of n-grams
position_ngrams(gr[[1]])
#transform n-grams to more readable form
decode_ngrams(gr[[1]])

Run the code above in your browser using DataLab