Learn R Programming

biogram (version 1.0)

count_ngrams: Detect And Count N-Grams In Sequences

Description

Counts all n-grams present in sequences.

Usage

count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE,
  threshold = 0)

Arguments

seq
integer vector or matrix describing sequence(s).
n
integer size of n-gram.
u
unigrams (integer, numeric or character vector).
d
integer vector of distances between elements of n-gram (0 means consecutive elements). See Details.
pos
logical, if TRUE n_grams contains position information.
scale
logical, if TRUE output data is normalized. Should be used only for n-grams without position information. See Details.
threshold
integer, if not equal to 0, data is binarized into two groups (larger or equal to threshold, smaller than threshold).

Value

Details

A distance vector should be always n - 1 long. For example when n = 3, d = c(1, 2) means A_A__A. For n = 4, d = c(2, 0, 1) means A__AA_A. If vector d has length 1, it is recycled to length n - 1.

Column names follow a specific convention. Elements of n-gram are separated by dot. If pos = TRUE, the left side of name means actual position of the n-gram (separated by _). the Right side of name is vector of distance(s) used separated by _.

Examples of naming convention:

  • 46_4.4.4_0_1 means trigram 44_4 on position 46.
12_2.1_2 means bigram 2__1 on position 12. 8_1.1.1_0_0 means continous trigram 111 on position 8.

See Also

Create vector of possible n-grams: create_ngrams.

Get n-grams from analyzed sequence: seq2ngrams.

Get indices of n-grams: get_ngrams_ind.

Count n-grams for multiple values of n: count_multigrams.

Examples

Run this code
#trigrams without position for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
#trigrams with position from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_ngrams(seqs, 3, 1L:4, pos = TRUE)

Run the code above in your browser using DataLab