Learn R Programming

biogram (version 1.1)

count_multigrams: Detect and count multiple n-grams in sequences

Description

A convinient wrapper around count_ngrams for counting multiple values of n and d.

Usage

count_multigrams(ns, ds = rep(0, length(ns)), seq, u, pos = FALSE,
  scale = FALSE, threshold = 0)

Arguments

ns
numeric vector of n-grams' sizes. See Details.
ds
list of distances between elements of n-grams. Each element of the list is a vector used as distance for the respective n-gram size given by the ns parameter.
seq
integer vector or matrix describing sequence(s).
u
integer, numeric or character vector of all possible unigrams.
pos
logical, if TRUE position-specific n_grams are counted.
scale
logical, if TRUE output data is normalized. Should be used only for n-grams without position information. See Details.
threshold
integer, if not equal to 0, data is binarized into two groups (larger or equal to threshold vs. smaller than threshold).

Value

  • a integer matrix with named columns. The naming conventions are the same as in count_ngrams.

Details

ns vector and ds vector must have equal length. Elements of ds vector are used as equivalents of d parameter for respective values of ns. For example, if ns is c(4, 4, 4), the ds must be a list of length 3. Each element of the ds list must have length 3 or 1, as appropriate for a d parameter in count_ngrams function.

Examples

Run this code
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE)
#if ds parameter is not present, n-grams are calculated for distance 0
count_multigrams(c(3, 1), seq = seqs, u = 1L:4)

#calculate three times n-gram with the same length, but different distances between
#elements
count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)),
                 seqs, 1L:4, pos = TRUE)

Run the code above in your browser using DataLab