Counts all n-grams or position-specific n-grams present in the input sequence(s).
count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)
a vector or matrix describing sequence(s).
integer
size of n-gram.
integer
, numeric
or character
vector of all
possible unigrams.
integer
vector of distances between elements of n-gram (0 means
consecutive elements). See Details.
logical
, if TRUE
position-specific n_grams are counted.
logical
, if TRUE
output data is normalized. May be
applied only to the counts of n-grams without position information. See Details
.
integer
, if not equal to 0, data is binarized into
two groups (larger or equal to threshold vs. smaller than threshold).
a simple_triplet_matrix
where columns represent
n-grams and rows sequences. See Details
for specifics of the naming convention.
A distance
vector should be always n
- 1 in length.
For example when n
= 3, d
= c(1,2) means A_A__A. For n
= 4,
d
= c(2,0,1) means A__AA_A. If vector d
has length 1, it is recycled to
length n
- 1.
n-gram names follow a specific convention and have three parts for position-specific
n-grams and two parts otherwise. The parts are separated by _
. The .
symbol
is used to separate elements within a part. The general naming scheme is
POSITION_NGRAM_DISTANCE
. The optional POSITION
part of the name indicates
the actual position of the n-gram in the sequence(s) and will be present
only if pos
= TRUE
. This part is always a single integer. The NGRAM
part of the name is a sequence of elements in the n-gram. For example, 4.2.2
indicates the n-gram 422 (e.g. TCC). The DISTANCE
part of the name is a vector of
distance(s). For example, 0.0
indicates zero distances (continuous n-grams), while
1.2
represents distances for the n-gram A_A__A.
Examples of n-gram names:
46_4.4.4_0.1 : trigram 44_4 on position 46
12_2.1_2 : bigram 2__1 on position 12
8_1.1.1_0.0 : continuous trigram 111 on position 8
1.1.1_0.0 : continuous trigram 111 without position information
Create vector of possible n-grams: create_ngrams
.
Extract n-grams from sequence(s): seq2ngrams
.
Get indices of n-grams: get_ngrams_ind
.
Count n-grams for multiple values of n: count_multigrams
.
Count only specified n-grams: count_specified
.
# NOT RUN {
# count trigrams without position information for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
# count position-specific trigrams from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE)
# output results of the n-gram counting to screen
as.matrix(ngrams)
# }
Run the code above in your browser using DataLab