patternMatrix: Get scores that correspond to k-mer or PWM matrix occurrence for bases in each window

Description

The function produces a base-pair resolution matrix or matrices of scores that correspond to k-mer or PWM matrix occurrence over predefined windows that have equal width. It finds either positions of pattern hits above a specified threshold and creates score matrix filled with 1 (presence of pattern) and 0 (its absence) or matrix with scores themselves. If pattern is a character of length 1 or PWM matrix then the function returns a ScoreMatrix object, if character of length more than 1 or list of PWMs then ScoreMatrixList.

Usage

patternMatrix(pattern, windows, genome = NULL, min.score = 0.8,
  asPercentage = FALSE, cores = 1)
\S4method{patternMatrix}{character,DNAStringSet}(pattern, windows,
                                                          asPercentage, cores)
\S4method{patternMatrix}{character,GRanges,BSgenome}(pattern, windows, genome,
                                                              cores)
\S4method{patternMatrix}{matrix,DNAStringSet}(pattern, windows,
                                                       min.score, asPercentage,
                                                       cores)
\S4method{patternMatrix}{matrix,GRanges,BSgenome}(pattern, windows, genome,
                                                           min.score, asPercentage,
                                                           cores)
\S4method{patternMatrix}{list,DNAStringSet}(pattern, windows,
                                                     min.score, asPercentage,
                                                     cores)
\S4method{patternMatrix}{list,GRanges,BSgenome}(pattern, windows, genome,
                                                         min.score, asPercentage,
                                                         cores)

Arguments

pattern

matrix (a PWM matrix), list of matrices or a character vector of length 1 or more. A matrix is a PWM matrix that needs to have one row for each nucleotide ("A","C","G" and "T" respectively). IUPAC ambiguity codes can be used and it will match any letter in the subject that is associated with the code.

windows

GRanges object or DNAStringSet object that have equal width of ranges or sequences.

genome

BSgenome object

min.score

numeric or character indicating minimum score to count a match. It can be given as a character string containing a percentage of the highest possible score or a single number (by default "80%" or 0.8). If min.score is set to NULL then patternMatrix returns scores themselves (default).

asPercentage

boolean telling whether scores represent percentage of the maximal motif PWM score (default: TRUE) or raw scores (FALSE).

cores

the number of cores to use (default: 1). It is supported only on Unix-like platforms.

Value

returns a scoreMatrix object or a scoreMatrixList object

Details

patternMatrix is based on functions from the seqPattern package: getPatternOccurrenceList function to find position of pattern that is a character vector in a list of sequences (a DNAStringSet object) and adapted function motifScanHits to find pattern that is a PWM matrix in sequences (a DNAStringSet object).

If cores > 1 is provided then for every window occurrence of pattern is counted in paralallel.

Examples

Run this code

library(Biostrings)

# consensus sequence of the ctcf motif
motif = "CCGCGNGGNGGCAG"
# Creates 10 random DNA sequences
seqs = sapply(1:10,
       function(x) paste(sample(c("A","T","G","C"), 180, replace=TRUE), collapse=""))
windows = DNAStringSet(seqs)
p = patternMatrix(pattern=motif, windows=windows, min.score=0.8)
p