motifEnrichment: Motif enrichment

Description

Calculate motif enrichment using one of available scoring algorithms and background corrections.

Usage

motifEnrichment(sequences, pwms, score = "autodetect", bg = "autodetect", cutoff = NULL, verbose = TRUE, motif.shuffles = 30, B = 1000, group.only = FALSE)

Arguments

sequences

the sequences to be scanned for enrichment. Can be either a single sequence (an object of class DNAString), or a list of DNAString objects, or a DNAStringSet object.

pwms

this parameter can take multiple values depending on the scoring scheme and background correction used. When the method parameter is set to "autodetect", the following default algorithms are going to be used:

if pwms is a list containing either frequency matrices or a list of PWM objects then the "affinity" algorithm is selected. If frequency matrices are given, they are converted to PWMs using uniform background. For best performance, convert frequency matrices to PWMs before calling this function using realistic genomic background.
Otherwise, appropriate scoring scheme and background correction are selected based on the class of the object (see below).

score

this parameter determines which scoring scheme to use. Following scheme as available:

"autodetect" - default value. Scoring method is determined based on the type of pwms parameter.
"affinity" - use threshold-free affinity scores without a background. The pwms parameter can either be a list of frequency matrices, PWM objects, or a PWMLognBackground object.
"cutoff" - use number of motif hits above a score cutoff as a measure of enrichment. No background correction is performed. The pwms parameter can either be a list of frequency matrices, PWM objects, or a PWMCutoffBackground object.
"clover" - use the Clover algorithm (Frith et al, 2004). The Clover score of a single sequence is identical to the affinity score, while for a group of sequences is an average of products of affinities over all sequence subsets.

this parameter determines which background correction to use, if any.

"autodetect" - default value. Background correction is determined based on the type of the pwms parameter.
"logn" - use a lognormal distribution background pre-computed for a set of PWMs. This requires pwms to be of class PWMLognBackground.
"z" - use a z-score for the number of significant motif hits compared to background number of hits. This requires pwms to be of class PWMCutoffBackground.
"pval" - use empirical P-value based on a set of background sequences. This requires pwms to be of class PWMEmpiricalBackground. Note that PWMEmpiricalBackground objects tend to be very large so that the empirical P-value can be calculated in reasonable time.
"ms" - shuffle columns of motif matrices and use that as basis for P-value calculation. Note that since the sequences need to rescanned with all of the new shuffled motifs this can be very slow. Also, this also works only no *individual* sequences, not groups.
"none" - no background correction

cutoff

the score cutoff for a significant motif hit if scoring scheme "cutoff" is selected.

verbose

if to print verbose output

motif.shuffles

number of times to shuffle motifs if using "ms" background correction

number of replicates when calculating empirical P-value

group.only

if to return statistics only for the group of sequences, not individual sequences. In the case of empirical background the P-values for individual sequences are not calculated (thus saving time), for other backgrounds they are calculated but not returned.

Value

a MotifEnrichmentResults object containing a subset following elements:

"score" - scoring scheme used
"bg" - background correction used
"params" - any additional parameters
"sequences" - the set of sequences used
"pwms" - the set of pwms used
"sequence.nobg" - per-sequence scores without any background correction. For "affinity" and "clover" a matrix of mean affinity scores; for "cutoff" number of significant hits above a cutoff
"sequence.bg" - per-sequence scores after background correction. For "logn" and "pval" the P-value (smaller is better); for "z" and "ms" background corrections the z-scores (bigger is better).
"group.nobg" - aggregate scores for the whole group of sequences without background correction. For "affinity" and "clover" the mean affinity over all sequences in the set; for "cutoff" the total number of hits in all sequences.
"group.bg" - aggregate scores for the whole group of sequences with background correction. For "logn" and "pval", the P-value for the whole group (smaller is better); for "z" and "ms" the z-score for the whole set (bigger is better).
"sequence.norm" - (only for "logn") the length-normalized scores for each of the sequences. Currently only implemented for "logn", where it returns the values normalized from LogN(0,1) distribution
"group.norm" - (only for "logn") similar to sequence.norm, but for the whole group of sequences

Details

This function provides and interface to all algorithms available in PWMEnrich to find motif enrichment in a single or a group of sequences with/without background correction.

Since for all algorithms the first step involves calculating raw scores without background correction, the output always contains the scores without background correction together with (optional) background-corrected scores.

Unless otherwise specified the scores are returned both separately for each sequence (without/with background) and for the whole group of sequences (without/with background).

To use a background correction you need to supply a set of PWMs with precompiled background distribution parameters (see function makeBackground). When such an object is supplied as the pwm parameter, the scoring scheme and background correction are automatically determined.

There are additional packages with already pre-computed background (e.g. see package PWMEnrich.Dmelanogaster.background).

Please refer to (Stojnic & Adryan, 2012) for more details on the algorithms.

References

R. Stojnic & B. Adryan: Identification of functional DNA motifs using a binding affinity lognormal background distribution, submitted.
MC Frith et al: Detection of functional DNA motifs via statistical over-representation, Nucleid Acid Research (2004).

Examples

Run this code

if(require("PWMEnrich.Dmelanogaster.background")){
   ###
   # load the pre-compiled lognormal background
   data(PWMLogn.dm3.MotifDb.Dmel)

   # scan two sequences for motif enrichment
   sequences = list(DNAString("GAAGTATCAAGTGACCAGTAGATTGAAGTAGACCAGTC"), DNAString("AGGTAGATAGAACAGTAGGCAATGGGGGAAATTGAGAGTC"))
   res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel)

   # most enriched in both sequences (lognormal background P-value)
   head(motifRankingForGroup(res))

   # most enriched in both sequences (raw affinity, no background)
   head(motifRankingForGroup(res, bg=FALSE))

   # most enriched in the first sequence (lognormal background P-value)
   head(motifRankingForSequence(res, 1))

   # most enriched in the first sequence (raw affinity, no background)
   head(motifRankingForSequence(res, 1, bg=FALSE))

   ###
   # Load the pre-compiled background for hit-based motif counts with cutoff of P-value = 0.001
   data(PWMPvalueCutoff1e3.dm3.MotifDb.Dmel)

   res.count = motifEnrichment(sequences, PWMPvalueCutoff1e3.dm3.MotifDb.Dmel)

   # Enrichment in the whole group, z-score for the number of motif hits
   head(motifRankingForGroup(res))

   # First sequence, sorted by number of motif hits with P-value < 0.001
   head(motifRankingForSequence(res, 1, bg=FALSE))

}

Run the code above in your browser using DataLab