scanning: Scan Statistics

Description

The function implements the scan statistics method of Kabluchko and Spodarev (2009), Theorem 3.1.

Usage

scanning(pos, freq, file, tuningUnits, alpha = 0.1, coarsening = 1,
         minscans=0, maxscans = 0, sumscan = FALSE, perSNP = TRUE,
         colname , n, threshold, collect=!old.def, old.def=FALSE,
         max.intervals = length(alpha) * 100000,
         max.basepair.distance = 50000, exclude.negative.at.boundary = TRUE,
         maximum = TRUE, mean.freq, sd.freq, mean.n)
scan.statistics(file, tuningUnits, alpha=c(0.05, 0.01), repet=1000,
                coarsening = 1,
                minscans=0, maxscans=0, sumscan = FALSE, perSNP=TRUE, 
                colname, n, return.simu = FALSE,
                debug = FALSE, formula = FALSE, 
                old.def=FALSE,
                max.intervals = length(alpha) * 100000,
                max.basepair.distance = 50000,
                exclude.negative.at.boundary = TRUE,
                pos, freq)

Arguments

pos, freq

alternatively to the file name, two vectors, pos and freq might be given.

file

filename or list. The rda file must contain the variables pos, freq, colname, and n. Or it is a list with the same named elements.

If the extension of the filename is ‘bed’, the behaviour of the programme is different, see the details

tuningUnits

real number. The value 0 codes the case of Theorem 3.1 in Kabluchko and Spordarev (2009). A positive value codes the case of Theorem 2.1 (which is very much preferred). The case of Theorem 3.2 does not suit, hence is not coded. Good values for tuningUnits seem to be around \(0.85\).

Note that first, the frequencies are standardized. Then tuningUnits\( * \)mean\((n) / n\) is substracted.

alpha

level(s) of testing. The levels should decrease.

coarsening

integer. If the value is larger than 1 then the data are first windower'ed by length=coarsening. This is important to do if the data are fine scaled!

repet

The number of simulation to determine the threshold(s) for testing in scan.statistics; see also formula. Should be at least 100 better 1000.

minscans,maxscans

integers. The minimunm and maximum length of the window, respectively. If non-positive the window sizes are not restricted from below or above, respectively.

sumscan

logical. If TRUE the old style picture appears. Otherwise the relative number of significant intervals containing a certain point is shown.

perSNP

logical. If TRUE then the test is based on SNPs as units. If FALSE the test is based on basepairs (not programmed yet).

colname

the column of the data frame that gives the relative frequencies. The default name (i.e., if missing) is "HeterAB". Alternatively colname is a number indicating the respective column.

In case the extension of the filename equals ‘bed’, the behaviour is different, see Details.

The number of individuals, the data are based on. Usually that number is determined automatically, but might be given for safety explicitely

return.simu

logical. to do

debug

logical or 2. If not FALSE important data are saved on the disk. If debug == 2 pictures of each simulation are shown. [to do in more detail]

threshold

scanning counts the number of intervals found above the given threshold. threshold is an alternative to alpha and is used instead of alpha if both are given. This threshold is applied to the standardized frequency data.

A value around 0.8 seems to be appropriate for Christian's data whereas values around 18 are appropriate for Amanda's data.

collect

scanning can be used in two ways. If collect=FALSE essentially only the scan statistic is determined. If collect=TRUE then also all the intervalls are determined that are considered to be significant at the given alpha levels.

old.def

logical. If TRUE all the tiny snippets that have not been agglutinated yet, are also returned. If TRUE it takes a lot of memory.

Further, if TRUE, negative (modified) values are allowed at the borders of an interval.

Finally, if TRUE the parameters max.intervals, max.basepair.distance, exclude.negative.at.boundary are not considered.

max.intervals

[only if old.def=FALSE]

As the number of intervals is determined dynamically, the total number of significant intervals cannot be determined in advance. To economise a lot of copying, an upper threshold is given by the user. 100000 for each level should be large enough. If not, please contact the author.

max.basepair.distance

[only if old.def=FALSE] if a basepair distance is larger than max.basepair.distance then the significant areas are considered as two separate areas.

exclude.negative.at.boundary

logical. If TRUE negative values at boundaries are not allowed. I.e. each significant area starts and ends with a positive modified frequency.

maximum

logical. MISSING DOC

mean.freq

If given, mean.freq overwrites mean(freq)

sd.freq

If given, sd.freq overwrites sd(freq)

mean.n

If given, mean.n overwrites mean(n)

formula

if formula=TRUE then the formula of Kabluchko and Spodarev (2009) is used in scan.statististics. Otherwise, a repet number of simulations under the null hypothesis are performed to get the threshold right.

Value

scanning

returns invisibly a list that contains always

file, pos, freq, tuningUnits, alpha, n, maxscans, perSNP: the input data
above.threshold: the number of intervals showing a total sum larger than the given threshold.
threshold: corresponding to alpha, if not given explicitely
maximum: the maximum value reached scanning over all windows

if collect=TRUE then the list also contains

areas: matrix of three rows containing information of all the (overlapping) intervals where the sums exceeds the thresholds. Each interval is given by a column. First row: left end point of the interval. Second row: right end point of the interval. Third interval: maximum number of threshold that are passed.
values: the sums that correspond to the maxima in areas
significant.areas: list of matrices. For each threshold, all the overlapping intervals are joined that overlap, so that non-overlapping intervals are finally obtained.
Message: whether the null hypothesis is rejected at the lowest alpha level.

scan.statistics

returns invisibly a list containing all elements of scanning for collect=TRUE. Additionally, it contains

maxima: the maxima of repet simulated data if formula=FALSE

Details

The ideas for the code are taken from Kabluchko and Spordarev (2009) although the values are not calculated from the respective theorems. Instead, values are obtained by simulation in a procedure similar to Bootstrapping.

In case the file is a bed-file, the following differences to the standard behaviour appears:

colname must be of the form c(pos=, freq=, n=) with default value c(pos=3, freq=4, n=5)
the sign of the frequency is changed
it is not checked whether the frequencies * n equals an integer number

References

Kabluchko, Z. and Spodarev, E. (2009) Scan statistics of Levy noises and marked empirical processes. Adv. Appl. Probab. 41, 13-37.

Examples

Run this code

# NOT RUN {
 
# }
# NOT RUN {
<!-- % library(miraculix) -->
# }
# NOT RUN {
if (interactive()) {
  n <- 30
  loci <- 9000 
  positions <- 25000:15000000
} else {
  n <- 3
  loci <- 900
  positions <- 2500:1500000
}
pos <- sort(sample(positions, loci))
freq <- rpois(loci, lambda=0.3) / n

alpha <- c(0.1, 0.05, 0.01)
s <- scan.statistics(n=n, pos=pos, freq=freq, repet=100,
                     tuningUnits=0.65, alpha=alpha)
str(s)
# }

Run the code above in your browser using DataLab