The function implements the scan statistics method of Kabluchko and Spodarev (2009), Theorem 3.1.
scanning(pos, freq, file, tuningUnits, alpha = 0.1, coarsening = 1,
minscans=0, maxscans = 0, sumscan = FALSE, perSNP = TRUE,
colname , n, threshold, collect=!old.def, old.def=FALSE,
max.intervals = length(alpha) * 100000,
max.basepair.distance = 50000, exclude.negative.at.boundary = TRUE,
maximum = TRUE, mean.freq, sd.freq, mean.n)scan.statistics(file, tuningUnits, alpha=c(0.05, 0.01), repet=1000,
coarsening = 1,
minscans=0, maxscans=0, sumscan = FALSE, perSNP=TRUE,
colname, n, return.simu = FALSE,
debug = FALSE, formula = FALSE,
old.def=FALSE,
max.intervals = length(alpha) * 100000,
max.basepair.distance = 50000,
exclude.negative.at.boundary = TRUE,
pos, freq)
alternatively to the file name, two vectors,
pos and freq might be given.
filename or list.
The rda file must contain the variables pos, freq,
colname, and n. Or it is a list with the same named elements.
If the extension of the filename is ‘bed’, the behaviour of the programme is different, see the details
real number. The value 0 codes the case
of Theorem 3.1 in Kabluchko and Spordarev (2009).
A positive value codes the case of Theorem 2.1 (which is very much
preferred). The case of Theorem 3.2 does not suit, hence is not
coded.
Good values for tuningUnits seem to be around
\(0.85\).
Note that first, the frequencies are standardized.
Then tuningUnits\( * \)mean\((n) /
n\) is substracted.
level(s) of testing. The levels should decrease.
integer. If the value is larger than 1 then the
data are first windower'ed by length=coarsening.
This is important to do if the data are fine scaled!
The number of simulation to determine the
threshold(s) for testing in scan.statistics;
see also formula. Should be at least 100 better 1000.
integers. The minimunm and maximum length of the window, respectively. If non-positive the window sizes are not restricted from below or above, respectively.
logical.
If TRUE the old style picture appears. Otherwise
the relative number of significant intervals containing a certain
point is shown.
logical. If TRUE then the test is based on SNPs
as units. If FALSE the test is based on basepairs (not
programmed yet).
the column of the data frame that gives the relative
frequencies. The default name (i.e., if missing) is "HeterAB".
Alternatively colname is a number indicating the respective column.
In case the extension of the filename equals ‘bed’, the behaviour is different, see Details.
The number of individuals, the data are based on. Usually that number is determined automatically, but might be given for safety explicitely
logical. to do
logical or 2.
If not FALSE important data are saved on the disk.
If debug == 2 pictures of each simulation are shown.
[to do in more detail]
scanning counts the number of intervals found above the given
threshold. threshold is an alternative to alpha
and is used instead of alpha if both are given.
This threshold is applied to the standardized frequency data.
A value around 0.8 seems to be appropriate for Christian's data whereas values around 18 are appropriate for Amanda's data.
scanning can be used in two ways. If collect=FALSE
essentially only the scan statistic is determined.
If collect=TRUE then also all the intervalls are determined
that are considered to be significant at the given alpha
levels.
logical. If TRUE all the tiny snippets that
have not been agglutinated yet, are also returned. If TRUE it
takes a lot of memory.
Further, if TRUE, negative (modified) values are allowed at
the borders of an interval.
Finally, if TRUE the parameters max.intervals,
max.basepair.distance, exclude.negative.at.boundary
are not considered.
[only if old.def=FALSE]
As the number of intervals is determined dynamically, the total number of significant intervals cannot be determined in advance. To economise a lot of copying, an upper threshold is given by the user. 100000 for each level should be large enough. If not, please contact the author.
[only if old.def=FALSE]
if a basepair distance is larger than max.basepair.distance
then the significant areas are considered as two separate areas.
logical. If TRUE negative values at boundaries are not
allowed. I.e. each significant area starts and ends with a
positive modified frequency.
logical. MISSING DOC
If given, mean.freq overwrites mean(freq)
If given, sd.freq overwrites sd(freq)
If given, mean.n overwrites mean(n)
if formula=TRUE then
the formula of Kabluchko and Spodarev (2009) is used in
scan.statististics.
Otherwise, a repet number of simulations under the null hypothesis
are performed to get the threshold right.
scanningreturns invisibly a list that contains always
the input data
the number of intervals showing a total sum
larger than the given threshold.
corresponding to alpha, if not given explicitely
the maximum value reached scanning over all windows
collect=TRUE then the list also contains
matrix of three rows containing information of all the
(overlapping) intervals where the sums exceeds the thresholds.
Each interval is given by a column. First row: left end point of
the interval. Second row: right end point of the interval.
Third interval: maximum number of threshold that are passed.
the sums that correspond to the maxima in
areas
list of matrices.
For each threshold, all the overlapping
intervals are joined that overlap, so that non-overlapping
intervals are finally obtained.
whether the null hypothesis is rejected
at the lowest alpha level.
scan.statisticsreturns invisibly a list containing all elements of
scanning for collect=TRUE.
Additionally, it contains
the maxima of repet simulated data if
formula=FALSE
The ideas for the code are taken from Kabluchko and Spordarev (2009) although the values are not calculated from the respective theorems. Instead, values are obtained by simulation in a procedure similar to Bootstrapping.
In case the file is a bed-file, the following differences to the standard behaviour appears:
colname must be of the form
c(pos=, freq=, n=) with default value
c(pos=3, freq=4, n=5)
the sign of the frequency is changed
it is not checked whether the frequencies * n equals an integer number
Kabluchko, Z. and Spodarev, E. (2009) Scan statistics of Levy noises and marked empirical processes. Adv. Appl. Probab. 41, 13-37.
# NOT RUN {
# }
# NOT RUN {
<!-- % library(miraculix) -->
# }
# NOT RUN {
if (interactive()) {
n <- 30
loci <- 9000
positions <- 25000:15000000
} else {
n <- 3
loci <- 900
positions <- 2500:1500000
}
pos <- sort(sample(positions, loci))
freq <- rpois(loci, lambda=0.3) / n
alpha <- c(0.1, 0.05, 0.01)
s <- scan.statistics(n=n, pos=pos, freq=freq, repet=100,
tuningUnits=0.65, alpha=alpha)
str(s)
# }
Run the code above in your browser using DataLab