The function implements the scan statistics method of Kabluchko and Spodarev (2009), Theorem 3.1.
scanning(pos, freq, file, tuningUnits, alpha = 0.1, coarsening = 1,
minscans=0, maxscans = 0, sumscan = FALSE, perSNP = TRUE,
colname , n, threshold, collect=!old.def, old.def=FALSE,
max.intervals = length(alpha) * 100000,
max.basepair.distance = 50000, exclude.negative.at.boundary = TRUE,
maximum = TRUE, mean.freq, sd.freq, mean.n)scan.statistics(file, tuningUnits, alpha=c(0.05, 0.01), repet=1000,
coarsening = 1,
minscans=0, maxscans=0, sumscan = FALSE, perSNP=TRUE,
colname, n, return.simu = FALSE,
debug = FALSE, formula = FALSE,
old.def=FALSE,
max.intervals = length(alpha) * 100000,
max.basepair.distance = 50000,
exclude.negative.at.boundary = TRUE,
pos, freq)
alternatively to the file
name, two vectors,
pos
and freq
might be given.
filename or list.
The rda file must contain the variables pos
, freq
,
colname
, and n
. Or it is a list with the same named elements.
If the extension of the filename is ‘bed’, the behaviour of the programme is different, see the details
real number. The value 0 codes the case
of Theorem 3.1 in Kabluchko and Spordarev (2009).
A positive value codes the case of Theorem 2.1 (which is very much
preferred). The case of Theorem 3.2 does not suit, hence is not
coded.
Good values for tuningUnits
seem to be around
\(0.85\).
Note that first, the frequencies are standardized.
Then tuningUnits
\( * \)mean
\((n) /
n\) is substracted.
level(s) of testing. The levels should decrease.
integer. If the value is larger than 1 then the
data are first windower'ed by length=coarsening
.
This is important to do if the data are fine scaled!
The number of simulation to determine the
threshold(s) for testing in scan.statistics
;
see also formula
. Should be at least 100 better 1000.
integers. The minimunm and maximum length of the window, respectively. If non-positive the window sizes are not restricted from below or above, respectively.
logical.
If TRUE
the old style picture appears. Otherwise
the relative number of significant intervals containing a certain
point is shown.
logical. If TRUE
then the test is based on SNPs
as units. If FALSE
the test is based on basepairs (not
programmed yet).
the column of the data frame that gives the relative
frequencies. The default name (i.e., if missing) is "HeterAB".
Alternatively colname
is a number indicating the respective column.
In case the extension of the filename equals ‘bed’, the behaviour is different, see Details.
The number of individuals, the data are based on. Usually that number is determined automatically, but might be given for safety explicitely
logical. to do
logical or 2.
If not FALSE
important data are saved on the disk.
If debug == 2
pictures of each simulation are shown.
[to do in more detail]
scanning
counts the number of intervals found above the given
threshold
. threshold
is an alternative to alpha
and is used instead of alpha
if both are given.
This threshold is applied to the standardized frequency data.
A value around 0.8 seems to be appropriate for Christian's data whereas values around 18 are appropriate for Amanda's data.
scanning
can be used in two ways. If collect=FALSE
essentially only the scan statistic is determined.
If collect=TRUE
then also all the intervalls are determined
that are considered to be significant at the given alpha
levels.
logical. If TRUE
all the tiny snippets that
have not been agglutinated yet, are also returned. If TRUE
it
takes a lot of memory.
Further, if TRUE
, negative (modified) values are allowed at
the borders of an interval.
Finally, if TRUE
the parameters max.intervals
,
max.basepair.distance
, exclude.negative.at.boundary
are not considered.
[only if old.def=FALSE
]
As the number of intervals is determined dynamically, the total number of significant intervals cannot be determined in advance. To economise a lot of copying, an upper threshold is given by the user. 100000 for each level should be large enough. If not, please contact the author.
[only if old.def=FALSE
]
if a basepair distance is larger than max.basepair.distance
then the significant areas are considered as two separate areas.
logical. If TRUE
negative values at boundaries are not
allowed. I.e. each significant area starts and ends with a
positive modified frequency.
logical. MISSING DOC
If given, mean.freq
overwrites mean(freq)
If given, sd.freq
overwrites sd(freq)
If given, mean.n
overwrites mean(n)
if formula=TRUE
then
the formula of Kabluchko and Spodarev (2009) is used in
scan.statististics
.
Otherwise, a repet
number of simulations under the null hypothesis
are performed to get the threshold right.
scanning
returns invisibly a list that contains always
the input data
the number of intervals showing a total sum
larger than the given threshold
.
corresponding to alpha, if not given explicitely
the maximum value reached scanning over all windows
collect=TRUE
then the list also contains
matrix of three rows containing information of all the
(overlapping) intervals where the sums exceeds the thresholds
.
Each interval is given by a column. First row: left end point of
the interval. Second row: right end point of the interval.
Third interval: maximum number of threshold that are passed.
the sums that correspond to the maxima in
areas
list of matrices.
For each threshold
, all the overlapping
intervals are joined that overlap, so that non-overlapping
intervals are finally obtained.
whether the null hypothesis is rejected
at the lowest alpha
level.
scan.statistics
returns invisibly a list containing all elements of
scanning
for collect=TRUE
.
Additionally, it contains
the maxima of repet
simulated data if
formula=FALSE
The ideas for the code are taken from Kabluchko and Spordarev (2009) although the values are not calculated from the respective theorems. Instead, values are obtained by simulation in a procedure similar to Bootstrapping.
In case the file is a bed-file, the following differences to the standard behaviour appears:
colname
must be of the form
c(pos=, freq=, n=)
with default value
c(pos=3, freq=4, n=5)
the sign of the frequency is changed
it is not checked whether the frequencies * n equals an integer number
Kabluchko, Z. and Spodarev, E. (2009) Scan statistics of Levy noises and marked empirical processes. Adv. Appl. Probab. 41, 13-37.
# NOT RUN {
# }
# NOT RUN {
<!-- % library(miraculix) -->
# }
# NOT RUN {
if (interactive()) {
n <- 30
loci <- 9000
positions <- 25000:15000000
} else {
n <- 3
loci <- 900
positions <- 2500:1500000
}
pos <- sort(sample(positions, loci))
freq <- rpois(loci, lambda=0.3) / n
alpha <- c(0.1, 0.05, 0.01)
s <- scan.statistics(n=n, pos=pos, freq=freq, repet=100,
tuningUnits=0.65, alpha=alpha)
str(s)
# }
Run the code above in your browser using DataLab