multi.split: Calculate P-values Based on Multi-Splitting Approach

Description

Calculate p-values and confidence intervals based on the multi-splitting approach

Usage

multi.split(x, y, B = 100, fraction = 0.5, ci = TRUE, ci.level = 0.95,
            model.selector = lasso.cv,
            classical.fit = lm.pval, classical.ci = lm.ci,
            parallel = FALSE, ncores = getOption("mc.cores", 2L),
            gamma = seq(ceiling(0.05 * B) / B, 1 - 1 / B, by = 1 / B),
            args.model.selector = NULL, args.classical.fit = NULL,
            args.classical.ci = NULL,
            return.nonaggr = FALSE, return.selmodels = FALSE,
            repeat.max = 20,
            verbose = FALSE)

Arguments

numeric design matrix (without intercept).

numeric response vector.

the number of sample-splits, a positive integer.

fraction

a number in $(0, 1)$ , the fraction of data used at each sample split for the model selection process. The remaining data is used for calculating the p-values.

logical indicating if a confidence interval should be calculated for each parameter.

ci.level

(if ci is true:) a number in $(0, 1)$ , typically close to 1, the desired coverage level of the confidence intervals.

model.selector

a function to perform model selection, with default lasso.cv. The function must have at least two arguments, x (the design matrix) and y (the response vector). Return value is the index vector of selected columns. See lasso.cv and lasso.firstq for an example. Additional arguments can be passed via args.model.selector.

classical.fit

a function to calculate (classical) p-values. Default is lm.pval. The function must have at least two arguments, x (the design matrix) and y (the response vector), and return the vector of p-values. See lm.pval for an example. Additional arguments can be passed through args.classical.fit.

classical.ci

a function to calculate (classical) confidence intervals. Default is lm.ci. The function must have at least 3 arguments, x (the design matrix), y (the response vector) and level (the coverage level), and return the matrix of confidence intervals. See lm.ci for an example. Additional arguments can be passed through args.classical.ci.

parallel

logical indicating if parallelization via mclapply should be used.

ncores

number of cores used for parallelization as mc.cores in mclapply().

gamma

vector of gamma-values. In case gamma is a scalar, the value $Q_{j}$ instead of $P_{j}$ is being calculated (see reference below).

args.model.selector

named list of further arguments for function model.selector.

args.classical.fit

named list of further arguments for function classical.fit.

args.classical.ci

named list of further arguments for function classical.ci.

return.nonaggr

logical indicating if the unadjusted p-values be returned.

return.selmodels

logical indicating if the selected models (at each split) should be returned. Necessary for the clusterGroupTest() part of the result.

repeat.max

positive integer indicating the maximal number of split trials. Should not matter in regular cases, but necessary to prevent infinite loops in borderline cases.

verbose

should information be printed out while computing? (logical).

Value

pval.corr

Vector of multiple testing corrected p-values.

gamma.min

Value of gamma where minimal p-values was attained.

clusterGroupTest

Function to perform groupwise tests based on hierarchical clustering. You can either provide a distance matrix and clustering method or the output of hierarchical clustering from the function hclust as for clusterGroupBound. P-values are adjusted for multiple testing.

%% FIXME: Need Example / Demo / Vignette

References

Meinshausen, N., Meier, L. and B<U+00FC>hlmann, P. (2009) P-values for high-dimensional regression. Journal of the American Statistical Association 104, 1671--1681.

Mandozzi, J. and B<U+00FC>hlmann, P. (2015) A sequential rejection testing method for high-dimensional regression with correlated variables. To appear in the International Journal of Biostatistics. Preprint arXiv:1502.03300

Examples

Run this code

# NOT RUN {
n <-  40 # a bit small, to keep example "fast"
p <- 256
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
y <- x[,1] * 2 + x[,2] * 2.5 + rnorm(n)

## Multi-splitting with lasso.firstq as model selector function
## 'q' must be specified
fit.multi <- multi.split(x, y, model.selector = lasso.firstq,
                         args.model.selector = list(q = 10))
fit.multi
head(fit.multi$pval.corr, 10) ## the first 10 p-values
ci. <- confint(fit.multi)
head(ci.) # the first 6
stopifnot(all.equal(ci.,
     with(fit.multi, cbind(lci, uci)), check.attributes=FALSE))

# }
# NOT RUN {
<!-- %% "too" expensive;  still run via --run-donttest -->
## Use default 'lasso.cv' (slower!!) -- incl cluster group testing:
system.time(fit.m2 <- multi.split(x, y, return.selmodels = TRUE))# 9 sec (on "i7")
head(fit.m2$pval.corr) ## the first  6  p-values
head(confint(fit.m2))  ## the first  6  95% conf.intervals

## Now do clustergroup testing
clGTst <- fit.m2$clusterGroupTest
names(envGT <- environment(clGTst))# about 14
if(!interactive()) # if you are curious (and advanced):
  print(ls.str(envGT), max = 0)
stopifnot(identical(clGTst, envGT$clusterGroupTest))
ccc <- clGTst()
str(ccc)<!-- %% MM: hmm.. -->
ccc$hh   # the clustering
has.1.or.2 <- sapply(ccc$clusters,
                function(j.set) any(c(1,2) %in% j.set))
ccc$pval[ has.1.or.2] ## all very small
ccc$pval[!has.1.or.2] ## all 1
# }
# NOT RUN {
<!-- % dont... -->
# }

Run the code above in your browser using DataLab

Get 50% off unlimited learning