Learn R Programming

Ball (version 1.3.7)

bcorsis: Ball Correlation Sure Independence Screening

Description

Generic non-parametric sure independence screening procedure based on ball correlation. Ball correlation is a generic multivariate measure of dependence in Banach space.

Usage

bcorsis(x, y, d = "small", weight = FALSE, method = "standard",
  distance = FALSE, parms = list(d1 = 5, d2 = 5, df = 3),
  num.threads = 2)

Arguments

x

a numeric matirx or data.frame included \(n\) rows and \(p\) columns. Each row is an observation vector and each column corresponding to a explanatory variable, generally \(p >> n\).

y

a numeric vector, matirx, data.frame or dist object.

d

the hard cutoff rule suggests selecting \(d\) variables. Setting d = "large" or d = "small" means n-1 or floor(n/log(n)) variables are selected. If d is a integer, d variables are selected. Default: d = "small"

weight

when weight = TRUE, weighted ball correlation is used instead of ball correlation. Default: weight = FALSE

method

method for sure independence screening procedure, include: "standard", "lm", "gam", "interaction" and "survival". Setting method = "standard" means standard sure independence screening procedure based on ball correlation while options "lm" and "gam" carry out iterative BCor-SIS procedure with ordinary linear regression and generalized additive models, respectively. Options "interaction" and "survival" are designed for detecting variables with potential linear interaction or associated with censored responses. Default: method = "standard"

distance

if distance = TRUE, y will be considered as a distance matrix. Arguments only available when method = "standard" and method = "interaction". Default: distance = FALSE

parms

parameters list only available when method = "lm" or "gam". It contains three parameters: d1, d2, and df. d1 is the number of initially selected variables, d2 is the number of variables collection size added in each iteration. df is degree freedom of basis in generalized additive models playing a role only when method = "gam". Default: parms = list(d1 = 5, d2 = 5, df = 3)

num.threads

Number of threads. Default num.threads = 1.

Value

ix

the vector of indices selected by ball correlation sure independence screening procedure.

method

the method used.

weight

the weight used.

complete.info

a list containing at least one \(p x 3\) matrix, where each row is corresponding to variable and each column is corresponding to differe ball correlation weight. If method = "gam" or method = "lm", complete.info is empty list.

Details

bcorsis implements a model-free generic screening procedure, BCor-SIS, with fewer and less restrictive assumptions. The sample sizes (number of rows or length of the vector) of the two variables x and y must agree, and samples must not contain missing values.

BCor-SIS procedure for censored response is carried out when method = "survival". At that time, the matrix or data.frame pass to argument y must have exactly two columns and the first column is event (failure) time while the second column is censored status, a dichotomous variable.

If we set distance = TRUE, arguments y is considered as distance matrix, otherwise y is treated as data.

BCor-SIS is based on a recently developed universal dependence measure: Ball correlation (BCor). BCor efficiently measures the dependence between two random vectors, which is between 0 and 1, and 0 if and only if these two random vectors are independent under some mild conditions. (See the manual page for bcor.)

Theory and numerical result indicate that BCor-SIS has following advantages:

(i) It has a strong screening consistency property without finite sub-exponential moments of the data. Consequently, even when the dimensionality is an exponential order of the sample size, BCor-SIS still almost surely able to retain the efficient variables.

(ii) It is nonparametric and has the property of robustness.

(iii) It works well for complex responses and/or predictors, such as shape or survival data

(iv) It can extract important features even when the underlying model is complicated.

References

Wenliang Pan, Xueqin Wang, Weinan Xiao & Hongtu Zhu (2018) A Generic Sure Independence Screening Procedure, Journal of the American Statistical Association, DOI: 10.1080/01621459.2018.1462709

Jin, Zhu, Wenliang Pan, Wei Zheng, and Xueqin Wang (2018). Ball: An R package for detecting distribution difference and association in metric spaces. arXiv preprint arXiv:1811.03750. URL http://arxiv.org/abs/1811.03750.

See Also

bcor

Examples

Run this code
# NOT RUN {
############### Quick Start for bcorsis function ###############
set.seed(1)
n <- 150
p <- 3000
x <- matrix(rnorm(n * p), nrow = n)
error <- rnorm(n)
y <- 3*x[, 1] + 5*(x[, 3])^2 + error
res <- bcorsis(y = y, x = x)
head(res[["ix"]])

############### BCor-SIS: Censored Data Example ###############
data("genlung")
result <- bcorsis(x = genlung[["covariate"]], y = genlung[["survival"]], 
                  method = "survival")
index <- result[["ix"]]
top_gene <- colnames(genlung[["covariate"]])[index]
head(top_gene, n = 1)


############### BCor-SIS: Interaction Pursuing ###############
set.seed(1)
n <- 150
p <- 3000
x <- matrix(rnorm(n * p), nrow = n)
error <- rnorm(n)
y <- 3*x[, 1]*x[, 5]*x[, 10] + error
res <- bcorsis(y = y, x = x, method = "interaction")
head(res[["ix"]])

############### BCor-SIS: Iterative Method ###############
library(mvtnorm)
set.seed(1)
n <- 150
p <- 3000
sigma_mat <- matrix(0.5, nrow = p, ncol = p)
diag(sigma_mat) <- 1
x <- rmvnorm(n = n, sigma = sigma_mat)
error <- rnorm(n)
rm(sigma_mat); gc(reset = TRUE)
y <- 3*(x[, 1])^2 + 5*(x[, 2])^2 + 5*x[, 8] - 8*x[, 16] + error
res <- bcorsis(y = y, x = x, method = "lm", d = 15)
res <- bcorsis(y = y, x = x, method = "gam", d = 15)
res[["ix"]]

############### Weighted BCor-SIS: Probability weight ###############
set.seed(1)
n <- 150
p <- 3000
x <- matrix(rnorm(n * p), nrow = n)
error <- rnorm(n)
y <- 3*x[, 1] + 5*(x[, 3])^2 + error
res <- bcorsis(y = y, x = x, weight = "prob")
head(res[["ix"]])
# Alternative, chisq weight:
res <- bcorsis(y = y, x = x, weight = "chisq")
head(res[["ix"]])
# }

Run the code above in your browser using DataLab