pcSelect: PC-Select: Estimate subgraph around a response variable

Description

The goal is feature selection: If you have a response variable \(y\) and a data matrix \(dm\), we want to know which variables are “strongly influential” on \(y\). The type of influence is the same as in the PC-Algorithm, i.e., \(y\) and \(x\) (a column of \(dm\)) are associated if they are correlated even when conditioning on any subset of the remaining columns in \(dm\). Therefore, only very strong relations will be found and the result is typically a subset of other feature selection techniques. Note that there are also robust correlation methods available which render this method robust.

Usage

pcSelect(y, dm, alpha, corMethod = "standard",
         verbose = FALSE, directed = FALSE)

Arguments

response vector.

data matrix (rows: samples/observations, columns: variables); nrow(dm) == length(y).

alpha

significance level of individual partial correlation tests.

corMethod

a string determining the method for correlation estimation via mcor(); specifically any of the mcor(*, method = "..") can be used, e.g., "Qn" for one kind of robust correlation estimate.

verbose

logical or in \(\{0,1,2\}\);

FALSE, 0:: No output,
TRUE, 1:: Little output,
2:: Detailed output.

Note that such diagnostic output may make the function considerably slower.

directed

logical; should the output graph be directed?

Value

A logical vector indicating which column of dm is associated with y.

zMin

The minimal z-values when testing partial correlations between y and each column of dm. The larger the number, the more consistent is the edge with the data.

Details

This function basically applies pc on the data matrix obtained by joining y and dm. Since the output is not concerned with the edges found within the columns of dm, the algorithm is adapted accordingly. Therefore, the runtime and the ability to deal with large datasets is typically increased substantially.

References

Buehlmann, P., Kalisch, M. and Maathuis, M.H. (2010). Variable selection for high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97, 261--278.

Examples

Run this code

# NOT RUN {
p <- 10
## generate and draw random DAG :
suppressWarnings(RNGversion("3.5.0"))
set.seed(101)
myDAG <- randomDAG(p, prob = 0.2)
if (require(Rgraphviz)) {
  plot(myDAG, main = "randomDAG(10, prob = 0.2)")
}
## generate 1000 samples of DAG using standard normal error distribution
n <- 1000
d.mat <- rmvDAG(n, myDAG, errDist = "normal")

## let's pretend that the 10th column is the response and the first 9
## columns are explanatory variable. Which of the first 9 variables
## "cause" the tenth variable?
y <- d.mat[,10]
dm <- d.mat[,-10]
(pcS <- pcSelect(d.mat[,10], d.mat[,-10], alpha=0.05))
## You see, that variable 4,5,6 are considered as important
## By inspecting zMin,
with(pcS, zMin[G])
## you can also see that the influence of variable 6
## is most evident from the data (its zMin is 18.64, so quite large - as
## a rule of thumb for judging what is large, you could use quantiles
## of the Standard Normal Distribution)
# }

Run the code above in your browser using DataLab