pcSelect: PC-Select: Estimate subgraph around a response variable

Description

The goal is feature selection: If you have a response variable $y$ and a data matrix $dm$, we want to know which variables are strongly influential on $y$. The type of influence is the same as in the PC-Algorithm, i.e., $y$ and $x$ (a column of $dm$) are associated if they are correlated even when conditioning on any subset of the remaining columns in $dm$. Therefore, only very strong relations will be found and the result is typically a subset of other feature selection techniques. Note that there are also robust correlation methods available which render this method robust.

Usage

pcSelect(y, dm, alpha, corMethod = "standard",
         verbose = FALSE, directed = FALSE)

Arguments

Response vector.

Data matrix (rows: samples/observations, columns: variables); nrow(dm) == length(y).

alpha

Significance level of individual partial correlation tests.

corMethod

"standard" or "Qn" for standard or robust correlation estimation

verbose

Logical or in ${0,1,2}$; [object Object],[object Object],[object Object] Note that such output makes the function very much slower.

directed

Logical; should the output graph be directed?

Value

GA logical vector indicating which column of dm is associated with y.
zMinThe minimal z-values when testing partial correlations between y and each column of dm. The larger the number, the more consistent is the edge with the data.

Details

This function basically applies pc on the data matrix obtained by joining y and dm. Since the output is not concerned with the edges found within the columns of dm, the algorithm is adapted accordingly. Therefore, the runtime and the ability to deal with large datasets is typically increased substantially.

References

Buehlmann, P., Kalisch, M. and Maathuis, M.H. (2010). Variable selection for high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97, 261--278.

Examples

Run this code

p <- 10
## generate and draw random DAG :
set.seed(101)
myDAG <- randomDAG(p, prob = 0.2)
if (require(Rgraphviz)) {
  plot(myDAG, main = "randomDAG(10, prob = 0.2)")
}
## generate 1000 samples of DAG using standard normal error distribution
n <- 1000
d.mat <- rmvDAG(n, myDAG, errDist = "normal")

## let's pretend that the 10th column is the response and the first 9
## columns are explanatory variable. Which of the first 9 variables
## "cause" the tenth variable?
y <- d.mat[,10]
dm <- d.mat[,-10]
(pcS <- pcSelect(d.mat[,10],d.mat[,-10], alpha=0.05))
## You see, that variable 4,5,6 are considered as important
## By inspecting zMin,
with(pcS, zMin[G])
## you can also see that the influence of variable 6
## is very evident from the data (zMin is 21.32, so quite large - as
## a rule of thumb for judging what is large, you could use quantiles
## of the Standard Normal Distribution)

## The result should be the same when using pcAlgo
resU <- pcAlgo(d.mat, alpha = 0.05, corMethod = "standard",directed=TRUE)
resU
if (require(Rgraphviz))
  plot(resU,zvalue.lwd=TRUE)
## as can be seen, the pcAlgo function also finds 4,5,6 as the important
## variables
## Again, variable 6 seems to be very evident from the data

Run the code above in your browser using DataLab