skeleton: Estimate (Initial) Skeleton of a DAG using the PC / PC-Stable Algorithm

Description

Estimate the skeleton of a DAG without latent and selection variables using the PC Algorithm or estimate an initial skeleton of a DAG with arbitrarily many latent and selection variables using the FCI and the RFCI algorithms.

If used in the PC algorithm, it estimates the order-independent PC-stable ("stable") or original PC ("original") skeleton of a directed acyclic graph (DAG) from observational data.

When used in the FCI and RFCI algorithms, this function estimates only an initial order-independent (or PC original) skeleton. Because of the presence of latent and selection variables, to find the final skeleton those algorithms need to perform additional tests later on and consequently some edges can be further deleted.

Usage

skeleton(suffStat, indepTest, alpha, labels, p,
         method = c("stable", "original", "stable.fast"), m.max = Inf,
         fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE,
         verbose = FALSE)

Arguments

suffStat

Sufficient statistics: List containing all necessary elements for the conditional independence decisions in the function indepTest.

indepTest

Predefined function for testing conditional independence. The function is internally called as indepTest(x,y,S,suffStat) and tests conditional independence of x and <

alpha

significance level (number in $(0,1)$ for the individual conditional independence tests.

labels

(optional) character vector of variable (or node) names. Typically preferred to specifying p.

(optional) number of variables (or nodes). May be specified if labels are not, in which case labels is set to 1:p.

method

Character string specifying method; the default, "stable" provides an order-independent skeleton, see Details below.

m.max

Maximal size of the conditioning sets that are considered in the conditional independence tests.

fixedGaps

logical symmetric matrix of dimension p*p. If entry [i,j] is true, the edge $i--j$ is removed before starting the algorithm. Therefore, this edge is guaranteed to be absent in the resulting graph.

fixedEdges

a logical symmetric matrix of dimension p*p. If entry [i,j] is true, the edge $i--j$ is never considered for removal. Therefore, this edge is guaranteed to be present in the resulting graph.

NAdelete

logical needed for the case indepTest(*) returns NA. If it is true, the corresponding edge is deleted, otherwise not.

verbose

if TRUE, detailed output is provided.

Value

An object of class "pcAlgo" (see pcAlgo) containing an estimate of the skeleton of the underlying DAG, the conditioning sets (sepset) that led to edge removals and several other parameters.

Details

Under the assumption that the distribution of the observed variables is faithful to a DAG and that there are no latent and selection variables, this function estimates the skeleton of the DAG. The skeleton of a DAG is the undirected graph resulting from removing all arrowheads from the DAG. Edges in the skeleton of a DAG have the following interpretation: There is an edge between $i$ and $j$, $i$ -- $j$, if and only if variables $i$ and $j$ are conditionally dependent given $S$ for all possible subsets $S$ of the remaining nodes.

On the other hand, the distribution of the observed variables is faithful to a DAG with arbitrarily many latent and selection variables, skeleton() estimates the initial skeleton of the DAG. Edges in this initial skeleton of a DAG have the following interpretation: There is an edge $i$ -- $j$ if and only if variables $i$ and $j$ are conditionally dependent given $S$ for all possible subsets $S$ of the neighbours of $i$ and the neighbours of $j$.

The data are not required to follow a specific distribution, but one should make sure that the conditional indepedence test used in indepTest is appropriate for the data. Pre-programmed versions of indepTest are available for Gaussian data (gaussCItest), discrete data (disCItest), and binary data (see binCItest). Users may also specify their own indepTest function.

The PC algorithm (Spirtes, Glymour and Scheines, 2000) (method = "original") is known to be order-dependent, in the sense that the output may depend on the order in which the variables are given. Therefore, Colombo and Maathuis (2013) proposed a simple modification, called PC-stable, which yields order-independent adjacencies in the skeleton, provided by pc() with the new default method = "stable". This stable variant of the algorithm is also available with the method = "stable.fast": it runs the algorithm of Colombo and Maathuis (2013) faster than method = "stable" in general, but should be regarded as an experimental option at the moment.

The algorithm starts with a complete undirected graph. In each step, it visits all pairs $(i,j)$ of adjacent nodes in the current graph, and determines based on conditional independence tests whether the edge $i--j$ should be removed. In particular, for each step $m$ ($m=0,1,\dots$) of the size of the conditioning sets, the algorithm at first determines the neighbours $a(i)$ of each node $i$ in the graph. Then, the algorithm visits all pairs $(i,j)$ of adjacent nodes in the current graph, and the edge $i--j$ is kept if and only if the null hypothesis $i$ and $j$ are conditionally independent given S rejected at significance level alpha for all subsets $S$ of size $m$ of $a(i)$ and of $a(j)$ (as judged by the function indepTest). For the "stable" method, the neighborhoods $a(i)$ are kept fixed within each value of $m$, and this makes the algorithm order-independent. Method "original", the original PC algorithm would update the neighbour list after each edge change.

The algorithm stops when $m$ is larger than the largest neighbourhood size of all nodes, or when $m$ has reached the limit m.max which may be set by the user.

Since the FCI (Spirtes, Glymour and Scheines, 2000) and RFCI (Colombo et al., 2012) algorithms are built up from the PC algorithm, they are also order-dependent in the skeleton. To resolve their order-dependence issues in the skeleton is more involved, see Colombo and Maathuis (2013). However now, with method = "stable", this function estimates an initial order-independent skeleton in these algorithms (for additional details on how to make the final skeleton of FCI fully order-independent see fci and Colombo and Maathuis (2013)).

The information in fixedGaps and fixedEdges is used as follows. The gaps given in fixedGaps are introduced in the very beginning of the algorithm by removing the corresponding edges from the complete undirected graph. Pairs $(i,j)$ in fixedEdges are skipped in all steps of the algorithm, so that these edges remain in the graph. Note: Throughout, the algorithm works with the column positions of the variables in the adjacency matrix, and not with the names of the variables.

References

D. Colombo and M.H. Maathuis (2013). Order-independent constraint-based causal structure learning, (arXiv:1211.3295v2)

D. Colombo, M. H. Maathuis, M. Kalisch, T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist. 40, 294-321.

M. Kalisch and P. Buehlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm, JMLR 8 613-636.

P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition, MIT Press.

Examples

Run this code

##################################################
## Using Gaussian Data
##################################################
## Load predefined data
data(gmG)
n <- nrow    (gmG8$x)
V <- colnames(gmG8$x) # labels aka node names

## estimate Skeleton
skel.fit <- skeleton(suffStat = list(C = cor(gmG8$x), n = n),
                     indepTest = gaussCItest, ## (partial correlations)
                     alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
  ## show estimated Skeleton
  par(mfrow=c(1,2))
  plot(skel.fit, main = "Estimated Skeleton")
  plot(gmG8$g, main = "True DAG")
}

##################################################
## Using d-separation oracle
##################################################

## define sufficient statistics (d-separation oracle)
Ora.stat <- list(g = gmG8$g, jp = RBGL::johnson.all.pairs.sp(gmG8$g))
## estimate Skeleton
fit.Ora <- skeleton(suffStat=Ora.stat, indepTest = dsepTest, labels = V,
                    alpha=0.01) # <- irrelevant as dsepTest returns either 0 or 1

if (require(Rgraphviz)) {
  ## show estimated Skeleton
  plot(fit.Ora, main = "Estimated Skeleton (d-sep oracle)")
  plot(gmG8$g, main = "True DAG")
}
##################################################
## Using discrete data
##################################################
## Load data
data(gmD)
V <- colnames(gmD$x) # labels aka node names

## define sufficient statistics
suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE)

## estimate Skeleton
skel.fit <- skeleton(suffStat,
                     indepTest = disCItest, ## (G^2 statistics independence test)
                     alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
  ## show estimated Skeleton
  par(mfrow = c(1,2))
  plot(skel.fit, main = "Estimated Skeleton")
  plot(gmD$g, main = "True DAG")
}

##################################################
## Using binary data
##################################################
## Load binary data
data(gmB)
X <- gmB$x

## estimate Skeleton
skel.fm2 <- skeleton(suffStat = list(dm = X, adaptDF = FALSE),
                     indepTest = binCItest, alpha = 0.01,
                     labels = colnames(X), verbose = TRUE)
if (require(Rgraphviz)) {
  ## show estimated Skeleton
  par(mfrow = c(1,2))
  plot(skel.fm2, main = "Binary Data 'gmB': Estimated Skeleton")
  plot(gmB$g, main = "True DAG")
}

Run the code above in your browser using DataLab

State of Data and AI Literacy Report 2025