getParents: Estimate the connectivity matrix of a causal graph

Description

Estimates the connectivity matrix of a directed causal graph, using various possible methods. Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC.

Usage

getParents(
  X,
  environment = NULL,
  interventions = NULL,
  parentsOf = 1:ncol(X),
  method = c("arges", "backShift", "bivariateANM", "bivariateCAM", "CAM", "fci",
    "fciplus", "ges", "gies", "hiddenICP", "ICP", "LINGAM", "mmhc", "rankArges",
    "rankFci", "rankGes", "rankGies", "rankPc", "rfci", "pc", "regression")[12],
  alpha = 0.1,
  mode = c("raw", "parental", "ancestral")[1],
  variableSelMat = NULL,
  excludeTargetInterventions = TRUE,
  onlyObservationalData = FALSE,
  indexObservationalData = 1,
  returnAsList = FALSE,
  sparse = FALSE,
  directed = FALSE,
  pointConf = FALSE,
  setOptions = list(),
  assumeNoSelectionVars = TRUE,
  verbose = FALSE,
  ...
)

Arguments

A \((n\) x \(p)\)-data matrix with n observations of \(p\) variables.

environment

An optional vector of length \(n\), where the entry for observation \(i\) is an index for the environment in which observation \(i\) took place (Simplest case: entries 1 for observational data and entries 2 for interventional data of unspecified type. Encoding for observational data can be changed with indexObservationalData). Is required for methods ICP, hiddenICP and backShift.

interventions

A optional list of length \(n\). The entry for observation \(i\) is a numeric vector that specifies the variables on which interventions happened for observation \(i\) (a scalar if an intervention happened on just one variable and integer(0) if no intervention occured for this observation). Is used for methods gies, rankGies and CAM and will generate the vector environment if the latter is set to NULL. (However, this might generate too many different environments for some data sets, so a hand-picked vector environment is preferable). Is also used for ICP and hiddenICP to exclude interventions on the target variable of interest.

parentsOf

The variables for which we would like to estimate the parents. Default are all variables. Currently only used with mode = "raw". Speeds up computation for methods bivariateANM, bivariateCAM, ICP, hiddenICP and regression; for other methods only affects output. Also see variableSelMat for possibly speeding up computational time by restricting the set of potential parents for a variable.

method

A string that specfies the method to use. The methods pc (PC-algorithm), LINGAM (LINGAM), arges (Adaptively restricted greedy equivalence search), ges (Greedy equivalence search), gies (Greedy interventional equivalence search), fci (Fast causal inference), fciplus and rfci (Really fast causal inference) are imported from the package "pcalg" and are documented there in more detail, including the additional options that can be supplied via setOptions. The "rank versions" of arges, fci, ges, gies and pc are based on [1]. The method CAM (Causal additive models) is documented in the package "CAM" and the methods ICP (Invariant causal prediction), hiddenICP (Invariant causal prediction with hidden variables) are from the package "InvariantCausalPrediction". The method backShift comes from the package "backShift". The method mmhc comes from the package "bnlearn". Finally, the methods bivariateANM and bivariateCAM are for now implemented internally but will hopefully be part of another package at some point in the near future.

alpha

The level at which tests are done. This leads to confidence intervals for ICP and hiddenICP and is used internally for pc, rankPc, mmhc, fci, rankFci, fciplus and rfci. For all other methods alpha is not used.

mode

Determines output type - can be "raw" or one of the queries "isParent", "isMaybeParent", "isNoParent", "isAncestor","isMaybeAncestor", "isNoAncestor". If "raw", getParents() returns the connectivity matrix computed by the specified method in sparse matrix format if sparse is set to TRUE; else in dense matrix format (or as list if returnAsList = TRUE). The options directed and pointConf will be ignored for all modes except for "raw" if set to TRUE. The different mode types are explained in the help for getRanking.

variableSelMat

An optional logical matrix of dimension \((p\) x \(p)\). An entry TRUE for entry \((i,j)\) says that variable \(i\) should be considered as a potential parent for variable \(j\) and vice versa for FALSE. If the default value of NULL is used, all variables will be considered, but this can be very slow, especially for methods pc, ges, gies, rfci and CAM. Ignored for methods backShift, fciplus, LINGAM and CAM.

excludeTargetInterventions

When looking for parents of variable \(k\) in \(1,...,p\), set to TRUE if observations where an intervention on variable \(k\) occured should be excluded. Default is TRUE. Used in ICP and hiddenICP.

onlyObservationalData

If set to TRUE, only observational data is used. It will take the index in environment specified by indexObservationalData. If environment is NULL, all observations are used. Default is FALSE.

indexObservationalData

Index in environment that encodes observational data. Default is 1.

returnAsList

If set to TRUE, will return a list, where entry \(k\) is a list containing the estimated parents of variable \(k\). Default is FALSE.

sparse

If set to TRUE and returnAsList is FALSE, output matrix will be in sparse matrix format.

directed

If TRUE, an edge will be returned if and only if an edge has been detected to be directed. I.e. entry will be set to 0 for entry \((j,k)\) if both \(j -> k\) and \(k -> j\) are estimated (ICP, hiddenICP, regression), if \(j -- k\) is undirected (in case of CPDAGs) or if the edge type is not of type \(i --> j\) in case of PAGs. If assumeNoSelectionVars = TRUE the edge type \(i --o j\) is also considered 'directed' for methods returning PAGs. Default is FALSE. Only supported in mode "raw".

pointConf

If TRUE, numerical estimates will be returned if possible. For methods ICP and hiddenICP, these are the values in the individual confidence intervals (at chosen level alpha) that are closest to 0; for other methods these are point estimates. Some methods do not return numerical point estimates; for these the output will remain binary 0/1 (no-edge/edge). Default is FALSE. Only supported in mode "raw".

setOptions

A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values.

assumeNoSelectionVars

Set to TRUE is you want to assume the absence of selection variables.

verbose

If TRUE, detailed output is provided.

...

Parameters to be passed to underlying method's function.

Value

If option returnAsList is FALSE, a sparse matrix, where a 0 entry in position (j,k) corresponds to an estimate of "no edge" j -> k, while an entry 1 corresponds to an estimated egde. If option pointConf is TRUE, the 1 entries will be replaced by numerical values that are either point estimates of the causal coefficients or confidence bounds (see above). If option returnAsList is TRUE, a list will be returned. The k-th entry in the list is the numeric vector with the indices of the estimated parents of node k.

References

Naftali Harris and Mathias Drton: PC Algorithm for Nonparanormal Graphical Models. J. Mach. Learn. Res. 14(1) 2013.

Examples

Run this code

# NOT RUN {
## load the backShift package for data generation and plotting functionality
if(require(backShift) & require(pcalg)){
  # Simulate data with connectivity matrix A with assumptions
  # 1) hidden variables present
  # 2) precise location of interventions is assumed unknown
  # 3) different environments can be distinguished
  
  ## simulate data
  myseed <- 1
  
  # sample size n
  n <- 10000
  
  # p=3 predictor variables and connectivity matrix A
  p <- 3
  labels <- c("1", "2", "3")
  A <- diag(p)*0
  A[1,2] <- 0.8
  A[2,3] <- 0.8
  A[3,1] <- -0.4
  
  # divide data in 10 different environments
  G <- 10
  
  # simulate
  simResult <- backShift::simulateInterventions(n, p, A, G, intervMultiplier = 3,
               noiseMult = 1, nonGauss = TRUE, hiddenVars = TRUE,
               knownInterventions = FALSE, fracVarInt = NULL, simulateObs = TRUE,
               seed = myseed)
  X <- simResult$X
  environment <- simResult$environment
  
  ## apply all  methods given in vector 'methods'
  ## (using all data pooled for pc/LINGAM/rfci/ges -- can be changed with option
  ## 'onlyObservationalData=TRUE')
  
  methods <- c("backShift", "LINGAM") #c("pc", "rfci", "ges")
  
  # select whether you want to run stability selection
  stability <- FALSE
  
  # arrange graphical output into a rectangular grid
  sq <- ceiling(sqrt(length(methods)+1))
  par(mfrow=c(ceiling((length(methods)+1)/sq),sq))
  
  ## plot and print true graph
  cat("\n true graph is  ------  \n" )
  print(A)
  plotGraphEdgeAttr(A, plotStabSelec = FALSE, labels = labels, thres.point = 0,
   main = "TRUE GRAPH")
  
  ## loop over all methods and compute and print/plot estimate
  for (method in methods){
    cat("\n result for method", method,"  ------  \n" )
  
    if(!stability){
      # Option 1): use this estimator as a point estimate
      Ahat <- getParents(X, environment, method=method, alpha=0.1, pointConf = TRUE)
    }else{
      # Option 2): use a stability selection based estimator
      # with expected number of false positives bounded by EV=2
      Ahat <- getParentsStable(X, environment, EV=2, method=method, alpha=0.1)
    }
  
   # print and plot estimate (point estimate thresholded if numerical estimates
   # are returned)
    print(Ahat)
    if(!stability)
      plotGraphEdgeAttr(Ahat, plotStabSelec = FALSE, labels = labels,
       thres.point = 0.05,
       main=paste("POINT ESTIMATE FOR METHOD\n", toupper(method)))
    else
      plotGraphEdgeAttr(Ahat, plotStabSelec = TRUE, labels = labels,
       thres.point = 0, main = paste("STABILITY SELECTION
       ESTIMATE\n FOR METHOD", toupper(method)))
   }
}else{
    cat("\nThe packages 'backShift' and 'pcalg' are needed for the examples to
work. Please install them.")
}
 

 
# }