miic: MIIC, causal network learning algorithm including latent variables

Description

MIIC (Multivariate Information based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes.

Usage

miic(inputData = NULL, categoryOrder = NULL, trueEdges = NULL,
  blackBox = NULL, nThreads = 1, cplx = c("nml", "mdl"),
  orientation = TRUE, propagation = TRUE, latent = FALSE, neff = -1,
  edges = NULL, confidenceShuffle = 0, confidenceThreshold = 0,
  confList = NULL, verbose = FALSE)

Arguments

inputData

[a data frame] A data frame that contains the observational data. Each column corresponds to one variable and each row is a sample that gives the values for all the observed variables. The column names correspond to the names of the observed variables. Data must be discrete like.

categoryOrder

[a data frame] An optional data frame giving information about how to order the various states of categorical variables. It will be used to compute the signs of the edges (using partial correlation coefficient) by sorting each variable<U+2019>s levels accordingly to the given category order.

trueEdges

[a data frame] An optional data frame containing all the true edges of the graph. Each line corresponds to one edge.

blackBox

[a data frame] An optional data frame containing the pairs of variables that should be considered as independent. Each row contains one column for each of the two variables. The variable name must correspond to the one in the inputData data frame.

nThreads

[a positive integer] When set greater than 1, the miic algorithm allows to use multithreading in the skeleton initialization phase.

cplx

[a string; c("nml", "mdl")] In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be shifted by a complexity term. The finite size corrections can be based on the Minimal Description Length (MDL) criterion (set the option with "mdl"). In practice, the MDL complexity criterion tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) criterion can be used (set the option with "nml"). The default is "nml" (see Affeldt et al., UAI 2015).

orientation

[a boolean value] The miic network skeleton can be partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The propagation procedure relyes on probabilities; for more details, see Verny et al., PLoS Comp. Bio. 2017). If set to FALSE the orientation step is not performed.

propagation

[a boolean value] If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are propagated to downstream undirected edges in unshielded triples following the orientation method

latent

[a boolean value] When set to TRUE, the network reconstruction is taking into account hidden (latent) variables. Dependence between two observed variables due to a latent variable is indicated with a '6' in the adjacency matrix and in the network edges.summary and by a bi-directed edge in the (partially) oriented graph.

neff

[a positive integer] The N samples given in the inputdata data frame are expected to be independent. In case of correlated samples such as in time series or Monte Carlo sampling approaches, the effective number of independent samples neff can be estimated using the decay of the autocorrelation function (Verny et al., PLoS Comp. Bio. 2017). This effective number neff of independent samples can be provided using this parameter.

edges

[a data frame] The miic$edges object returned by an execution of the miic function. It represents the result of the skeleton step. If this object is provided, the skelethon step will not be done, and the required orientation will be performed using this edges data frame.

confidenceShuffle

[a positive integer] The number of shufflings of the original dataset in order to evaluate the edge specific confidence ratio of all inferred edges.

confidenceThreshold

[a positive floating point] The threshold used to filter the less probable edges following the skeleton step. See Verny et al., PLoS Comp. Bio. 2017.

confList

[a data frame] An optional data frame containing the confFile data frame returned by a miic execution. It is useful when a second run of the same input data set has to be performed with a different confidence threshold and the same confidenceShuffle value. In this way the computations based on the randomized dataset do not need to be performed again, and the values in this data frame are used instead.

verbose

[a boolean value] If TRUE, debugging output is printed.

Value

A miic-like object that contains:

all.edges.summary: a data frame with information about the relationship between each pair of variables
- x: X node
- y: Y node
- type: contains 'N' if the edge has been removed or 'P' for retained edges. If a true edges file is given, 'P' becomes 'TP' (True Positive) or 'FP' (False Positive), while 'N' becomes 'TN' (True Negative) or 'FN' (False Negative).
- ai: the contributing nodes found by the method which participate in the mutual information between x and y, and possibly separate them.
- info: provides the final mutual information times Nxy_ai for the pair (x, y) when conditioned on the collected nodes ai.
- cplx: gives the computed complexity between the (x, y) variables taking into account the contributing nodes ai.
- Nxy_ai: gives the number of samples on which the information and the complexity have been computed. If the input dataset has no missing value, the number of samples is the same for all pairs and corresponds to the total number of samples.
- log_confidence: represents the info - cplx value. It is a way to quantify the strength of the edge (x, y).
- confidenceRatio: this column is present if the confidence cut is > 0 and it represents the ratio between the probability to reject the edge (x, y) in the dataset versus the mean probability to do the same in multiple (user defined) number of randomized datasets.
- infOrt: the orientation of the edge (x, y). It is the same value as in the adjacency matrix at row x and column y.
- trueOrt: the orientation of the edge (x, y) present in the true edges file (if true edges file is provided).
- isOrtOk: information about the consistency of the inferred graph<U+2019>s orientations with a reference graph is given (i.e. if true edges file is provided). Y: the orientation is consistent; N: the orientation is not consistent with the PAG derived from the given true graph.
- sign: the sign of the partial correlation between variables x and y, conditioned on the contributing nodes ai.
- partial_correlation: value of the partial correlation for the edge (x, y) conditioned on the contributing nodes ai.
retained.edges.summary: a data frame in the format of all.edges.summary containing only the inferred edges.
orientations.prob: this data frame lists the orientation probabilities of the two edges of all unshielded triples of the reconstructed network with the structure: node1 -- mid-node -- node2:
- node1: node at the end of the unshielded triplet
- p1: probability of the arrowhead node1 <- mid-node
- p2: probability of the arrowhead node1 -> mid-node
- mid-node: node at the center of the unshielded triplet
- p3: probability of the arrowhead mid-node <- node2
- p4: probability of the arrowhead mid-node -> node2
- node2: node at the end of the unshielded triplet
- NI3: 3 point (conditional) mutual information * N
AdjMatrix: the adjacency matrix is a square matrix used to represent the inferred graph. The entries of the matrix indicate whether pairs of vertices are adjacent or not in the graph. The matrix can be read as a (row, column) set of couples where the row represents the source node and the column the target node. Since miic can reconstruct mixed networks (including directed, undirected and bidirected edges), we will have a different digit for each case:
- 1: (x, y) edge is undirected
- 2: (x, y) edge is directed as x -> y
- -2: (x, y) edge is directed as x <- y
- 6: (x, y) edge is bidirected

Details

Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data.

References

Verny et al., PLoS Comp. Bio. 2017. The preprint of the paper is available at https://miic.curie.fr/publications.php.

Examples

Run this code

# NOT RUN {
library(miic)

# EXAMPLE HEMATOPOIESIS
data(hematoData)

# execute MIIC (reconstruct graph)
miic.res = miic(inputData = hematoData, latent = TRUE,
confidenceShuffle = 10, confidenceThreshold = 0.001)

# plot graph
miic.plot(miic.res)
# }
# NOT RUN {
# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).

miic.write.network.cytoscape(g = miic.res, file.path(tempdir(),"/temp"))

# EXAMPLE CANCER
data(cosmicCancer)
data(cosmicCancer_stateOrder)
# execute MIIC (reconstruct graph)
miic.res = miic(inputData = cosmicCancer, categoryOrder = cosmicCancer_stateOrder, latent = TRUE,
confidenceShuffle = 100, confidenceThreshold = 0.001)

# plot graph
miic.plot(miic.res, igraphLayout=igraph::layout_on_grid)

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(),"/temp"))

# EXAMPLE OHNOLOGS
data(ohno)
data(ohno_stateOrder)
# execute MIIC (reconstruct graph)
miic.res = miic(inputData = ohno, latent = TRUE, categoryOrder = ohno_stateOrder,
confidenceShuffle = 100, confidenceThreshold = 0.001)

# plot graph
miic.plot(miic.res)

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(),"/temp"))
# }

Run the code above in your browser using DataLab