SEMml: Nodewise-predictive SEM train using Machine Learning (ML)

Description

The function converts a graph to a collection of nodewise-based models: each mediator or sink variable can be expressed as a function of its parents. Based on the assumed type of relationship, i.e. linear or non-linear, SEMml() fits a ML model to each node (variable) with non-zero incoming connectivity. The model fitting is repeated equation-by equation (r=1,...,R) times, where R is the number of mediators and sink nodes.

Usage

SEMml(
  graph,
  data,
  train = NULL,
  algo = "sem",
  vimp = FALSE,
  thr = NULL,
  verbose = FALSE,
  ...
)

Value

An S3 object of class "ML" is returned. It is a list of 5 objects:

"fit", a list of ML model objects, including: the estimated covariance matrix (Sigma), the estimated model errors (Psi), the fitting indices (fitIdx), and the signed Shapley R2 values (parameterEstimates), if shap = TRUE,
"Yhat", the matrix of continuous predicted values of graph nodes (excluding source nodes) based on training samples.
"model", a list of all the fitted nodewise-based models (sem, gam, rf, xgb or nn).
"graph", the induced DAG of the input graph mapped on data variables. If vimp = TRUE, the DAG is colored based on the variable importance measure, i.e., if abs(vimp) > thr will be highlighted in red (vimp > 0) or blue (vimp < 0).
"data", istandardized training data subset mapping graph nodes.

Arguments

graph

An igraph object.

data

A matrix with rows corresponding to subjects, and columns to graph nodes (variables).

train

A numeric vector specifying the row indices corresponding to the train dataset (default = NULL).

algo

ML method used for nodewise-network predictions. Six algorithms can be specified:

algo="sem" (default) for a linear SEM, see SEMrun.
algo="gam" for a generalized additive model, see gam.
algo="rf" for a random forest model, see ranger.
algo="xgb" for a XGBoost model, see xgboost.
algo="nn" for a small neural network model (1 hidden layer and 10 nodes), see nnet.
algo="dnn" for a large neural network model (1 hidden layers and 1000 nodes), see dnn.

vimp

A Logical value(default=FALSE). If TRUE compute the variable importance, considering: (i) the squared value of the t-statistic or F-statistic of the model parameters for "sem" or "gam"; (ii) the variable importance from the importance or xgb.importance functions for "rf" or "xgb"; (iii) the Olden's connection weights for "nn" or "dnn".

thr

A numerical value indicating the threshold to apply on the variable importance to color the graph. If thr=NULL (default), the threshold is set to thr = abs(mean(vimp)).

verbose

A logical value. If FALSE (default), the processed graph will not be plotted to screen.

...

Currently ignored.

Author

Mario Grassi mario.grassi@unipv.it

Details

By mapping data onto the input graph, SEMml() creates a set of nodewise-based models based on the directed links, i.e., a set of edges pointing in the same direction, between two nodes in the input graph that are causally relevant to each other. The mediator or sink variables can be characterized in detail as functions of their parents. An ML model (sem, gam, rf, xgb, nn, dnn) can then be fitted to each variable with non-zero inbound connectivity, taking into account the kind of relationship (linear or non-linear). With R representing the number of mediators and sink nodes in the network, the model fitting process is performed equation-by-equation (r=1,...,R) times.

References

Grassi M, Palluzzi F, Tarantino B (2022). SEMgraph: An R Package for Causal Network Analysis of High-Throughput Data with Structural Equation Models. Bioinformatics, 38 (20), 4829–4830 <https://doi.org/10.1093/bioinformatics/btac567>

Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. London: Chapman and Hall.

Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.

Redell, N. (2019). Shapley Decomposition of R-Squared in Machine Learning Models. arXiv: Methodology.

Examples

Run this code


# \donttest{
# Load Amyotrophic Lateral Sclerosis (ALS)
data<- alsData$exprs; dim(data)
data<- transformData(data)$data
group<- alsData$group; table (group)
ig<- alsData$graph; gplot(ig)

#...with train-test (0.5-0.5) samples
set.seed(123)
train<- sample(1:nrow(data), 0.5*nrow(data))

start<- Sys.time()
# ... rf
res1<- SEMml(ig, data, train, algo="rf", vimp=TRUE)

# ... xgb
res2<- SEMml(ig, data, train, algo="xgb", vimp=TRUE)

# ... nn
res3<- SEMml(ig, data, train, algo="nn", vimp=TRUE)

# ... gam
res4<- SEMml(ig, data, train, algo="gam", vimp=TRUE)
end<- Sys.time()
print(end-start)

# ... sem
res5<- SEMml(ig, data, train, algo="sem", vimp=TRUE)

#str(res5, max.level=2)
res5$fit$fitIdx
res5$fit$parameterEstimates
gplot(res5$graph)

#Comparison of AMSE (in train data)
rf <- res1$fit$fitIdx[2];rf
xgb<- res2$fit$fitIdx[2];xgb
nn <- res3$fit$fitIdx[2];nn
gam<- res4$fit$fitIdx[2];gam
sem<- res5$fit$fitIdx[2];sem

#Comparison of SRMR (in train data)
rf <- res1$fit$fitIdx[4];rf
xgb<- res2$fit$fitIdx[4];xgb
nn <- res3$fit$fitIdx[4];nn
gam<- res4$fit$fitIdx[4];gam
sem<- res5$fit$fitIdx[4];sem

#Comparison of VIMP (in train data)
table(E(res1$graph)$color) #rf
table(E(res2$graph)$color) #xgb
table(E(res3$graph)$color) #nn
table(E(res4$graph)$color) #gam
table(E(res5$graph)$color) #sem

#Comparison of AMSE (in test data)
print(predict(res1, data[-train, ])$PE[1]) #rf
print(predict(res2, data[-train, ])$PE[1]) #xgb
print(predict(res3, data[-train, ])$PE[1]) #nn
print(predict(res4, data[-train, ])$PE[1]) #gam
print(predict(res5, data[-train, ])$PE[1]) #sem

#...with a binary outcome (1=case, 0=control)

ig1<- mapGraph(ig, type="outcome"); gplot(ig1)
outcome<- ifelse(group == 0, -1, 1); table(outcome)
data1<- cbind(outcome, data); data1[1:5,1:5]

res6 <- SEMml(ig1, data1, train, algo="nn", vimp=TRUE)
gplot(res6$graph)
table(E(res6$graph)$color)

mse6 <- predict(res6, data1[-train, ])
yobs <- group[-train]
yhat <- mse6$Yhat[ ,"outcome"]
benchmark(yobs, yhat, thr=0, F1=TRUE)
benchmark(yobs, yhat, thr=0, F1=FALSE)
# }

Run the code above in your browser using DataLab