Learn R Programming

⚠️There's a newer version (3.0.3) of this package.Take me there.

Introduction

The LUCIDus R package is an integrative tool to obtain a joint estimation of latent or unknown clusters/subgroups with multi-omics data and phenotypic traits. This package is an implementation for the novel statistical method proposed in the research paper “A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits” published by the Bioinformatics.

Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from LUCID paper)

Let G be a n by p matrix with columns representing genetic features/environmental exposures, and rows being the observations; Z be a n by m matrix of standardized biomarkers and Y be a n-dimensional vector of disease outcome. By the DAG graph, it is further assumed that all three components above are linked by a categorical latent cluster variable X of K classes and with the conditional independence implied by the DAG, the general joint likelihood of the LUCID model can be formalized into

where Theta is a generic notation standing for parameters associated with each probability model. Additionally, we assume X follows a multinomial distribution conditioning on G, Z follows a multivariate normal distribution conditioning on X and Y follows a normal/Bernoulli (depending on the specific data structure of disease outcome) distribution conditioning on X. Therefore, the equation above can be finalized as

where S denotes the softmax function and phi denotes the probability density function (pdf) of the multivariate normal distribution.

To obtain the maximum likelihood estimates (MLE) of the model parameters, an EM algorithm is applied to handle the latent variable X. Denote the observed data as D, then the posterior probability of observation i being assigned to latent cluster j is expressed as

and the expectation of the complete log likelihood can be written as

At each iteration, in the E-step, compute the expectation of the complete data log likelihood by plugging in the posterior probability and then in the M-step, update the parameters by maximizing the expected complete likelihood function. Detailed derivations of the EM algorithm for LUCID can be found elsewhere.

Installation

You can install the development version from GitHub with:

install.packages("devtools")
devtools::install_github("Yinqi93/LUCIDus")

Example

library(LUCIDus2)

The two main functions: est.lucid() and boot.lucid() are used for model fitting and estimation of SE of the model parameters. You can also achieve variable selection by setting tuning parameters in def.lucid. The model outputs can be summarized and visualized using summary and plot respectively. Predictions could be made with pred.

Estimating latent clusters with multi-omics data, missing values in biomarker data are allowed, and information in the outcome of interest can be integrated. For illustration, we use a testing dataset with 10 genetic features (5 causal) and 10 biomarkers (5 causal)

Integrative clustering without feature selection

First, fit the model with est.lucid.

set.seed(10)
myfit <- est.lucid(G = G1,Z = Z1,Y = Y1, CoY = CovY, K = 2, family = "binary")
myfit

Check the model features.

summary(myfit)

A summary of results start with this:

Then visualize the results with Sankey diagram using plot_lucid()

plot(myfit)

Integrative clustering with feature selection

Run LUCID with tuning parameters and select informative features

set.seed(10)
myfit2 <- est.lucid(G = G1, Z = Z1, Y = Y1, CoY = CovY, K = 2, family = "binary", useY = FALSE, tune = def.tune(Select_Z = TRUE, Rho_Z_InvCov = 0.2, Rho_Z_CovMu = 90, Select_G = TRUE, Rho_G = 0.02))
selectG <- myfit2$select$selectG
selectZ <- myfit2$select$selectZ

Re-fit with selected features

set.seed(10)
myfit3 <- est.lucid(G = G1[, selectG], Z = Z1[, selectZ], Y = Y1, CoY = CovY, K = 2, family = "binary", useY = FALSE)
plot(myfit3)

Bootstrap method to obtain SEs for LUCID parameter estimates

set.seed(10)
myboot <- boot.lucid(G = G1[, selectG], Z = Z1[, selectZ], Y = Y1, CoY = CovY, model = myfit3, R = 50)
summary(myfit3, boot.se = myboot)

A detailed summary with 95% CI is provided as below.

For more details, see documentations for each function in the R package.

Built With

  • devtools - Tools to Make Developing R Packages Easier
  • roxygen2 - In-Line Documentation for R

Versioning

The current version is 1.0.0.

For the versions available, see the Release on this repository.

Authors

  • Yinqi Zhao

License

This project is licensed under the GPL-2 License.

Acknowledgments

  • Cheng Peng, Ph.D.
  • David V. Conti, Ph.D.
  • Zhao Yang, Ph.D.
  • USC IMAGE P1 Group

Copy Link

Version

Install

install.packages('LUCIDus')

Monthly Downloads

282

Version

2.0.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Yinqi Zhao

Last Published

May 18th, 2020

Functions in LUCIDus (2.0.0)

tune.lucid

Grid search for tuning parameters to fit the LUCID model
plot.lucid

Visualize the LUCID model through a Sankey diagram This function generates a Sankey diagram for the results of integrative clustering based on an lucid object
est.lucid

Estimate latent unknown clusters with multi-omics data
print.sumlucid

Print the output of LUCID in a nicer table
summary.lucid

Summarize the results of LUCID model
print.lucid

Print the output of est.lucid
predict.lucid

Predict the outcome based on a fitted LUCID model
Y1

Outcome Set 1
Y2

Outcome Set 2
def.control

Control parameters for EM algorithm
boot.lucid

Bootstrap method of inference for LUCID
def.tune

Define tuning parameters of regularization for LUCID model.
CovY

Covariates Set 1
Z2

Biomarker Set 2
G2

Genetic Features Set 2
G1

Genetic Features Set 1
Z1

Biomarker Set 1