Learn R Programming

CDSeq

CDSeq is a complete deconvolution method for dissecting bulk RNA-Seq data. The input of CDSeq is, ideally, bulk RNA-Seq read counts (similar to the input format required by DESeq2), and CDSeq will estimate, simultaneously, the cell-type-specific gene expression profiles and the sample-specific cell-type proportions, no reference of pure cell line GEPs or scRNAseq reference is needed for running CDSeq.

For example, if you have a bulk RNA-Seq data, a G by M matrix A, which is a G by M matrix. G denotes the number of genes and M is the sample size, then CDSeq will output B (a G by T matrix) and C (a T by M matrix), where T is the number of cell types, B is the estimate of cell-type-specific GEPs and C is the estimate of sample-specific cell-type proportions.

Importantly, you can ask CDSeq to estimate the number of cell types, i.e. T, by providing a vector of possible integer values for T. For example, if the user input for T is a vector, i.e. (T={2,3,4,5,6}), then CDSeq will estimate the most likely number for T.

Installation

You can install the released version of CDSeq from CRAN with:

install.packages("CDSeq")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package")

build the vignette with

# install.packages("devtools")
devtools::install_github("kkang7/CDSeq_R_Package", build_vignettes = TRUE)

Known issue about MacOS installation

It is possible for Mac users to run into some errors when install from source due to problems of Rcpp compiler tools. Follow the instruction here may help: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/

Example

Load package

library(CDSeq)

When the number of cell types is a scalar

## basic example code
result1<-CDSeq(bulk_data =  mixtureGEP, 
               cell_type_number = 6, 
               mcmc_iterations = 5, # increase the mcmc_iterations to 700 or above
               gene_length = as.vector(gene_length), 
               reference_gep = refGEP,  # gene expression profile of pure cell lines
               cpu_number = 1)

When the number of cell types is a vector

The cell_type_number can also be a vector which contains different integer values. CDSeq will perform estimation for each integer in the vector and estimate the number of cell types in the mixtures. For example, one can set cell_type_number = 2:10 as follows, and CDSeq will estimate the most likely number of cell types from 2 to 10.

result2<-CDSeq(bulk_data =  mixtureGEP, 
              cell_type_number = 2:10, 
              mcmc_iterations = 5, 
              dilution_factor = 1, 
              block_number = 1, 
              gene_length = as.vector(gene_length), 
              reference_gep = refGEP, # gene expression profile of pure cell lines
              cpu_number = 1, # use multiple cores to save time. Set the cpu_number = length(cell_type_number) if there is enough cores.
              print_progress_msg_to_file = 0)

Use single cell to annotate CDSeq-estimated cell types

cdseq.result <- CDSeq::CDSeq(bulk_data = pbmc_mix,
                             cell_type_number = seq(3,12,3),
                             beta = 0.5,
                             alpha = 5,
                             mcmc_iterations = 700,
                             cpu_number = 4,
                             dilution_factor = 10)

cdseq.result.celltypeassign <- cellTypeAssignSCRNA(cdseq_gep = cdseq.result$estGEP, # CDSeq-estimated cell-type-specific GEPs
                                                   cdseq_prop = cdseq.result$estProp, # CDSeq-estimated cell type proportions
                                                   sc_gep = sc_gep,         # PBMC single cell data
                                                   sc_annotation = sc_annotation,# PBMC single data annotations
                                                   sc_pt_size = 3,
                                                   cdseq_pt_size = 6,
                                                   seurat_nfeatures = 100,
                                                   seurat_npcs = 50,
                                                   seurat_dims=1:5,
                                                   plot_umap = 1,
                                                   plot_tsne = 0)

Check vignette for more details and examples: browseVignettes(“CDSeq”).

Contact

email: kangkai0714@gmail.com

Copy Link

Version

Install

install.packages('CDSeq')

Monthly Downloads

10

Version

1.0.8

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Kai Kang

Last Published

February 10th, 2021

Functions in CDSeq (1.0.8)

cellTypeAssignSCRNA

cellTypeAssignSCRNA assigns CDSeq-identified cell types using single cell RNAseq data.
SyntheticMixtureData

Synthetic bulk RNA-seq read counts data of six cell types, PBMC mixtures using scRNASeq and some preliminary results
Cell2RNA

Cell proportion to RNA proportion Cell2RNA converts Cell proportion to RNA proportion
CDSeq-R-package

CDSeq: A package for complete deconvolution using sequencing data
cellTypeAssign

Assign cell types using correlation matrix computed using cell-type-specific GEPs and reference GEPs. cellTypeAssign assigns CDSeq-identified cell types to reference profile.
RNA2Cell

RNA proportion to cell proportion RNA2Cell converts RNA proportion to cell proportion
CDSeq

Complete deconvolution using sequencing data.
cdseq.result

Output of synthetic mixtures of PBMC scRNAseq data
gene2rpkm

gene2rpkm outputs the rpkm normalizations of the CDSeq-estimated GEPs. gene2rpkm outputs the rpkm normalizations of the CDSeq-estimated GEPs.
hungarian_Rcpp

This is the Hungarian algorithm wrapper for cell type assignment hungarian_Rcpp returns cell type assignment given reference GEPs
gene_length

Gene length
logpost

logpost computes the log posterior of the CDSeq model. logpost outputs the value of log posterior.
result1

CDSeq result of synthetic bulk RNA-seq read counts data of six cell types
cellTypeAssignMarkerGenes

cellTypeAssignMarkerGenes assigns CDSeq-identified cell types using user-provided marker gene list and plots heatmap.
result2

CDSeq result of synthetic bulk RNA-seq read counts data of six cell types
max_rep

max_rep Find the element that repeats the most in a given vector and calculate its proportion.
seedMT

This is the Mersenne Twister random number generator. cokus generates pseudorandom integers uniformly distributed in 0..(2^32 - 1).
mixtureGEP

Synthetic bulk RNA-seq read counts data of six cell types
merge_df

Data frame for keeping the CDSeq-estimated cell type proportions for PBMC mixtures
sc_gep

PBMC single cell RNAseq read counts that used for creating synthetic PBMC mixtures
intersection

intersection take intersection of multiple lists and return the common set and index
true_prop_RNA

True cell type RNA proportions
result3

CDSeq result of synthetic bulk RNA-seq read counts data of six cell types
read2gene

read2gene outputs the GEP normalized by gene length of the CDSeq-estimated GEPs. read2gene outputs the gene length normalized CDSeq-estimated GEP.
gibbsSampler

This is the Gibbs sampler for CDSeq. GibbsSampler returns estimated GEPs and cell type proportions.
refGEP

GEPs of six component pure cell lines
true_prop_cell

True cell proportions of the mixtures
true_GEP_read

True GEPs of the six component cell types unnormalized by gene length
pbmc_mix

Synthetic bulk RNA-seq read counts data of PBMC single cell data
pbmc_ggplot

ggplot figures of comparison between CDSeq-estimated cell type proportion and ground truth
true_GEP_gene

True GEPs of the six component cell types normalized by gene length
sc_annotation

Cell type annotation of the PBMC single cell data
true_GEP_rpkm

True GEPs of the six component cell types RPKM normalization
true_prop

True cell type proportion in the PBMC synthetic mixtures