Learn R Programming

gsdensity: Density-based Gene Set Evaluation

Introduction

Investigation of gene sets among cell populations is a common step in single-cell transcriptomic studies. Several hurdles do exist in this approach, such as:

  1. the noise/sparsity of the detected transcripts per cell;
  2. the uncertainty/bias in clustering and annotation; and
  3. the necessity of manual annotation (for factorization/pattern approaches).

Here, we present 'gsdensity', a computational tool dedicated for 'pathway centric' analysis of single-cell data. Given a gene set and a cell-by-gene matrix, we want to ask: is this gene set somehow enriched by a subpopulation of the cells? If so, can we fetch the cells for downstream analysis?

The motivation is that in many cases, to analyze single cell data, there is not a standard way of finding out which cell population is the most 'interesting' and worth a deeper analysis. However, scientists of the very field always have insights regarding 'it will be interesting to look at this pathway in the data'. This is especially true when clustering and annotation is non-trivial for the dataset.

The core idea of gsdensity is that we use multiple correspondence analsysis (MCA) to co-embed cells and genes (thanks to the CelliD package) and then investigate the 'density' of the gene sets of interest in the MCA space. The position of each gene in the MCA space reflects the relationship between that gene and cells and other genes. Thus, when a gene set shows a very different 'density' in the MCA space compared with the background, it indicates that there does exist some 'affinity' between the gene set and some cell subpopulation. We can then fetch the most relevant cells for a gene set using the label propagation algorithm since the genes and cells can be placed in the same graph (nearest neighbor graph built with the MCA space), with the genes in the gene set being 'seeds'. Thus, gsdensity is not affected by any of the three hurdles brought up above:

  1. The co-embedding of cells and genes will alleviate of the sparsity of single-cell data
  2. This gsdensity method does not require any clustering information; however, when clustering/annotation is available, we can find highly specific gene sets for a cluster of interest (please check tutorials)
  3. We can use well-annotated gene sets directly as the input; there is no need for manual annotation of patterns.

Installation

Have been tested for windows and linux systems with no problem. Running on a linux server (with parallel computing) is strongly recommended when there are many gene sets to be tested.

Installation not successful in macos with M1 chip, due to a dependency of the CelliD package (https://github.com/LTLA/scuttle/issues/14#issuecomment-989653173). Will keep testing and update when this is solved. (9/17: please check Update section #2)

# First, install dependencies: Seurat, CelliD, dnet, supraHex, Rgraphviz, infotheo, anticlust, multimode, philentropy; then:

# install.packages("remotes")

#Turn off warning-error-conversion, because the tiniest warning stops installation
Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true")

#install from github
remotes::install_github("https://github.com/qingnanl/gsdensity")

Usage

We included two examples showing how to use gsdensity with scRNA-seq data and spatial transcriptomics data.

In the former one we introduced the basic workflow and main functions of gsdensity, including a quick comparison between gsdensity and gsea.

The latter one is mainly focused on how to associate gene sets with the spatial information of cells or other partition of cells.

Using gsdensity to analyze scRNA-seq data

Using gsdensity to analyze spatial transcriptomics data

Quick start

library(gsdensity)
library(ggplot2) # for plotting
library(reshape2)
library(msigdbr) # for gathering gene sets
library(Seurat)
library(SeuratData)
# library(future) # for parallel computing
# library(future.apply) # for parallel computing

# use GO_BP gene sets 
# Conver the format to 'list'

mdb_c5 <- msigdbr(species = "Homo sapiens", category = "C5")
mdb_c5_bp <- mdb_c5[mdb_c5$gs_subcat == "GO:BP", ]
gene.set.list <- list()
for (gene.set.name in unique(mdb_c5_bp$gs_name)){
        gene.set.list[[gene.set.name]] <- mdb_c5_bp[mdb_c5_bp$gs_name %in% gene.set.name, ]$gene_symbol
}

# from SeuratData
data("pbmc3k")
# Normalize the data
pbmc3k <- NormalizeData(pbmc3k)

# compute cell/gene embeddings
ce <- compute.mca(object = pbmc3k)
# find gene sets with differential density 
res <- compute.kld(coembed = ce, 
                   genes.use = intersect(rownames(ce), rownames(pbmc3k)), # this intersection is to select only genes, not cells. 
                   n.grids = 100, 
                   gene.set.list = gene.set.list,
                   gene.set.cutoff = 3,
                   n.times = 100)                   
gene.set.deviated <- res[res$p.adj < 0.05, ]$gene.set

# build nearest neighbor graph
cells <- colnames(pbmc3k)
el <- compute.nn.edges(coembed = ce, nn.use = 300)

# get label propagation probability for each cell of a gene set 'GOBP_B_CELL_ACTIVATION'
cv <- run.rwr(el = el, gene_set = gene.set.list[["GOBP_B_CELL_ACTIVATION"]], cells = cells)
# get a 'positive' or 'negative' label for each cell of this gene set
cl <- compute.cell.label(cv)

# Compute the UMAP coordinates for visualization
pbmc3k <- pbmc3k %>%
        FindVariableFeatures() %>%
        ScaleData() %>%
        RunPCA() %>%
        FindNeighbors() %>%
        RunUMAP(dims = 1:20)
# plot the cell annotation
DimPlot(pbmc3k,
        group.by = "seurat_annotations",
        label = T,
        raster = T)

# add the label propagation probability and binarized label to meta data
pbmc3k@meta.data$b_cell_activition <- cv[colnames(pbmc3k)]
pbmc3k@meta.data$b_cell_activition_bin <- cl[colnames(pbmc3k)]

# plot the probabilities
FeaturePlot(pbmc3k,
        features = "b_cell_activition",
        raster = T)

# plot the 'positive' or 'negative' labels        
DimPlot(pbmc3k,
        group.by = "b_cell_activition_bin",
        raster = T)        

Updates

  1. (9/15/22)When such error appears when running run.rwr():
Error: as(<dsCMatrix>, “dgCMatrix”) is deprecated since Matrix 1.5-0; do as(., “generalMatrix”) instead

please set:

igraph::igraph.options(sparsematrices = FALSE)

then run the code again

  1. (9/17/22) For MacOS with M1 chip, the tool can be installed with the intel build R, but not for the arm64 version of R. This has been tested with an M1-2020 computer.

  2. (9/25/22) Now we provide statistical test for spatial relevance of gene sets with 'compute.spatial.kld' and 'compute.spatial.kld.df' functions.

  3. (2/27/23) Now we added a tunable parameter, "restart", for network propagation. In most cases, using the default (0.75) will give reasonable results. When the gene list is very small or many genes from the list are undetected, and the user still want to find the relatively most relevant cells, please consider setting a lower "restart", e.g., 0.5. This will let the propagation go farther.

Copy Link

Version

Install

install.packages('gsdensity')

Monthly Downloads

1

Version

0.1.2

License

GPL-3

Maintainer

Qingnan, Liang

Last Published

March 16th, 2023

Functions in gsdensity (0.1.2)

gene.set.list

A gene set list containing multiple human GO gene sets
el_nn_search

this function is called by 'compute.nn.edges' to convert nearest neighbor identity matrix to edge list
kde2d.weighted

based on https://stat.ethz.ch/pipermail/r-help/2006-June/107405.html this is called by compute.spatial.kld to calculate the kernel density estimation in 2d space with each data point weighted.
pbmc.meta

pbmc3k meta
compute.spec.single

This is to calculate the similarity between: 1. the label propagation probability of cells for gene sets and 2. the identify of cells in a certain partition This is called by 'compute.spec'; can also run by itself
seed.mat.list

this function is used when more than one 'seed sets' will be used (when there are multiple gene sets of interest)
compute.spatial.kld.df

This function is to calculate how likely the cells relevant to multiple gene sets are randomly distributed spatially
compute.spec

This is to calculate the similarity between: 1. the label propagation probability of cells for gene sets and 2. the identify of cells in partitions
sample.spatial.kld

this function is called by 'compute.spatial.kld' to calculate the kl-divergence between cell-weighted with shuffled weight vector and the ref (all cells, unweighted)
seed.mat

4. compute label propagation from gene set to cells this function is to form a 'seed matrix' used by the dRWR function (dnet R package); the seed matrix is specifying which nodes are the sources for label propagation
coords.df

mouse brain coords
run.rwr

4.1 To calculate the label propagation probability for a gene set among cells; result in a vector (length = number of cells) reflecting the probability each cell is labeled during the propagation (relevance to the gene set)
pbmc.mtx

pbmc3k matrix
run.rwr.list

result in a matrix (number of rows = number of cells; number of columns = number of gene sets) reflecting the probability each cell is labeled during the propagation (relevance to the gene set); same idea as run.rwr but with multiple gene sets
sample.kld

this function is called by 'compute.kld' to calculate the kl-divergence between sampled (background) gene set and the ref (all) gene set
weight_df

mouse brain gene set activities
vectorized_pdist

from an excellent post: https://www.r-bloggers.com/2013/05/pairwise-distances-in-r/ enhanced the speed this function is called by 'compute.kld' to quickly compute the distance between genes to grid points
compute.db

this function is called by 'compute.kld' to aggregate the density contribution of each gene to each grid point, and then normalize the densities of grid points to 1.
compute.cell.label.df

similar to compute.cell.label; used when working with multiple gene sets
compute.spatial.kld

6. find gene sets with spatial relevance
compute.cell.label

4.2. binarize the label propagation probability in the cell population; result in a binarized vector of cells with 'nagative' and 'positive' labels; 'positive' means that the cells are relevant to the gene set
compute.grid.coords

2. compute density of gene sets of interest 2.1 compute grid point coordinates
compute.kld

2.2 compute KL-divergence (some are adapted from https://github.com/alexisvdb/singleCellHaystack/)
ce

cell embeddings for pbmc3k data
compute.mca

1. compute MCA embeddings
compute.jsd

5. compute the specificity of gene set when cell partition information is available; the information could be clustering, sample origins, or other conditions inspired by https://github.com/FloWuenne/scFunctions/blob/0d9ea609fa72210a151f7270e61bdee008e8fc88/R/calculate_rrs.R
compute.nn.edges

3. compute nearest neighbor graph for genes and cells This graph will be used for fetching the most relevant cells of a gene set