scorpion: Build gene regulatory networks from single-cell RNA-seq data using PANDA

Description

Constructs gene regulatory networks from single-cell/nuclei RNA-seq data by first applying coarse-graining to reduce sparsity, then running the PANDA (Passing Attributes between Networks for Data Assimilation) message-passing algorithm to integrate transcription factor motifs, protein-protein interactions, and gene expression data into unified regulatory networks.

Usage

scorpion(
  tfMotifs = NULL,
  gexMatrix,
  ppiNet = NULL,
  computingEngine = "cpu",
  nCores = 1,
  gammaValue = 10,
  nPC = 25,
  assocMethod = "pearson",
  alphaValue = 0.1,
  hammingValue = 0.001,
  nIter = Inf,
  outNet = c("regNet", "coregNet", "coopNet"),
  zScaling = TRUE,
  showProgress = TRUE,
  randomizationMethod = "None",
  scaleByPresent = FALSE,
  filterExpr = FALSE
)

Value

A list of 6 elements describing the inferred networks at convergence:

regNet: Regulatory network matrix (TFs × genes)
coregNet: Co-regulation network matrix (genes × genes)
coopNet: Cooperation network matrix (TFs × TFs)
numGenes: Number of genes in the network
numTFs: Number of transcription factors
numEdges: Total number of edges in regulatory network

Arguments

tfMotifs: A motif dataset (data.frame or matrix) with 3 columns: TF, target gene, and motif score. Pass NULL for co-expression analysis only.
gexMatrix: An expression dataset, with genes in the rows and barcodes (cells) in the columns.
ppiNet: A Protein-Protein-Interaction dataset (data.frame or matrix) with 3 columns: protein 1, protein 2, and interaction score. Pass NULL to disable protein interaction integration.
computingEngine: Character specifying computing device: 'cpu' or 'gpu' (if available). Default 'cpu'.
nCores: Number of processors to be used if BLAS or MPI is active.
gammaValue: Graining level of data (proportion of number of single cells in the initial dataset to the number of super-cells in the final dataset)
nPC: Number of principal components to use for construction of single-cell kNN network.
assocMethod: Association method. Must be one of 'pearson', 'spearman' or 'pcNet'.
alphaValue: Numeric update parameter (0 to 1) controlling relative contribution of prior networks. Default 0.1.
hammingValue: Numeric convergence threshold based on Hamming distance. Algorithm stops when updates fall below this. Default 0.001.
nIter: Sets the maximum number of iterations PANDA can run before exiting.
outNet: A vector containing which networks to return. Options include "regNet", "coregNet", "coopNet".
zScaling: Boolean to indicate use of Z-Scores in output. FALSE will use [0,1] scale.
showProgress: Boolean to indicate printing of output for algorithm progress.
randomizationMethod: Method by which to randomize gene expression matrix. Default "None". Must be one of "None", "within.gene", "by.genes". "within.gene" randomization scrambles each row of the gene expression matrix, "by.gene" scrambles gene labels.
scaleByPresent: Boolean to indicate scaling of correlations by percentage of positive samples.
filterExpr: Boolean to remove genes with zero expression across all cells before network inference. Default FALSE.

Author

Daniel Osorio <daniecos@uio.no>

Examples

Run this code

# Loading example data
data(scorpionTest)

# The structure of the data
str(scorpionTest)

# List of 4
# $ gex     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# .. ..@ i       : int [1:46171] 29 32 41 43 61 170 208 245 251 269 ...
# .. ..@ p       : int [1:1955] 0 11 62 97 112 163 184 215 257 274 ...
# .. ..@ Dim     : int [1:2] 300 1954
# .. ..@ Dimnames:List of 2
# .. .. ..$ : chr [1:300] "IGHM" "IGHG2" "IGLC3" "IGLL5" ...
# .. .. ..$ : chr [1:1954] "P31-T_AAACGGGTCGGTTAAC" "P31-T_AAAGATGGTGGCCCTA" ...
# .. ..@ x       : num [1:46171] 1 1 1 1 2 2 1 1 2 1 ...
# .. ..@ factors : list()
# $ tf      :'data.frame':	371738 obs. of  3 variables:
# ..$ source_genesymbol: chr [1:371738] "MYC" "SPI1" "JUN_JUND" "FOS_JUND" ...
# ..$ target_genesymbol: chr [1:371738] "TERT" "BGLAP" "JUN" "JUN" ...
# ..$ weight           : num [1:371738] 1 1 1 1 1 1 1 1 1 1 ...
# ..- attr(*, "origin")= chr "cache"
# ..- attr(*, "url")= chr "https://omnipathdb.org/interactions? __truncated__
# $ ppi     :'data.frame':	4076 obs. of  3 variables:
# ..$ source_genesymbol: chr [1:4076] "ZIC1" "HES5" "ATOH1" "DLL1" ...
# ..$ target_genesymbol: chr [1:4076] "ATOH1" "ATOH1" "HES5" "NOTCH1" ...
# ..$ weight           : num [1:4076] 1 1 1 1 1 1 1 1 1 1 ...
# ..- attr(*, "origin")= chr "cache"
# ..- attr(*, "url")= chr "https://omnipathdb.org/interactions?__truncated__
# $ metadata:'data.frame':	1954 obs. of  4 variables:
# ..$ cell_id  : chr [1:1954] "P31-T_AAACGGGTCGGTTAAC" "P31-T_AAAGATGGTGGCCCTA"...
# ..$ donor    : chr [1:1954] "P31" "P31" "P31" "P31" ...
# ..$ region   : chr [1:1954] "T" "T" "T" "T" ...
# ..$ cell_type: Factor w/ 1 level "Epithelial": 1 1 1 1 1 1 1 1 1 1 ...

# Running SCORPION for epithelial cells from the normal tissue
# We are using alphaValue = 0.8 for testing purposes (Default = 0.1).
scorpionOutput <- scorpion(
  tfMotifs = scorpionTest$tf,
  gexMatrix = scorpionTest$gex[, scorpionTest$metadata$region == "N"],
  ppiNet = scorpionTest$ppi,
  alphaValue = 0.8
)

# -- SCORPION --------------------------------------------------------------------------------------
# + Initializing and validating
# + Verified sufficient samples
# i Normalizing networks
# i Learning Network
# i Using tanimoto similarity
# + Successfully ran SCORPION on 281 Genes and 963 TFs

# Structure of the output.
str(scorpionOutput)

# List of 6
# $ regNet  : num [1:963, 1:281] -0.1556 -0.0455 -0.1461 1.6881 0.8746 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:963] "AATF" "ABL1" "ACSS2" "ADNP" ...
# .. ..$ : chr [1:281] "ACKR1" "ACTA2" "ACTG2" "ADAMDEC1" ...
# $ coregNet: num [1:281, 1:281] 2.02e+06 3.84 4.10 -1.26 8.81e-01 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:281] "ACKR1" "ACTA2" "ACTG2" "ADAMDEC1" ...
# .. ..$ : chr [1:281] "ACKR1" "ACTA2" "ACTG2" "ADAMDEC1" ...
# $ coopNet : num [1:963, 1:963] 1.17e+07 -2.66 8.13 -1.31 4.95 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:963] "AATF" "ABL1" "ACSS2" "ADNP" ...
# .. ..$ : chr [1:963] "AATF" "ABL1" "ACSS2" "ADNP" ...
# $ numGenes: int 281
# $ numTFs  : int 963
# $ numEdges: int 270603