fateBias: Computation of fate bias

Description

This function computes fate biases for single cells based on expression data from a single cell sequencing experiment. It requires a clustering partition and a target cluster representing a commited state for each trajectory.

Usage

fateBias(
  x,
  y,
  tar,
  z = NULL,
  minnr = NULL,
  minnrh = NULL,
  adapt = TRUE,
  confidence = 0.75,
  nbfactor = 5,
  use.dist = FALSE,
  seed = NULL,
  nbtree = NULL,
  verbose = FALSE,
  ...
)

Arguments

expression data frame with genes as rows and cells as columns. Gene IDs should be given as row names and cell IDs should be given as column names. This can be a reduced expression table only including the features (genes) to be used in the analysis.

clustering partition. A vector with an integer cluster number for each cell. The order of the cells has to be the same as for the columns of x.

tar

vector of integers representing target cluster numbers. Each element of tar corresponds to a cluster of cells committed towards a particular mature state. One cluster per different cell lineage has to be given and is used as a starting point for learning the differentiation trajectory.

Matrix containing cell-to-cell distances to be used in the fate bias computation. Default is NULL. In this case, a correlation-based distance is computed from x by 1 - cor(x).

minnr

integer number of cells per target cluster to be selected for classification (test set) in each iteration. For each target cluster, the minnr cells with the highest similarity to a cell in the training set are selected for classification. If z is not NULL it is used as the similarity matrix for this step. Otherwise, 1-cor(x) is used. Default value is NULL and minnr is estimated as the minimum of and 20 and half the median of target cluster sizes.

minnrh

integer number of cells from the training set used for classification. From each training set, the minnrh cells with the highest similarity to the training set are selected. If z is not NULL it is used as the similarity matrix for this step. Default value is NULL and minnrh is estimated as the maximum of and 20 and half the median of target cluster sizes.

adapt

logical. If TRUE then the size of the test set for each target cluster is adapted based on the classification success in the previous iteration. For each target cluster, the number of successfully classified cells is determined, i.e. the number of cells with a minimum fraction of votes given by the confidence parameter for the target cluster, which gave rise to the inclusion of the cell in the test set (see minnr). Weights are then derived by dividing this number by the maximum across all clusters after adding a pseudocount of 1. The test set size minnr is rescaled for each cluster by the respective weight in the next iteration. Default is TRUE.

confidence

real number between 0 and 1. See adapt parameter. Default is 0.75.

nbfactor

positive integer number. Determines the number of trees grown for each random forest. The number of trees is given by the number of columns of th training set multiplied by nbfactor. Default value is 5.

use.dist

logical value. If TRUE then the distance matrix is used as feature matrix (i. e. z if not equal to NULL and 1-cor(x) otherwise). If FALSE, gene expression values in x are used. Default is FALSE.

seed

integer seed for initialization. If equal to NULL then each run will yield slightly different results due to the radomness of the random forest algorithm. Default is NULL

nbtree

integer value. If given, it specifies the number of trees for each random forest explicitely. Default is NULL.

verbose

logical. If TRUE, then print information to console.

...

additional arguments to be passed to the low level function randomForest.

Value

A list with the following three components:

probs

a data frame with the fraction of random forest votes for each cell. Columns represent the target clusters. Column names are given by a concatenation of t and target cluster number.

votes

a data frame with the number of random forest votes for each cell. Columns represent the target clusters. Column names are given by a concatenation of t and target cluster number.

list of vectors. Each component contains the IDs of all cells on the trajectory to a given target cluster. Component names are given by a concatenation of t and target cluster number.

rfl

list of randomForest objects for each iteration of the classification.

trall

vector of cell ids ordered by the random forest iteration in which they have been classified into one of the target clusters.

Details

The bias is computed as the ratio of the number of random forest votes for a trajectory and the number of votes for the trajectory with the second largest number of votes. By this means only the trajectory with the largest number of votes will receive a bias >1. The siginifcance is computed based on counting statistics on the difference in the number of votes. A significant bias requires a p-value < 0.05. Cells are assigned to a trajectory if they exhibit a significant bias >1 for this trajectory.

Examples

Run this code

# NOT RUN {
x <- intestine$x
y <- intestine$y
tar <- c(6,9,13)
fb <- fateBias(x,y,tar,minnr=5,minnrh=20,adapt=TRUE,confidence=0.75,nbfactor=5)
head(fb$probs)
# }

Run the code above in your browser using DataLab