pruneKnn: Function inferring a pruned knn matrix

Description

This function determines k nearest neighbours for each cell in gene expression space, and tests if the links are supported by a negative binomial joint distribution of gene expression. A probability is assigned to each link which is given by the minimum joint probability across all genes.

Usage

pruneKnn(
  expData,
  distM = NULL,
  large = TRUE,
  regNB = TRUE,
  batch = NULL,
  regVar = NULL,
  ngenes = 2000,
  span = 0.75,
  pcaComp = 100,
  algorithm = "kd_tree",
  metric = "pearson",
  genes = NULL,
  knn = 10,
  alpha = NULL,
  no_cores = NULL,
  FSelect = FALSE,
  seed = 12345,
  res = NULL
)

Value

List object of six components:

distM: Distance matrix.
dimRed: PCA transformation of expData including the first pcaComp principle components, computed on including genes or variable genes only if Fselect equals TRUE. Is is set to NULL if large equals FALSE.
pvM: Matrix of link probabilities between a cell and each of its k nearest neighbours. Column i shows the k nearest neighbour link probabilities for cell i in matrix x.
NN: Matrix of column indices of k nearest neighbours for each cell according to input matrix x. First entry corresponds to index of the cell itself. Column i shows the k nearest neighbour indices for cell i in matrix x.
B: List object with background model of gene expression as obtained by fitBackVar function.
regData: If regNB=TRUE this argument contains a list of four components: component pearsonRes contains a matrix of the Pearson Residual computed from the negative binomial regression, component nbRegr contains a matrix with the regression coefficients, component nbRegrSmooth contains a matrix with the smoothed regression coefficients, and log10_umi is a vector with the total log10 UMI count for each cell. The regression coefficients comprise the dispersion parameter theta, the intercept, the regression coefficient beta for the log10 UMI count, and the regression coefficients of the batches (if batch is not NULL).

Arguments

expData: Matrix of gene expression values with genes as rows and cells as columns. These values have to correspond to unique molecular identifier counts.
distM: Optional distance matrix used for determining k nearest neighbours. Default is NULL and the distance matrix is computed using a metric given by the parameter metric.
large: logical. If TRUE then no distance matrix is required and nearest neighbours are inferred by the FNN package based on a reduced feature matrix computed by a principle component analysis. Only the first pcaComp principle components are considered. Prior to principal component analysis a negative binomial regression is performed to eliminate the dependence on the total number of transcripts per cell. The pearson residuals of this regression serve as input for the principal component analysis after smoothing the parameter dependence on the mean by a loess regression. Deafult is TRUE. Recommended mode for very large datasets, where a distance matrix consumes too much memory. A distance matrix is no longer required, and if distM is initialized it will be ignored if large equals TRUE.
regNB: logical. If TRUE then gene a negative binomial regression is performed to prior to the principle component analysis if large = TRUE. See large. Default is TRUE.
batch: vector of batch variables. Component names need to correspond to valid cell IDs, i.e. column names of expData. If regNB is TRUE, than the batch variable will be regressed out simultaneously with the log10 UMI count per cell.An interaction term is included for the log10 UMI count with the batch variable. Default value is NULL.
regVar: data.frame with additional variables to be regressed out simultaneously with the log10 UMI count and the batch variable (if batch is TRUE). Column names indicate variable names (name beta is reserved for the coefficient of the log10 UMI count), and rownames need to correspond to valid cell IDs, i.e. column names of expData. Interaction terms are included for each variable in regVar with the batch variable (if batch is TRUE). Default value is NULL.
ngenes: Positive integer number. Randomly sampled number of genes (from rownames of expData) used for predicting regression coefficients (if regNB=TRUE). Smoothed coefficients are derived for all genes. Default is 2000.
span: Positive real number. Parameter for loess-regression (see large) controlling the degree of smoothing. Default is 0.75.
pcaComp: Positive integer number. Number of princple components to be included if large is TRUE. Default is 100.
algorithm: Algorithm for fast k nearest neighbour inference, using the get.knn function from the FNN package. See help(get.knn). Deafult is "kd_tree".
metric: Distances are computed from the expression matrix x after optionally including only genes given as argument genes or after optional feature selection (see FSelect). Possible values for metric are "pearson", "spearman", "logpearson", "euclidean". Default is "pearson". In case of the correlation based methods, the distance is computed as 1 – correlation.
genes: Vector of gene names corresponding to a subset of rownames of x. Only these genes are used for the computation of a distance matrix and for the computation of joint probabilities of nearest neighbours. Default is NULL and all genes are used.
knn: Positive integer number. Number of nearest neighbours considered for each cell. Default is 10.
alpha: Positive real number. Relative weight of a cell versus its k nearest neigbour applied for the derivation of joint probabilities. A cell receives a weight of alpha while the weight of its k nearest neighbours is determined by quadratic programming. The sum across all weights is normalized to one, and the weighted mean expression is used for computing the joint probability of a cell and each of its k nearest neighbours. These probabilities are used for the derivation of of link probabilities. Larger values give more weight to the gene expression observed in a cell versus its neighbourhood. Typical values should be in the range of 0 to 10. Default is NULL. In this case, alpha is inferred by an optimization, i.e., alpha is minimized under the constraint that the gene expression in a cell does not deviate more then one standard deviation from the predicted weigthed mean, where the standard deviation is calculated from the predicted mean using the background model (the average dependence of the variance on the mean expression).
no_cores: Positive integer number. Number of cores for multithreading. If set to NULL then the number of available cores minus two is used. Default is 1.
FSelect: Logical parameter. If TRUE, then feature selection is performed prior to distance matrix calculation and VarID analysis. Default is FALSE.
seed: Integer number. Random number to initialize stochastic routines. Default is 12345.
res: Output object from pruneKnn. The rownames (genes) and colnames (cells) of the parameter expData have to be subsets on the input data used to produce this output. For example, the batch effects could have been corrected on the global dataset using the pruneKnn function, and using the output from the global run permits using regression parameters from the global analysis on specific subsets if expData contain a subset of genes and cells.

Examples

Run this code

res <- pruneKnn(intestinalDataSmall,metric="pearson",knn=10,alpha=1,no_cores=1,FSelect=FALSE)

Run the code above in your browser using DataLab