Learn R Programming

RepeatedHighDim (version 2.5.0)

scTrimDist: ScTrimDist: Trim extreme cells based on kNN distance within cell types

Description

Identifies and removes extreme (outlier) cells within each cell type or cluster based on k-nearest neighbour (kNN) distances computed in the normalized high-dimensional gene expression space. Cells located in sparsely populated regions at the periphery of clusters are excluded prior to downstream analyses.

Usage

scTrimDist(
  seurat_obj,
  celltype_col,
  knn_k = 30,
  keep_frac = 0.05,
  normalization_method = "LogNormalize",
  nfeatures = 2000,
  assay = "RNA",
  npcs = 20,
  resolution = 0.5,
  log2FC_filter = 1,
  pred,
  verbose = TRUE
)

Value

A named list containing:

  • plot_outliers: ggplot showing t-SNE with outliers highlighted.

  • trimmed_object: Seurat object after trimming and reprocessing.

  • all_markers: Data frame of marker genes.

  • knn_res: List of kNN results per cell type.

Arguments

seurat_obj

A Seurat object containing single-cell expression data.

celltype_col

Character scalar specifying the column in seurat_obj@meta.data defining cell types or clusters.

knn_k

Integer specifying the number of nearest neighbours.

keep_frac

Numeric in (0,1) specifying the fraction of most extreme cells to remove per cell type.

normalization_method

Normalization method passed to Seurat::NormalizeData.

nfeatures

Number of variable features selected.

assay

Assay used for expression data extraction.

npcs

Number of principal components used downstream.

resolution

Clustering resolution for FindClusters.

log2FC_filter

Minimum log2 fold-change threshold for marker filtering. If NULL, no filtering is applied.

pred

A SingleR result object. Row names must correspond to cell barcodes; pred$labels is used for annotation.

verbose

Logical indicating whether progress messages are printed.

Details

For each cell type (or cluster), a kNN search is performed using the normalized gene expression matrix obtained from a standard Seurat preprocessing workflow. For a given cell \(i\) in cluster \(k\), the Euclidean distances \(D_{(j,i)}^k\) to its \(j = 1, \ldots, K\) nearest neighbours are computed.

The minimum distance $$ \min D_i^k = \min_{j = 1, \ldots, K} D_{(j,i)}^k $$ is used as a measure of local neighbourhood density. Cells with large minimum distances are interpreted as extreme or non-representative cells.

A fraction \(\alpha\) (specified via keep_frac) of the most extreme cells is removed per cluster, defined as cells with $$ \min D_i^k > Q_{1 - \alpha} $$ where \(Q_{1 - \alpha}\) is the \((1 - \alpha)\) quantile of the minimum kNN distance distribution within the cluster.

After trimming, the remaining cells are re-normalized and reprocessed using standard Seurat workflows. Cell type annotations are assigned using a **precomputed SingleR result** supplied by the user, and cluster-specific marker genes are identified.