scTrimDist: ScTrimDist: Trim extreme cells based on kNN distance within cell types

Description

Identifies and removes extreme (outlier) cells within each cell type or cluster based on k-nearest neighbour (kNN) distances computed in the normalized high-dimensional gene expression space. Cells located in sparsely populated regions at the periphery of clusters are excluded prior to downstream analyses.

Usage

scTrimDist(
  seurat_obj,
  celltype_col,
  knn_k = 30,
  keep_frac = 0.05,
  normalization_method = "LogNormalize",
  nfeatures = 2000,
  assay = "RNA",
  npcs = 20,
  resolution = 0.5,
  log2FC_filter = 1,
  pred,
  verbose = TRUE
)

Value

A named list containing:

plot_outliers: ggplot showing t-SNE with outliers highlighted.
trimmed_object: Seurat object after trimming and reprocessing.
all_markers: Data frame of marker genes.
knn_res: List of kNN results per cell type.

Arguments

seurat_obj: A Seurat object containing single-cell expression data.
celltype_col: Character scalar specifying the column in seurat_obj@meta.data defining cell types or clusters.
knn_k: Integer specifying the number of nearest neighbours.
keep_frac: Numeric in (0,1) specifying the fraction of most extreme cells to remove per cell type.
normalization_method: Normalization method passed to Seurat::NormalizeData.
nfeatures: Number of variable features selected.
assay: Assay used for expression data extraction.
npcs: Number of principal components used downstream.
resolution: Clustering resolution for FindClusters.
log2FC_filter: Minimum log2 fold-change threshold for marker filtering. If NULL, no filtering is applied.
pred: A SingleR result object. Row names must correspond to cell barcodes; pred$labels is used for annotation.
verbose: Logical indicating whether progress messages are printed.

Details

For each cell type (or cluster), a kNN search is performed using the normalized gene expression matrix obtained from a standard Seurat preprocessing workflow. For a given cell $i$ in cluster $k$, the Euclidean distances $D_{(j,i)}^k$ to its $j = 1, \ldots, K$ nearest neighbours are computed.

The minimum distance $$ \min D_i^k = \min_{j = 1, \ldots, K} D_{(j,i)}^k $$ is used as a measure of local neighbourhood density. Cells with large minimum distances are interpreted as extreme or non-representative cells.

A fraction $\alpha$ (specified via keep_frac) of the most extreme cells is removed per cluster, defined as cells with $$ \min D_i^k > Q_{1 - \alpha} $$ where $Q_{1 - \alpha}$ is the $(1 - \alpha)$ quantile of the minimum kNN distance distribution within the cluster.

After trimming, the remaining cells are re-normalized and reprocessed using standard Seurat workflows. Cell type annotations are assigned using a **precomputed SingleR result** supplied by the user, and cluster-specific marker genes are identified.