Identifies and removes extreme (outlier) cells within each cell type or cluster based on k-nearest neighbour (kNN) distances computed in the normalized high-dimensional gene expression space. Cells located in sparsely populated regions at the periphery of clusters are excluded prior to downstream analyses.
scTrimDist(
seurat_obj,
celltype_col,
knn_k = 30,
keep_frac = 0.05,
normalization_method = "LogNormalize",
nfeatures = 2000,
assay = "RNA",
npcs = 20,
resolution = 0.5,
log2FC_filter = 1,
pred,
verbose = TRUE
)A named list containing:
plot_outliers: ggplot showing t-SNE with outliers highlighted.
trimmed_object: Seurat object after trimming and reprocessing.
all_markers: Data frame of marker genes.
knn_res: List of kNN results per cell type.
A Seurat object containing single-cell expression data.
Character scalar specifying the column in
seurat_obj@meta.data defining cell types or clusters.
Integer specifying the number of nearest neighbours.
Numeric in (0,1) specifying the fraction of most extreme cells to remove per cell type.
Normalization method passed to
Seurat::NormalizeData.
Number of variable features selected.
Assay used for expression data extraction.
Number of principal components used downstream.
Clustering resolution for FindClusters.
Minimum log2 fold-change threshold for marker filtering.
If NULL, no filtering is applied.
A SingleR result object. Row names must correspond to cell
barcodes; pred$labels is used for annotation.
Logical indicating whether progress messages are printed.
For each cell type (or cluster), a kNN search is performed using the normalized gene expression matrix obtained from a standard Seurat preprocessing workflow. For a given cell \(i\) in cluster \(k\), the Euclidean distances \(D_{(j,i)}^k\) to its \(j = 1, \ldots, K\) nearest neighbours are computed.
The minimum distance $$ \min D_i^k = \min_{j = 1, \ldots, K} D_{(j,i)}^k $$ is used as a measure of local neighbourhood density. Cells with large minimum distances are interpreted as extreme or non-representative cells.
A fraction \(\alpha\) (specified via keep_frac) of the most extreme cells
is removed per cluster, defined as cells with
$$
\min D_i^k > Q_{1 - \alpha}
$$
where \(Q_{1 - \alpha}\) is the \((1 - \alpha)\) quantile of the minimum
kNN distance distribution within the cluster.
After trimming, the remaining cells are re-normalized and reprocessed using standard Seurat workflows. Cell type annotations are assigned using a **precomputed SingleR result** supplied by the user, and cluster-specific marker genes are identified.