Learn R Programming

immunarch (version 0.10.3)

airr_stats: Compute key immune repertoire statistics

Description

[Experimental]

A family of functions that extract core descriptive statistics from an ImmunData object.

Available functions

Supported methods are the following.

airr_stats_chains --- count V(D)J chains per repertoire (optionally split by locus). Quickly gauges capture depth per repertoire and, when split by locus, reveals TRA/TRB/IGH balance. Use it for QC, library-size checks, and to spot locus-specific dropouts or over-representation.

airr_stats_lengths --- count the number of sequence lengths per repertoire. Summarizes the CDR3 length distribution, a sensitive QC fingerprint of repertoire prep and selection. Helpful for detecting primer/UMI biases, comparing cohorts, and deriving length-based features for models.

airr_stats_genes - count V(D)J gene segments per repertoire, optionally split by locus and using either receptor counts or barcode/UMI counts as the measure. Profiles V/D/J gene usage to characterize repertoire composition and germline biases, with optional locus split. Useful for cohort comparisons, flagging clonal expansions, and producing ML-ready features for repertoire-level ML tasks.

Usage

airr_stats_chains(
  idata,
  locus_col = NA,
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

airr_stats_lengths( idata, seq_col = "cdr3_aa", autojoin = getOption("immundata.autojoin", TRUE), format = c("long", "wide") )

airr_stats_genes( idata, gene_col = "v_call", level = c("receptor", "barcode"), by = c(NA, "locus"), autojoin = getOption("immundata.autojoin", TRUE), format = c("long", "wide") )

Value

airr_stats_chains Returns a tibble with columns:

  • repertoire_id -- repertoire identifier

  • locus -- TRA, TRB, IGH, ... (present only if locus_col is supplied)

  • n_chains -- number of chains

airr_stats_lengths Returns a tibble with columns:

  • repertoire_id -- repertoire identifier

  • seq_len -- lengths of sequences

  • n -- number of receptors

airr_stats_genes A tibble with columns:

  • repertoire_id - repertoire identifier

  • (optional) locus - TRA, TRB, IGH, ... (present only when by = "locus" and the locus column exists)

  • <gene_col> - the gene segment value (e.g., V gene)

  • n - the measure:

    • if level = "receptor": number of receptors carrying the gene segment

    • if level = "barcode": sum of counts across receptors for the segment

Arguments

idata

An ImmunData object.

locus_col

Column in idata$annotations that stores the locus (e.g. "locus"). If NULL or missing, the result is not split by locus.

autojoin

Logical. If TRUE, join repertoire metadata by the schema repertoire id. Change the default behaviour by calling options(immunarch.autojoin = FALSE).

format

String. One of "long" ("long" tibble with imd_repertoire_id, facet columns, and value; useful for visualizations) or "wide" (wide/unmelted table of features, with each row corresponding to a specific repertoire / pair of repertoires; useful for Machine Learning).

seq_col

Character vector with names of the columns containing sequences.

gene_col

A single column name in idata$annotations with gene segment calls (e.g., "v_call", "d_call", "j_call", "c_call"). Default is "v_call".

level

One of "receptor" or "barcode". If "receptor" (default), the function counts unique receptors (one per receptor ID) that carry a given gene segment. If "barcode", the function sums counts (e.g., cells/UMIs) per gene segment using the column defined by immundata::imd_schema("count").

by

Either NULL (no split) or "locus". When "locus", the result is further split by the locus column if present (as given by immundata::imd_schema("locus")); otherwise a warning is emitted and the split is ignored.

See Also

immundata::ImmunData

Examples

Run this code
# Limit the number of threads used by the underlying DB for this session.
# Change this only if you know what you're doing (e.g., multi-user machines, shared CI/servers).
db_exec("SET threads TO 2")

# Load data
if (FALSE) {
immdata <- get_test_idata() |> agg_repertoires("Therapy")
}

#
# airr_stats_chains
#

if (FALSE) {
airr_stats_chains(immdata)
}

#
# airr_stats_lengths
#

if (FALSE) {
airr_stats_lengths(immdata)
}

#
# airr_stats_genes
#

if (FALSE) {
# V gene usage by receptor count
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor")

# V gene usage by summed cell/UMI counts (if a count column is present)
airr_stats_genes(immdata, gene_col = "v_call", level = "barcode")

# Split by locus (TRA/TRB/... if locus column exists)
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor", by = "locus")
}

Run the code above in your browser using DataLab