airr_stats: Compute key immune repertoire statistics

Description

A family of functions that extract core descriptive statistics from an ImmunData object.

Available functions

Supported methods are the following.

airr_stats_chains --- count V(D)J chains per repertoire (optionally split by locus). Quickly gauges capture depth per repertoire and, when split by locus, reveals TRA/TRB/IGH balance. Use it for QC, library-size checks, and to spot locus-specific dropouts or over-representation.

airr_stats_lengths --- count the number of sequence lengths per repertoire. Summarizes the CDR3 length distribution, a sensitive QC fingerprint of repertoire prep and selection. Helpful for detecting primer/UMI biases, comparing cohorts, and deriving length-based features for models.

airr_stats_genes - count V(D)J gene segments per repertoire, optionally split by locus and using either receptor counts or barcode/UMI counts as the measure. Profiles V/D/J gene usage to characterize repertoire composition and germline biases, with optional locus split. Useful for cohort comparisons, flagging clonal expansions, and producing ML-ready features for repertoire-level ML tasks.

Usage

airr_stats_chains(
  idata,
  locus_col = NA,
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)
airr_stats_lengths(
  idata,
  seq_col = "cdr3_aa",
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)
airr_stats_genes(
  idata,
  gene_col = "v_call",
  level = c("receptor", "barcode"),
  by = c(NA, "locus"),
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

Value

`airr_stats_chains` Returns a tibble with columns:

repertoire_id -- repertoire identifier
locus -- TRA, TRB, IGH, ... (present only if locus_col is supplied)
n_chains -- number of chains

`airr_stats_lengths` Returns a tibble with columns:

repertoire_id -- repertoire identifier
seq_len -- lengths of sequences
n -- number of receptors

`airr_stats_genes` A tibble with columns:

repertoire_id - repertoire identifier
(optional) locus - TRA, TRB, IGH, ... (present only when by = "locus" and the locus column exists)
<gene_col> - the gene segment value (e.g., V gene)
n - the measure:
- if level = "receptor": number of receptors carrying the gene segment
- if level = "barcode": sum of counts across receptors for the segment

Arguments

idata: An ImmunData object.
locus_col: Column in idata$annotations that stores the locus (e.g. "locus"). If NULL or missing, the result is not split by locus.
autojoin: Logical. If TRUE, join repertoire metadata by the schema repertoire id. Change the default behaviour by calling options(immunarch.autojoin = FALSE).
format: String. One of "long" ("long" tibble with imd_repertoire_id, facet columns, and value; useful for visualizations) or "wide" (wide/unmelted table of features, with each row corresponding to a specific repertoire / pair of repertoires; useful for Machine Learning).
seq_col: Character vector with names of the columns containing sequences.
gene_col: A single column name in idata$annotations with gene segment calls (e.g., "v_call", "d_call", "j_call", "c_call"). Default is "v_call".
level: One of "receptor" or "barcode". If "receptor" (default), the function counts unique receptors (one per receptor ID) that carry a given gene segment. If "barcode", the function sums counts (e.g., cells/UMIs) per gene segment using the column defined by immundata::imd_schema("count").
by: Either NULL (no split) or "locus". When "locus", the result is further split by the locus column if present (as given by immundata::imd_schema("locus")); otherwise a warning is emitted and the split is ignored.

Examples

Run this code

# Limit the number of threads used by the underlying DB for this session.
# Change this only if you know what you're doing (e.g., multi-user machines, shared CI/servers).
db_exec("SET threads TO 2")

# Load data
if (FALSE) {
immdata <- get_test_idata() |> agg_repertoires("Therapy")
}

#
# airr_stats_chains
#

if (FALSE) {
airr_stats_chains(immdata)
}

#
# airr_stats_lengths
#

if (FALSE) {
airr_stats_lengths(immdata)
}

#
# airr_stats_genes
#

if (FALSE) {
# V gene usage by receptor count
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor")

# V gene usage by summed cell/UMI counts (if a count column is present)
airr_stats_genes(immdata, gene_col = "v_call", level = "barcode")

# Split by locus (TRA/TRB/... if locus column exists)
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor", by = "locus")
}

Run the code above in your browser using DataLab

Description

Available functions

Usage

Value

airr_stats_chains Returns a tibble with columns:

airr_stats_lengths Returns a tibble with columns:

airr_stats_genes A tibble with columns:

Arguments

See Also

Examples

`airr_stats_chains` Returns a tibble with columns:

`airr_stats_lengths` Returns a tibble with columns:

`airr_stats_genes` A tibble with columns: