sig_tally: Tally a Genomic Alteration Object

Description

Tally a variation object like MAF and return a matrix for NMF de-composition and more. This is a generic function, so it can be further extended to other mutation cases. Please read details about how to set sex for identifying copy number signatures. Please read https://osf.io/s93d5/ for the generation of SBS, DBS and ID (INDEL) components.

Usage

sig_tally(object, ...)
# S3 method for CopyNumber
sig_tally(
  object,
  method = "Wang",
  ignore_chrs = NULL,
  feature_setting = sigminer::CN.features,
  type = c("probability", "count"),
  reference_components = FALSE,
  cores = 1,
  seed = 123456,
  min_comp = 2,
  max_comp = 15,
  min_prior = 0.001,
  model_selection = "BIC",
  threshold = 0.1,
  nrep = 1,
  niter = 1000,
  keep_only_matrix = FALSE,
  ...
)
# S3 method for MAF
sig_tally(
  object,
  mode = c("SBS", "DBS", "ID", "ALL"),
  ref_genome = NULL,
  genome_build = NULL,
  add_trans_bias = FALSE,
  ignore_chrs = NULL,
  use_syn = TRUE,
  keep_only_matrix = FALSE,
  ...
)

Arguments

object

a CopyNumber object or MAF object.

...

custom setting for operating object. Detail see S3 method for corresponding class (e.g. CopyNumber).

method

method for feature classfication, can be one of "Macintyre" ("M") and "Wang" ("W").

ignore_chrs

Chromsomes to ignore from analysis. e.g. chrX and chrY.

feature_setting

a data.frame used for classification. Only used when method is "Wang" ("W"). Default is CN.features. Users can also set custom input with "feature", "min" and "max" columns available. Valid features can be printed by unique(CN.features$feature).

type

one of "probability", "count". Default is "probability", return a matrix with the sum of posterior probabilities for each components. If set to 'count', return a matrix with event count assigned to each components. The result for both types should be close. Only used when method is "Macintyre".

reference_components

default is FALSE, calculate mixture components from CopyNumber object. Only used when method is "Macintyre".

cores

number of compute cores to run this task. You can use future::availableCores() function to check how many cores you can use.

seed

seed number. Only used when method is "Macintyre".

min_comp

minimal number of components to fit, default is 2. Can also be a vector with length 6, which apply to each feature. Only used when method is "Macintyre".

max_comp

maximal number of components to fit, default is 15. Can also be a vector with length 6, which apply to each feature. Only used when method is "Macintyre".

min_prior

the minimum relative size of components, default is 0.001. Details about custom setting please refer to flexmix package. Only used when method is "Macintyre".

model_selection

model selection strategy, default is 'BIC'. Details about custom setting please refer to flexmix package. Only used when method is "Macintyre".

threshold

default is 0.1. Sometimes, the result components include adjacent distributions with similar mu (two and more distribution are very close), we use this threshold to obtain a more meaningful fit with less components. Only used when method is "Macintyre".

nrep

number of run times for each value of component, keep only the solution with maximum likelihood. Only used when method is "Macintyre".

niter

the maximum number of iterations. Only used when method is "Macintyre".

keep_only_matrix

if TRUE, keep only matrix for signature extraction. For a MAF object, this will just return the most useful matrix.

mode

type of mutation matrix to extract, can be one of 'SBS', 'DBS' and 'ID'.

ref_genome

BSgenome object or name of the installed BSgenome package. Example: BSgenome.Hsapiens.UCSC.hg19 Default NULL, tries to auto-detect from installed genomes.

genome_build

genome build 'hg19' or 'hg38', if not set, guess it by ref_genome.

add_trans_bias

if TRUE, consider transcriptional bias categories. 'T:' for Transcribed (the variant is on the transcribed strand); 'U:' for Un-transcribed (the variant is on the untranscribed strand); 'B:' for Bi-directional (the variant is on both strand and is transcribed either way); 'N:' for Non-transcribed (the variant is in a non-coding region and is untranslated); 'Q:' for Questionable. NOTE: the result counts of 'B' and 'N' labels are a little different from SigProfilerMatrixGenerator, the reason is unknown (may be caused by annotation file).

use_syn

Logical. Whether to include synonymous variants in analysis. Defaults to TRUE

Value

a list contains a matrix used for NMF de-composition.

Methods (by class)

CopyNumber: Returns copy number features, components and component-by-sample matrix
MAF: Returns SBS mutation sample-by-component matrix and APOBEC enrichment

Details

For identifying copy number signatures, we have to derive copy number features firstly. Due to the difference of copy number values in sex chromosomes between male and female, we have to do an extra step if we don't want to ignore them.

I create two options to control this, the default values are shown as the following, you can use the same way to set (per R session).

options(sigminer.sex = "female", sigminer.copynumber.max = NA_integer_)

If your cohort are all females, you can totally ignore this.
If your cohort are all males, set sigminer.sex to 'male' and sigminer.copynumber.max to a proper value (the best is consistent with read_copynumber).
If your cohort contains bother males and females, set sigminer.sex as a data.frame with two columns "sample" and "sex". And set sigminer.copynumber.max to a proper value (the best is consistent with read_copynumber).

References

Macintyre, Geoff, et al. "Copy number signatures and mutational processes in ovarian carcinoma." Nature genetics 50.9 (2018): 1262.

Mayakonda, Anand, et al. "Maftools: efficient and comprehensive analysis of somatic variants in cancer." Genome research 28.11 (2018): 1747-1756.

Roberts SA, Lawrence MS, Klimczak LJ, et al. An APOBEC Cytidine Deaminase Mutagenesis Pattern is Widespread in Human Cancers. Nature genetics. 2013;45(9):970-976. doi:10.1038/ng.2702.

Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB: SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 2019, 20:685 https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6041-2

Examples

Run this code

# NOT RUN {
# Load copy number object
load(system.file("extdata", "toy_copynumber.RData",
  package = "sigminer", mustWork = TRUE
))
# }
# NOT RUN {
# Use method designed by Wang, Shixiang et al.
cn_tally_W <- sig_tally(cn, method = "W")
# Use method designed by Macintyre et al.
cn_tally_M <- sig_tally(cn, method = "M")
# }
# NOT RUN {
# Prepare SBS signature analysis
laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
laml <- read_maf(maf = laml.maf)
if (require("BSgenome.Hsapiens.UCSC.hg19")) {
  mt_tally <- sig_tally(
    laml,
    ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
    use_syn = TRUE
  )
  mt_tally$nmf_matrix[1:5, 1:5]

  ## Use strand bias categories
  mt_tally <- sig_tally(
    laml,
    ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
    use_syn = TRUE, add_trans_bias = TRUE
  )
  ## Test it by enrichment analysis
  enrich_component_strand_bias(mt_tally$nmf_matrix)
  enrich_component_strand_bias(mt_tally$all_matrices$SBS_24)
} else {
  message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!")
}
# }

Run the code above in your browser using DataLab

State of Data and AI Literacy Report 2025