filter_genesets: filter a geneset table; intersect with an array of genes-of-interest then apply cutoffs on min/max genes per geneset

Description

filter a geneset table; intersect with an array of genes-of-interest then apply cutoffs on min/max genes per geneset

Usage

filter_genesets(
  genesets,
  genelist,
  min_overlap = 10L,
  max_overlap = 1500L,
  max_overlap_fraction = 0.5,
  min_signif = NA,
  max_size = NA,
  dedupe = FALSE
)

Value

the input genesets filtered for the subset of rows that match user's filter parameters

Arguments

genesets: tibble with genesets, must contain columns 'id', 'genes' and 'ngenes'
genelist: tibble with genes, must contain column 'gene' and 'signif'. gene = character column, which are matched against list column 'genes' in genesets tibble. signif = boolean column (you can set all to FALSE if not performing Fisher-exact or hypergeometric test downstream)
min_overlap: integer, minimum number of genes in the genelist table that must match a geneset. Must be at least 1 but when using the GOAT algorithm downstream, this should be set to at least 10 (default=10). e.g. when set to 10, this will only retain genesets that contain at least 10 genes that are also in your genelist.
max_overlap: integer, maximum number of genes in the genelist table that must match a geneset. Set to NA to disable
max_overlap_fraction: analogous to max_overlap, which limits the max geneset size to a given N, this parameter defines the maximum geneset size that is to be retained as a fraction of the input genelist length. For example, setting this to 0.5 will remove all genesets that contain more than half the genes in the input genelist (i.e. testing enrichment of a geneset that contains 1000 out of a total 1200 genes from your input genelist is probably meaningless). Defaults to 50%
min_signif: expert setting for debugging and algorithm evaluation/benchmarking, NOT for regular geneset analyses. integer, minimum number of genes in the genelist table that are signif==TRUE and match a geneset. Be careful, this is "prefiltering" and will affect the correctness / calibration of estimated geneset p-values. For GOAT and GSEA, this is NOT RECOMMENDED and will cause bias in your dataset! Set to NA to disable (default)
max_size: integer, maximum number of genes in the geneset (i.e. prior to intersect with user's gene list provided as genelist). Optionally, use this to remove highly generic terms. Set to NA to disable
dedupe: boolean, remove duplicate genesets (as determined after intersection with genelist)