Usage
xGRviaGeneAnno(data.file, background.file = NULL,
format.file = c("data.frame", "bed", "chr:start-end", "GRanges"),
build.conversion = c(NA, "hg38.to.hg19", "hg18.to.hg19"), gap.max = 0,
GR.Gene = c("UCSC_knownGene", "UCSC_knownCanonical"), ontology =
c("GOBP",
"GOMF", "GOCC", "PS", "PS2", "SF", "Pfam", "DO", "HPPA", "HPMI",
"HPCM",
"HPMA", "MP", "MsigdbH", "MsigdbC1", "MsigdbC2CGP", "MsigdbC2CPall",
"MsigdbC2CP", "MsigdbC2KEGG", "MsigdbC2REACTOME", "MsigdbC2BIOCARTA",
"MsigdbC3TFT", "MsigdbC3MIR", "MsigdbC4CGN", "MsigdbC4CM",
"MsigdbC5BP",
"MsigdbC5MF", "MsigdbC5CC", "MsigdbC6", "MsigdbC7", "DGIdb"),
size.range = c(10, 2000), min.overlap = 3, which.distance = NULL,
test = c("hypergeo", "fisher", "binomial"), p.adjust.method = c("BH",
"BY", "bonferroni", "holm", "hochberg", "hommel"),
ontology.algorithm = c("none", "pc", "elim", "lea"), elim.pvalue =
0.01,
lea.depth = 2, path.mode = c("all_paths", "shortest_paths",
"all_shortest_paths"), true.path.rule = F, verbose = T,
RData.location = "http://galahad.well.ox.ac.uk/bigdata")
Arguments
data.file
an input data file, containing a list of genomic
regions to test. If the input file is formatted as a 'data.frame'
(specified by the parameter 'format.file' below), the first three
columns correspond to the chromosome (1st column), the starting
chromosome position (2nd column), and the ending chromosome position
(3rd column). If the format is indicated as 'bed' (browser extensible
data), the same as 'data.frame' format but the position is 0-based
offset from chromomose position. If the genomic regions provided are
not ranged but only the single position, the ending chromosome position
(3rd column) is allowed not to be provided. If the format is indicated
as "chr:start-end", instead of using the first 3 columns, only the
first column will be used and processed. If the file also contains
other columns, these additional columns will be ignored. Alternatively,
the input file can be the content itself assuming that input file has
been read. Note: the file should use the tab delimiter as the field
separator between columns
background.file
an input background file containing a list of
genomic regions as the test background. The file format is the same as
'data.file'. By default, it is NULL meaning all annotatable bases (ig
non-redundant bases covered by 'annotation.file') are used as
background. However, if only one annotation (eg only a transcription
factor) is provided in 'annotation.file', the background must be
provided.
format.file
the format for input files. It can be one of
"data.frame", "chr:start-end", "bed" or "GRanges"
build.conversion
the conversion from one genome build to
another. The conversions supported are "hg38.to.hg19" and
"hg18.to.hg19". By default it is NA (no need to do so)
gap.max
the maximum distance to nearby genes. Only those genes
no far way from this distance will be considered as nearby genes. By
default, it is 0 meaning that nearby genes are those overlapping with
genomic regions
GR.Gene
the genomic regions of genes. By default, it is
'UCSC_knownGene', that is, UCSC known genes (together with genomic
locations) based on human genome assembly hg19. It can be
'UCSC_knownCanonical', that is, UCSC known canonical genes (together
with genomic locations) based on human genome assembly hg19.
Alternatively, the user can specify the customised input. To do so,
first save your RData file (containing an GR object) into your local
computer, and make sure the GR object content names refer to Gene
Symbols. Then, tell "GR.Gene" with your RData file name (with or
without extension), plus specify your file RData path in
"RData.location"
ontology
the ontology supported currently. It can be "GOBP" for
Gene Ontology Biological Process, "GOMF" for Gene Ontology Molecular
Function, "GOCC" for Gene Ontology Cellular Component, "PS" for
phylostratific age information, "PS2" for the collapsed PS version
(inferred ancestors being collapsed into one with the known taxonomy
information), "SF" for SCOP domain superfamilies, "Pfam" for Pfam
domain families, "DO" for Disease Ontology, "HPPA" for Human Phenotype
Phenotypic Abnormality, "HPMI" for Human Phenotype Mode of Inheritance,
"HPCM" for Human Phenotype Clinical Modifier, "HPMA" for Human
Phenotype Mortality Aging, "MP" for Mammalian Phenotype, and Drug-Gene
Interaction database (DGIdb) for drugable categories, and the molecular
signatures database (Msigdb, including "MsigdbH", "MsigdbC1",
"MsigdbC2CGP", "MsigdbC2CPall", "MsigdbC2CP", "MsigdbC2KEGG",
"MsigdbC2REACTOME", "MsigdbC2BIOCARTA", "MsigdbC3TFT", "MsigdbC3MIR",
"MsigdbC4CGN", "MsigdbC4CM", "MsigdbC5BP", "MsigdbC5MF", "MsigdbC5CC",
"MsigdbC6", "MsigdbC7")
size.range
the minimum and maximum size of members of each term
in consideration. By default, it sets to a minimum of 10 but no more
than 2000
min.overlap
the minimum number of overlaps. Only those terms
with members that overlap with input data at least min.overlap (3 by
default) will be processed
which.distance
which terms with the distance away from the
ontology root (if any) is used to restrict terms in consideration. By
default, it sets to 'NULL' to consider all distances
test
the test statistic used. It can be "fisher" for using
fisher's exact test, "hypergeo" for using hypergeometric test, or
"binomial" for using binomial test. Fisher's exact test is to test the
independence between gene group (genes belonging to a group or not) and
gene annotation (genes annotated by a term or not), and thus compare
sampling to the left part of background (after sampling without
replacement). Hypergeometric test is to sample at random (without
replacement) from the background containing annotated and non-annotated
genes, and thus compare sampling to background. Unlike hypergeometric
test, binomial test is to sample at random (with replacement) from the
background with the constant probability. In terms of the ease of
finding the significance, they are in order: hypergeometric test >
fisher's exact test > binomial test. In other words, in terms of the
calculated p-value, hypergeometric test < fisher's exact test <
binomial test
p.adjust.method
the method used to adjust p-values. It can be
one of "BH", "BY", "bonferroni", "holm", "hochberg" and "hommel". The
first two methods "BH" (widely used) and "BY" control the false
discovery rate (FDR: the expected proportion of false discoveries
amongst the rejected hypotheses); the last four methods "bonferroni",
"holm", "hochberg" and "hommel" are designed to give strong control of
the family-wise error rate (FWER). Notes: FDR is a less stringent
condition than FWER
ontology.algorithm
the algorithm used to account for the
hierarchy of the ontology. It can be one of "none", "pc", "elim" and
"lea". For details, please see 'Note' below
elim.pvalue
the parameter only used when "ontology.algorithm" is
"elim". It is used to control how to declare a signficantly enriched
term (and subsequently all genes in this term are eliminated from all
its ancestors)
lea.depth
the parameter only used when "ontology.algorithm" is
"lea". It is used to control how many maximum depth is used to consider
the children of a term (and subsequently all genes in these children
term are eliminated from the use for the recalculation of the
signifance at this term)
path.mode
the mode of paths induced by vertices/nodes with input
annotation data. It can be "all_paths" for all possible paths to the
root, "shortest_paths" for only one path to the root (for each node in
query), "all_shortest_paths" for all shortest paths to the root (i.e.
for each node, find all shortest paths with the equal lengths)
true.path.rule
logical to indicate whether the true-path rule
should be applied to propagate annotations. By default, it sets to
false
verbose
logical to indicate whether the messages will be
displayed in the screen. By default, it sets to false for no display
RData.location
the characters to tell the location of built-in
RData files. See xRDataLoader
for details