chipenrich
to run the method on their data. A number of arguments can be provided to change the type of test, the genome build, which sets of genes to test, how peaks are assigned to genes, and other minor options.
chipenrich(peaks, out_name = "chipenrich", out_path = getwd(), genome = "hg19",
genesets = c("GOBP","GOCC","GOMF"),
locusdef = "nearest_tss", method = "chipenrich",
fisher_alt = "two.sided", use_mappability = F, mappa_file = NULL, read_length = 36,
qc_plots = T, max_geneset_size = 2000, num_peak_threshold = 1, n_cores=1)
qc_plots
is set, then a file "chipenrich_qcplots.pdf" is produced containing a number of quality control plots. If out_name
is set to NULL, no files are written, and results then must be retrieved from the list returned by chipenrich
.
getwd
.
supported_genomes
function.
supported_genesets
function.
supported_locusdefs
for a list of supported definitions built-in. If using a user-specified file, the file must have 4 columns: geneid, chrom, start, end and be tab-delimited.
supported_methods
.
read_length
option.)
method='chipenrich'
and method='fet'
) is the closest TSS to this peak (for any gene, not necessarily the gene this peak was assigned to.) method='chipenrich'
and method='fet'
) is the gene having the closest TSS to the peak (should be the same as geneid when using the nearest TSS locus definition.) method='chipenrich'
and method='fet'
) is the strand of the gene with the closest TSS. method='broadenrich'
only) the start position of the peak overlap with the gene locus.method='broadenrich'
only) the end position of the peak overlap with the gene locus.method='broadenrich'
only) the base pair overlap of the peak with the gene locus.method='broadenrich'
only) is the mean proportion of the gene loci in the gene set covered by a peak.chipenrich
.num_peak_threshold
. method='broadenrich'
only) is the number of base pairs of the gene covered by a peak.method='broadenrich'
only) is the proportion of the gene covered by a peak.chipenrich
function is used to run gene set enrichment tests on a file or data frame contaning peaks called from a ChIP-seq experiment. It can currently run the following tests:
method='chipenrich'
) is the main method, which is similar to Fisher's exact test in that it is a test of association between genes having a peak, and genes belonging to a predefined gene set, but that also adjusts for gene locus length and sequence mappability (optional). The method uses binomial smoothing splines in a generalized additive model framework, see (Wood, 2010) or the gam
function for more details.
method='broadenrich'
) is designed for use with broad peaked datasets such as histone modifications. Given a locus definition, genes are scored according to the proportion of the gene locus which is covered by a peak (or peaks). As in method='chipenrich'
, Broad-Enrich uses binomial smoothing splines to adjust for gene locus length.
method='fet'
) is the naive approach, where peaks are assigned to a gene using the locus definitions set using the locusdef
option, and then each set of genes is tested using a 2x2 table tabulating whether genes have a peak, and whether they belong to the set of genes being tested. Note that this method does not adjust for gene locus length.
The peaks
option should be either:
The locusdef
option controls how peaks are assigned to genes. A number of options are available and can be listed using supported_locusdefs
. We currently support:
nearest_tss
: assigns peaks to the gene with the closest transcription start site (TSS). Each gene locus stretches from the midpoint upstream between TSSs, and the midpoint downstream between TSSs.
nearest_gene
: assigns peaks to the nearest gene. Each gene locus stretches from the upstream midpoint between genes and the downstream midpoint between genes.
1kb
: each gene locus is defined as the region 1KB up- and down-stream of the TSS.
5kb
: each gene locus is defined as the region 5KB up- and down-stream of the TSS.
The use_mappability
option enables correction for sequence mappability. This is done by multiplying the gene locus length by the mappability of the locus to compute the mappable genome length. The log10 mappable locus length is then used in the model rather than the log10 locus length. See the vignette for a complete description of mappability.
If use_mappability
is enabled, the user should specify read_length
, which should roughly correspond to the length of sequencing reads from the ChIP-seq experiment. Mappability of sequence is calculated using "reads" of a given length. See the vignette for a description of how mappability is calculated given a gene locus and read length. A list of supported read lengths can be found by using the supported_read_lengths
function.
The user can also provide their own per-gene mappability scores using the mappa_file
option. This overrides both use_mappability
and read_length
options above. The file should contain two columns: geneid, and mappa. Gene ID should be a valid Entrez gene ID, and mappa is a mappability score between 0 and 1.
The qc_plots
option enables the automatic generation of quality control plots aimed at giving the user a better understanding of their data. Currently three plots are created:
plot_spline_length
to generate the plot separately, and for a description of the plot features.
plot_dist_to_tss
to generate the plot separately, and for a description of the plot features.
plot_gene_coverage
to generate the plot separately, and for a description of the plot features. This plot is only automatically generated for method='broadenrich'
.
chipenrich
:
supported_genomes
for a list of available genomes.
supported_genesets
for a list of available geneset databases (KEGG, GO, etc.)
supported_locusdefs
for a list of available gene locus definitions.
supported_methods
for a list of available enrichment testing methods.
For making various QC plots from your peak data:
library(chipenrich.data)
library(chipenrich)
# Run ChipEnrich using an example dataset, assigning peaks to the nearest TSS,
# testing all Biocarta and Panther pathways
data(peaks_E2F4)
results = chipenrich(peaks_E2F4,method='chipenrich',locusdef='nearest_tss',
genesets=c('biocarta_pathway','panther_pathway'),out_name=NULL)
# Get the list of peaks that were assigned to genes.
assigned_peaks = results$peaks
# Get the results of enrichment testing.
enrich = results$results
Run the code above in your browser using DataLab