
Last chance! 50% off unlimited learning
Sale ends in
Calculate gene and pathway p-values using the ARTP test and summary data.
sARTP(summary.files, pathway, family, reference, lambda,
ncases, ncontrols, nsamples, options = NULL)
a character vector of file names containing the summary results of SNPs included in one or multiple studies.
Each file must be able to be read by read.table
.
Each file must have columns called SNP
, RefAllele
, EffectAllele
, Beta
,
and at least one of SE
, P
.
An optional column called Direction
describing studies information can also be included if the
summary results were calculated from multiple studies by inverse weighting method.
Two optional columns called Chr
and Pos
are required if excluded.regions
is specified in options
.
SNPs within excluded.regions
are going to be excluded from the analysis.
If options$ambig.by.AF
is TRUE, then a column called "RAF" or "EAF" is required.
See Details
.
a character of the name of file containing definition of a pathway.
It must be able to be read by read.table
and have columns called SNP
, Gene
, and Chr
.
It can also be a data frame with the three columns.
The SNP
column can also have values of the form loc1-loc2
, where loc1
and loc2
are
base-pair locations denoting a region of SNPs to use.
a character taking values of 'gaussian'
or 'binomial'
.
a data.frame containing the paths of binary PLINK files of reference dataset.
It must have columns called bed
, bim
and fam
.
The current version allows users to specify multiple sets of bed/bim/fam PLINK files as separate rows of the data frame.
a numeric vector of inflation factors. Each file in summary.files
should have one inflation factor specified in lambda
.
a list of numeric vectors specifying sample sizes of cases of participating studies. This argument should be specified only if family == 'binomial'
, otherwise list()
. See Examples
.
a list of numeric vectors specifying sample sizes of controls of participating studies. This argument should be specified only if family == 'binomial'
, otherwise list()
. See Examples
.
a list of numeric vectors specifying total sample sizes of participating studies. This argument should be specified only if family == 'gaussian'
, otherwise list()
. See Examples
.
a list of options to control the test procedure. If NULL
, default options will be used. See options
.
sARTP
returns an object of class ARTP2
. It is a list containing the following components:
final pathway p-value accounting for multiple comparisons.
a data frame containing gene name, number of SNPs in the gene that were included in the analysis, chromosome name, and the p-value for the gene accounting for multiple comparisons.
a data frame defining the pathway that was actually tested after various filters applied.
a list containing detailed information of selected SNPs in each gene.
a character vector of genes selected by ARTP2
. They are the most promising candidates, although their statistical significance is not guaranteed.
a data frame containing SNPs excluded from the analysis and their reasons.
a data frame containing genes excluded from the analysis because they are subsets of other remaining genes. Set options$rm.gene.subset
to be FALSE
to include all genes even if they are subsets of other genes.
a list of options used in the analysis. See options
TRUE
if options$nperm
is large enougth to accurately estimate p-values, i.e., if the criteria
sqrt(pvalue*(1-pvalue)/nperm)/pvalue < 0.1
is satisfied.
a list containing necessary input for warm.start
. It can be written to a file by using the function save
, then its path can be the input of warm.start
. Loading from reference
, it also contains a data frame of genotypes of used SNPs (setup$ref.geno
), if options$keep.geno
is TRUE
.
This function computes gene and pathway p-values when only summary data is available. Only the ARTP test modified from Yu et al. (2009) is well tested and is released with this package. ARTP is the Adaptive Rank Truncated Product test.
Each file in summary.files
must contain
SNP
SNP name
RefAllele
reference allele. Can be different in studies
EffectAllele
effect allele. Can be different in studies
Beta
estimated effect in linear regression model or log odds ratio in logistic regression model
and must contain one of the optional columns
SE
estimated standard error of Beta
P
p-value of Wald's, LRT or score test for testing H_0: Beta = 0
. Can be generated by lm
, glm
, anova
in R
or other standard statistical softwares.
An optional column Direction
is encouraged to be provided by the user
Direction
a character vector indicating which studies include a SNP. Any symbol except for '?' means a SNP is included in that study. Please note that the real direction of a SNP in studies ('+' or '-') does not matter, e.g., '++-?+' and '**+?-' provide exact the same information. See Examples
.
Another two optional columns Chr
and Pos
are needed if excluded.regions
is specified in options
. ARTP2
will convert the column names to be upper case, so for example, either Beta
or BETA
or beta
are accepted. See Examples
.
Chr
chromosome.
Pos
base-pair position (bp units).
If the option ambig.by.AF
is set to 1, then the summary files must contain at least one of:
RAF
reference allele frequency.
EAF
effect allele frequency.
The order of columns in files summary.files
, pathway
or in data frame reference
are arbitrary,
and all unnecessary columns (if any) are discarded in the analysis.
Zhang H, Wheeler W, Hyland LP, Yang Y, Shi J, Chatterjee N, Yu K. (2016) A powerful procedure for pathway-based meta-analysis using summary statistics identifies 43 pathways associated with type II diabetes in European populations. PLoS Genetics 12(6): e1006122
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. (2009) Pathway analysis by adaptive combination of P-values. Genet Epidemiol 33(8): 700 - 709
Zhang H, Shi J, Liang F, Wheeler W, Stolzenberg-Solomon R, Yu K. (2014) A fast multilocus test with adaptive SNP selection for large-scale genetic association studies. European Journal of Human Genetics 22: 696 - 702
# NOT RUN {
library(ARTP2)
## Path of files containing summary statistics
## Only required columns will be loaded
study1 <- system.file("extdata", package = "ARTP2", "study1.txt.gz")
study2 <- system.file("extdata", package = "ARTP2", "study2.txt.gz")
## Path of a build-in file containing pathway definition
pathway <- system.file("extdata", package = "ARTP2", "pathway.txt.gz")
## Create data frame containing paths of build-in PLINK files that are going to used as reference
## As an example, use all chromosomes
chr <- 1:22
nchr <- length(chr)
fam <- vector("character", nchr)
bim <- vector("character", nchr)
bed <- vector("character", nchr)
for(i in 1:nchr){
fam[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".fam", sep = ""))
bim[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".bim", sep = ""))
bed[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".bed", sep = ""))
}
reference <- data.frame(fam, bim, bed, stringsAsFactors = FALSE)
## Set the options.
## Accumulate signal from the top 2 SNPs in each gene
## 1e5 replicates of resampling to estimate the p-value
options <- list(inspect.snp.n = 2, nperm = 1e4,
maf = .01, HWE.p = 1e-6,
gene.R2 = .9,
id.str = "unique-pathway-id",
out.dir = getwd(), save.setup = FALSE)
## different inflation factors are adjusted in two studies
lambda <- c(1.10, 1.08)
## two summary files, so there are two elements in each of two lists ncases and ncontrols
## the first summary file includes data calculated from meta-analysis of two sub-studies,
## each with sample size 63390 (9580 cases and 53810 controls) and 5643 (2591 cases and
## 3052 controls)
## see a few rows in study1
# s <- read.table(study1, header = TRUE, as.is = TRUE, nrows = 10)
# s$Direction
## [1] "+?" "+?" "++" "+?" "+?" "+?" "+?" "+?" "+?" "++"
## sub-study1 has 9580 cases, and sub-study2 has 2591 cases
## sub-study1 has 53810 cases, and sub-study2 has 3052 cases
## '?' means a SNP is not included in that sub-study
## any other symbols means a SNP is included in that sub-study
ncases <- list()
ncontrols <- list()
ncases[[1]] <- c(9580, 2591)
ncontrols[[1]] <- c(53810, 3052)
## the second summary file includes data calculated from one sub-studies with sample size
## 61957 (7638 cases and 54319 controls)
ncases[[2]] <- 7638
ncontrols[[2]] <- 54319
# logistic regression is used in base model, thus ncases and ncontrols should be specified.
family <- 'binomial'
## pathway test with two study files
# ret <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda,
# ncases, ncontrols, options = options)
# ret$pathway.pvalue
## [1] 0.04594541 # Mac OS
## [1] 0.05149485 # Linux with 1 thread
## [1] 0.03969603 # Linux with 32 threads
## Mac OS
# head(ret$gene.pvalue)
## Gene Chr N.SNP Pvalue
## 1 BDH2 4 10 0.000749925
## 2 UBE2D3 4 6 0.001849815
## 3 PBX2 6 22 0.003849615
## 4 PPP1R14D 15 9 0.003849615
## 5 MRPL10 17 18 0.011448855
## 6 SCYL1 11 3 0.019848015
## Linux with 1 thread
# head(ret$gene.pvalue)
## Gene Chr N.SNP Pvalue
## 1 BDH2 4 10 0.000949905
## 2 UBE2D3 4 6 0.001699830
## 3 PPP1R14D 15 9 0.003949605
## 4 PBX2 6 22 0.004299570
## 5 MRPL10 17 18 0.012448755
## 6 SCYL1 11 3 0.017148285
## Linux with 32 threads
# head(ret$gene.pvalue)
## Gene Chr N.SNP Pvalue
## 1 UBE2D3 4 6 0.000849915
## 2 BDH2 4 10 0.001049895
## 3 PPP1R14D 15 9 0.003949605
## 4 PBX2 6 22 0.004899510
## 5 MRPL10 17 18 0.012798720
## 6 SCYL1 11 3 0.015048495
## pathway test with each of two studies
# ret1 <- sARTP(summary.files = study1, pathway, family, reference, lambda[1],
# ncases[1], ncontrols[1], options = options)
# ret2 <- sARTP(summary.files = study2, pathway, family, reference, lambda[2],
# ncases[2], ncontrols[2], options = options)
# ret1$pathway.pvalue
## [1] 0.04279572 # Mac OS
## [1] 0.03519648 # Linux with 1 thread
## [1] 0.04644536 # Linux with 32 threads
# ret2$pathway.pvalue
## [1] 0.3092691 # Mac OS
## [1] 0.2870213 # Linux with 1 thread
## [1] 0.3010699 # Linux with 32 threads
##################################################
## The reference is passed as an individual-level genotype data frame
data(ref.geno)
# ret.ref <- sARTP(summary.files = c(study1, study2), pathway, family, ref.geno, lambda,
# ncases, ncontrols, options = options)
# ret.ref$pathway.pvalue == ret$pathway.pvalue
##################################################
## The reference genotype data can also be merged into a single set of PLINK files
bed <- system.file("extdata", package = "ARTP2", "ref.bed")
bim <- system.file("extdata", package = "ARTP2", "ref.bim")
fam <- system.file("extdata", package = "ARTP2", "ref.fam")
reference <- data.frame(fam, bim, bed)
# ret.comb <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda,
# ncases, ncontrols, options = options)
# ret.comb$pathway.pvalue == ret$pathway.pvalue
################
## exclude some regions
exc.reg1 <- data.frame(Chr = c(1, 1, 22),
Pos = c(1706160, 11979231, 51052379),
Radius = c(5000, 0, 2000))
options$excluded.regions <- exc.reg1
# ret.exc1 <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda,
# ncases, ncontrols, options = options)
# ret.exc1$pathway.pvalue
## [1] 0.04619538 # Mac OS
## [1] 0.0510449 # Linux with 1 thread
## [1] 0.04054595 # Linux with 32 threads
# sum(ret.exc1$deleted.snps$reason == 'RM_BY_REGIONS')
## or equivalently
exc.reg2 <- data.frame(Chr = c(1, 1, 22),
Start = c(1701160, 11979231, 51050379),
End = c(1711160, 11979231, 51054379))
options$excluded.regions <- exc.reg2
# ret.exc2 <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda,
# ncases, ncontrols, options = options)
# ret.exc1$pathway.pvalue == ret.exc2$pathway.pvalue
#################
## select a subset of subjects in plink files as the reference
## options$selected.subs should be in the same format as the first column of fam file
## load character vector subj.id of 400 subjects from build-in dataset
data(subj.id, package = "ARTP2")
head(subj.id)
options$selected.subs <- subj.id
options$excluded.regions <- NULL
# ret.sel <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda,
# ncases, ncontrols, options = options)
# ret.sel$pathway.pvalue
## [1] 0.03469653 # Mac OS
## [1] 0.05284472 # Linux with 1 thread
## [1] 0.04164584 # Linux with 32 threads
# }
Run the code above in your browser using DataLab