sARTP: ARTP test for summary data

Description

Calculate gene and pathway p-values using the ARTP test and summary data.

Usage

sARTP(summary.files, pathway, family, reference, lambda, 
      ncases, ncontrols, nsamples, options = NULL)

Arguments

summary.files

a character vector of file names containing the summary results of SNPs included in one or multiple studies. Each file must be able to be read by read.table. Each file must have columns called SNP, RefAllele, EffectAllele, Beta, and at least one of SE, P. An optional column called Direction describing studies information can also be included if the summary results were calculated from multiple studies by inverse weighting method. Two optional columns called Chr and Pos are required if excluded.regions is specified in options. SNPs within excluded.regions are going to be excluded from the analysis. If options$ambig.by.AF is TRUE, then a column called "RAF" or "EAF" is required. See Details.

pathway

a character of the name of file containing definition of a pathway. It must be able to be read by read.table and have columns called SNP, Gene, and Chr. It can also be a data frame with the three columns. The SNP column can also have values of the form loc1-loc2, where loc1 and loc2 are base-pair locations denoting a region of SNPs to use.

family

a character taking values of 'gaussian' or 'binomial'.

reference

a data.frame containing the paths of binary PLINK files of reference dataset. It must have columns called bed, bim and fam. The current version allows users to specify multiple sets of bed/bim/fam PLINK files as separate rows of the data frame.

lambda

a numeric vector of inflation factors. Each file in summary.files should have one inflation factor specified in lambda.

ncases

a list of numeric vectors specifying sample sizes of cases of participating studies. This argument should be specified only if family == 'binomial', otherwise list(). See Examples.

ncontrols

a list of numeric vectors specifying sample sizes of controls of participating studies. This argument should be specified only if family == 'binomial', otherwise list(). See Examples.

nsamples

a list of numeric vectors specifying total sample sizes of participating studies. This argument should be specified only if family == 'gaussian', otherwise list(). See Examples.

options

a list of options to control the test procedure. If NULL, default options will be used. See options.

Value

sARTP returns an object of class ARTP2. It is a list containing the following components:

pathway.pvalue

final pathway p-value accounting for multiple comparisons.

gene.pvalue

a data frame containing gene name, number of SNPs in the gene that were included in the analysis, chromosome name, and the p-value for the gene accounting for multiple comparisons.

pathway

a data frame defining the pathway that was actually tested after various filters applied.

model

a list containing detailed information of selected SNPs in each gene.

most.sig.genes

a character vector of genes selected by ARTP2. They are the most promising candidates, although their statistical significance is not guaranteed.

deleted.snps

a data frame containing SNPs excluded from the analysis and their reasons.

deleted.genes

a data frame containing genes excluded from the analysis because they are subsets of other remaining genes. Set options$rm.gene.subset to be FALSE to include all genes even if they are subsets of other genes.

options

a list of options used in the analysis. See options

accurate

TRUE if options$nperm is large enougth to accurately estimate p-values, i.e., if the criteria sqrt(pvalue*(1-pvalue)/nperm)/pvalue < 0.1 is satisfied.

setup

a list containing necessary input for warm.start. It can be written to a file by using the function save, then its path can be the input of warm.start. Loading from reference, it also contains a data frame of genotypes of used SNPs (setup$ref.geno), if options$keep.geno is TRUE.

Details

This function computes gene and pathway p-values when only summary data is available. Only the ARTP test modified from Yu et al. (2009) is well tested and is released with this package. ARTP is the Adaptive Rank Truncated Product test.

Each file in summary.files must contain

SNP SNP name
RefAllele reference allele. Can be different in studies
EffectAllele effect allele. Can be different in studies
Beta estimated effect in linear regression model or log odds ratio in logistic regression model

and must contain one of the optional columns

SE estimated standard error of Beta
P p-value of Wald's, LRT or score test for testing H_0: Beta = 0. Can be generated by lm, glm, anova in R or other standard statistical softwares.

An optional column Direction is encouraged to be provided by the user

Direction a character vector indicating which studies include a SNP. Any symbol except for '?' means a SNP is included in that study. Please note that the real direction of a SNP in studies ('+' or '-') does not matter, e.g., '++-?+' and '**+?-' provide exact the same information. See Examples.

Another two optional columns Chr and Pos are needed if excluded.regions is specified in options. ARTP2 will convert the column names to be upper case, so for example, either Beta or BETA or beta are accepted. See Examples.

Chr chromosome.
Pos base-pair position (bp units).

If the option ambig.by.AF is set to 1, then the summary files must contain at least one of:

RAF reference allele frequency.
EAF effect allele frequency.

The order of columns in files summary.files, pathway or in data frame reference are arbitrary, and all unnecessary columns (if any) are discarded in the analysis.

References

Zhang H, Wheeler W, Hyland LP, Yang Y, Shi J, Chatterjee N, Yu K. (2016) A powerful procedure for pathway-based meta-analysis using summary statistics identifies 43 pathways associated with type II diabetes in European populations. PLoS Genetics 12(6): e1006122

Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. (2009) Pathway analysis by adaptive combination of P-values. Genet Epidemiol 33(8): 700 - 709

Zhang H, Shi J, Liang F, Wheeler W, Stolzenberg-Solomon R, Yu K. (2014) A fast multilocus test with adaptive SNP selection for large-scale genetic association studies. European Journal of Human Genetics 22: 696 - 702

Examples

Run this code

# NOT RUN {
library(ARTP2)

## Path of files containing summary statistics
## Only required columns will be loaded
study1 <- system.file("extdata", package = "ARTP2", "study1.txt.gz")
study2 <- system.file("extdata", package = "ARTP2", "study2.txt.gz")

## Path of a build-in file containing pathway definition
pathway <- system.file("extdata", package = "ARTP2", "pathway.txt.gz")

## Create data frame containing paths of build-in PLINK files that are going to used as reference
## As an example, use all chromosomes
chr <- 1:22
nchr <- length(chr)

fam <- vector("character", nchr)
bim <- vector("character", nchr)
bed <- vector("character", nchr)

for(i in 1:nchr){
  fam[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".fam", sep = ""))
  bim[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".bim", sep = ""))
  bed[i] <- system.file("extdata", package = "ARTP2", paste("chr", chr[i], ".bed", sep = ""))
}

reference <- data.frame(fam, bim, bed, stringsAsFactors = FALSE)

## Set the options. 
## Accumulate signal from the top 2 SNPs in each gene
## 1e5 replicates of resampling to estimate the p-value
options <- list(inspect.snp.n = 2, nperm = 1e4, 
                maf = .01, HWE.p = 1e-6, 
                gene.R2 = .9, 
                id.str = "unique-pathway-id", 
                out.dir = getwd(), save.setup = FALSE)
                
## different inflation factors are adjusted in two studies
lambda <- c(1.10, 1.08)

## two summary files, so there are two elements in each of two lists ncases and ncontrols
## the first summary file includes data calculated from meta-analysis of two sub-studies, 
## each with sample size 63390 (9580 cases and 53810 controls) and 5643 (2591 cases and 
## 3052 controls)
## see a few rows in study1
# s <- read.table(study1, header = TRUE, as.is = TRUE, nrows = 10)
# s$Direction
## [1] "+?" "+?" "++" "+?" "+?" "+?" "+?" "+?" "+?" "++"
## sub-study1 has 9580 cases, and sub-study2 has 2591 cases
## sub-study1 has 53810 cases, and sub-study2 has 3052 cases
## '?' means a SNP is not included in that sub-study
## any other symbols means a SNP is included in that sub-study
ncases <- list()
ncontrols <- list()
ncases[[1]] <- c(9580, 2591)
ncontrols[[1]] <- c(53810, 3052)

## the second summary file includes data calculated from one sub-studies with sample size 
## 61957 (7638 cases and 54319 controls)
ncases[[2]] <- 7638
ncontrols[[2]] <- 54319

# logistic regression is used in base model, thus ncases and ncontrols should be specified. 
family <- 'binomial'

## pathway test with two study files
# ret <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda, 
#              ncases, ncontrols, options = options)

# ret$pathway.pvalue
## [1] 0.04594541  # Mac OS
## [1] 0.05149485  # Linux with 1 thread
## [1] 0.03969603  # Linux with 32 threads

## Mac OS
# head(ret$gene.pvalue)
##       Gene Chr N.SNP      Pvalue
## 1     BDH2   4    10 0.000749925
## 2   UBE2D3   4     6 0.001849815
## 3     PBX2   6    22 0.003849615
## 4 PPP1R14D  15     9 0.003849615
## 5   MRPL10  17    18 0.011448855
## 6    SCYL1  11     3 0.019848015

## Linux with 1 thread
# head(ret$gene.pvalue)
##       Gene Chr N.SNP      Pvalue
## 1     BDH2   4    10 0.000949905
## 2   UBE2D3   4     6 0.001699830
## 3 PPP1R14D  15     9 0.003949605
## 4     PBX2   6    22 0.004299570
## 5   MRPL10  17    18 0.012448755
## 6    SCYL1  11     3 0.017148285

## Linux with 32 threads
# head(ret$gene.pvalue)
##       Gene Chr N.SNP      Pvalue
## 1   UBE2D3   4     6 0.000849915
## 2     BDH2   4    10 0.001049895
## 3 PPP1R14D  15     9 0.003949605
## 4     PBX2   6    22 0.004899510
## 5   MRPL10  17    18 0.012798720
## 6    SCYL1  11     3 0.015048495

## pathway test with each of two studies
# ret1 <- sARTP(summary.files = study1, pathway, family, reference, lambda[1], 
#               ncases[1], ncontrols[1], options = options)

# ret2 <- sARTP(summary.files = study2, pathway, family, reference, lambda[2], 
#               ncases[2], ncontrols[2], options = options)

# ret1$pathway.pvalue
## [1] 0.04279572  # Mac OS
## [1] 0.03519648  # Linux with 1 thread
## [1] 0.04644536  # Linux with 32 threads

# ret2$pathway.pvalue
## [1] 0.3092691  # Mac OS
## [1] 0.2870213  # Linux with 1 thread
## [1] 0.3010699  # Linux with 32 threads

##################################################
## The reference is passed as an individual-level genotype data frame

data(ref.geno)
# ret.ref <- sARTP(summary.files = c(study1, study2), pathway, family, ref.geno, lambda, 
#                  ncases, ncontrols, options = options)

# ret.ref$pathway.pvalue == ret$pathway.pvalue

##################################################
## The reference genotype data can also be merged into a single set of PLINK files

bed <- system.file("extdata", package = "ARTP2", "ref.bed")
bim <- system.file("extdata", package = "ARTP2", "ref.bim")
fam <- system.file("extdata", package = "ARTP2", "ref.fam")

reference <- data.frame(fam, bim, bed)
# ret.comb <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda, 
#                   ncases, ncontrols, options = options)
# ret.comb$pathway.pvalue == ret$pathway.pvalue

################

## exclude some regions
exc.reg1 <- data.frame(Chr = c(1, 1, 22), 
                       Pos = c(1706160, 11979231, 51052379), 
                       Radius = c(5000, 0, 2000))
options$excluded.regions <- exc.reg1

# ret.exc1 <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda, 
#                   ncases, ncontrols, options = options)

# ret.exc1$pathway.pvalue
## [1] 0.04619538 # Mac OS
## [1] 0.0510449  # Linux with 1 thread
## [1] 0.04054595 # Linux with 32 threads

# sum(ret.exc1$deleted.snps$reason == 'RM_BY_REGIONS')

## or equivalently
exc.reg2 <- data.frame(Chr = c(1, 1, 22), 
                       Start = c(1701160, 11979231, 51050379), 
                       End = c(1711160, 11979231, 51054379))
options$excluded.regions <- exc.reg2

# ret.exc2 <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda, 
#                   ncases, ncontrols, options = options)
# ret.exc1$pathway.pvalue == ret.exc2$pathway.pvalue

#################

## select a subset of subjects in plink files as the reference
## options$selected.subs should be in the same format as the first column of fam file
## load character vector subj.id of 400 subjects from build-in dataset
data(subj.id, package = "ARTP2")
head(subj.id)
options$selected.subs <- subj.id
options$excluded.regions <- NULL

# ret.sel <- sARTP(summary.files = c(study1, study2), pathway, family, reference, lambda, 
#                  ncases, ncontrols, options = options)
# ret.sel$pathway.pvalue
## [1] 0.03469653 # Mac OS
## [1] 0.05284472 # Linux with 1 thread
## [1] 0.04164584 # Linux with 32 threads

# }

Run the code above in your browser using DataLab