assocTestRegression: Association tests

Description

This function performs regression based and likelihood ratio based association tests for both genotype main effects as well as interaction effects. It also computes genotype counts for association tests.

Usage

assocTestRegression(genoData, outcome, model.type,  covar.list = NULL, ivar.list = NULL, gene.action.list = NULL, dosage = FALSE, scan.chromosome.filter = NULL,  scan.exclude = NULL, CI = 0.95, robust = FALSE, LRtest = TRUE, chromosome.set = NULL, block.set = NULL,  block.size = 5000, verbose = TRUE, outfile = NULL)

Arguments

genoData

GenotypeData object, should contain phenotypes and covariates in scan annotation. Chromosomes are expected to be in contiguous blocks.

outcome

Vector (of length equal to the number of models) of names of the outcome variables for each model. These names must be in the scan annotation of genoData. e.g. c("case.cntl.status", "blood.pressure") will use "case.cntl.status" as the outcome for the first model and "blood pressure" for the second. Outcome variables must be coded as 0/1 for logistic regression.

model.type

vector (of length equal to the number of models) with the types of models to be fitted. The elements should be one of: "logistic", "linear", or "poisson". e.g. c("logistic", "linear") will perform two tests: the first using logistic regression, and the second using linear regression.

covar.list

list (of length equal to the number of models) of vectors containing the names of covariates to be used in the regression model (blank, i.e. "" if none). The default value is NULL and will include no covariates in any of the models. The covariate names must be in the scan annotation of genoData. e.g.


  covar.list() <- list();
  covar.list[[1]] <- c("age","sex");
  covar.list[[2]] <- c("");

will use both "age" and "sex" as covariates for the first model and no covariates for the second model (this regresses on only the genotype).

ivar.list

list (of length equal to the number of models) of vectors containing the names of covariates for which to include an interaction with genotype (blank, i.e. "" if none). The default value is NULL and will include no interactions in any of the models. The covariate names must be in the scan annotation of genoData. e.g.


  ivar.list() <- list();
  ivar.list[[1]] <- c("sex");
  ivar.list[[2]] <- c("");

will include a genotype*"sex" interaction term for the first model and no interactions for the second model.

gene.action.list

a list (of length equal to the number of models) of vectors containing the types of gene action models to be used in the corresponding regression model. Valid options are "additive", "dominant", and "recessive", referring to how the minor allele is treated, as well as "dominance". "additive" coding sets the marker variable for homozygous minor allele samples = 2, heterozygous samples = 1, and homozygous major allele samples = 0. "dominant" coding sets the marker variable for homozygous minor allele samples = 2, heterozygous samples = 2, and homozygous major allele samples = 0. "recessive" coding sets the marker variable for homozygous minor allele samples = 2, heterozygous samples = 0, and homozygous major allele samples = 0. "dominance" coding sets the marker variable for homozygous minor allele samples = major allele frequency, heterozygous samples = 0, and homozygous major allele samples = minor allele frequency. This coding eliminates the additive component of variance for the marker variable, leaving only the dominance component of variance. The default value is NULL, which assumes only an "additive" gene action model for every test. e.g.


  gene.action.list() <- list();
  gene.action.list[[1]] <- c("additive");
  gene.action.list[[2]] <- c("dominant", "recessive");

will run the first model using "additive" gene action, and will run the second model using both "dominant" and "recessive" gene actions.

dosage

logical for whether or not the genotype values are imputed dosages. The default value is FALSE for true genotype calls. When using imputed dosages, the gene.action must be additive, and genotype counts will not be calculated.

scan.chromosome.filter

a logical matrix that can be used to exclude some chromosomes, some scans, or some specific scan-chromosome pairs. Entries should be TRUE if that scan-chromosome pair should be included in the analysis, FALSE if not. The number of rows must be equal to the number of scans in genoData, and the number of columns must be equal to the largest integer chromosome value in genoData. The column number must match the chromosome number. e.g. A scan.chromosome.filter matrix used for an analyis when genoData has SNPs with chromosome=(1-24, 26, 27) (i.e. no Y (25) chromosome SNPs) must have 27 columns (all FALSE in the 25th column). But a scan.chromosome.filter matrix used for an analysis genoData has SNPs chromosome=(1-26) (i.e no Unmapped (27) chromosome SNPs) must have only 26 columns.

scan.exclude

an integer vector containing the IDs of entire scans to be excluded.

sets the confidence level for the confidence interval calculations. Confidence intervals are computed at every SNP; for the odds ratio when using logistic regression, for the linear trend parameter when using linear regression, and for the rate ratio when using Poisson regression. The default value is 0.95 (i.e. a 95% confidence interval). The confidence level must be between 0 and 1.

robust

logical for whether to use sandwich-based robust standard errors. The default value is FALSE, and uses model based standard errors. The standard error estimates are returned and also used for Wald Tests of significance.

LRtest

logical for whether to perform Likelihood Ratio Tests. The default value is TRUE, and performs LR tests in addition to Wald tests (which are always performed). NOTE: Performing the LR tests adds a noticeable amount of computation time.

chromosome.set

integer vector with chromosome(s) to be analyzed. Use 23, 24, 25, 26, 27 for X, XY, Y, M, Unmapped respectively.

block.set

list (of length equal to length(chromosome.set)) of vectors where every vectors contains the indices of the SNP blocks (on that chromosome) to be analyzed. e.g.

chromosome.set <- c(1,2); block.set <- list(); chr.1 <- c(1,2,3);
  chr.2 <- c(5,6,7,8); block.set$chr.1 <- chr.1; block.set$chr.2 <- chr.2;

will analyze first three block on chromosome 1 and 5th through 8th blocks on chromosome 2. The actual number of SNPs analyzed will depend on block.size. Default value is NULL. If block.set == NULL, all the SNPs on chromosomes in chromosome.set will be analyzed.

block.size

Number of SNPs to be read from genoData at once.

verbose

if TRUE (default), will print status updates while the function runs. e.g. it will print "chr 1 block 1 of 10" etc. in the R console after each block of SNPs is done being analyzed.

outfile

a character string to append in front of ".model.j.gene_action.chr.i_k.RData" for naming the output data-frames; where j is the model number, gene_action is the gene.action type, i is the first chromosome, and k is the last chromosome used in that call to the function. "chr.i_k." will be omitted if chromosome.set=NULL. If set to NULL (default), then the results are returned to the R console.

Value

snpID: snpID (from genoData) of the SNP
model.M.n: sample size for the regression
model.M.nAA: number of AA genotypes in samples
model.M.nAB: number of AB genotypes in samples
model.M.nBB: number of BB genotypes in samples
model.M.nAA.cc0: number of AA genotypes in samples with outcome coded as 0
model.M.nAB.cc0: number of AB genotypes in samples with outcome coded as 0
model.M.nBB.cc0: number of BB genotypes in samples with outcome coded as 0
model.M.nAA.cc1: number of AA genotypes in samples with outcome coded as 1
model.M.nAB.cc1: number of AB genotypes in samples with outcome coded as 1
model.M.nBB.cc1: number of BB genotypes in samples with outcome coded as 1
model.M.MAF: minor allele frequency. Note that calculation of allele frequency for the X chromosome is different than that for the autosomes and the XY (pseudo-autosomal) region. Hence if chromosome.set includes 23, genoData should provide the sex of the scan ("M" or "F") i.e. there should be a column named "sex" with "F" for females and "M" for males.
model.M.minor.allele: the minor allele. Takes values "A" or "B".
model.M.gene_action.warningOrError: report of different possible warnings or errors: 0 if controls are monomorphic (logistic regression only), 1 if cases are monomorphic (logistic refression only), 2 if all samples are monomorphic or allele frequency is NA, 9 if a warning or error occured during model fitting, NA if none
model.M.gene_action.Est.G: estimate of the regression coefficient for the genotype term. See the description in gene.action.list above for interpretation.
model.M.gene_action.SE.G: standard error of the regression coefficient estimate for the genotype term. Could be either sandwich based (robust) or model based; see description in robust.
model.M.gene_action.L95.G: lower 95% confidence limit for the genotype coefficient (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.U95.G: upper 95% confidence limit for the genotype coefficient (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.OR.G: odds ratio for the genotype term. This is exp(the regression coefficient). See the description in "gene.action.list" above for interpretation.
model.M.gene_action.OR_L95.G: lower 95% confidence limit for the odds ratio (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.OR_U95.G: upper 95% confidence limit for the odds ratio (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.RR.G: relative risk for the genotype term. This is exp(the regression coefficient). See the description in "gene.action.list" above for interpretation.
model.M.gene_action.RR_L95.G: lower 95% confidence limit for the relative risk (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.RR_U95.G: upper 95% confidence limit for the relative risk (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.Wald.Stat.G: value of the Wald test statistic for testing the genotype parameter
model.M.gene_action.Wald.pval.G: Wald test p-value, calculated from a Chi-Squared distribution. This can be calculated using either sandwich based robust standard errors or model based standard errors (see robust).
model.M.gene_action.LR.Stat.G: value of the Likelihood Ratio test statistic for testing the genotype parameter
model.M.gene_action.LR.pval.G: Likelihood Ratio test p-value.
model.M.gene_action.Est.G:ivar_name: estimate of the regression coefficient for the interaction between genotype and ivar_name.
model.M.gene_action.SE.G:ivar_name: standard error of the interaction regression coefficient estimate. Could be either sandwich based (robust) or model based; see description in robust.
model.M.gene_action.L95.G:ivar_name: lower 95% confidence limit for the genotype*ivar_name interaction coefficient (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.U95.G:ivar_name: upper 95% confidence limit for the genotype*ivar_name interaction coefficient (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.OR.G:ivar_name: odds ratio for the genotype*ivar_name interaction term. This is exp(the interaction regression coefficient). A separate odds ratio is calculated for each interaction term. See the description in "gene.action.list" above for interpretation.
model.M.gene_action.OR_L95.G:ivar_name: lower 95% confidence limit for the odds ratio (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.OR_U95.G:ivar_name: upper 95% confidence limit for the odds ratio (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.RR.G:ivar_name: relative risk for the genotype*ivar_name interaction term. This is exp(the interaction regression coefficient). A separate relative risk is calculated for each interaction term. See the description in "gene.action.list" above for interpretation.
model.M.gene_action.RR_L95.G:ivar_name: lower 95% confidence limit for the relative risk (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.RR_U95.G:ivar_name: upper 95% confidence limit for the relative risk (95 will be replaced with whatever confidence level is chosen in CI).
model.M.gene_action.Wald.Stat.G:ivar_name: value of the Wald test statistic for testing the genotype*ivar_name interaction parameter
model.M.gene_action.Wald.pval.G:ivar_name: Wald test p-value for testing the genotype*ivar_name interaction parameter, calculated from a Chi-Squared distribution. This can be calculated using either sandwich based robust standard errors or model based standard errors (see robust).
model.M.gene_action.LR.Stat.G:ivar_name: value of the Likelihood Ratio test statistic for testing the genotype*ivar_name interaction parameter
model.M.gene_action.LR.pval.G:ivar_name: Likelihood Ratio test p-value for testing the genotype*ivar_name interaction parameter.
model.M.gene_action.Wald.Stat.G.Joint: value of the Wald test statistic for jointly testing all of the genotype parameters (main effects and interactions); a test for any genotype effect.
model.M.gene_action.Wald.pval.G.Joint: Wald test p-value for jointly testing all of the genotype parameters, calculated from a Chi-Squared distribution. This can be calculated using either sandwich based robust standard errors or model based standard errors (see robust).
model.M.gene_action.LR.Stat.G.Joint: value of the Likelihood Ratio test statistic for jointly testing all of the genotype parameters (main effects and interactions); a test for any genotype effect.
model.M.gene_action.LR.pval.G.Joint: Likelihood Ratio test p-value for jointly testing all of the genotype parameters.

Details

When using models without interaction terms, the association tests compare the model including the covariates and genotype value to the model including only the covariates (a test of genotype effect). When using a model with interaction terms, tests are performed for each of the interaction terms separately as well as a joint test of all the genotype terms (main effects and interactions) to detect any genotype effect. All tests and p-values are always computed using Wald tests with p-values computed from Chi-Squared distribtuions. The option of using either sandwich based robust standard errors (which make no model assumptions) or using model based standard errors for the confidence intervals and Wald tests is specified by the robust parameter. The option of also performing equivalent Likelihood Ratio tests is available and is specified by the LRtest parameter. Three types of regression models are available: "logistic", "linear", or "poisson". Multiple models can be run at the same time by putting multiple arguments in the outcome, model.type, covar.list, ivar.list, and gene.action.list parameters. For each model, available gene action models are "additive", "dominant", "recessive", and "dominance." See above for the correct usage of each of these. For logistic regression models, if the SNP is monomorphic in either cases or controls, then the slope parameter is not well-defined. In this situation, an error message will be returned (see model.N.gene_action.warningOrError in the Value section below for details), and the regression of this SNP will not be performed. If a test of significance is still desired for these SNPs, we suggest performing either a Fisher's Exact Test using the assocTestFisherExact function provided in GWASTools or performing a trend test (using model.type = "linear" in this function).

Individual samples can be included or excluded from the analysis using the scan.exclude parameter. Individual chromosomes can be included or excluded by specifying the indices of the chromosomes to be included in the chromosome.set parameter. Specific chromosomes for specific samples can be included or excluded using the scan.chromosome.filter parameter. The inclusion or exclusion of specific blocks of SNP's on each chromosome can be specified using the block.set parameter. Note that the actual SNP's included or excluded will change according to the value of block.size. Both scan.chromosome.filter and scan.exclude may be used together. If a scan is excluded in EITHER, then it will be excluded from the analysis, but it does NOT need to be excluded in both. This design allows for easy filtering of anomalous scan-chromosome pairs using the scan.chromosome.filter matrix, but still allows easy exclusion of a specific group of scans (e.g. males or Caucasians) using scan.exclude. This function allows for the usage of imputed dosages in place of genotypes in the additive model by specifying dosage = TRUE.

Examples

Run this code

# The following example would perform 3 tests (from 2 models): 
# the first a logistic regression of case.cntl.status on genotype, age, and sex, including an interaction term between genotype and sex, using additive gene action; 
# the second a linear regression of blood pressure on genotype using dominant gene action, 
# and the third, a linear regression of blood pressure on genotype again, but this time using recessive gene action.
# This test would only use chromosome 21.  
# It would perform both robust Wald tests using sandwich based robust standard errors as well as Likelihood Ratio tests.

# an example of a scan chromosome matrix
# desiged to eliminate duplicated individuals
# and scans with missing values of sex
library(GWASdata)
data(illumina_scan_annot)
scanAnnot <- ScanAnnotationDataFrame(illumina_scan_annot)
samp.chr.matrix <- matrix(TRUE,nrow(scanAnnot),26)
dup <- duplicated(scanAnnot$subjectID)
samp.chr.matrix[dup | is.na(scanAnnot$sex),] <- FALSE

# additionally, exclude YRI subjects
scan.exclude <- scanAnnot$scanID[scanAnnot$race == "YRI"]

# create some variables for the scans
scanAnnot$sex <- as.factor(scanAnnot$sex)
scanAnnot$age <- rnorm(nrow(scanAnnot),mean=40, sd=10)
scanAnnot$case.cntl.status <- rbinom(nrow(scanAnnot),1,0.4)
scanAnnot$blood.pressure[scanAnnot$case.cntl.status==1] <- rnorm(sum(scanAnnot$case.cntl.status==1),mean=100,sd=10)
scanAnnot$blood.pressure[scanAnnot$case.cntl.status==0] <- rnorm(sum(scanAnnot$case.cntl.status==0),mean=90,sd=5)

# create data object
gdsfile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
gds <- GdsGenotypeReader(gdsfile)
genoData <-  GenotypeData(gds, scanAnnot=scanAnnot)

# set regression variables and models
outcome <- c("case.cntl.status","blood.pressure")

covar.list <- list()
covar.list[[1]] <- c("age","sex")
covar.list[[2]] <- c("")

ivar.list <- list();
ivar.list[[1]] <- c("sex");
ivar.list[[2]] <- c("");
  
model.type <- c("logistic","linear")

gene.action.list <- list()
gene.action.list[[1]] <- c("additive")
gene.action.list[[2]] <- c("dominant", "recessive")

chr.set <- 21

outfile <- tempfile()

assocTestRegression(genoData,
                    outcome = outcome,
                    model.type = model.type,
                    covar.list = covar.list,
                    ivar.list = ivar.list,
                    gene.action.list = gene.action.list,
                    scan.chromosome.filter = samp.chr.matrix,
                    scan.exclude = scan.exclude,
                    CI = 0.95,
                    robust = TRUE,
                    LRtest = TRUE,
                    chromosome.set = chr.set,
                    outfile = outfile)

model1 <- getobj(paste(outfile, ".model.1.additive.chr.21_21.RData", sep=""))
model2 <- getobj(paste(outfile, ".model.2.dominant.chr.21_21.RData", sep=""))
model3 <- getobj(paste(outfile, ".model.2.recessive.chr.21_21.RData", sep=""))

close(genoData)
unlink(paste(outfile, "*", sep=""))

# In order to run the test on all chromosomes, it is suggested to run the function in parallel.
# To run the function in parallel the following unix can be used:
# R --vanilla --args 21 22 < assoc.analysis.r >logfile.txt &
# where the file assoc.analysis.r will include commands similar to this example 
# where chromosome.set and/or block.set can be passed to R using --args
# Here, tests on chromosomes 21 and 22 are performed; these could be replaced by any set of chromosomes
# these values are retrieved in R by putting a
# chr.set <- as.numeric(commandArgs(trailingOnly=TRUE))
# command in assoc.analysis.r

Run the code above in your browser using DataLab