Learn R Programming

MatrixEQTL (version 2.1.0)

Matrix_eQTL_main: Main function for fast eQTL analysis in MatrixEQTL package

Description

Matrix_eQTL_engine function tests association of every row of the snps dataset with every row of the gene dataset using a linear regression model defined by the useModel parameter (see below). The testing procedure accounts for extra covariates in cvrt parameter. The errorCovariance parameter can be set to the error variance-covariance matrix to account for heteroskedastic and/or correlated errors. Associations significant at pvOutputThreshold (pvOutputThreshold.cis) levels are saved to output_file_name (output_file_name.cis), with corresponding estimates of effect size (slope coefficient), test statistics, p-values, and q-values (false discovery rate). Matrix eQTL can perform separate analysis for local (cis) and distant (trans) eQTLs. For such analysis one has to set the cis-analysis specific parameters pvOutputThreshold.cis > 0, cisDist, snpspos and {genepos} in the call of Matrix_eQTL_main function. A gene-SNP pair is considered local if the distance between them is less or equal to cisDist. The genomic location of genes and SNPs is defined by data frames snpspos and {genepos}. Depending on p-value thresholds pvOutputThreshold and pvOutputThreshold.cis Matrix eQTL runs in one of three different modes:
  • SetpvOutputThreshold > 0andpvOutputThreshold.cis = 0(or useMatrix_eQTL_engine) to perform eQTL analysis without using gene/SNP locations. Associations significant at thepvOutputThresholdlevel are be recorded inoutput_file_nameand in the returned object.
  • SetpvOutputThreshold = 0andpvOutputThreshold.cis > 0to perform eQTL analysis for local gene-SNP pairs only. Local associations significant atpvOutputThreshold.cislevel will be recorded inoutput_file_name.cisand in the returned object.
  • SetpvOutputThreshold > 0andpvOutputThreshold.cis > 0to perform eQTL analysis with separate p-value thresholds for local and distant eQTLs. Distant and local associations significant at corresponding thresholds are recorded inoutput_file_nameandoutput_file_name.cisrespectively and in the returned object. In this case the false discovery rate is calculated separately for these two sets of eQTLs.
Matrix_eQTL_engine is a wrapper for Matrix_eQTL_main for eQTL analysis without regard to gene/SNP location and provided for compatibility with the previous versions of the package. The parameter pvalue.hist allows to record information sufficient to create a histogram or QQ-plot of all the p-values (see plot).

Usage

Matrix_eQTL_main(	snps, 
                  gene, 
                  cvrt = SlicedData$new(), 
                  output_file_name = "", 
                  pvOutputThreshold = 1e-5,
                  useModel = modelLINEAR, 
                  errorCovariance = numeric(), 
                  verbose = TRUE, 
                  output_file_name.cis = "", 
                  pvOutputThreshold.cis = 0,
                  snpspos = NULL, 
                  genepos = NULL,
                  cisDist = 1e6,
                  pvalue.hist = FALSE,
                  min.pv.by.genesnp = FALSE,
                  noFDRsaveMemory = FALSE)

Matrix_eQTL_engine(snps, 
                   gene, 
                   cvrt = SlicedData$new(), 
                   output_file_name, 
                   pvOutputThreshold = 1e-5, 
                   useModel = modelLINEAR, 
                   errorCovariance = numeric(), 
                   verbose = TRUE,
                   pvalue.hist = FALSE,
                   min.pv.by.genesnp = FALSE,
                   noFDRsaveMemory = FALSE)

Arguments

snps
SlicedData object with genotype information. Can be real-valued for linear models and must take at most 3 distinct values for ANOVA unless the number of ANOVA categories is set to a higher number (see useM
gene
SlicedData object with gene expression information. Must have columns matching those of snps.
cvrt
SlicedData object with additional covariates. Can be an empty SlicedData object in case of no covariates. The constant is always included in the model and would cause an error if included in
output_file_name
character, connection, or NULL. If not NULL, significant associations are saved to this file (all significant associations if pvOutputThreshold=0 or only distant if pvOutputThresho
output_file_name.cis
character, connection, or NULL. If not NULL, significant local associations are saved to this file. If the file with this name exists, it is overwritten.
pvOutputThreshold
numeric. Significance threshold for all/distant tests.
pvOutputThreshold.cis
numeric. Same as pvOutputThreshold, but for local eQTLs.
useModel
integer. Eigher modelLINEAR, modelANOVA, or modelLINEAR_CROSS.
  1. SetuseModel = modelLINEARto model the effect of the genotype as additive lin
errorCovariance
numeric. The error covariance matrix. Use numeric() for homoskedastic independent errors.
verbose
logical. Set to TRUE to display more detailed report about the progress.
snpspos
data.frame object with information about SNP locations, must have 3 columns - SNP name, chromosome, and position, like this: ccc{ snpid chr pos Snp_01 1 721289 Snp_02 1 752565 ... ... .
genepos
data.frame with information about transcript locations, must have 4 columns - the name, chromosome, and positions of the left and right ends, like this: cccc{ geneid chr left right Gene_01 1 721289 7
cisDist
numeric. SNP-gene pairs within this distance are considered local. The distance is measured from the nearest end of the gene. SNPs within a gene are always considered local.
pvalue.hist
logical, numerical, or "qqplot". Defines whether and how the distribution of p-values is recorded in the returned object. If pvalue.hist = FALSE, the information is not recorded and the analysis is
min.pv.by.genesnp
logical. Set min.pv.by.genesnp = TRUE to record the minimum p-value for each SNP and each gene in the returned object. The minimum p-values are recorded even if if they are above the corresponding thresholds of pvOutputThre
noFDRsaveMemory
logical. Set noFDRsaveMemory = TRUE to save significant gene-SNP pairs directly to the output files, reduce memory footprint and skip FDR calculation. The eQTLs are not recorded in the returned object if noFDRsaveMemory = T

Value

  • The detected eQTLs are saved in output_file_name and/or output_file_name.cis if they are not NULL. The method also returns a list with a summary of the performed analysis.
  • paramKeeps all input parameters and also records the number of degrees of freedom for the full model.
  • time.in.secTime difference between the start and the end of the analysis (in seconds).
  • allInformation about all detected eQTLs.
  • cisInformation about detected local eQTLs.
  • transInformation about detected distant eQTLs.
  • The elements all, cis, and trans may contain the following components [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Details

Note that the columns of gene, snps, and cvrt must match. If they do not match in the input files, use ColumnSubsample method to subset and/or reorder them.

References

The package website: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

See Also

The code below is the sample code for eQTL analysis NOT using gene/SNP locations. See MatrixEQTL_cis_code for sample code for eQTL analysis that separates local and distant tests.

Examples

Run this code
# Matrix eQTL by Andrey A. Shabalin
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
# 
# Be sure to use an up to date version of R and Matrix eQTL.

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
library(MatrixEQTL)

## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");

# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Output file name
output_file_name = tempfile();

# Only associations significant at this level will be saved
pvOutputThreshold = 1e-2;

# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");


## Load genotype data

snps = SlicedData$new();
snps$fileDelimiter = "t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
  cvrt$LoadFile(covariates_file_name);
}

## Run the analysis

me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE,
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);
		
unlink(output_file_name);

## Results:

cat('Analysis done in: ', me$time.in.sec, 'seconds', '');
cat('Detected eQTLs:', '');
show(me$all$eqtls)

## Plot the histogram of all p-values

plot(me)

Run the code above in your browser using DataLab